Skip to main content

TranscribeMe uses automatic speech recognition (ASR) technology in order to auto-complete audio to text transcripts.

When the audio is of very high quality and the completion requirements are less than 100% word accuracy, an ASR can provide a pretty accurate finished transcript in a short amount of time. This accuracy is usually accomplished when there’s either a single speaker or where there’s a dialogue between two speakers each with a separate mic–which could be two phones. Though you would think that this would be the best way to create data sets for more accurate ASR, this is not always the typical case for several reasons, the primary one being audio quality.


Audio Quality Limitations

Audio quality is not always that straightforward. In fact, there are many factors that can occur beyond simple clear recording. The audio may be very clear, but there are multiple speakers speaking over each other, the speakers may have accents that confuse the ASR or the recording may contain significant background noise

These examples as well as other quality issues can limit ASR usability.

ASRs need to do more than just simple word transcription. Many use cases require additional features that most ASRs just don’t have. The two most common requirements are timestamping (per word and/or speaker change) and the other in ASR terms, diarization, which is when a speaker identification, (typically not by name), but simply by identifying speaker 1, speaker 2, speaker 3, etc.

Since TranscribeMe does not create its own ASR, they constantly test all available options; the one common failure is diarization. TranscribeMe has yet to find a speech model that can do this consistently.

Speech Technology Design

speech recA quick word about speech technology. TranscribeMe recently met with a potential partner who asked if “Bigs”(Google, Amazon, Microsoft, IBM) technology was used. The assumption from the potential partner was that these companies have the resources to provide the best technology and smaller companies like TranscribeMe could not be competitive. The thing is, the use case matters and these companies have a specific niche, mostly, for which they design, which is for query responses ie. “ok google”; “Alexa, play my tunes” etc. These companies are not trying to autocomplete six hours of legal deposition.

TranscribeMe constantly tests ASRs to be able to include the best options for customers. The “Bigs” are not at the top of the list for that reason. In fact, no single ASR is consistently top of the list to be able to meet all requirements. There are variations in language support; the ability to understand English in various dialects or accents; the ability to provide a runtime that lives in our domain for customer security requirements; the ability to add a dictionary of expected terms for niche audio. Also, is the ASR tuned for call centers; is it tuned for business dialogue, eg, earnings calls; is it tuned for management consultants, etc.

TranscribeMe has yet to find a single ASR that works best in all use cases so they employ multiple engines. That said, as alluded to before, the ASR alone, except in highly constrained cases, can’t do the job on its own; it also needs help from humans.

Why ASR Technology Still Needs Human Help

TranscribeMe calls the process of helping ASRs with a humans’ help, “Blend”. You might also hear the phrase, “human in the loop”. Whatever it’s called, it simply means that an audio file is first processed through an ASR and then sent to a human for correction and completion.

But wait! There’s more! Back to the quality issue. Poor quality audio processed by an ASR produces a transcript that’s so poor, it takes longer to correct than it does for a transcriptionist to do from scratch. To limit ASR processing to “good enough quality audio” a confidence score is used by running a snippet and getting an assumption of the overall audio quality. If that assumption or confidence is at or above a certain threshold then the full audio is processed through Blend, otherwise, it’s sent to a manual workflow.

So, now there’s ASR only and Blend. That’s still not enough in some cases to build a good enough data set. Additional processing is required which can include timestamping, per word, and speaker change. In cases that require per word microsecond stamping, not possible with any ASR it’s accomplished through a dedicated QA UI tool built by the TranscribeMe crowd.

Customer requirements/style guides require further post-processing which, again, can’t be done by an ASR. The styles may require numbers to either be spelled out or not. “Ahs and Ums” either need to be included or not. For these styles, TranscribeMe adds scripting per project to fine-tune the transcript before it’s returned.

Why Automated Speech Recognition (ASR) is Still So Limited

The question someone might ask is, why is an ASR so limited? The answer lies in what is required to create an ASR. There’s a term, unsupervised learning, which is one of those nirvana terms–the ultimate goal–the ASR trains itself, just like any artificial intelligence you see in movies. (They learn on their own and eventually take over the world!)

In real life and in each limited case niche, an AI must be exhaustingly taught every possible case–known by humans–to be able to function. In the case of speech automation, annotated datasets must be created and then fed into deep learning algorithms to produce an ASR, and then it needs to be done again, iteratively until it’s good enough, and then it needs to be done some more.

TranscribeMe has been employed to build these types of structured data sets in order to have insight into what’s required to build an accurate ASR. The dataset can be turned to a specific niche/dialogue/dictionary set or in some cases tuned to a specific customer.

TranscribeMe has actually been employed by specific customers who want to create their own ASR specifically trained on their own specific audio. This generic engine has serious limitations in what it can provide to users looking for specific results. But regardless of how sophisticated the engines become, for the foreseeable future, humans will continue to be involved, either in the creation of training data or for transcription, in order to have a more accurate completion of the final product.