As much as technology has evolved in the last few years, and even with the advancements in speech recognition software, transcription services still require human review and intervention to assure close to 100% accuracy. Although lower accuracy may be an adequate tradeoff for immediate results, the majority of people who use transcription professionally require publishing-ready transcripts with minimal review time in order to justify the expense.
In order to get around this hurdle, our transcription process uses a hybrid approach that assures quicker turnaround time and prices than human transcription alone, as well as higher accuracy than speech processing alone in one package. Will speech-processing software ever reach human transcriber accuracy? Realistically, this will not happen for at least another decade, and here’s why:
Speech patterns and accents
Different regions and people within those regions might have a different way of speaking, which makes training a computer to recognize accents and speech patterns very difficult, even when tested with various sample groups. Moreover, some people slur their words or blend them when speaking very quickly, which can cause additional errors in transcription of the audio. People may stutter or pause to think, which means that the software may include filler words like “um,” “ah,” “hmm,” that should have otherwise been omitted in a clean transcript.
Grammar and punctuation
Speech recognition software also requires that you verbalize punctuation versus automatically implementing it (for example, by stating comma, period or colon instead of implying it by the tone of your voice). This makes the transcription of audio from professional speeches, meetings and interviews difficult to transcribe because they will require human review in order to add the appropriate punctuation, and/or to fix any grammar mistakes.
Homonyms and unusual words
Speech processing software can only recognize words and phrases that it has specifically been trained to recognize. As such, any time that slang or made up terms are used that may not necessarily be in the computer program, the machine will not recognize these terms – this applies to brand names, last names, and other unusual words, such as acronyms or highly technical vocabulary.
Another possible problem is the usage of homonyms, or words that sound the same but are not, such as there/their/they’re and air/heir. A computer will not be able to recognize which word it should use without being able to understand the context of the sentence, which requires extensive programming and advancements in the technology.
Ambient noise, overlapping speech, and number of speakers
When multiple speakers are present, they will frequently interrupt each other or speak at the same time, which can be challenging to transcribe for even the most experienced human transcribers. Because computers require clear speech, it is nearly impossible for them to deduce the words accurately and then separate the text by the speaker. A human transcriber may at least be able to figure out who spoke and what they said based on the sound of the speaker’s voice, as well as the previous context.
Additionally, ambient noise from music, talking and even wind noise will affect the accuracy of the transcription, as the computer uses sound bites to figure out the word and these other sounds can cause inaccuracies. Although ambient noise can also be an issue for human transcribers, they can at least try to figure out what is being said.
Speech recognition software, while a valuable tool in transcription, is still a long way away from achieving close to 100% accuracy for the majority of the public and especially without some form of human review. Thus, for professional and enterprise clients, a hybrid approach such as TranscribeMe’s is the best for generating speedy transcripts with high accuracy and at low cost.
Please get in touch with our sales team if you’re interested in transcribing audio to text!
This article was originally published on November 28, 2016.