At TranscribeMe, we offer Speech Recognition using state-of-the-art technologies and expertise to provide businesses with the highest accuracy automated transcriptions. Speech recognition refers to the technology by which a machine or program is engineered to identify spoken words or phrases and convert them to text or any other format that can be read by a machine. Here are some quotes from experts that explain how this in-demand technology actually works, some of its many applications and challenges the field currently faces:
How Does Speech Recognition Work?
“The computer takes in the waveform of your speech. Then it breaks that up into words, which it does by looking at the micro pauses you take in between words as you talk.”
– Meredith Broussard, Data Journalist and Professor at NYU
“Let’s say we have a particular speech sound, like the word “one.” If I have a couple thousand examples of a one I can compute the statistics of its acoustic properties, and the more data — the more samples of one — I have the more precise the description become. And once I have that I can build fairly powerful recognition systems.”
– Alexander Rudnicky, Research Professor with the Carnegie Mellon Speech Group
“So the lexical models are built by stringing together acoustic models, the language model is built by stringing together word models, and it all gets compiled into one enormous representation of spoken English, let’s say, and that becomes the model that gets learned from data, and that recognizes or searches when some acoustics come in and it needs to find out what’s my best guess at what just got said.”
– Mike Cohen, Manager of Speech Technologies at Google.
“Let’s start with speech recognition. Before we go and train a speech system, what we have to do is collect a whole bunch of audio clips, so for example, if we wanted to build a new voice search engine, I would need to get lots of examples of people speaking to me, giving me little voice queries. And then I would actually need human annotators or I need some kind of system that can give me ground truth, it can tell me for a given audio clip, what was the correct transcription. And so once you’ve done that, you can ask a deep learning algorithm to learn the function that predicts the correct text transcript from the audio clip.“
– Adam Coates, Director of Baidu’s Silicon Valley AI Lab.
What Are Some of the Ways It Can Be Applied?
“From a person’s voice alone most people can tell if someone is angry or nervous, but there are a ton of subtle things that are not perceivable by the human ear that are also connected to your thoughts. In our work, we measure thousands of aspects of speech and language, and many of them go beyond human hearing. We certainly can’t objectively measure them, but machines can, and those features are often highly correlated with one’s cognitive status and can indicate whether someone has Alzheimer’s, dementia, depression or anxiety.”
– Dr. Frank Rudzicz, Toronto Rehabilitation Institute-UHN
“In the past decade, voice based solutions were mostly used in banking and telecom call centers as well as in healthcare, but this was largely an experimentation stage, considering the issues of accuracy and business relevance. Only in the past few years, we noted a significant increase in demand and preparedness for speech technologies in financial services, insurance, and other sectors. There are many positive implementation examples across these industries: e.g. Barclays, Citibank, ING, Wells Fargo and others in banking.”
– Alexey Popov, CEO at Spitch
“It’s nice that ASR [Automatic Speech Recognition] is actually starting to be useful now. When I started out, the most visible ASR product was Dragon Dictate, which few people actually used— I believe it was marketed as the ideal Christmas present, which was deceptive. These days we have Amazon Alexa and Google Home, which people actually use — not to mention call center dialog systems. They are annoying, but that’s often a limitation from the dialog management rather than the ASR.”
– Daniel Povey, Associate Research Professor at the Center for Language and Speech Processing at Johns Hopkins University
Which are Some Challenges the Field Currently Faces?
“Speech recognition and the understanding of language is core to the future of search and information, but there are lots of hard problems such as understanding how a reference works, understanding what ‘he’, ‘she’ or ‘it’ refers to in a sentence. It’s not at all a trivial problem to solve in language and that’s just one of the millions of problems to solve in language.”
– Ben Gomes, Head of Search at Google
“With the rise of speech as a primary interface element, one of Mozilla’s main challenges is putting speech-to-text and text-to-speech into Firefox, voice assistants, and opening up these technologies up for broader innovation. Speech has gone from being a “nice-to-have” browser feature to being “table stakes.” It is all but required.”
– Kelly Davis, Machine Learning Researcher at Mozilla
“The ability to recognize speech as well as humans do is a continuing challenge, since human speech, especially during spontaneous conversation, is extremely complex. It’s also difficult to define human performance since humans also vary in their ability to understand the speech of others. When we compare automatic recognition to human performance it’s extremely important to take both these things into account: the performance of the recognizer and the way human performance on the same speech is estimated.”
– Julia Hirschberg, Professor and Chair at the Department of Computer Science at Columbia University.
Our ASR models have the capacity to be applied across multiple languages, accents, and other data points while constantly evolving and improving over time. Get in touch with our sales team today to request a demo for a solution customized to fit your enterprise ASR needs!