Why Annotated Data is So Important to Machine Learning

TranscribeMe creates structured data sets for customers to use to create or enhance machine learning models.

Before getting to case studies illustrating this work, some terms need to be either defined or clarified, i.e., “structured data” and “AI.”

I consider AI to be a misnomer. Intelligence is intelligence; excluding all other flora and fauna, it divides into human or machine. So for me, there’s nothing artificial about an intelligent machine. It’s simply not human.

Learning Through Structured Data

Consider how humans learn. A newborn is pretty much helpless, but from birth it packs an enormously powerful and complex brain that from day one is collecting, integrating, and assimilating environmental data, including speech. Without speech, the child is in stealth mode, but the right brain is hyper engaged in an activity that data scientists would call unsupervised learning.

As the child grows, structured data is introduced in the form of books. Initially, a parent may read to the child and point out elements in the story. For example, while reading “Goodnight Moon,” the parent might say, “Moon,” then point to its picture, tying the word to a visual. That is data annotation!

As children continue to learn, the enormous capacity of the brain to log, store, and collate data comes into play and the children become, for the most part, autonomous learners.

A newborn machine has neither a right brain, nor the nearly unlimited data capacity of a human brain to begin learning and storing data. It’s estimated that a human brain can store 2.5 petabytes of information. That would be equivalent to a DVR recording continuously for 300 years!

A newborn machine begins its quest for intelligence at the Goodnight Moon stage where a pairing takes place: an audio recording of the word “moon” with the written word, or an image of the moon with an audio recording of the word.

As is the case with the child learner, this is data annotation.

An example of structured data could be, let’s say, a complex set of data defining all North American songbirds at the exclusion of all else. This would produce an intelligent machine that could identify every single songbird on the continent. But it couldn’t tell us a thing about butterflies! And there would be nothing in its database or algorithmic logic to take it from songbird to butterfly.

A new set of structured data must be created and assimilated for every new thing we want our machine to learn. It’s always been this way from the beginning of time, machine learning time, that is.

Here’s a quote from Wikipedia in the article, Expert System: “In the late 1950s… biomedical researchers started creating computer-aided systems for diagnostic applications in medicine and biology. These early diagnostic systems used patients’ symptoms and laboratory test results as inputs to generate a diagnostic outcome.” Even for the first machines, data annotation was required.

From the 1950s until now, all machine learning has required data annotation to create structured datasets to create or enhance machine learning models. There have been many claims of unsupervised learning, but that has not been true in cases we’ve seen. The machines have gotten more sophisticated with their data collection, but overall the machine needs to be trained for a specific use.

Use Cases for Annotated Data

Every day AI and machine learning technologies are delivering astounding accomplishments that benefit a broad spectrum of fields and people around the world, including encompassing areas such as software and development, cybersecurity, medicine, engineering, customer service, finance, manufacturing, and more.

But scientists, technologists, and huge industries are not the only ones reaping the benefits of machine learning. Small businesses and individuals alike are beginning to understand that data collection and analysis are now the norm, so it is no wonder that AI and machine learning are among the fastest growing technologies globally.

These technologies include audio, images, videos, podcasts and more. Simply put, data is labeled to make it comprehensible to AIs. The key is the accuracy of the data sets and the quantity of data sets is also very important so that there is increased variety in the verbiage and context.

This is where TranscribeMe comes in. We have been asked to provide annotated data for a variety of use cases. And we have teams that are specially trained to label and process data appropriately for any given project. Here are just a few examples:

Medical Services

Topic: Medical Emergency Screening
Form of Data Annotation: Audio
Process: Annotators listen to agonal breathing recordings and mark the beginnings and ends of the wavelengths.
Purpose: To be able to teach the provider’s automated system to screen patient calls for agonal breathing in order to identify callers who are experiencing a heart attack or stroke.

Fast Food Industry

Topic: Accuracy of Automated Orders
Form of Data Annotation: Audio/text
Process: Customers’ drive-thru orders are transcribed.
Purpose: To train the restaurant’s automated system to recognize drive-thru orders that are placed by learning to recognize menu items regardless of customers’ accents and despite high levels of surrounding noise.

Telephony Company

Topic: Customer Service Analysis
Form of Data Annotation: Text
Process: Specific labels are used to tag words or phrases in pre-transcribed customer service conversations.
Purpose: To build custom speech models for call center use cases by identifying customer sentiment, logging why customers call, as well as how the calls end, and by qualifying the agents’ responses.

Court Stenography Company

Topic: Annotation via Keywords
Form of Data Annotation: keyword spotting
Process: Words and phrases from notices of depositions are tagged according to keywords per the clients’ instructions.
Purpose: To compile data sets from deposition notices using keywords that identify plaintiffs, defendants, witnesses, attorneys, deposition location, date, time, and other similar information.

Self-Driving Vehicle Manufacturer

Topic: Passenger Safety
Form of Data Annotation: image tagging
Process: Annotators use special software to draw a shape around specific images in photos and videos.
Purpose: Tagged images are used to teach self-driving vehicles to avoid obstacles in the road such as potholes, cracks, water, etc.

We Train ASR’s

As technology advances and as more general transcribed audio becomes available on the net, ASR systems can scrape this data and self-train to a degree. We’re currently working with a company that is actively doing this and has produced very good results–but not great results. Consequently, they have come to us to acquire what is considered the gold standard in training data–human transcribed and annotated audio to text. That human factor is what it takes to make a good ASR a much better ASR.

Ledley RS, and Lusted LB (1959). “Reasoning foundations of medical diagnosis”. Science. 130 (3366): 9–21. Bibcode:1959Sci…130….9L. doi:10.1126/science.130.3366.9. PMID 13668531

Weiss SM, Kulikowski CA, Amarel S, Safir A (1978). “A model-based method for computer-aided medical decision-making”. Artificial Intelligence. 11 (1–2): 145–172. doi:10.1016/0004-3702(78)90015-2

Why Annotated Data is So Important to Machine Learning

Learning Through Structured Data

Use Cases for Annotated Data

Medical Services

Fast Food Industry

Telephony Company

Court Stenography Company

Self-Driving Vehicle Manufacturer

We Train ASR’s

Previous Post4 Ways Transcription Outsourcing Can Positively Impact Your Business

Next PostWhat is AI Training Data & Why Is It Important?

Our Services

Request a Quote

Services

Project Info

Use Cases

Resources

Help & Connect