How to Compare Speech Engine Accuracy

by Emily Nave


Speech technology is a game of ‘what is most likely to have been said here’ and the winner is the speech engine that can predict the results most accurately. There are two main methods to building speech recognition software; phonetic-based and text-based / fixed vocabulary speech engines. The first thing to determine when comparing the accuracy of speech engines is the type you are assessing. Phonetic based speech engines are built with a smaller grammar set and use phonemes as the basis for recognition and search, while fixed vocabulary engines are built with a larger, fixed, predefined vocabulary. To learn more about the difference of these two methods, check out our previous blog; Phonetic vs. Fixed Vocabulary Speech Technology.

Generally, phonetic-based engines tend to only be used in very niche use cases, while text/large-vocabulary based engines tend to be more advanced. Every use case is different, but keep in mind the features that will matter most to you when starting this process.


‘WER’ (Word Error Rate) and ‘Word Accuracy’ are the best measurements to take when comparing two accuracies, these are typically values in % and are derived by comparing a reference transcript with the ASR transcript (or hypothesis) for the audio. The algorithm used is called the Levenshtein distance, it is calculated by aligning the reference with hypothesis and counting the words that are Insertions, Deletions and Substitutions (and correct). Basically you will use this method to compare a machine transcription from each speech engine to a perfect human transcription of that file.

For keyword spotting accuracy, which is important to measure since that is what many people use transcription for, you should be using precision and recall. These are standard measures used in information retrieval science. Recall is % of the words you are looking for that were found (so 80% means we found 8 out of 10 were found and 2 were missed). Precision is the % of the hits we found actually were valid hits (so 90% means 9 out of results in the list were true and one was a false positive). This is important to measure in addition to the WER and Word Accuracy as the most important words to get transcribed correct are the terms you need to spot or search for. If a speech engine can not recognize Xfinity or Comcast and those are important terms for your use case, the other accuracy is irrelevant.


Optimize Your Transcripts

When comparing transcripts there is some pre-processing you can do of the text in both the reference transcript and the hypothesis transcript to make them easier to compare.  For example converting everything to lowercase, removing speaker turns and punctuation can help the raw accuracy comparison, especially when the results are very close. Accuracy of the reference transcripts become more of a factor as the accuracy levels increase.  At low accuracy levels these errors are small enough to get lost in the noise.

There can also be issues with word forms that are small factors when accuracy levels are low but become more of an issue when comparing higher accuracy levels.  For example:

  • Number formats (10 or ten)
  • Acronym formats   ATT or AT&T
  • Word forms/spellings ( or voicebase dot com)

The best thing to do is to identify all of these possible terms in your recording (or reference transcript) and do a search and replace on all of the identified terms to make them a uniform format.


Once you’ve gotten past those hurdles and you know what to look for, you’re ready to get started testing with these 6 steps:

Step 1: Identify The Right Recordings

Find a set of recordings that are representative of the audio you will be working with. Be sure this content has all of the unique terms and numbers that you will need to spot in order to get the best comparison for your business case. This data set should be the best representation of what your real calls sound like.

What to include in sample calls:

  • Account Numbers
  • Phone Number
  • PCI data / SSN / Address
  • Acronyms
  • Brand Names


For the best results do not compress, up-sample or down-sample your audio. Compression will dilute the accuracy levels of each engine and give you poor results. The higher the data rate and the higher the frequency the better the results. For example; recordings under 8kHz tends to yield much poorer result. We recommend telephony and recording parameters be set to 16khz if possible for best results.

Learn more about audio file compression and codec types here.

Step 3: Human Transcripts To Compare

For each test file, you’ll need plain text reference transcripts. Note that this is different than human tagging or scoring. These will need to be fully readable transcriptions, not just check marks of what was said. There are many vendor options for this service, such as Call Criteria.

Step 4: Individual Machine Transcripts

For all ASR (Automatic Speech Recognition) engines under test, you’ll need to obtain plain text transcripts for each test file. Basically you need to run each file through every speech engine you’re testing and download a plain .TXT file of the results.

Step 5: Run Test Comparisons

This can be done using SCLITE which is a NIST software that is in the public domain. SCLITE is part of Speech Recognition Scoring Toolkit (SCTK). If you do not have access to that software, VoiceBase sales engineers can run your speech results from different vendors through our assessment systems to provide you with the results.

Step 6: Review Results

Compare the pros and cons of the data points we outlined earlier; WER, Word Accuracy, Speed of Results and Cost to determine which speech recognition fits the needs of your content.

Here are some other features you may want to compare as well:

  • Redaction (The ability to remove sensitive data such as PCI, PII, SSN)
  • Custom Vocabulary (The ability to add acronyms, pronouns and names to a unique dictionary on the fly)
  • Auto Call Classification/Disposition (The ability to spot events in a recording such as a hot lead, upset customer, appointment made or an agent that needs training).
  • Ability to Query – Can you create custom queries to manipulate the data?
  • Number formatting (phone numbers, addresses, zip codes, SSN, etc)


Many businesses look for transcription and speech to text in order to unearth something else in their recordings; angry customers, appointments made, rude agents, etc. Transcription is a means to an end, a means to find the word, phrase or event you’re really interested in. If this is the case, instead of measuring accuracy, measure how well the speech technology can spot the important events in your spoken content, such as ‘customers about to cancel’ or ‘hot leads’. Because it doesn’t matter how good a transcript is, when what you care about spotting are really events that are difficult to find in any transcript. Curious how this works? Here’s a quick video below describing Predictive Insights:

Powerful insights occur when you are able to surface the WHYS behind keywords and phone call events. From there, you can start to understand and predict customer behavior, optimize your processes, and make better business decisions.

Interested in learning more about VoiceBase’s speech recognition and speech analytics solutions? Contact us here to arrange a demo call today, or visit our website for more information.

More From the Voice analytics blog

Predictive Analytics for Strategic Insights

Predictive Analytics for Strategic Insights

Predictive analytics is an advanced form of data mining that leverages machine learning to identify patterns in voice recordings, intuit a speaker’s intent, and predict a future outcome — be it a sale, account cancellation, or one of many customized “X” signals your clients might request.

read more