Questions to Ask When Searching for an ASR Vendor

Introduction

How should you choose which automatic speech recognition (ASR) solution to invest in? With so many options available, it can be a challenging decision. That’s why Rev.ai has prepared a guide that outlines the key questions to ask when deciding on an ASR vendor. 

By considering these questions, you’ll have a better idea of the strengths and weaknesses of each ASR product. How much value you place on each answer will depend on your own needs. So it’s also important to understand what exactly you need ASR for, and if there are any particular must-haves for your products or the pain points you’re trying to solve. 

How much uptime can you ensure?

Uptime is one of the most important aspects of any ASR solution, so it follows that this should be the first question you ask of potential ASR vendors. Each ASR vendor will have a guaranteed percentage of uptime for their customers, as laid out in their service level agreement (SLA). 

Generally, higher uptimes come at higher costs. But if speech recognition is critical to your work, a high level of uptime is priceless. 

Rev.ai guarantees 99.9% uptime in their SLA, which corresponds to just 8 hours and 45 minutes of downtime in a year.

 

How fast is it?

There are two scenarios to think about when asking for the speed at which an ASR can turn speech into text. The first is real-time streaming applications, such as live-captioning meetings, lectures, or broadcasts. Here, the key measure of speed is latency: the delay between the time something is spoken and the time the corresponding text is returned to the user. You can expect a good ASR to have a latency of under one second. 

The second scenario is asynchronous applications, such as generating transcripts from recordings. The key measure here is overall turnaround time. This is usually less critical than latency, as you aren’t relying on your ASR in real-time. However, overall turnaround time can still be an important consideration to take into account—waiting for prolonged periods for a transcript can be frustrating.

 

How accurate is it?

Accuracy is an important consideration when choosing an ASR vendor. Inaccurate speech recognition software can potentially require many time-consuming corrections.

Different ASR vendors will use different benchmarking approaches to measure accuracy. The primary tool Rev uses, which they suggest you use too, is the Word Error Rate (WER). This gives a percentage of how many words the ASR got wrong. Failures include omitting the right word, inserting the wrong word, and incorrectly substituting one word in the place of another.

Another consideration to make here is whether the ASR vendor includes verbatim words. These include filler words, false starts, and self-corrections. ​​Verbatim words are verbal cues that provide helpful context and set the scene of recording. Rev is currently the only ASR vendor that can offer fully verbatim transcripts. This is thanks to the millions of hours of Rev’s unique verbatim training data from professional transcriptionists.

As ASR vendors will tell you their WER numbers from their own benchmarking processes, Rev recommends running some of your own audio through several ASR products. To compare the results, Rev offers a free, open-source command-line tool, called FSTAlign. You provide FSTAlign with the output text file from an ASR along with a ground-truth transcript and it reports back the WER. FSTAlign is available on GitHub.

When Rev ran a publicly available, 39-hour unedited long-form audio dataset called Earnings-21 through FSTAlign, they found they had the lowest WER value compared to Google and Amazon, among other competitors.

How quickly can I implement it?

There are two main components to consider when it comes to the implementation speed of an ASR. The first is the API and its documentation, and the second is how much fine-tuning the ASR machine learning model will require for your purposes.

There are several factors that will affect how long it will take to achieve the first successful API call-in production. Well-designed APIs with comprehensive and clear documentation can significantly shorten the development life cycle. Some APIs are ready to use straight away, whereas others will require a conversation with a representative from the vendor organization to get production keys. It is also worth finding out whether API error codes are clearly defined and which programming languages software developer kits are available for.

Implementation speed can also be greatly affected by how much adjustment the ASR machine learning model requires before it’s fully operational. Will you have to fully train a model from scratch or can you tweak a prebuilt model for your own use cases? How much of your own data will you need to provide before the model operates at suitable performance levels? Will you need data science expertise in your team?

 

Further questions to ask potential ASR vendors

Rev encourages you to get in touch with multiple vendors, ask for the details above, and compare the results. Other questions you may want to ask include:

  • How easy is it to customize? 
  • How does it perform across multiple demographics?
  • Is there support for multiple languages?
  • How well does it, separate speakers? 
  • Does it include automatic punctuation?
  • What level of developer support is offered? 
  • Who owns the data and where is it stored?
  • How much does it cost and what’s the contract length?
  • Can I try before I buy?
  • What does a true partnership look like?

Many of these considerations are nuanced, especially when it comes to culture. For more details, read Rev’s guide. 

Sian Page

Sian Williams Page is a freelance science and technology writer based in Edinburgh, UK.

Software Weekly

Software Weekly

Subscribe to Software Weekly, a curated weekly newsletter featuring the best and newest from the software engineering community.