Speech-Recognition Software


Software Engineering; Applications; Algorithms


Speech-recognition software records, analyzes, and responds to human speech. The earliest such systems were used for speech-to-text programs. Speech recognition became commonplace in the 2010s through automated assistants. Speech recognition depends on complex algorithms that analyze speech patterns and predict the most likely word from various possibilities.


Speech-recognition software consists of computer programs that can recognize and respond to human speech. Applications include speech-to-text software that translates speech into digital text for text messaging and document dictation. This software is also used by automated personal assistants such as Apple's Siri and Microsoft's Cortana, which can respond to spoken commands. Speech-recognition software development draws on the fields of linguistics, machine learning, and software engineering. Researchers first began investigating the possibility of speech-recognition software in the 1950s. However, the first such programs only became available to the public in the 1990s.

Speech recognition software must perform a number of processes to convert the spoken word to the written word.

Speech recognition software must perform a number of processes to convert the spoken word to the written word. A spoken sentence must first go through spectral analysis to identify the unique soundwave, then the soundwaves are decoded into potential words, and finally the words are run through probability algorithms that estimate their likelihood based on the rules of grammar and pronunciation models. The end result is the most likely transcription of what was spoken.
EBSCO illustration.

Some speech-recognition software uses training, in which the speaker first reads text or a list of vocabulary words to help the program learn particularities of their voice. Training increases accuracy and decreases the error rate. Software that requires training is described as speaker dependent. Speaker-independent software does not require training, but it may be less accurate. Speaker-adaptive systems can alter some operations in response to new users.


Research into speech-recognition software began in the 1950s. The first functional speech-recognition programs were developed in the 1960s and 1970s. The first innovation in speech-recognition technology was the development of dynamic time warping (DTW). DTW is an algorithm that can analyze and compare two auditory sequences that occur at different rates.

Speech recognition advanced rapidly with the invention of the hidden Markov model (HMM). The HMM is an algorithm that evaluates a series of potential outcomes to a problem and estimates the probability of each one. It is used to determine the “most likely explanation” of a sequence of phonemes, and thus the most likely word, given options taken from a speaker's phonemes. Together, HMMs and DTW are used to predict the most likely word or words intended by an utterance.

Speech recognition is based on predictive analysis. An important part of developing a predictive algorithm is feature engineering. This is the process of teaching a computer to recognize features, or relevant characteristics needed to solve a problem. Raw speech features are shown as waveforms. Speech waveforms are the 2-D representations of sonic signals produced when various phonemes are said.

An emerging feature in speech recognition is the use of neutral networks. These computing systems are designed to mimic the way that brains handle computations. Though only beginning to affect speech recognition, neural networks are being combined with deep learning algorithms, which make use of raw features, to analyze data.


Experts predict that speech-recognition apps and devices will likely become ubiquitous. Fast, accented, or impeded speech and slang words pose much less of a challenge than they once did. Speech-recognition software has become a basic feature in many new versions of the Mac and Windows operating systems. These programs also help make digital technology more accessible for people with disabilities. In future, as voice recognition improves and becomes commonplace, a wider range of users will be able to use advanced computing features.

—Micah L. Issitt

Gallagher, Sean. “Cortana for All: Microsoft's Plan to Put Voice Recognition behind Anything.” Ars Technica. Condé Nast, 15 May 2015. Web. 21 Mar. 2016.

Information Resources Management Association, ed. Assistive Technologies: Concepts, Methodologies, Tools, and Applications. Vol. 1. Hershey: Information Science Reference, 2014. Print.

“How Speech-Recognition Software Got So Good.” Economist. Economist Newspaper, 22 Apr 2014. Web. 21 Mar. 2016.

Kay, Roger. “Behind Apple's Siri Lies Nuance's Speech Recognition.” Forbes. Forbes. com, 24 Mar. 2014. Web. 21 Mar. 2016.

Manjoo, Farhad. “Now You're Talking!” Slate. Slate Group, 6 Apr. 2011. Web. 21 Mar. 2016.

McMillan, Robert. “Siri Will Soon Understand You a Whole Lot Better.” Wired. Condé Nast, 30 June 2014. Web. 21 Mar. 2016.

Pinola, Melanie. “Speech Recognition through the Decades: How We Ended Up with Siri.” PCWorld. IDG Consumer & SMB, 2 Nov. 2011. Web. 21 Mar. 2016.