The use of speech recognition, whereby a computer talks and/or interacts with a person, is proliferating into applications far beyond those seen in sci-fi movies. In fact, it is becoming commonplace to encounter silicon speech personalities during our interactions with businesses or even self-chosen interactions through the purchase of business and consumer products that contain the technology.
What is all this talk about?
Automatic Speech Recognition (ASR) is the technology that allows a machine to understand human speech. Alternately referred to as either ASR or speech recognition, the technology takes human speech input, digitises it, and converts it into a machine-readable string of text. A technology component called a recogniser then manipulates the text into a form that the recogniser uses to identify what the speaker said. It requires a thorough linguistic understanding of speech combined with statistical analysis, plus a healthy dose of electrical engineering, and digital signal processing. With more than three decades of public and private R&D behind us, commercially viable speech recognition solutions are here.
Early researchers found that there were numerous needs requiring command and control of devices and applications in a hands-free environment, either out of safety concerns or because of physical limitations of the user. Potential applications include using speech recognition to control a computer, dictate words and paragraphs, or control an action such as dropping fire retardant from a helicopter, when your hands are busy with flight control. The resulting development of speech technologies has evolved into three main areas: PC software, such as dictation or command and control applications; over the phone applications; and embedded systems applications.
How does it work?
Without any refinements to the technology, a basic speech recognition interaction with a caller develops as follows:
Caller input
A caller states a phrase or sentence using an input device such as a telephone, which is captured in the form of an acoustic signal.
Digitisation
The system converts the words from an analogue to a digital signal it can understand, into something closely approximating the acoustic properties of the human ear.
Phonetic breakdown.
The speech recognition software breaks the digitally converted words down into the basic components of speech.
Statistical modelling
The system then tries to match these sounds to its phonetic representations.
Matching
The speech recognition application tries to map the possible phonetic representations to words or phrases defined in the grammar of that application.
To make speech recognition work, the software has to take into account the acoustics of the words being spoken, the vocabulary being used, and the language model. The language model contains all the statistical information about usage of the vocabulary so that the recogniser can make a reasonable guess at what is being said. Recognisers can be either hardware- or software-based and various methodologies exist to transform the digitised text string into words.
Components of speech recognition
In a lab environment or over the phone in ideal conditions it is easy to make a simple call flow work. However, under normal conditions there are many factors that can make it difficult for the recogniser to be accurate. The type of input device, ambient noise, differences in caller accent, tone and gender, as well as regional differences in terminology are all examples of challenges that the speech developer faces. Further, the ultimate goal for speech recognition is to bring a human touch to the interaction by creating a natural dialogue with the caller.
Accuracy
The accuracy rate refers to the percentage of time that a recogniser will accurately recognise what a caller said.
Critical elements used to bolster accuracy rates are the ways in which an application is designed to handle error conditions. There are a number of ways that an application can be designed to handle errors. For example, with N-Best recognition, when the system doesn't understand the utterance, it searches for alternatives and presents them back to the caller. In response to ambiguous input a caller might hear, "I did not quite understand what you said. Did you say 'Smith'?" Alternatively, the application might reply with, "I did not understand what you just said. Could you please repeat?" or simply repeat the question a second time.
Usability and user-friendliness
As important as accuracy is the usability and user-friendliness of the application. Obviously, the system has to be accurate or users won't use it; however, there are many other things that developers have created to enhance user-friendliness of these applications. These include the following:
User interface design
The most critical component in the success of the application is the user interface, known as call flow or dialogue. The beauty of applying speech recognition to an application is to transform the user experience from automated input to something more closely resembling that of interaction with a human. Therefore, if the application is poorly designed, the system has failed.
Setting a voice and style for the application is critical for its success. The design must encourage interaction with the caller to obtain the proper response while being socially appropriate for the type of caller using it. There are three types of dialogues available, these include:
Customised speech applications
Although many turnkey speech recognition applications are being sold, others require varying degrees of development and customisation. This customisation ranges from adding names to a list to designing a telephone or even to web-based applications using speech. In addition, the industry as a whole is creating standards to facilitate such things as how speech should be incorporated into an application or how to speech-enable web development so that applications can access information from intranets and the internet. At the same time, just as the technology itself has been refined, so have the tools to develop an application. From the developer's perspective, there have been great improvements with the ease with which a speech recognition application can be developed and deployed. This is particularly true for speech-enabled call centre and IVR applications because of the parallel development of GUI development tools being used in those industries. On the voice processing side, improvements have been made in simplifying the process of creating and updating directory entries as well.
The market drivers for speech
The spoken word is the most natural user interface in the world. Most of us use it all the time. Speech technologies have grown beyond the lab into mainstream business and consumer applications. Speech recognition technology is being deployed in four main business areas:
Benefits of speech as a technology interface
In the past two decades telephony applications have been of benefit to both businesses and consumers alike. These technologies have resulted in expanded business hours, increased speed of information delivery, and enhanced transactions for both businesses and consumers.
Interactive voice response systems, predictive diallers, computer telephony integration software, automatic call distribution and auto-attendants demonstrate some of the ways in which businesses can offload repetitive and menial tasks so that employees can be used for more complex and challenging projects. Similarly, they enable customers to initiate self-service applications 24 hours a day, providing them with access to information and allowing them to make transactions without the aid of a CSR.
The application of speech recognition in telephony markets
Voice processing
Voice processing, which includes auto-attendants, voice messaging and unified messaging, is primed for the addition of speech recognition.
The creation of speech-recognition driven auto-attendants for the enterprise market has brought both reduced costs and expanded coverage for businesses, as an auto-attendant works 24 hours a day without pay.
With the depth of research done in command and control applications, speech recognition is a natural choice for controlling telephony applications that require navigation. For this reason one of the most obvious targets for deployment is in voice and unified messaging applications. As with the auto-attendant function, the addition of speech recognition allows users to navigate through voice messages without remembering command codes or a series of keystrokes on the telephone or keyboard. It also levels the playing field for user interfaces in that any combination of words can be recognised as a command.
An offshoot of the messaging market is that of virtual assistants. Also named personal telephony or unified communications, virtual assistants are speech-recognition driven personal call assistants that allow a user to access and control any number of communication applications from a telephone. Designed with personality in mind, these agents act on behalf of the caller to access and utilise applications such as unified messaging, voice messaging, voice activated dialling to internal and external numbers, integration with personal information manager (PIM) applications, and calendar functions.
Call centres
There are ample opportunities in the call centre market for speech as well. In fact it is a natural enhancement to many applications in a call centre including:
The internet
Finally, one of the most exciting emerging opportunities is the speech empowerment of the internet. Speech is a natural fit in this scenario. In addition to call centre and IVR applications, voice portals (just like web portals) are appearing everywhere, allowing callers to voice surf the web over the phone, without benefit of a computer.
This is occurring because of the development of a number of emerging standards including the most promising one to date: voice eXtensible mark-up language (VoiceXML), which is an XML-based mark-up language that defines a spoken dialogue (voice page) just as HTML defines a graphical web page. This and other developing standards, such as VXML, will eventually allow the interconnection of various speech sites to form a speech web, the way that individual graphic sites have made up the world wide web.