"You wouldn't do it to us," intones a recent television commercial for an insurance company, as one of its rival's customer's blood boils in speech-recognition-induced frustration. "So we won't do it to you," it concludes, smugly implying that your calls will be answered by a real, caring, intelligent and possibly even beautiful person within microseconds of your call being connected.
The advertisement is wrong on two counts. The people who do answer the company in question's phones sound like defeated drones; and the service they offer could almost certainly be improved by judicious use of the speech recognition technology the company so readily ridicules. Speech recognition could even do it cheaper, faster and deliver better ROI.
Moore's Law makes speech viable
How has this ugly duckling of technology turned into a swan? As ever, Moore's Law is part of the answer, because the CPU cycles available on the desktop or server now exceed the numbers required to perform fast and accurate speech recognition.
"Speech recognition is now quick to learn, easy to use and the computing power needed to deliver 95 to 98% accuracy is there," says Bob Anderson, regional director for Asia-Pacific and Japan at ScanSoft, one of the world's leading speech software companies.
The two key technologies at the heart of speech recognition have also improved over time. The first, acoustic recognition, involves analysing a sound wave to translate it into a word and benefits most from the extra CPU cycles available to modern machines. Better analysis algorithms and smarter ways of learning the quirks of individual voices help too. Yet acoustic recognition can't work alone: a bad microphone, some background noise or an accent means that for the foreseeable future 'acoustic models' will also be needed.
Acoustic models comprise a dictionary so that speech recognition software can first try acoustic recognition and then refer its guess about the word being spoken to the model. The models also help to recognise the spoken word by using some contextual information, so an Australian model should at least have an inkling that the word "John" is likely to be followed by the word "Howard".
Acoustic models have advanced in recent years to add thousands of words while staying small in size. A few hundred megabytes now suffices for a reasonably complex model, a size which drastically cuts the 'enrolment period', the time required before speech recognition software can interpret an individual voice to an acceptable level of accuracy.
XML spells ease of use
Another important innovation is VoiceXML, a standard that makes it possible for any application capable of consuming XML to interface with speech-recognition software. VoiceXML also handles text-to-speech output and, by making this all possible in a web programming environment, removes the need for proprietary applications to handle every aspect of a speech-enabled system.
"VoiceXML is revolutionary stuff because you used to have to hard code speech applications," says Steve Kelly, a Pervasive Computing sales specialist in IBM's software group. "Now we have a GUI development tool," which in concert with the other new speech technologies makes it possible to link all manner of applications into speech recognition systems with unprecedented ease.
This simplicity means VoiceXML has quickly become a developer favourite. ScanSoft, for example, integrates its speech recognition engine into many IVR products from the likes of Genesys by VoiceXML-enabling many of its products. IBM has done likewise with its WebSphere Voice Server.
New applications
New applications for speech recognition are following these innovations.
Dictation is one speech recognition hot spot. Desktop dictation has long been popular with professionals like doctors and lawyers, whose specialised vocabularies could easily be worked into acoustic models. Shorter enrolment periods and cheap 3 GHz PCs are now making the user speech recognition experience profoundly better for individuals who are using it to replace keyboard input.
Improved dictation is also challenging the role of typing pools everywhere through network dictation, an idea that sees personal tape recorders replaced with digital voice recorders.
Workers who dictate letters continue to do so but instead of handing tapes to a typist, upload their recorded voice as a file to a server where they are processed by speech recognition software and turned into text files. Typists then edit those files rather than transcribing them, a task that takes less time and can therefore place downward pressure on staff numbers in ways that dictation-intensive workplaces like law firms appreciate.
Biometrics is another emerging application. Every voice is unique and recognition can now pick out the nuances in a voice to provide unusually accurate identification and authentication via voice. Microsoft Research, for example, has worked on the 'Cepstrum', the measure of the change in frequency of your voice relative to changes in volume. Our anatomies determine the qualities of the Cepstrum, making it a potentially useful identifier.
Speech recognition is even becoming a control mechanism for air traffic control systems and fighter planes, applications Melbourne company Adacel is working on for the US military.
The customer service challenge
But it is customer self-service where speech is really taking off, because VoiceXML not only enables voice input, it also makes it much easier to provide text-to-speech services for voice output to create what some now call a voice user interface (VUI) in which users speak to an application and are spoken to in return.
Building these applications poses two challenges, the first of which is the familiar integrating chore of connecting applications to produce the required results, then hooking them into a speech recognition engine. Expertise to perform this kind of work is burgeoning, as telephony integrators and IT services firms alike enter the market wielding tools like IBM's WebSphere Voice Application Access, middleware specifically designed to voice-enable enterprise applications.
The second challenge - designing a VUI - is more specialised because the skills required to design a voice interface are totally different to those needed for a point-and-click interface.
Enter people like Jane Curtain, a user interface and linguistic consultant for Dimension Data. Curtain has qualifications in linguistics and psychology, a combination she says is important when creating VUIs because "speech recognition is all about human language interactions and getting people to respond to questions that we ask. We have all been speaking since we were very young and we have expectations of language."
Fail to meet those expectations, she says, and you quickly take users outside their comfort zone and into a place where they find it uncomfortable to use a VUI.
"That's why people don't like answering machines and IVRs: they use very formal language and don't offer the response you would get during a natural language conversation."
Every word used in a VUI is important, she says. "Everyone has an opinion about what they think an application should say. A linguist has a more scientific approach. We don't design on whim or opinion, we use scientifically-based rules that use the subtle differences in human languages to make something more usable."
Even with the need for a scientific approach, Curtain advises that the first step is to create a persona for a speech application, to reflect a business' brand and set the tone for voice interactions.
Curtain's skills were recently pressed into Vodafone's service to create a voice-driven application for pre-paid mobile phone recharges. The project saw Vodafone audition twelve different voice talents before conducting an internal 'Idol' competition to pick the winner that best reflected its brand values. The company even went so far as to design a personality for the application, dubbed 'Lara', that includes eye colour and the car 'she' drives, all of which are reflected in the phrases she utters.
Another chore is designing a 'call flow', a formal map of the interactions required that is created by detailed analysis of existing call centres and the queries they process. Scripting follows, and employs the socio-linguistic skills so that the options offered to a caller steer them towards their desired options using a pre-defined vocabulary the system has been set to recognise.
Lara has been a runaway success for Vodafone and generated ROI within months by averting the need to hire more than one hundred call centre agents while keeping existing agents happy because the most common and boring tasks are now handled by machines.
Still room for improvement
Yet when a customer goes beyond the accepted vocabulary, customers will still go through to a human.
Speech scientists, however, have already recognised the next steps they must take to let speech recognition replace even more human interaction.
"We need much smarter technologies to really predict what is being said," says Robert Dale, professor and director of Macquarie University's Centre for Language Technology.
"If you were dictating a letter to a secretary and said 'Dear John I can do lunch on Thursday. Hang on. Make that Wednesday,' the secretary would know what to type." Today's speech recognition would transcribe the entire sentence to create an embarrassing mess.
"There is a gap between transcription and interpreting the language," Dale says. "The next hurdle is to get that interpretation going, so speech recognition can draw the distinction between recognising the word and recognising what to do with it."
When will we see such technologies? Dale believes it will be at least a decade. But between now and whenever this new wave of technology arrives, only the extremely cynical would ignore speech recognition, which is now a mainstream if imperfect technology more than ready for use alongside any enterprise application.
|
Getting multimodal Using a mobile computing device is frustrating. Styluses for writing on PDAs are too thin and learning special alphabets is a chore. SMS just isn't feasible for messages more than a few words long, while the 'thumboards' offered by devices like BlackBerry devices can literally be painful to use. Multi-modal interfaces are advanced as the way out of the mobile device's size constraints. As the name implies, multi-modal interfaces offer more than one way to use a device and blend them so it is optimally easy to access a mobile computer's functions. Researchers at the Smart Internet Technology Cooperative Research Centre at the University of New South Wales have already created such a system. 'Amanda' and 'Joshua' are speech-recognising PDAs that use Wi-Fi links to send your voice to a server for recognition. The pair present a human face and respond to natural language questions by operating the PDA's software. Ask about your "appointment on Tuesday afternoon with Bob", for example, and they will open the appointment in the PDA's calendar. Using speech recognition for this kind of task reduces the amount of interaction that requires unnatural styli or mini-keyboards and steers users towards the best mode of interaction for the task they wish to complete. Multi-modal interfaces are just starting to become commercially available too. Smartphones running Windows mobile offer the service today, while Adelaide's Clipsal offers a voice-powered home automation system that lets its owners turn on lights by speaking into a power switch. When the time comes to reprogram the automation regime, users change modes and sit down at a PC where the richer interface of keyboard, mouse and monitor is far better suited to the task. |