News

Long-Promised, Voice Commands Are Finally Going Mainstream

voice recognition width=

From Wired Speech technology has long languished in the no-man's land between sci-fi fantasy ("Computer, engage warp drive!") and disappointing reality ("For further assistance, please say or press 1 ..."). But that's about to change, as advances in computing power make voice recognition the next big thing in electronic security and user-interface design. A whole host of highly advanced speech technologies, including emotion and lie detection, are moving from the lab to the marketplace.

"This is not a new technology," says Daniel Hong, an analyst at Datamonitor who specializes in speech technology. "But it took a long time for Moore's Law to make it viable." Hong estimates that the speech technology market is worth more than $2 billion, with plenty of growth in embedded and network apps.

It's about time. Speech technology has been around since the 1950s, but only recently have computer processors grown powerful enough to handle the complex algorithms that are required to recognize human speech with enough accuracy to be useful. There are already several capable voice-controlled technologies on the market. You can issue spoken commands to devices like Motorola's Mobile TV DH01n, a mobile TV with navigation capabilities, and TomTom's GO 920 GPS navigation boxes. Microsoft recently announced a deal to slip voice-activation software into cars manufactured by Hyundai and Kia, and its TellMe division is investigating voice-recognition applications for the iPhone. And Indesit, Europe's second-largest home appliances manufacturer, just introduced the world's first voice-controlled oven.

Yet as promising as this year's crop of voice-activated gadgets may be, they're just the beginning. Speech technology comes in several flavors, including the speech recognition that drives voice-activated mobile devices; network systems that power automated call centers; and PC applications like the MacSpeech Dictate transcription software I'm using to write this article.

Voice biometrics is a particularly hot area. Every individual has a unique voice print that is determined by the physical characteristics of his or her vocal tract. By analyzing speech samples for telltale acoustic features, voice biometrics can verify a speaker's identity either in person or over the phone, without the specialized hardware required for fingerprint or retinal scanning.

The technology can also have unanticipated consequences. When the Australian social services agency Centrelink began using voice biometrics to authenticate users of its automated phone system, the software started to identify welfare fraudsters who were claiming multiple benefits -- something a simple password system could never do. The Federal Financial Institutions Examination Council has issued guidance requiring stronger security than simple ID and password combinations, which is expected to drive widespread adoption of voice verification by U.S. financial institutions in coming years. Ameritrade, Volkswagen and European banking giant ABN AMRO all employ voice-authentication systems already.

Speech recognition systems that can tell if a speaker is agitated, anxious or lying are also in the pipeline. Computer scientists have already developed software that can identify emotional states and even truthfulness by analyzing acoustic features like pitch and intensity, and lexical ones like the use of contractions and particular parts of speech. And they are honing their algorithms using the massive amounts of real-world speech data collected by call centers.

A reliable, speech-based lie detector would be a boon to law enforcement and the military. But broader emotion detection could be useful as well. For example, a virtual call center agent that could sense a customer's mounting frustration and route her to a live agent would save time, money and customer loyalty. "It's not quite ready, but it's coming pretty soon," says James Larson, an independent speech application consultant who co-chairs the W3C Voice Browser Working Group. Companies like Autonomy eTalk claim to have functioning anger and frustration detection systems already, but experts are skeptical. According to Julia Hirschberg, a computer scientist at Columbia University, "The systems in place are typically not ones that have been scientifically tested." According to Hirschberg, lab-grade systems are currently able to detect anger with accuracy rates in "the mid-70s to the low 80s." They are even better at detecting uncertainty, which could be helpful in automated training contexts. (Imagine a computer-based tutorial that was sufficiently savvy to drill you in areas you seemed unsure of.)

Lie detection is a tougher nut to crack, but progress is being made. In a study funded by the National Science Foundation and the Department of Homeland Security, Hirschberg and several colleagues used software tools developed by SRI to scan statements that were known to be either true or false. Scanning for 250 different acoustic and lexical cues, "We were getting accuracy maybe around the mid- to upper-60s," she says. That may not sound so hot, but it's a lot better than the commercial speech-based lie detection systems currently on the market. According to independent researchers, such "voice stress analysis" systems are no more reliable than a coin-toss.

It may be awhile before industrial-strength emotion and lie detection come to a call center near you. But make no mistake: They are coming. And they will be preceded by a mounting tide of gadgets that you can talk to -- and argue with. Don't be surprised if, some day soon, your Bluetooth headset tells you to calm down. Or informs you that your last caller was lying through his teeth.

Device: Arabic In, English Out

interpreter

From Wired Soldiers can't prevent the diplomatic misunderstandings that breed warfare, but the Pentagon hopes a handheld electronic interpreter in GIs' packs can prevent language barriers from claiming lives on the battlefield. To be successful, such a gadget has to go way beyond the electronic phrase books and generic tourist directories available today.

A new device being tested at the Office of Naval Research shows a lot of promise, according to Joel Davis, a neurobiologist there. "We have good ones now; they'll be better in a few years, and eventually fantastic," he said.

Over the past several years, the Navy has pumped about $4 million into Davis' program to develop simultaneous machine translation and interpretation. On Friday, the Senate Armed Services Committee will see a demonstration of the choice fruit of that effort, a blend of voice recognition, speech synthesis and translation technologies called Interact.

"There are really a lot of competitors to this, and I've funded them, but no one has come quite close to this," Davis said of Interact, created by SpeechGear, a small startup in Northfield, Minnesota. Interact lets someone talk into the device in one language -- then it spits out an audio translation with just a two-second delay and no need for the speaker to pause.

After a demonstration for military brass last week and field tests in December at Navy Central Command in Bahrain, "we had to pry our demo model out of their hands," said Robert Palmquist, president and CEO of SpeechGear. Unfortunately, however, Palmquist said the new Iraq war came a year too early for his product. In the current conflict, military personnel will have to rely on human interpreters and weathered pocket dictionaries to communicate with refugees, wounded civilians, prisoners and combatants.

The secret to Interact is not that it's a brand-new technology, but rather an amalgam of existing solutions. The hardware comes from any electronics store: a Linux or Windows XP tablet PC with a microphone and speaker. When a user speaks into the Interact system, a voice-recognition program generates text that is then passed on to translation software. That program then bridges the two languages, and a voice synthesizer "reads" the translation out loud.

SpeechGear bundles Interact with a traditional electronic dictionary, called "Interprete," but specialized vocabulary for the military, hospitals, firefighters and others can also be added, as can personal word lists. "SpeechGear doesn't care where the software comes from; they'll use whatever translation engine or word recognition program they can find," Davis said. "The genius is to put that together in a seamless fashion."

Palmquist agreed. "We love going out and finding stuff we can use," he said. "Inevitably there are always portions we have to develop ourselves, and that last 10 percent can be tough and very involved." Palmquist declined to identify the tools SpeechGear used to develop Interact, not only for proprietary reasons, but also "because it's always changing. If we find a better component on the market, we'll integrate that and drop the old one. We tell our clients that."

And there is always room for improvement. "For this to work, you need two people with a desire to communicate. A recalcitrant interviewee can screw things up," Davis said.

The system mangles grammar at times and would also stumble over the contents of a wiretap, Palmquist said, because it would be confused by idioms and slang. "You really have to stick to a neutral way of speaking," he said. Users must also hold the unit at close proximity, and the mortar blasts and machine gun rattle of the battlefield can interfere.

SpeechGear plans to sell a consumer version in a year, Palmquist said. Eventually, he said, customers will be able to dial up from a cell phone to a service that will interpret for you at, say, a local market in Peru or on a conference call with speakers of several languages. People toting camera/cell-phone combos can snap a quick pic of a Chinese menu -- in China -- and get a quick translation into their native tongue thanks to SpeechGear's "Camara" system.

Interact, Interprete and Camara are marketed as a package by SpeechGear under the name Compadre Language Technologies. A good part of the heavy lifting to make speech-to-speech machine translation a reality will be done in university labs supported by the Pentagon, National Science Foundation, European Union and Japan.

"The hottest areas of research right now are being able to port rapidly to new languages and getting these things to run well on very small devices like PDAs," said Robert Frederking, senior systems scientist at Carnegie Mellon University's Language Technologies Institute. His group tested a system called Tongues in Croatia in 2001. "We have in fact built experimental systems on both laptops and high-end PDAs," he wrote in an e-mail interview. "There are many issues; it's still quite hard to do. Limited memory size, the quality of the soundboards and the lack of real-number arithmetic on some PDAs are, (for) example, difficult issues."

Diana Liao, chief of the United Nations interpretation service, said she won't put her faith in machine translators anytime soon. "We always rely on the human solution. The human voice is very difficult to work with. In diplomacy you have to get the nuance in your language and pay attention to inflection and even body language," she said.

But skilled interpreters are always in short supply. Liao's division has started trials using remote interpretation -- routing conversations through an interpreter at another location by telephone or videoconferencing -- but she cites problems with equipment and time differences in that approach as well. "These meetings are usually set up way ahead of time, and sometimes it might even be more expensive than to just send a person over."