December/January 2000
Increased Computing Power, Better Speech Recognition and Improvement in Directed Dialogue Make for a New Class of Communications Devices
By James A. Larson
A new class of communications devices are emerging that will enable users to access the World Wide Web with more ease and convenience than they can with a desktop PC. These devices, called Internet appliances, will provide the user with useful information and interesting entertainment from the Web. Users will be able to converse with these inexpensive and intuitive appliances to access services, which will make users' lives easier and more enjoyable. These new electronic devices will listen to a user and understand what they say.
Users will speak and listen to these appliances, read small displays, and press physical buttons. Some will have embedded processors to provide limited speech recognition and synthesis. Many appliances will be connected to large servers that provide powerful speech recognition, speech synthesis, natural language processing, and dialog management.
With the advances in communications and computing, Internet appliances are being designed for a variety of specific activities. The following table summarizes the functions of some examples.
Appliance | Use |
Multimedia Player | Download entertainment files from the WWW for replay at any time, especially music and videos |
Information Pad | Portable WWW browser with a small display for presenting requested information such as the weather, stock quotes or sports scores |
Smart telephone | Connect and communicate with people via voice, fax and e-mail |
Set-top box | Enhance broadcast television with additional content and activities such as product descriptions and other forms, biographical information about the speaker, games and puzzles related to the broadcast, and URLs for additional information |
While these Internet appliances are designed for different purposes, they all exhibit these basic qualities:
The main method of communication is speech. The user speaks simple instructions to a continuously listening software agent. The appliance verbally prompts the user to clarify instructions when the software agent does not understand the user's spoken words (speech recognition error) or does not understand the user's command (error in command meaning).
Alternative interface methods may be available for the few situations when the user's speech is not recognized correctly. These include:
Speech Recognition
Speech recognition has become practical as a means of operating such devices. Speech has been a very challenging problem for developers for decades, but recently three broad technical advances have made the problem tractable:
One of the fastest growing speech recognition market segments is found in the telephony world. Interactive Voice Response (IVR) systems that prompt the user with multiple levels of voice menus of the form, "For 'x' press 1, for 'y' press 2," are being replaced by systems that recognize the user's goal immediately from the user's spoken request. Speech recognition enables users to conduct short, directed conversations about what to do.
Advances that have been made and implemented in IVR systems will soon be found in Internet appliances.
Connection Continuum
How such appliances connect to the Internet varies greatly. The devices can be grouped along a continuum with stand-alone appliances and connected as end points with the majority of devices falling in a category of occasionally connected appliances, somewhere between the endpoints.
Stand-alone Appliance. Factors that determine if a stand-alone appliance can perform speech processing include available processor MIPS, memory size, and power consumption. If the processor has sufficient MIPS and memory size and does not use much power, then the appliance can perform its own speech processing, including speech recognition and text-to-speech synthesis. Low power consumption is extremely important for stand-alone devices. High power consumption means that the batteries must be replaced frequently or the appliance must be docked while its batteries are recharged.
Currently, there are an impressive number of appliances with embedded speech processors available. While these devices do not access the Internet, they illustrate the small vocabulary and limited dialog capability that can be supported by a stand-alone appliance.
Connected Appliance. An example of a connected appliance is a smart VCR, which would always be connected to the Internet. The smart VCR captures entertainment from the Internet as it broadcasts and stores it for replay to the user at a later time. If the smart VCR can be disconnected from the Internet to replay content from a videotape or disk, then the smart VCR is an occasionally connected appliance.
Embedded Speech Processors
An impressive number of products with embedded speech processors are now available. Some of these include:
Sensory [www.voiceactivation.com] also sells a toolkit for hobbyists and inventors to add voice recognition to consumer electronics, toys, motors, appliances, lights, games, model railroads and any electronic product. |
Occasionally-Connected Appliance. The majority of Internet appliances fall into the occasionally connected category. These appliances can be used in stand-alone mode, but are only connected to the Internet to either access data or another processor.
Appliance Occasionally Connects to Data. The Internet appliance must be connected in order to upload or download data from an Internet source, such as a database or a Web site. Once that is accomplished, however, it may be used in stand-alone mode. Examples of these appliances include:
VoiceTIMES is a consortium founded by IBM to coordinate the technical requirements needed to integrate voice into mobile devices and enterprise solution software. The organization plans to conduct highly targeted research studies to identify enterprise solutions where mobile devices will change the way companies do business.
MIT's Laboratory for Computer Science has created the experimental Handy21. This handheld computer will incorporate the functions of a variety of communications gadgets, including a television, a pager, an AM/FM radio, a cellular telephone, and a wireless Internet connection. The device will be equipped with an antenna to transmit and receive communication signals, but all signal processing will be done on a general-purpose microprocessor, which allows the user to run many applications. Using the conversational speech technology developed by Dr. Victor Zue and his fellow researchers, the Handy 21 will perform speech recognition and text-to-speech processing locally and connect via telephone to the Internet for additional remote processing.
Enabling Web-based Applications
Microsoft's Internet Explorer and Netscape's Navigator have enabled millions of PC users to render data and execute applications encoded in a standard language called the HyperText Markup Language (HTML). While HTML and its extensions enable PC users to see GUI-based renderings of data and applications, it is not adequate for representing voice-based applications for Internet appliances.
Capturing Speech for PC Processing Rather then including an expensive chip, other venders have elected to off-load the speech processing to a readily available powerful processor - the PC. These devices include hand-held microphones and recorders. Sample products include:
|
Recently, the World Wide Web Consortium (W3C) has established the Voice Browser Working Group to define a language for representing voice data and applications. When Internet appliances are connected to the WWW, they can download and execute Web pages in much the same way as the GUI-based Web browsers download and execute HTML Web pages. Several candidate languages for voice have been specified. They include: VoxML from the VoxML Forum (IBM, AT & T, Motorola, and Lucent); Java Speech Markup Language (JSML) from Sun; Voice eXtensible Markup Language (VxML) from the VxML Forum (Bell Laboratories, AT&T, Motorola, IBM); and Speech Markup Language (SpeechML) from IBM. Currently, voice specification languages are being defined and implemented to encode data and applications for rendering and execution on these inexpensive, intuitive conversational Internet appliances.
James Larson is the manager of Human I/O at the Intel Architecture Labs in Hillsboro, OR. He can be reached at jim@larson-tech.com .