May/June 2001

Cover Story: The Internet's Information Highway Runs Right Through Voice Portals

By James A. Larson

Access to the Internet's wealth of information is becoming more than just a convenient tool - it is becoming a necessity. Its users are demanding more ways to go online than with just a PC and modem. Mobile connectivity has become key to productive communications.

To answer the demand for uninterrupted access to the Internet, phones have become popular Internet terminals, connecting through voice portals to access a large number and variety of information services. Developers use speech-oriented markup languages such as VoiceXML to create voice portals. Designers now are facing the challenge of coordinating content designed for GUI browsers, such as that written in HTML, with content designed for voice browsers.

Motivation
The Internet has affected dramatically the way people access information and conduct business. The number of Web sites is proliferating geometrically. Most major retailers have Web sites to complement their traditional brick-and-mortar sites. "E-tailers" have Web sites containing catalogs of products and services for sale. Many users now have their own Web sites, on which they publish a variety of information.

Web sites are accessed from PCs, which are connected to the Internet using browsers such as Microsoft's Internet Explorer and Netscape's Navigator. New devices and appliances entering the market enable users to access Internet information, such as WebTV, where users can browse the Web using their televisions. All of these interfaces require displays for rendering Web content to the user.

Telephones can be used to access the Internet by speaking and listening, a very natural user interface, which are skills every user learned during childhood. About 200 million PCs can be connected to the Internet, but more than 2 billion people have access to telephones. Mobile devices - cell phones, handhelds and laptop computers - are portable. They can be used anywhere - at home, at work and on the road. These devices provide the user with time-sensitive data - data when the user needs it. An example might be a late-breaking news item about a company in which the user owns stock. Mobile devices provide the user with location-sensitive data as well, such as the changing traffic conditions.

Handsets and their close relative, the headset - consisting of a microphone and speaker worn on the user's head - present a pragmatic interface to people with visual impairments or needing Web access while keeping their hands and eyes free for other tasks. Asian users whose language does not lend itself to entry with traditional QWERTY keyboards will benefit from the voice interface enabled by handsets or headsets.

IVR versus ASR
Two major technologies have emerged to capture user input in telephone applications: interactive voice response, or IVR, and automatic speech recognition, or ASR. In an interactive voice response system, a user responds to hierarchies of voice menus by pressing buttons on the telephone keypad. Major problems with IVR systems include:

Lost in space: The user accidentally proceeds down the wrong menu branch and cannot reach the desired information. For example, the user mistakenly enters "1" for "northwest" instead of "3" for "northeast" and becomes frustrated when the option "Roxy Theater" (actually located in the northeast) is not heard nor can it be selected. To retreat from an incorrect choice, the user must either press a key corresponding to "home menu" or repeatedly press a key corresponding to "undo" to back up to the menu option where the user entered the incorrect choice.
Time-consuming menus: The user must listen to voiced options instead of asking directly for the information desired. For example, forcing the user to enter "3" for "northeast" and, then, "2" for "Act III Theaters" before hearing the option "5" for "Roxy Theater" can be frustrating for impatient users. Users experienced with the IVR system may use type-ahead to partially overcome this problem. Type-ahead enables the user to press keys before listening to the complete menu option. Experienced users who are familiar with the number options may press a sequence of keys to go immediately to the desired option. For example, the user might press the "3-2-5" keys to go directly to the Roxy Theater option.

Improvements in algorithms and increased processor speed have enabled ASR to hear what the user speaks rather than forcing the user to respond only by pressing keys on the telephone keypad. By using automatic speech recognition, the user says "Roxy Theater" rather than listening to a hierarchy of voice menus and pressing keys. Major problems with interactive ASR systems include:

Understandability: The ASR does not understand each user perfectly. When ASR fails, the user may spell the desired option verbally or press keys that correspond to the spelling of the option (i.e., "1" for A, B, or C; "2" for D, E, or F; and so on).
Time-consuming dialogs: The sequential verbalization of options is time-consuming. Barge-in is useful for experienced users who can remember the options and can say them before hearing the full set of options. Mixed initiative dialogs enable the user to ask questions, as well as respond to the system's questions, to home in to the desired information faster.

How Voice Portals Work: User Perspective
Voice portals are Web sites that support ASR, as well as IVR, and enable user interaction with multiple services. The user dials a number that connects a telephone to the voice portal. After listening to a voice greeting (and possibly an advertisement depending upon the business model of the voice portal), the user selects a service. Example services include:

Accessing self-service queries and transactions, which includes a corporate "front desk" asking callers who or what they want, automated telephone ordering services, support desks, order tracking, airline arrival and departure information, cinema and theater booking services, and home banking services.
Accessing public information, which includes community information such as weather, traffic conditions, school closures, events and their locations, addresses and driving directions; local, national and international news, such as sports scores, movie reviews and horoscopes; Internet radio stations; national and international stock market information; yellow pages; and business and e-commerce transactions. Telephone calls also may be routed to human operators at specific businesses.
Accessing personal information, such as e-mail, calendars, address and telephone lists, to-do lists, shopping lists and calorie counters.
Helping the user communicate with other people, such as initiating telephone calls via voice dialing and sending and receiving voice-mail messages.

The user interacts with the service by speaking commands or pressing keys on the telephone keypad. The service responds with a prerecorded speech or text-to-speech synthesis.

The user also can configure personal settings by using either the telephone or a PC-based visual user interface. For example, the user might specify parameters, such as:

Initial service upon connection - much like specification of the home page on a traditional Web browser;
Novice or experienced - terseness or verboseness of the prompt and help messages presented to the user;
Synthesized voice parameters - gender, age, and speaking style; Timeout period - length of time the system listens for a response before offering help;
Password and security management - password and security policies, changes, and alterations;
Speaker profile - voice characteristics of the user enables improved ASR performance. The speaker profile is created when the user "trains" the system to recognize the user's personal voice patterns.

How Voice Portals Work: System Perspective

Figure 1 illustrates a voice portal's modules. The data manager stores and retrieves each user's personal data, including user profile, calendar, e-mail and to-do lists. The script manager stores and retrieves dialog scripts for execution by the script execution engine. The script execution engine interprets dialog scripts, such as the VoiceXML script illustrated in Figure 2. The engine also controls the exchange of data between the user and the voice portal and switches among dialog scripts whenever the user switches among services. The script execution engine uses several input and output modules for presenting information to the user and obtaining information from the user.

Input modules include an audio recorder for recording user verbalizations, a DTMF interpreter for decoding sequences of telephone keypad presses and an ASR for interpreting the user's words. Output modules include replay audio of recorded voice or music and TTS for converting text to synthesized speech for the user to hear. Services may be classified into either or both of two categories:

Services paid for by the user, including "sticky" services to encourage the user to return to the voice portal frequently. Sticky services might include e-mail, personal information services and Web browsing. Sticky services generate revenue for the portal provider through customer subscription or transaction fees.
Services that generate revenue for the portal providers from non-customer sources, including short, six-second commercials (equivalent to banner advertisements) targeted at specific users and revenue-sharing with e-commerce businesses.

All of the voice portal modules illustrated in Figure 1 reside on the server. Any device containing a telephone can be a client. While telephones and cell phones are readily available, the verbal interface to the Web is quite restrictive. Some liken the experience of accessing Web data via a verbal interface to drinking water from a lake through a straw. Other voice-centric clients will emerge to support graphical user interfaces, including small screens on WAP-compatible cell phones, medium-size screens on handheld devices and large screens on desktop personal computers.

Authoring Web Services
Developers create dialog scripts - voice programs - to specify the sequence of information exchanged between the user and a service. Dialog scripts may be written in any of several languages, including VoxML (Motorola, www.voxml.com), OmniView XML (Indicast, www.indicast.com), VA-XML (Conita Technologies, www.conita.com) and VoiceXML (AT&T, IBM, Lucent and Motorola, www.voicesml.com). VoiceXML, at least initially, seems to be the most widely used voice dialog markup language, since it has been adopted by the W3C as a model for the dialog language within the W3C's Speech Interface Framework, which will include integrated markup languages for dialog, grammar, speech synthesis, natural language semantics and multimodal dialogs. Figure 2 illustrates a VoiceXML script that solicits the user's e-mail address and telephone number, and then invites the user to select another script depending upon where the user wants to fly. Figure 3 illustrates a typical dialog showing the prompts produced by the service and the user's verbal responses.

Voice services are basically sequential: The user listens to a sequence of options and then makes a selection. Each voice menu usually consists of no more than seven options, which is approximately the number of information pieces that people can store in their short-term memories and save for later processing. On the other hand, visual content is usually splashed on the screen as a large menu of headlines and icons from which the user selects to view in greater detail. There is no need to keep the number of choices to seven or fewer because the screen itself acts an extension of the human memory.

The presentation of graphical and voice content is inherently different. Figure 4 illustrates the HTML for the simple screen illustrated in Figure 5. The size, font and color attributes have all been eliminated to improve readability. The corresponding VoiceXML is illustrated in Figure 2, which was used to produce the voice dialog shown in Figure 3. Note that the two descriptions are quite different, even though they involve nearly identical content.

How can developers support both forms of content delivery - graphical and voice - to the user? There are four general approaches: dual-script authoring, script translation, single-script authoring and multiple views of the same content.

Dual-script authoring. The developer creates a visual description of the content using a graphical markup language, such as HTML, and a separate verbal description of the content using a voice markup language, such as VoiceXML. However, when a change is made to one description, the content provider also must make the corresponding change to the other description. Keeping the two descriptions in sync can become a burdensome task.

Script translation. A developer creates a single description of the content to be rendered by a GUI browser. Then, a translator (or interpreter) performs the following tasks:

Extracts data from the HTML description to be represented in the voice description;
Applies hints and cues for data missing in the HTML, such as captions for pictures. This missing data can be embedded into the HTML by the GUI authors or added manually before or during the translation;
Determines the sequence of data presentation;
Generates appropriate timeout parameters and help messages;
Generates grammars to represent the synonyms that might be spoken by the user in response to a voice prompt;
Reformats the data into the format required by the voice browser.

Then, a voice browser can render a voice dialog. Automatic and semi- automatic translators would enable voice portals to access existing Web sites written in HTML - a major way to reuse existing content. Currently, several developers are creating these translators. For example, Vocal Point (www.vocalpoint.com) and Internet Speech (www.internetspeech. com) have technology that enables callers to browse existing HTML Web sites and avoids expensive re-coding of existing content.

Single-script authoring. A developer creates a single description of the content to be rendered by both a visual browser and a voice browser. A single language, supporting both graphical and voice descriptions, would be quite complex. It would need to describe the graphical aspects - font size, color and type - as well as the placement of each item on the display screen. Also, the language needs to describe the sequential nature of the voice interface and the appropriate timeout parameters, help messages and grammars representing synonyms of words selected by the user in the graphical user interface. A scripting language capable of describing both GUI and voice dialogs and is also easy to use is an active research area.

Multiple views of the same content. The basic information content is represented once and is mapped to multiple views. The mapping to the GUI view includes the layout, format and color information. The mapping to the voice view includes the sequence, pronunciation and verbal emphasis information. Minor changes to the basic information content can be made without modifying the mappings. However, insertion, deletion or major changes to the information content may require changes to one or more of the mappings. Mapping information content into both visual and audio presentation formats is also an area for research.

Some sources for technologies and infrastructures for building voice-enabled Web sites
Airtrac's Everywhere Office lets employees use the telephone to connect to their offices (voice-driven scheduling, e-mail access and dictation) and to the Web. www.airtrac.net

BeVocal has extended this philosophy to reusable applications called VocalSuites. For example, BeVocal's VocalLocator application provides callers with information about nearby businesses, as well as directions to the locations. The VocalLocator solution directs people to hotels, restaurant chains, retailers, banks and other businesses that have a large number of local, regional or national locations. Available VocalSuites applications include VocalConcierge, VocalDelivery, VocalFinance, VocalLocator, VocalNews, VocalTraffic, VocalTravel and VocalWeather. BeVocal aims to provide consumers with a wide range of voice-enabled applications via a single, toll-free number accessible by any telephone, anywhere. www.bevocal.com

Conita Technologies' Personal Virtual Assistant enables callers to access enterprise data via the telephone. PVA is integrated with Microsoft's Outlook/Exchange, which enables users to access their voice mail, e-mail, calendar, contact list data and task list from any telephone. PVA can be programmed to provide almost any information accessible via the Internet. www.conita.com

NetByTel Connected provides a proprietary platform and a network center so users connect to Web sites quickly and perform e-business tasks. Functions available to inbound callers include order purchase, order status, lead capture, contest entry, information access and customer service. Outbound call functions include auction notification, market research, reminder and direct response. www.NetByTel.com

Nuance provides a suite of developer tools including V-builder, which is a graphics tool that lets developers easily create voice interfaces for Web sites or any application using VoiceXML and SpeechObjects. SpeechObjects is a set of open, reusable components that encapsulates the best practices of voice interface design. Developers use SpeechObjects to reduce the time required to build high quality speech recognition applications, as well as provide consistent voice user interfaces that access multiple applications. www.nuance.com

OneVoice Technologies' VoiceSite enables Webmasters to map Web sites and create interactive scripts for each page of the site. When using IVAN, OneVoice's Intelligent Voice Animated Navigator, the scripts are invoked so the user hears personal greetings such as yes/no questions, FAQs or statements of general information. VoiceSite also voice enables all the hotlinks on the site or page, so an IVAN user simply states a hotlink and navigates the site. www.onevoicetech.com

SpeechWorks' Speechsite is a packages application solution that includes a built-in 5,000-name auto attendant with a personality that the user defines. SpeechSite directs calls, delivers company information, provides fax-back services and supports transactions. SpeechWorks also supplies a set of reusable dialogs. www.speechworks.com

Talkingweb voice browser connects to the desired Web page and plays the content back to the user. The pages must be registered previously with Talkingweb, for reasons similar to registering a URL address on the Internet. The browser supports various voice commands, including go back, go forward, hang up and go to the top. Hobbyists can create a Talkingweb site at home for free (a bit of time is required) and host it where they host their conventional Web pages. www.talkingweb.com

The VoiceXML Forum (AT&T, IBM, Lucent and Motorola) has published Version 2.0 of the VoiceXML markup language. VoiceXML Forum members are developing tools for the creation and use of this language. The VoiceXML markup language has been accepted as the model dialog markup language to be used within the W3C Speech Interface Framework by the W3C Voice Browser