May/June 2001

Cover Story: The Internet's Information Highway Runs Right Through Voice Portals

By James A. Larson

Access to the Internet's wealth of information is becoming more than just a convenient tool - it is becoming a necessity. Its users are demanding more ways to go online than with just a PC and modem. Mobile connectivity has become key to productive communications.

To answer the demand for uninterrupted access to the Internet, phones have become popular Internet terminals, connecting through voice portals to access a large number and variety of information services. Developers use speech-oriented markup languages such as VoiceXML to create voice portals. Designers now are facing the challenge of coordinating content designed for GUI browsers, such as that written in HTML, with content designed for voice browsers.

Motivation
The Internet has affected dramatically the way people access information and conduct business. The number of Web sites is proliferating geometrically. Most major retailers have Web sites to complement their traditional brick-and-mortar sites. "E-tailers" have Web sites containing catalogs of products and services for sale. Many users now have their own Web sites, on which they publish a variety of information.

Web sites are accessed from PCs, which are connected to the Internet using browsers such as Microsoft's Internet Explorer and Netscape's Navigator. New devices and appliances entering the market enable users to access Internet information, such as WebTV, where users can browse the Web using their televisions. All of these interfaces require displays for rendering Web content to the user.

Telephones can be used to access the Internet by speaking and listening, a very natural user interface, which are skills every user learned during childhood. About 200 million PCs can be connected to the Internet, but more than 2 billion people have access to telephones. Mobile devices - cell phones, handhelds and laptop computers - are portable. They can be used anywhere - at home, at work and on the road. These devices provide the user with time-sensitive data - data when the user needs it. An example might be a late-breaking news item about a company in which the user owns stock. Mobile devices provide the user with location-sensitive data as well, such as the changing traffic conditions.

Handsets and their close relative, the headset - consisting of a microphone and speaker worn on the user's head - present a pragmatic interface to people with visual impairments or needing Web access while keeping their hands and eyes free for other tasks. Asian users whose language does not lend itself to entry with traditional QWERTY keyboards will benefit from the voice interface enabled by handsets or headsets.

IVR versus ASR
Two major technologies have emerged to capture user input in telephone applications: interactive voice response, or IVR, and automatic speech recognition, or ASR. In an interactive voice response system, a user responds to hierarchies of voice menus by pressing buttons on the telephone keypad. Major problems with IVR systems include:

Improvements in algorithms and increased processor speed have enabled ASR to hear what the user speaks rather than forcing the user to respond only by pressing keys on the telephone keypad. By using automatic speech recognition, the user says "Roxy Theater" rather than listening to a hierarchy of voice menus and pressing keys. Major problems with interactive ASR systems include:

How Voice Portals Work: User Perspective
Voice portals are Web sites that support ASR, as well as IVR, and enable user interaction with multiple services. The user dials a number that connects a telephone to the voice portal. After listening to a voice greeting (and possibly an advertisement depending upon the business model of the voice portal), the user selects a service. Example services include:

The user interacts with the service by speaking commands or pressing keys on the telephone keypad. The service responds with a prerecorded speech or text-to-speech synthesis.

The user also can configure personal settings by using either the telephone or a PC-based visual user interface. For example, the user might specify parameters, such as:

How Voice Portals Work: System Perspective


Figure 1 illustrates a voice portal's modules. The data manager stores and retrieves each user's personal data, including user profile, calendar, e-mail and to-do lists. The script manager stores and retrieves dialog scripts for execution by the script execution engine. The script execution engine interprets dialog scripts, such as the VoiceXML script illustrated in Figure 2. The engine also controls the exchange of data between the user and the voice portal and switches among dialog scripts whenever the user switches among services. The script execution engine uses several input and output modules for presenting information to the user and obtaining information from the user.

Input modules include an audio recorder for recording user verbalizations, a DTMF interpreter for decoding sequences of telephone keypad presses and an ASR for interpreting the user's words. Output modules include replay audio of recorded voice or music and TTS for converting text to synthesized speech for the user to hear. Services may be classified into either or both of two categories:

All of the voice portal modules illustrated in Figure 1 reside on the server. Any device containing a telephone can be a client. While telephones and cell phones are readily available, the verbal interface to the Web is quite restrictive. Some liken the experience of accessing Web data via a verbal interface to drinking water from a lake through a straw. Other voice-centric clients will emerge to support graphical user interfaces, including small screens on WAP-compatible cell phones, medium-size screens on handheld devices and large screens on desktop personal computers.

Authoring Web Services
Developers create dialog scripts - voice programs - to specify the sequence of information exchanged between the user and a service. Dialog scripts may be written in any of several languages, including VoxML (Motorola, www.voxml.com), OmniView XML (Indicast, www.indicast.com), VA-XML (Conita Technologies, www.conita.com) and VoiceXML (AT&T, IBM, Lucent and Motorola, www.voicesml.com). VoiceXML, at least initially, seems to be the most widely used voice dialog markup language, since it has been adopted by the W3C as a model for the dialog language within the W3C's Speech Interface Framework, which will include integrated markup languages for dialog, grammar, speech synthesis, natural language semantics and multimodal dialogs. Figure 2 illustrates a VoiceXML script that solicits the user's e-mail address and telephone number, and then invites the user to select another script depending upon where the user wants to fly. Figure 3 illustrates a typical dialog showing the prompts produced by the service and the user's verbal responses.


Voice services are basically sequential: The user listens to a sequence of options and then makes a selection. Each voice menu usually consists of no more than seven options, which is approximately the number of information pieces that people can store in their short-term memories and save for later processing. On the other hand, visual content is usually splashed on the screen as a large menu of headlines and icons from which the user selects to view in greater detail. There is no need to keep the number of choices to seven or fewer because the screen itself acts an extension of the human memory.

The presentation of graphical and voice content is inherently different. Figure 4 illustrates the HTML for the simple screen illustrated in Figure 5. The size, font and color attributes have all been eliminated to improve readability. The corresponding VoiceXML is illustrated in Figure 2, which was used to produce the voice dialog shown in Figure 3. Note that the two descriptions are quite different, even though they involve nearly identical content.


How can developers support both forms of content delivery - graphical and voice - to the user? There are four general approaches: dual-script authoring, script translation, single-script authoring and multiple views of the same content.

Dual-script authoring. The developer creates a visual description of the content using a graphical markup language, such as HTML, and a separate verbal description of the content using a voice markup language, such as VoiceXML. However, when a change is made to one description, the content provider also must make the corresponding change to the other description. Keeping the two descriptions in sync can become a burdensome task.


Script translation. A developer creates a single description of the content to be rendered by a GUI browser. Then, a translator (or interpreter) performs the following tasks:


Then, a voice browser can render a voice dialog. Automatic and semi- automatic translators would enable voice portals to access existing Web sites written in HTML - a major way to reuse existing content. Currently, several developers are creating these translators. For example, Vocal Point (www.vocalpoint.com) and Internet Speech (www.internetspeech. com) have technology that enables callers to browse existing HTML Web sites and avoids expensive re-coding of existing content.

Single-script authoring. A developer creates a single description of the content to be rendered by both a visual browser and a voice browser. A single language, supporting both graphical and voice descriptions, would be quite complex. It would need to describe the graphical aspects - font size, color and type - as well as the placement of each item on the display screen. Also, the language needs to describe the sequential nature of the voice interface and the appropriate timeout parameters, help messages and grammars representing synonyms of words selected by the user in the graphical user interface. A scripting language capable of describing both GUI and voice dialogs and is also easy to use is an active research area.

Multiple views of the same content. The basic information content is represented once and is mapped to multiple views. The mapping to the GUI view includes the layout, format and color information. The mapping to the voice view includes the sequence, pronunciation and verbal emphasis information. Minor changes to the basic information content can be made without modifying the mappings. However, insertion, deletion or major changes to the information content may require changes to one or more of the mappings. Mapping information content into both visual and audio presentation formats is also an area for research.

Some sources for technologies and infrastructures for building voice-enabled Web sites

Airtrac's Everywhere Office lets employees use the telephone to connect to their offices (voice-driven scheduling, e-mail access and dictation) and to the Web. www.airtrac.net

BeVocal has extended this philosophy to reusable applications called VocalSuites. For example, BeVocal's VocalLocator application provides callers with information about nearby businesses, as well as directions to the locations. The VocalLocator solution directs people to hotels, restaurant chains, retailers, banks and other businesses that have a large number of local, regional or national locations. Available VocalSuites applications include VocalConcierge, VocalDelivery, VocalFinance, VocalLocator, VocalNews, VocalTraffic, VocalTravel and VocalWeather. BeVocal aims to provide consumers with a wide range of voice-enabled applications via a single, toll-free number accessible by any telephone, anywhere. www.bevocal.com

Conita Technologies' Personal Virtual Assistant enables callers to access enterprise data via the telephone. PVA is integrated with Microsoft's Outlook/Exchange, which enables users to access their voice mail, e-mail, calendar, contact list data and task list from any telephone. PVA can be programmed to provide almost any information accessible via the Internet. www.conita.com

NetByTel Connected provides a proprietary platform and a network center so users connect to Web sites quickly and perform e-business tasks. Functions available to inbound callers include order purchase, order status, lead capture, contest entry, information access and customer service. Outbound call functions include auction notification, market research, reminder and direct response. www.NetByTel.com

Nuance provides a suite of developer tools including V-builder, which is a graphics tool that lets developers easily create voice interfaces for Web sites or any application using VoiceXML and SpeechObjects. SpeechObjects is a set of open, reusable components that encapsulates the best practices of voice interface design. Developers use SpeechObjects to reduce the time required to build high quality speech recognition applications, as well as provide consistent voice user interfaces that access multiple applications. www.nuance.com

OneVoice Technologies' VoiceSite enables Webmasters to map Web sites and create interactive scripts for each page of the site. When using IVAN, OneVoice's Intelligent Voice Animated Navigator, the scripts are invoked so the user hears personal greetings such as yes/no questions, FAQs or statements of general information. VoiceSite also voice enables all the hotlinks on the site or page, so an IVAN user simply states a hotlink and navigates the site. www.onevoicetech.com

SpeechWorks' Speechsite is a packages application solution that includes a built-in 5,000-name auto attendant with a personality that the user defines. SpeechSite directs calls, delivers company information, provides fax-back services and supports transactions. SpeechWorks also supplies a set of reusable dialogs. www.speechworks.com

Talkingweb voice browser connects to the desired Web page and plays the content back to the user. The pages must be registered previously with Talkingweb, for reasons similar to registering a URL address on the Internet. The browser supports various voice commands, including go back, go forward, hang up and go to the top. Hobbyists can create a Talkingweb site at home for free (a bit of time is required) and host it where they host their conventional Web pages. www.talkingweb.com

The VoiceXML Forum (AT&T, IBM, Lucent and Motorola) has published Version 2.0 of the VoiceXML markup language. VoiceXML Forum members are developing tools for the creation and use of this language. The VoiceXML markup language has been accepted as the model dialog markup language to be used within the W3C Speech Interface Framework by the W3C Voice Browser