Untitled Document

April/May 1999

APPLICATION DEVELOPMENT
Parts of Speech: Building Dialogs

Computer-Driven Conversational Dialogs Present Challenges for Speech Application Designers

By James Larson

Many speech developers spent long hours trying to eliminate the "talking robot" phenomenon that experts once thought was the major log jam for the acceptance of speech-centric applications. As a result, speech applications now "sound" right. Today’s text-to-speech applications are sometimes mistaken for human beings and speech recognition applications no longer require that users speak "discretely" but instead allows users to speak "continuously" without pausing between each individual word.

While that success has certainly cleared a major hurdle toward the acceptance of speech, true interactive dialogs, as when two people converse over coffee, or more importantly, as when two people discuss a purchase, has been difficult for computers to duplicate. In retrospect, it is probably not a surprise to find that speech is easier to attain than conversation.

Conversational speech occurs when two participants, either two people or a person and a computer, take turns speaking with each other. Examples of conversational speech between a person and a computer include:

Human-driven conversational dialogs, in which the person repeatedly asks a question or speaks a command and the computer responds. In command and control systems, the user repeatedly instructs the computer to perform a task. The computer performs the instruction and informs the user of the outcome.
Computer-driven conversational dialogs, where the computer repeatedly asks questions to solicit answers and instructions from a user.
Mixed initiative dialogs, where human-driven and computer-driven dialogs are combined. The user and computer take turns driving the conversations.

Many new speech applications are being developed using computer-driven dialogs. Computer-driven conversational dialogs are especially popular in telephony applications, where the user is asked a sequence of questions, each requiring that the user respond with a simple word or phrase. This type of dialog has two technical benefits. First, it forces the user to concentrate on answering a specific question rather than leaving the user to guess how to inform the system of his/her request. Second, it enables the speech recognition system to recognize, at each point in the dialog, a small number of words and phrases, rather than all words in a large vocabulary. A third benefit results from recent improvements in speech recognition, which replace DTMF (touchtone) input and dialogs in the form of "for accounting, press one; for customer service, press two,…"

Computer-driven conversational dialogs present some interesting challenges for the speech application designer, including how to phrase a question so it can be easily and quickly answered with a word or short phrase, and how to assist the user in determining an acceptable answer to each question asked.

Dialog Problems

"Would you like to buy it?" is a simple question that users should be able to answer. But in applications using speech recognition systems, obtaining an answer to a question like this can be difficult. Here are some problems an application developer may encounter:

User barges in. The user responds before the system finishes asking a question. In effect, the system must listen while it speaks. As soon as the system detects an utterance from the user, the system should stop asking the question.
User does not respond. When the user fails to respond within a prescribed period of time, called the time-out period, the question should be repeated to encourage the user to respond.
User utters an unexpected response. Rather than with a simple "yes" or "no," the user may utter words such as "yeah," "of course," "sure," "nope," "negative," "help!," "what?," "I didn’t understand," or "huh?" One way of dealing with this problem is to specify each possible response and its interpretation. In effect, the developer must specify the grammar for all possible responses. For example, affirmative responses include "yes," "yeah," "of course," and "sure." Negative responses include "no," and "nope."
User asks for help. When the user responds with an inappropriate answer or asks for help, then rephrasing the question may clarify the possible confusion in the user’s mind. For example, "Would you like to buy the book, War and Peace, for $19.95 now?" is more specific than "Would you like to buy it?" and might solicit a more meaningful response from the user. Another way to encourage a user’s reaction is to suggest specific responses; for example, "Would you like to buy the book, War and Peace, for $19.95 now; please say ‘yes’ or ‘no.’"
Avoidance of hyperarticulation. When the system repeatedly fails to understand the user’s responses, the user may become annoyed and tense. Unfortunately, this changes the characteristics of the user’s speech. This in turn makes it more difficult for the system to recognize what the user is saying, causing even more frustration and additional change in the user’s speech pattern. The user may slowly and carefully pronounce each syllable, an effect called hyperarticulation. Ways of dealing with hyperarticulation and other changes to the user’s voice include:
1. Rephrase the question in such a manner that the user is likely to respond using a different vocabulary
2. Replace the question by a series of questions, each of which solicits a simple response that the user is more likely to be able to provide
3. Prompt the user with possible responses
4. As a last resort, ask the user to spell his or her response

To obtain the answer to a yes/no question, the application developer needs to specify a dialog that allows interrupts, accepts alternative phrases as responses, prompts users who do not respond, rephrases questions when users ask for help or enter inappropriate responses, and avoids hyperarticulation. User testing is required to fine-tune dialog components to find the length of the time-out period, capture the phrases used to respond to the question, refine the follow-up questions, and develop an appropriate strategy for dealing with hyperarticulation.

Reusable Dialog Components

Fortunately, much of the dialog component for "Do you want to buy it?" can be reused in other yes/no questions. However, some specifications may change. For example, the initial question and its various re-phrasings will need to be changed, as well as some of the acceptable user responses. But most of the structure of the dialog component remains the same.

While it is tempting simply to copy and edit the code for the question, "Do you want to buy it?," a better approach is to modularize the code. This is done by isolating the initial question and its various re-phrasings, confirmations, and acceptable responses as parameters for a module that is invoked for every yes/no question in the application. This module is an example of a reusable dialog component.

Controls, such as pull-down menus, radio buttons, and editing boxes, have made a tremendous impact on graphical user interfaces. They provide user interface consistency, incorporate proven user interface techniques, and improve ease of application development. Reusable dialog components provide the same benefits to voice-user interfaces, specifically:

Consistent user interfaces. Each yes/no question is handled consistently. Once becoming familiar with the protocol for answering yes/no questions, the user quickly and confidently responds to any yes/no question. This consistency enables the user to concentrate on answering the question rather than struggling with an unfamiliar user interface.
Proven user interface techniques. Useful protocols for dealing with non-responses, requests for help, error handling, and avoidance of hyperarticulation are encoded into reusable dialog components and are available when the user needs them.
Easy development. It is much easier for developers to specify parameters for a reusable dialog component than to develop the code from scratch. Developers save time and effort by incorporating reusable dialog components within their speech applications.

Just as several companies create and sell controls, speech venders will create and sell reusable dialog components. For several years, SpeechWorks International (formally named ALTech) has supported several reusable dialog components, called DialogModules,™ that are a part of the SpeechWorks product. DialogModules™ are high-level building blocks that implement frequently-used dialog components. Developers can quickly assemble and integrate DialogModules™ into conversational speech applications. Currently, the following SpeechWorks DialogModules™ are available:

Yes/No
ItemList
VoiceMenu
ContinuousDigits
AlphanumericString
TelephoneNumber
ZipCode
Spelling
CustomContext
Currency
Date

Recently, Nuance introduced SpeechObjects™ , which are reusable dialog components consisting of three key elements—dialogs, grammars, and prompts. Dialogs describe the flow of interaction between a user and an application; grammars are lists of valid words and phrases the user might say in a dialog; prompts are messages presented to a user to elicit a response.

Some SpeechObjects™ are relatively simple, such as a Date SpeechObject. The Date SpeechObject recognizes when a person says a date in a variety of different formats. Other SpeechObjects™ are compound components that incorporate several SpeechObjects™, such as a Flight Information SpeechObject, which recognizes flight numbers and dates and provides users with the latest schedule of a flight.

Potential problems with reusable dialog components include:

Incompatibility among reusable dialog components from different venders. The reusable dialog component may be tightly integrated with the speech vender’s speech recognition engine. It may be impossible to "mix and match" or "select the best-of-breed" from the libraries of reusable dialog components from different venders.
Inconsistent user interfaces. Two different reusable dialog components may use different protocols for non-responses, unexpected responses, requests for help, inappropriate responses, and avoidance of hyperarticulation. Different protocols for different types of questions within the same application will be confusing for users.

The latter problem can be overcome by using traditional object-oriented design and programming techniques.

Inheritance

Rather than have each and every reusable dialog component implement its own protocol for non-responses, unexpected responses, inappropriate responses, requests for help, and hyperarticulation, developers should factor out common protocol elements from the reusable dialog components and place them into a higher-level module. Also, individual reusable dialog components should be designed to inherit common protocols from their common ancestors. These types of reusable dialog components have the following advantages:

Consistent dialog component protocols across all types of questions. Because common protocol elements apply to all types of questions, the dialog component execution is consistent across all questions.
Easy specification of reusable dialog components. The developer only needs to specify the "what" for each question and the wording, paraphrases, and acceptable responses for each question. In effect, the dialog component becomes a specification for "what" while the inherited dialog component inherits the "how."

Complex Dialog Components

Sometimes it is convenient to combine several reusable dialog components into a more complicated dialog. For example, a banking application has a transfer operation requiring three parameters: the source account, the target account, and the amount to be transferred. A composite dialog component would accept values for these three parameters—to, from, and amount — in any sequence. For example:

"Transfer $200 from my savings account to my checking account"

"Transfer $200 to my checking account from my savings account"

"Transfer from my savings account $200 to my checking account"

A composite dialog component also solicits missing values the user does not include within the initial response. For example:

User: Transfer $200 to my checking account.

Computer: From which account should $200 be transferred?

User: From my savings account.

Researchers at The Oregon Graduate Institute are designing composite dialog components called task boxes. A task box enables users to accomplish a specific task by specifying multiple values in any sequence with a single user utterance.

Multi-modal Dialog Components

Another generalization of reusable dialog components is the multi-modal dialog component. Two examples of reusable multi-modal dialog components include:

A message is presented both verbally and visually to the user, who then responds either by speaking or typing the appropriate response.
Users enter some parameters of a composite dialog component by voice and some by stylus gesture. For example:
- User says: "Transfer $200 from here."
- The user points the stylus to the savings account. User says: "to here."
- The user points the stylus to the checking account.
Multi-modal dialog components enable people to use multiple I/O devices to converse with the computer. In the past, speech venders have marketed their speech engines by emphasizing their vocabulary size, accuracy, performance, and memory requirements. As developers begin to create the conversational speech applications, they will also choose speech engines based on development tools and reusable dialog components.

Reusable composite dialog components and reusable multi-modal dialogs are just around the corner. Just as controls have been re-used in hundreds of GUI applications, reusable dialog components have the potential to be reused in hundreds of speech applications. The focus of competition between speech toolkit venders could soon shift from speech recognition engines to libraries of reusable dialog components.

Jim Larson is the manager of Advanced Human I/O at the Intel Architecture Labs in Hillsboro, OR. He can be reached at jim@larson-tech.com