April/May 1999
APPLICATION DEVELOPMENTComputer-Driven Conversational Dialogs Present Challenges for Speech Application Designers
By James Larson
Many speech developers spent long hours trying to eliminate the "talking robot" phenomenon that experts once thought was the major log jam for the acceptance of speech-centric applications. As a result, speech applications now "sound" right. Today’s text-to-speech applications are sometimes mistaken for human beings and speech recognition applications no longer require that users speak "discretely" but instead allows users to speak "continuously" without pausing between each individual word.
While that success has certainly cleared a major hurdle toward the acceptance of speech, true interactive dialogs, as when two people converse over coffee, or more importantly, as when two people discuss a purchase, has been difficult for computers to duplicate. In retrospect, it is probably not a surprise to find that speech is easier to attain than conversation.
Conversational speech occurs when two participants, either two people or a person and a computer, take turns speaking with each other. Examples of conversational speech between a person and a computer include:
Many new speech applications are being developed using computer-driven dialogs. Computer-driven conversational dialogs are especially popular in telephony applications, where the user is asked a sequence of questions, each requiring that the user respond with a simple word or phrase. This type of dialog has two technical benefits. First, it forces the user to concentrate on answering a specific question rather than leaving the user to guess how to inform the system of his/her request. Second, it enables the speech recognition system to recognize, at each point in the dialog, a small number of words and phrases, rather than all words in a large vocabulary. A third benefit results from recent improvements in speech recognition, which replace DTMF (touchtone) input and dialogs in the form of "for accounting, press one; for customer service, press two,…"
Computer-driven conversational dialogs present some interesting challenges for the speech application designer, including how to phrase a question so it can be easily and quickly answered with a word or short phrase, and how to assist the user in determining an acceptable answer to each question asked.
Dialog Problems
"Would you like to buy it?" is a simple question that users should be able to answer. But in applications using speech recognition systems, obtaining an answer to a question like this can be difficult. Here are some problems an application developer may encounter:
To obtain the answer to a yes/no question, the application developer needs to specify a dialog that allows interrupts, accepts alternative phrases as responses, prompts users who do not respond, rephrases questions when users ask for help or enter inappropriate responses, and avoids hyperarticulation. User testing is required to fine-tune dialog components to find the length of the time-out period, capture the phrases used to respond to the question, refine the follow-up questions, and develop an appropriate strategy for dealing with hyperarticulation.
Reusable Dialog Components
Fortunately, much of the dialog component for "Do you want to buy it?" can be reused in other yes/no questions. However, some specifications may change. For example, the initial question and its various re-phrasings will need to be changed, as well as some of the acceptable user responses. But most of the structure of the dialog component remains the same.
While it is tempting simply to copy and edit the code for the question, "Do you want to buy it?," a better approach is to modularize the code. This is done by isolating the initial question and its various re-phrasings, confirmations, and acceptable responses as parameters for a module that is invoked for every yes/no question in the application. This module is an example of a reusable dialog component.
Controls, such as pull-down menus, radio buttons, and editing boxes, have made a tremendous impact on graphical user interfaces. They provide user interface consistency, incorporate proven user interface techniques, and improve ease of application development. Reusable dialog components provide the same benefits to voice-user interfaces, specifically:
Just as several companies create and sell controls, speech venders will create and sell reusable dialog components. For several years, SpeechWorks International (formally named ALTech) has supported several reusable dialog components, called DialogModules,™ that are a part of the SpeechWorks product. DialogModules™ are high-level building blocks that implement frequently-used dialog components. Developers can quickly assemble and integrate DialogModules™ into conversational speech applications. Currently, the following SpeechWorks DialogModules™ are available:
Some SpeechObjects™ are relatively simple, such as a Date SpeechObject. The Date SpeechObject recognizes when a person says a date in a variety of different formats. Other SpeechObjects™ are compound components that incorporate several SpeechObjects™, such as a Flight Information SpeechObject, which recognizes flight numbers and dates and provides users with the latest schedule of a flight.
Potential problems with reusable dialog components include:
Inheritance
Rather than have each and every reusable dialog component implement its own protocol for non-responses, unexpected responses, inappropriate responses, requests for help, and hyperarticulation, developers should factor out common protocol elements from the reusable dialog components and place them into a higher-level module. Also, individual reusable dialog components should be designed to inherit common protocols from their common ancestors. These types of reusable dialog components have the following advantages:
Complex Dialog Components
Sometimes it is convenient to combine several reusable dialog components into a more complicated dialog. For example, a banking application has a transfer operation requiring three parameters: the source account, the target account, and the amount to be transferred. A composite dialog component would accept values for these three parameters—to, from, and amount — in any sequence. For example:
"Transfer $200 from my savings account to my checking account"
"Transfer $200 to my checking account from my savings account"
"Transfer from my savings account $200 to my checking account"
A composite dialog component also solicits missing values the user does not include within the initial response. For example:
User: Transfer $200 to my checking account.
Computer: From which account should $200 be transferred?
User: From my savings account.
Researchers at The Oregon Graduate Institute are designing composite dialog components called task boxes. A task box enables users to accomplish a specific task by specifying multiple values in any sequence with a single user utterance.
Multi-modal Dialog Components
Another generalization of reusable dialog components is the multi-modal dialog component. Two examples of reusable multi-modal dialog components include:
Multi-modal dialog components enable people to use multiple I/O devices to converse with the computer. In the past, speech venders have marketed their speech engines by emphasizing their vocabulary size, accuracy, performance, and memory requirements. As developers begin to create the conversational speech applications, they will also choose speech engines based on development tools and reusable dialog components.
Reusable composite dialog components and reusable multi-modal dialogs are just around the corner. Just as controls have been re-used in hundreds of GUI applications, reusable dialog components have the potential to be reused in hundreds of speech applications. The focus of competition between speech toolkit venders could soon shift from speech recognition engines to libraries of reusable dialog components.
Jim Larson is the manager of Advanced Human I/O at the Intel Architecture Labs in Hillsboro, OR. He can be reached at jim@larson-tech.com