Chapter	#
Voice User Interface Design for Novice and Experienced Users
Subtitle
James A. Larson
Intel Corporation, 16055 S. W. Walker Rd., #402, Beaverton OR 97006, www.larson-tech.com

Key words:

Abstract: With careful design and testing, voice application designers can develop efficient and enjoyable VoiceXML applications for both novice and experienced callers. VoiceXML lends itself to a system-directed dialog for voice users, guiding them through the unfamiliar application and helping callers achieve their desired results. However, as callers become more experienced, they learn shortcuts to increase the pace of the dialog. Experienced users familiar with the structure and commands of the application can direct the application to perform actions quickly and efficiently using a type of mixed-initiative dialog.

INTRODUCTION

The novice-experienced user problem

Telephony applications enable callers to speak and listen to the computer using a telephone or cell phone in order to achieve computational tasks. Two general goals of most telephony applications include:

(1) Enable anyone who can speak and listen to use telephony applications without previous training .

(2) Enable all callers to perform tasks quickly and efficiently.

To achieve the first goal, many developers design telephony applications to be application-directed (also called application-initiative, machine-directed, system-directed, or directed dialogs)—the application leads the caller through the application by asking the caller questions to which the caller responds. With no previous training, callers who are not familiar with any telephony application are able to interact with the application.

However, users who frequently use the application may become impatient by listening to and answering the same questions in the same order each time they use the application. Average callers save time by volunteering the answers to the questions without waiting for the questions to be asked. Many average callers desire to take a more active role by leading the interaction with the application so that they can complete their tasks more quickly and efficiently. A mixed-initiative telephony application enables callers to lead as well as be led by the application.

This paper describes how to develop telephony applications that can be both application-directed and mixed-initiative. Novice users answer detailed questions in response to verbal prompts to achieve their desired results, while average users speed up the interaction by volunteering the answers to questions without waiting to listen to some of the prompts.

VoiceXML

VoiceXML [ Andersson, Eve Astrid, Stephen Breitenback, Tyler Burd, Nimal Chidambaram, Paul Houle, Daniel Newsome, Xiaofei Tang, Xiaolan Zhu, Early Adopter VoiceXML, Wrox Press: Birmingham , UK. ISBN 1-861005-672-8 ] [ Beasley, Rick, Kenneth Michael Farley, John O’Reilly, and Leon Henry Squire, Voice Application Development with VoiceXML, Sams: Indianapolis, 2002.] [Edgar, Bob, VoiceXML: The Telephony-enabled Web, CMP Books: NY, 2001] [Larson, James A, VoiceXML: An Introduction to Developing Voice Applications, Prentice Hall, Upper Saddle River, NJ: 2002] is a markup language for building telephony applications. VoiceXML can be thought of as the “HTML for telephony applications.” By hiding many low-level details, developers use VoiceXML to create telephony applications by specifying high-level menus and forms rather than procedural programming code. Decreasing the programming time and effort enables developers to perform additional iterations of usability testing and design refinement. VoiceXML is lowering the entry barrier for creating telephony applications.

While VoiceXML makes it easy to create speech applications, it is difficult to create a good one. An HTML programmer easily learns how to write VoiceXML documents, but designing a usable VoiceXML application is still more of an art than a science. Most VoiceXML language manuals do not offer much advice for how to phrase a prompt, what to include in the grammar describing what a caller might say in response to a prompt, and what to do if the caller does not respond appropriately.

Figure 1 illustrates a fragment of a telephony application expressed using the VoiceXML language. This fragment illustrates a VoiceXML form—the equivalent of a paper form in which the caller enters a date consisting of fields containing values spoken by the caller and converted to text by a speech recognition engine. Each field has a name acting as a variable for the value of the field, a prompt that is presented to the caller as speech by either replaying a pre-recorded audio file or by a speech synthesis engine which converts a text string into spoken words, and a grammar used by the speech recognition engine to convert spoken words into text. The speech recognition engine recognizes only words specified by the grammar and places the corresponding text string into the field name.

For example, in Figure 1, the date form has three fields, month, day, and year. The month field has the prompt “What month?” that is converted to speech by a speech synthesizer and presented to the user. The user responds by speaking the name of one of the twelve months. Using the grammar that specifies the valid user responses to the prompt, the speech recognition converts the spoken month name into text, which is placed into the variable named “month.”

<?xml version="1.0"?>
<vxml version="2.0" xmlns = "http://www.w3.org/2001/vxml">
     <form id = "date">
          <field name = "month">
               <prompt> What month? </prompt>
               <grammar root="month" version="1.0" type= "application/grammar+xml" >
                   <rule id = "month">
                       <one-of>
                            <item>January</item>
                            <item>February</item>
                            <item>March</item>
                            <item>April</item>
                            <item>May</item>
                           <item>June</item>
                           <item>July</item>
                           <item>August</item>
                           <item>September</item>
                           <item>October</item>
                           <item>November</item>
                           <item>December</item>
                      </one-of>
                   </rule>
               </grammar>
          </field>

          <field name = "day">
               <prompt> What day of the month? </prompt>
               <grammar root="day" version = "1.0" type = "application/grammar+xml">
                    <rule id = "day">
                         <one-of>
                              <item>one</item>
                              <item>two</item>
                              
                              <item>thirty-one</item>
                        </one-of>
                    </rule>
              </grammar>
          </field>

          <field name = "year">
               <prompt> What year? </prompt>
               <grammar root = "year" version = "1.0" type = "application/grammar+xml">
                    <rule id = "year" >
                        <one-of>
                             <item>nineteen hundred</item>
                             <item>nineteen hundred one</item>
                             
                             <item>two thousand two</item>
                      </one-of>
                   </rule>
               </grammar>
          </field>
     </form>
</vxml>

Figure: 1. Example VoiceXML dialog fragment

The form of Figure 1 contains three fields—month, day, and year. VoiceXML processors use the Form Interpretation Algorithm (FIA) to sequence the fields for processing. The FIA selects the next field in the form whose field name has no value. In general, the FIA selects first the month field, then the day field, and finally the year field.

User classes

The caller’s previous experience with speech applications is an important factor in the caller’s ability to use a new speech application. At one extreme, a caller may be a novice—someone who has never used a speech interface. Novice callers may not know that they may barge-in before the end of a prompt, may not know how to ask for help, and may not know that they can skip over fields within a verbal form. At the other extreme, a caller may be experienced—someone who has used voice interfaces many times. Experienced callers are able to leverage what they know about voice applications to explore and learn new voice applications. The caller realizes that it is not necessary to listen to a complete prompt, it is possible to ask for help at any time just by saying “help,” and may speak the field values for a voice form in any convenient order.

Because most callers have encountered a touch-tone menu system, they probably are one notch above a basic novice caller. However, most have never encountered voice-enabled applications and fall short of being experienced voice-application callers.

The three types of dialog styles

Figure 2 illustrates the typical flow of an application-directed dialog. The application prompts the caller by asking a question or giving instructions and then waits for the caller to respond. The caller responds by speaking or pressing the buttons on a touch-tone phone. The application then performs the appropriate action, and the cycle begins again. When interpreted by a VoiceXML browser, the VoiceXML fragment of Figure 1 might produce the application-directed dialog illustrated in Figure 2.

Application: What month?
Caller: February
Application: What day of the week?
Caller: Twelve
Application: What year?
Caller: Nineteen ninety-seven

Figure 2: Example application-directed dialog produced by the VoiceXML fragment in Figure 1.

Application-directed dialogs are easy to code and maintain. Many callers like application-directed dialogs because they guide the caller through the application. The caller is not required to remember any commands or options (although with experience, callers do remember commands and options and use barge-in to speed up the dialog). Many callers have experienced application-directed dialogs when using the touch-tone buttons on telephones. Most novice callers feel comfortable with application-directed dialogs. On the negative side, callers may complain that the dialog is too structured and rigid, and takes too much time to complete. Some callers feel that the computer becomes their master and they become mere slaves to the computer.

With user-directed dialogs (also called user-initiative dialogs), the caller speaks to the application and, instructs the application what to do. The caller speaks a request, the application performs the appropriate action and confirms the result to the caller, and then waits for the caller to speak the next request. Then, the cycle starts again. A user-directed dialog is illustrated in Figure 3.

Caller: Set month to February
Application: Month is February
Caller: Set day to twelve
Application: Day is twelve
Caller Set year to nineteen ninety-seven
Application Year is nineteen ninety-seven

Figure 3: Example of a user-directed dialog.

The caller “drives” the dialog by initiating each dialog segment without explicit prompts. User-directed dialogs generally require the caller to remember the names of commands and parameters. While this is seldom a problem for an experienced caller, it may be problematic for a novice caller. This problem can be minimized with carefully designed help messages or a cue card listing the commands that the caller can carry in his wallet or her purse. After learning the command set, callers generally like user-directed dialogs because callers do not have to listen to lengthy menus. They feel in control of the application.

Mixed-initiative dialogs are a mixture of application-directed and user-directed dialogs. Figure 4 illustrates an application-directed dialog.

Application: What month?
Caller: February twelve nineteen ninety-seven

Figure 4: A type of mixed initiative dialog in which the user takes the initiative to enter more data than was requested.

Some callers become confused when they first encounter a mixed-initiative dialog because they do not know what to do or when to barge-in. Perhaps this is because many callers are already familiar with application-directed dialogs, especially touch-tone based applications and have little experience with mixed-initiative dialogs. Mixed-initiative dialogs feels more natural to most callers after they pass through the initial learning period. On the negative side, mixed-initiative dialogs can be more time-consuming to build, test, and maintain and may be more compute intensive.

Developing voice user interfaces

The general process for developing a voice application is similar to the process for developing general application software. The steps are:

Identify the application by interviewing prospective callers to identify their needs.
Select the application to be developed and marketed.
Develop usage scenarios and create a conceptual model of how the user will access the application.
Develop the application.
Test the application iteratively to refine the user interface.

However, there are some unique tasks that must be performed when developing voice user interfaces, including specifying prompt messages to solicit information from the caller, writing grammars to describe the words and phrases a caller may speak in response prompts, and writing error handlers to help the caller resolve problems that occur when the speech recognition engine fails to recognize the user utterance.

Several guidelines for developing voice user interfaces will be described, along with examples of how to implement the guidelines using the most recent version of VoiceXML 2.0. The guidelines are grouped into three clusters:

Guidelines for voice user interfaces for novice users—First time callers who need to be guided through the application
Guidelines for voice user interfaces for average users—Callers who have used the voice application several times and want to speed up the pace of the interaction
Guidelines for voice user interfaces for experienced users—Callers who are very familiar with the application and do not need to be guided through the application.

See [ Balentine, Bruce and David P. Morgan, How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues, Enterprise Integration Group, San Ramon, CA, 1999.] for additional human-factors guidelines for speech applications.

Guidelines voice user interfaces for Novice Users

These guidelines are designed for developing voice user interfaces for novice users who need to be guided through the application. Developers use these guidelines to design a system-directed dialog that can easily be implemented in VoiceXML 2.0. The guidelines are:

Match the organization of the voice user interface to the caller’s perspective
Guide the caller.
Use error handlers to provide help to the caller.
Provide progressive assistance.

Match the organization of the voice user interface to the caller’s perspective.

Callers frequently use a voice user interface to achieve specific results. For example, callers access a voice mail system to review and possibly create and send voice mail messages. The best way to organize a voice user interface is by a task flow that matches the sequence of tasks normally performed by the user. For example, the caller first reviews a list of received messages, selects a message for review, and may then record, edit and send a response. The voice user interface should be designed to facilitate this typical sequence of tasks.

In North America, callers usually say the date in month-day-year sequence. The VoiceXML fragment in Figure 1 illustrates how a voice user interface asks for date information in the same sequence that North American callers usually write dates. Europeans frequently write dates in the sequence of day/month/year. For European callers, designers should consider switching the position of the month and day fields so that the user interface more closely matches how Europeans write dates.

Guide the caller.

Most callers have experienced touch-tone applications which are generally of the form “for accounting, press one; for sales, press two; …” Callers, who are not familiar with the application, find these system-directed dialogs an easy and natural way to interact with a voice over a telephone or cell phone despite the cognitive task of mapping the user’s request to pressing a numbered key on a keypad. Currently, many callers have never experienced a user-directed dialog and will not know what to do if the voice application asks an open-ended question, such as “What would you like to do next?” (This situation may change as more users begin to experience mixed-initiative dialogs.)

Designers should create applications that guide the caller towards speaking responses that the speech recognition engine. It is much better to use “suggestive” prompts rather than require the user to memorize and recall words.

Phrase each question to solicit a simple answer. Either enumerate the words that the user may speak in response to the question, “How do you want to travel? By air, boat or car,” or identify a class of words well-known to the user, such as “Which day of the month?” An experienced caller may speed up the dialog by barging-in during the question speaking a response.

Use error handlers to provide help to the caller.

Voice tutorials presented at the beginning of an application have not proven to be helpful to users. Users have difficulty remembering the details presented in voice tutorials. Also many users resent the time required to listen to them. Instead of presenting wordy tutorials, provide just-in-time help. When the user becomes confused and does not know how to respond, the user asks for help. Help is provided in a short verbal message to the user expressed in the context of the user’s current point in the dialog. If the user fails to respond to a question or responds with a word not recognized by the speech recognition engine, present a help message to the caller explaining appropriate responses.

When errors occur the application responds according to instructions specified by the dialog designer in event handlers that are special snippets of VoiceXML code especially designed to help users resolve errors. VoiceXML detects several types of caller errors, including:

Help. The caller has explicitly asked for help. Provide a more detailed explanation of what the user should say.

Noinput . Either the caller did not understand the prompt or does not know how to answer the question. Consider telling the caller “I didn’t hear you,” and rephrase the question using different wording.

Nomatch . The caller has uttered a word or phrase that is not recognized by the speech recognition system. Consider telling the caller “I didn’t understand you.” Rephrase the question and enumerate the possible answers.

Figure 5 illustrates event handlers for assisting users when they ask for help or encounter noinput or nomatch events. Event handlers typically use the <catch> tag to intercept the event. In these examples, the user is prompted to try again. Because no value has been placed into the month field, the FIA continues to obtain a value for the month field.

<?xml version="1.0"?>
<vxml version="2.0" xmlns = "http://www.w3.org/2001/vxml">
     <form id = "date">
          <field name = "month">
               <prompt count = "1"> What month? </prompt>

               <catch event = "nomatch">
                    <prompt>
                         I did not understand you. Which month of the year?
                    </prompt>
               </catch>

               <catch event = "noinput">
                    <prompt>
                         I did not hear you. Which month of the year?
                    </prompt>
               </catch>

               <catch event ="help">
                    <prompt>
                         Which month of the year? For example, January.
                   </prompt>
               </catch>

               <grammar root = "month" type = "application/grammar+xml" version = "1.0">
                     <rule id = "month">
                          <one-of>
                               <item>January</item>
                               
                               <item>December</item>
                          </one-of>
                     </rule>
               </grammar>
          </field>

     </form>
</vxml>

Figure 5: Example event handlers for assisting users resolve errors.

Provide progressive assistance.

Sometimes a simple help message is not enough—the caller fails to respond, responds incorrectly, or asks for additional help. In this situation, reveal additional information and instructions to the caller each time the caller fails to respond correctly. Designers frequently write levels of help messages to provide additional assistance for each level. Figures 6 and 7 illustrates four levels of progressive assistance:

Level 1—Present a short prompt asking the caller to respond.
Level 2—Present a short description of what the caller should say.
Level 3—Present an example of what the caller should do.
Level 4—Offer to present short segments of a verbal tutorial to the caller or transfer the caller to a human operator to resolve the caller’s problem.

If for any reason a value is not inserted into the month field, the FIA attempts to solicit the missing value from the user. After each failure, the prompt count is incremented by one, and the next prompt wording is presented to the user.

<?xml version="1.0"?>
<vxml version="2.0" xmlns = "http://www.w3.org/2001/vxml">
     <form id = "date">

          <field name = "month">
               <prompt count = "1"> What month? </prompt>
               <catch event = "help nomatch noinput" count = "2">
                    <prompt>Which month of the year? </prompt>
               </catch>
               <catch event = "help nomatch noinput" count = "3">
                    <prompt>
                            Which month of the year? For example, January.
                    </prompt>
               </catch>
               <catch event = "help nomatch noinput" count = "4">
                    <throw event = "transfer"/>
               </catch>
          </field>

          <catch event="transfer">
               <prompt> Transferring. Please hold. </prompt>
               
               <exit/>
          </catch>
               <grammar root = "month" version = "1.0" type = "application/grammar+xml">
                    <rule id = "month">
                         <one-of>
                              <item>January</item>
                              
                              <item>December</item>
                        </one-of>
                  </rule>
          </grammar>
     </form>
</vxml>

Figure 6: Example of four levels of progressive assistance

Application: What month?
Caller: Huh?
Application: What month of the year?
Caller: Uh, aa. . .
Application: What month of the year? For example, January.
Caller: Aa…
Application Transferring, please hold.

Figure 7: Dialog illustrating progressive assistance

Callers will tolerate only a limited number of failures before they give up and abandon the application. The application should connect the caller to a human operator before this happens. Only user testing will determine the approximate number of caller failures before the caller gives up. Adjust the prompts so the caller is transferred to the operator before the caller experiences this number of failures.

As callers use the system and listen to the short help messages, they will gradually learn how to use that portion of the voice application. The more the caller uses the voice application, the more expertise the caller obtains. This type of incremental learning enables the user to progress from novice to expert level at the user’s own rate. The user only learns about the portions of the application that the caller frequently uses. If the expert user forgets a detail, the expert can easily ask for help.

Guidelines for Voice user interfaces for AVERAGE users

Average callers may complain that application-directed dialogs are too time-consuming, especially when callers must listen to prompt messages that they have heard many times. When applied to a VoiceXML 2.0 application, guidelines will enable callers to accelerate their interaction so they can complete their desired tasks faster and easier. These guidelines are

Enable barge-in
Enable alternative utterances

Enable barge-in

Barge-in enables callers to interrupt and speak before the completion of a voice prompt. As soon as the caller speaks, the prompt stops. The caller will not hear the remainder of the prompt. For average callers who are familiar with the telephony application, barge-in increases the pace interaction.

Barge-in speeds up a verbal conversation for experienced callers. However, barge-in is a double-edged sword. While barge-in increases the pace of interaction, callers may miss important information. Dialog designers should design prompts so that important information appears early in the prompt. Figure 8 illustrates the same dialog as Figure 2 except the caller is allowed to barge-in. In this example, the caller anticipates the question “What year?” and barges-in before hearing the question.

Application: What month?
Caller: February
Application: What day of….
Caller: (barges in) twelve
Application: What …
Caller: (barges in) nineteen ninety-seven

Figure 8. Example dialog illustrating barge in

Enable alternative utterances

When average callers barge-in, they may not hear the choices presented at the end of a prompt. Average callers may not remember the exact words and phrases and may utter synonyms instead. Application designers should include these synonyms in the grammar associated with the field containing the prompt. For example, Figure 9 illustrates the grammar used in the day field of Figure 1. The grammar is extended to enable the caller say “first,” “second,” … “thirty-first” in addition to “one,” “two,” … “thirty-first.”

<?xml version="1.0"?>
<vxml version="2.0" xmlns = "http://www.w3.org/2001/vxml">
     <form id = "date">
          <field name = "day">
               <prompt> What day of the week? </prompt>
               <grammar root = "day" version = "1.0" type = "application/grammar+xml">
                    <rule id = "day">
                         <one-of>
                              <item>one</item>
                              <item>two</item>
                              
                              <item>thirty-one</item>
                              <item> <tag>$.day = "one"</tag>first</item>
                              <item> <tag>$.day = "two" two</tag>second</item>
                              
                              <item><tag>$.day= "thirty-one</tag>thirty-first</item>
                        </one-of>
                   </rule>
               </grammar>
          </field>
     </form>
</vxml>

Figure 9. Grammar from the day field in Figure 1 extended to enable the caller to say both ordinal and cardinal numbers.

Semantic interpretation tags enable a text string to be substituted for words spoken by the caller. In the example above, “one” is substituted for “first,” “two” is substitute for “second,” and so on..

During user testing, developers should record alternate words and synonyms spoken by callers in response to prompts and, then, consider adding these words to the grammar associated with the field containing the prompt.

Alternative utterances enable callers to speak words not included in the prompts as well as words included in the prompts but not spoken when the caller barges-in. Enabling the caller to speak synonyms makes the dialog more flexible and results in fewer requests for nomatch events and requests for help by the caller. However, larger grammars may slow the speech recognition engine, which may result in increased latency for the application.

Guidelines for user interfaces for experienced users

Experienced users are familiar with the structure and content of the application. They need shortcuts that accelerate the interaction. The following guidelines result in enhancements to a system-directed VoiceXML application enabling it to support a type of mixed-initiative dialog:

Enable ask switching
Enable single utterance for multiple fields
Enable out-of-sequence data entry
Prompt for missing parameters
Resolve overlapping grammars

Enable task switching

Many visual applications contain a navigation bar. Clicking a word or icon on the navigation bar causes the user to jump to another document or application.

The VoiceXML equivalent of a navigation bar is called a root document. A root document acts as an index to other verbal documents and applications. The root document presents callers with a verbal menu of universal commands that the caller can speak at any time to be automatically transferred the corresponding application. The designer includes the name of each target application as a grammar in the <link> tag. For example, if the dialog designer includes “deposit” as a link in the root document, then whenever a caller says “deposit,” the caller is transferred immediately to the deposit document.

Novice callers listen to a list of services before selecting and speaking the name of the desired application. Experienced callers simply say the name of the desired application.

Figure 10 illustrates a fragment of a root document that enables the caller to select from among several verbal forms: date, address, and credit card. The caller may say “date,” “address,” or “credit card” at any time to be transferred to the corresponding application.

<initial>
     <prompt>
          Welcome to your reservation profile.
          Do you want to update your arrival date, address, or credit card?
     </prompt>

     <link next="#date1">
          <grammar root = "day" version = "1.0" type = "application/grammar+xml">
               <rule id = "day"> <item> arrival date </item> </rule>
          </grammar>
     </link>
     <link next="#address1">
          <grammar root = "address" version = "1.0" type = "application/grammar+xml">
               <rule id = "address"> <item> address </item> </rule>
          </grammar>
     </link>
     <link next="#card1">
          <grammar root = "card" version = "1.0" type = "application/grammar+xml">
               <rule id = "card"> <item> credit card </item> </rule>
          </grammar>
     </link>

</initial>

Figure 10. Transfer to another application

Enable single utterance for multiple fields

Callers can bypass a prompt by speaking multiple words in the same utterance. A form-level grammar is required to do this. A form-level grammar contains grammar rules for how grammars for individual fields (called field-level grammars) may be combined and sequenced within a single utterance. For example, the form-level grammar highlighted below in Figure 11 enables an American caller to say “January thirteen two thousand four” in a single utterance in response to the prompt, “What date?” If the caller’s utterance matches the entire sequence, then all fields receive their respective values. If a user does not respond to the form-level prompt (which is only presented once, when the FIA first enters the form) before the timeout period, then the FIA will visit each individual field, presenting its prompt and listening for words in each field-level grammar.

<form id = "date">

     <initial>
          <prompt> What date?> </prompt>
     </initial>

     <grammar root = "American" version = "1.0" type = "appliction/grammar+xml">

           
          <rule name = "American_date">
               <ruleref name= "#month">
               <ruleref name= "#day">
               <ruleref name= "#year">
          </rule>

          
          <rule id = "month">
               <one-of>
                    <item>January</item>
                    
                    <item>December</item>
               </one-of>
          </rule>
          <rule id = "day">
                <one-of>
                    <item>one</item>
                    <item>two</item>
                    
                    <item>thirty-one</item>
               </one-of>
          </rule>
          <rule id = "year">
               <one-of>
                    <item>nineteen hundred</item>
                    <item>nineteen hundred one</item>
                    
                    <item>two thousand two</item>
               </one-of>
          </rule>
     </grammar>

     <field name = "month">
     …
     <field name = "day" >
     …
     <field name = "year">
     …
     </form>

Figure 11: Example form-level grammar

Be specifying multiple form-level grammar rules as illustrated in Figure 12, designers enable callers to utter alternative sequences of words. For example, the following form-level grammar rules enable an American caller to say “January thirteen, two thousand two,” and enable a European caller to say “Thirteen January, two thousand two.”

<rule id = "date">
     <one-of>
          <ruleref name = "#American_date">
          <ruleref name = "#European_date">
     </one-of>

<rule id = "American_date">
     <ruleref name= "#month">
     <ruleref name= "#day">
     <ruleref name= "#year">

</rule>
<rule id = "European_date">
     <ruleref name= "#day">
     <ruleref name= "#month">
     <ruleref name= "#year">
</rule>

Figure 12: Example of a complex form-level grammar enabling multiple sequences of field-level grammars

Enable out-of-sequence data entry

If a field’s grammar is defined at the form level, then the grammar associated with the field is active throughout the entire form—not just the field in which the grammar is specified. The speech recognition engine listens for words in all active grammars and places recognized words into the appropriate field variables. This increases the flexibility of the dialog document by enabling the caller to speak the names of the values in any order and have those values placed into the proper fields, as illustrated in Figure 13.

Application: What date?
Caller: (no reponse)
Application: What month?
Caller: Twelfth
Application: What month?
Caller: February
Application: What year?
Caller: Nineteen ninety-seven

Figure 13: Example dialog in which the user speaks the values out of sequence

Prompt for missing parameters

Suppose a caller first speaks the month and day and is then silent for the the year. The Form Interpretation Algorithm—FIA—will prompt the caller for values for unvisited fields—in this case, the Year field. The resulting dialog is illustrated in Figure 14.

Application: What date?
Caller: February twelfth
Application: What year?
Caller: Nineteen ninety-seven

Figure 14: Example dialog in which the user omits a value and is prompted to supply the missing value.

The FIA sequences through the empty fields of a form one at a time, prompting the caller for values. The FIA sets a guard variable associated with each form field when the caller supplies a value for that field. If the caller enters a value out of sequence, the FIA accepts that value and marks the field’s guard variable. The FIA cycles through the form again and again until all guard variables are marked—and all of the fields have values.

Design form-level prompts to encourage the caller to respond with multiple parameters. Design field-level prompts to solicit individual parameters not spoken in response to the form level prompt.

Resolve overlapping grammars

Suppose the caller has two types of accounts—checking and savings. The transfer form first asks for the source account and then asks for the target account. Suppose further that the caller answers the first prompt by saying “checking.” Because “checking” is a value for both the source_account and the target_account, it is not clear whether the caller means that “checking” is intended to be the source or the target. This can be resolved by adding the words “to” and “from” to the grammar. The caller speaks the words “to” and “from” to indicate whether “checking” is the source or target. Specifically, if the caller says, “to checking,” then the “checking” becomes the value of target_account. However, if the caller says “from checking,” then “checking” becomes the value of the source_account. The <tag> element contains semantic interpretation instructions that specify which variable, source or target, should receive the value “checking.” As illustrated in Figure 15, the semantic interpretation that assigns the value “checking” to target_account is attached to the grammar rule containing the word “to,” while the semantic attachment that assigns the value “checking” to source_account is attached to the grammar rule containing the word “from.”

<rule id = " source_account" >
     from
     <one-of>
          <item> <tag>$.source_account= " savings" </tag> savings </item>
          <item><tag>$.source_account= " checking" </tag>checking </item>
     </one-of>
</rule>

<rule id = " target_account" >
     to
     <one-of>
     <item> <tag> $.target_account= " savings" </tag> savings </item>
     <item> <tag>$.target_account= " checking" </tag> checking </item>
     </one-of>
</rule>

Figure 15: Overlapping grammars resolved by semantic interpretation commands

Advanced techniques

To understand a caller’s utterance, VoiceXML and the underlying speech recognition system use information from the following sources:

The caller’s utterance
Syntactic information encoded within the grammar
Tags embedded into the grammar that contain semantic processing instructions for extracting and translating words from the user’s utterance.

Additional information can be used to improve the understanding of what the caller says. This information can be extracted from the utterances of typical users, and the context in which the caller’s utterance appears.

Statistical techniques

Two statistical techniques that can be used with VoiceXML are N-grams and classifiers.

A large vocabulary of words may be represented by N-grams. N-grams encode sequences of words that appear together in spoken languages such as English, Spanish, or Chinese restricted to a specific domain such as medical, legal, or scientific area. If N=2, then pairs of words are used to describe utterances in the domain. If N=3, then triples of words are used. To create an N-gram language model, the developer collects large samples of text from the specific domain. Statistical algorithms construct the collection of N-grams and their frequencies within the text samples and represent the language model in a data structure that can be searched quickly. Rather than trying to write a grammar for a large collection of user utterances, developers can generate and use an N-gram language model.

Classifiers have long been used in information storage and retrieval applications to classify documents into categories or topics for easy retrieval. Classifiers have also been used in vision applications to categorize—sort graphical objects—based on the color or visual patterns. AT&T has introduced a speech classification system [http://www.research.att.com/~algor/hmihy/papers.html] called HMIHY—How My I help You that applies classification technology to convert a caller’s initial request into one of several topics, and rout the call accordingly.

Unlike typical speech dialog systems where callers must navigate through menu choices by saying prespecified words, HMIHY callers simply say what they want. The HMIHY system recognizes key words and phrases and routes the call to a category specific subsystem, either another artificial agent or a human operator. Classifiers enable speakers to verbalize a request in their own words rather than select prespecified words from menu hierarchies. This shifts the burden from callers to the computer. Callers find the system much easier to use than traditional conversational speech systems because they spend less time getting what they need. The HMIHY system has been successfully managing hundreds of calls each day to AT&T customer care centers in a trial that began in mid-November of last year.

No longer must a developer structure the menu hierarchy and write all of the prompts, grammars, and error handlers. But the developer must pay a price. The developer must (1) collect large corpora of typical caller utterances, (2) annotate each with the appropriate topic, and (3) train the classifier to classify (match) caller utterances to topics. Furthermore, if the topics requested by callers change, then the whole process may need to be repeated.

Context

Grammars with embedded tags use context within the caller’s utterance to extract and translate words. Understanding a caller’s utterance can be improved by using additional sources of context, including:

Caller profile, which contains the information about the caller’s interest and previous activities that may be useful in understanding the caller’s current requests. For Example, if the caller profile indicates that the caller is a farmer, then the noun “run” could be interpreted as a holding area for livestock, such as a chicken run. On the other hand, if the caller profile indicates that the caller is interested in athletic events, then the word “run” could be interpreted as a race.

Previous utterances may provide hints and clues for the interpretation of a user’s utterance. For example, If the caller spoke about chickens in the previous utterance, the word “run” probably refers to a chick run rather than an event for athletics.

Advanced artificial intelligence algorithms can replace a simple VoiceXML grammar to provide an improved understanding of a user’s utterance. The use of caller profiles and context is an ongoing research topic that will soon provide additional power to VoiceXML.

ITERATIVE TESTING

Coding voice applications for both novice and experienced callers is easy. Specifying he right prompts, grammars, and event handlers is difficult. Many designers use iterative usability testing—a repeated cycle in which users test the system and designers refine prompts and grammars.

As the number of words and phrases in a grammar grows, more time is required by the speech recognition engine to recognize the words spoken by the user. Developers manage the trade-off between vocabulary size and the number of nomatch events by conducting usability tests to measure the speed of the speech recognition engine and the number of nomatch events. Developers attempt to minimize both by fine-tuning the prompt wordings and vocabulary words. In addition, universal commands may cause performance problems because the speech recognition engine must also consider universal commands during each recognition phase. Developers try to minimize the number of universal commands to increase the recognition engine’s response time, as well as decrease the number of commands that an experienced user must remember.

During each testing phase the developers carefully analyse the dialog. For each field in which the caller asks for help or causes a nomatch or noinput event, consider revising the prompt to be more specific and consider additional event handlers. For each field in which a nomatch event occurs, consider adding the word spoken by the caller to the grammar for the field.

Developers use the <log> tag to record timestamped events and values to a history log. A software log analysis application summarizes and calculates performance measurements. Performance measurements are objective measurements of how many actions the caller is able to complete in a specified period of time and how many times a caller attempted to perform an action without success. Without performance measurements, it is difficult to know if code changes actually improve performance.

Developers interview callers after each test to collect caller preference measurements. Preference measurements are subjective measurements of how users like various aspects of the voice application. Preference measurements indicate how likely a caller is to use the voice application again.

Summary

It is easy to learn how to use VoiceXML syntax to create a working voice application, but effort is required to develop a voice application that is both efficient and enjoyable for novice, average, and experienced users. Novice users are presented with a system-directed dialog that guides them through the application and helps them achieve the desired goals. By extending grammars and enabling barge-in, average callers can increase the pace of the dialog. By using form-level grammars and encouraging callers to take a more active role in the dialog, experienced callers direct the application to achieve the desired results using a type of mixed-initiative dialog. Finally, iterative testing and user interface refinement are necessary for developing an efficient and enjoyable voice applications.

Acknowledgements: The author acknowledges the many contributors to the VoiceXML language. Thanks first to the founders of the VoiceXML Forum who specified VoiceXML 1.0, and then to the members of the World Wide Web Consortium who specified VoiceXML 2.0.