Designing dialogs for verbal user-computer interactions is still an art as much as it is a science. Speech application developers iteratively test applications to refine dialogs to minimize communication breakdowns, decrease the time for callers to complete tasks, and improve the naturalness of the conversation.
From the limited experience to date, some guiding principles for designing dialogs have become apparent. These principles are used to generate human factor guidelines for verbal user interfaces. While guidelines often serve as a checklist of what to do and what not to do in a verbal user interface, guiding principles motivate and encourage good dialog designs.
Two early guiding principles have emerged: consistency and symmetry. Presentation consistency refers to the similar structure and format of prompts presented to the caller. Response consistency refers the similar phrasing of words spoken by the caller in response to the prompts. Symmetry is a type of consistency between prompts and responses, in which callers mimic or parrot the structure, format, and words they hear in the prompt.
Presentation Consistency
Unlike creative writing, in which consistency can sometimes be boring, presentation consistency enables callers to predict what they will hear, so they can be ready to respond appropriately. More experienced callers can bypass much of the prompt to accelerate the dialog. Here are two examples of presentation consistency.
-
Use parallel syntactic constructs in prompts. It is easier for callers to select words from a menu when the menu's options are presented consistently. It is easier for callers to provide values for multiple fields in a form if the prompts are presented consistently. For example, it may not be a good idea to use the following inconsistent syntactic:
Say your name.
What is your birth date?
Say your street address.
In which city do you live?
Please enter your Zip code now.
Instead, use parallel syntactic constructs such as the following:
Your name?
Your birth date?
Your street address?
Your city?
Your Zip code?
-
Use a consistent format for a prompt in a menu. Presentation consistency enables callers to predict what will be spoken and how to react proactively. Using a consistent format for prompts also will help callers select options more quickly. Structure all menu prompts so that experienced callers can barge in (interrupt the voice menu options) as soon as they hear the menu name, average callers can barge in after they hear a question, and novice callers respond after they hear the menu option list. A menu prompt should consist of the following sequence:
-
Speak the menu name. For important menus, the dialog designer should include the menu name in the menu's prompt. The menu name serves as a landmark. A landmark is a speech or non-speech cue that marks a specific location within the dialog structure. By providing the menu name, such as "main menu" or "thermostat," callers can jump to this menu by speaking the menu name, or they can return to this menu if they get confused or lost. Also, repeating the menu name to the caller confirms that the caller has reached the correct menu.
-
Ask a question. Often, this can be achieved with two or three words. This should be enough for experienced callers to say one of the commands without listening to the enumerated options. Novice callers will listen to the enumerated options before speaking their selection.
-
Enumerate options. List the options so novice callers can hear and select their desired options.
-
For example:
Color.
(This is the name of the menu. Expert callers can barge in here because they know the question and the allowable options.)
Say the color you want.
(An average caller can listen to the question and barge in if they know the answer.)
Green, red, or blue?
(A novice caller listens to all of the menu options before responding.)
Other types of presentation consistency include consistent formats for help and error messages, formats for items in a form, validation of the caller's input, and the use of audio icons to indicate that "it is the caller's turn to speak," "the computer is working," and "this is a hyperlink that can be invoked by speaking its name."
If the style and format of dialogs are consistent, then an application's persona (the personality of the application) will emerge as consistent and helpful rather than random and confused.
Response Consistency
Callers can complete their tasks more quickly if they can respond consistently. The response becomes more automatic if users can use the same words and phrases in similar situations. Here are several ways to increase response consistency:
-
Consistent use of commands. Use the same command for the same task in all applications. For example, use the command goodbye in all applications instead of using goodbye in one application, stop in another application, and hang up in a third application.
-
Consistent format for similar data. In addition to presentation consistency, callers prefer to respond to similar prompts by speaking the same words in the same order. For example, when asked for a birth date, start date, end date, or any other kind of date, the caller should be encouraged to respond by using the same format. In North America, this sequence is usually the month, the day, and finally the year—with the name of the month, the day in ordinal format (thirteenth), and then the year (often only the last two digits of the year: "02")
If applications reuse standard dialog modules, then callers will learn how to interact with these modules when they are first encountered, and use that knowledge whenever the modules are re-encountered. For example, speaking a credit card number in four chunks of four digits or speaking a North American telephone number in chunks of three, three, and four digits accelerates the dialog and makes the experience more natural for the caller. Most VoiceXML vendors have collections of dialog modules that, when reused, accelerate not only the callers' dialogs, but also the time to implement the application.
Symmetry
The principle of symmetry suggests that callers respond to prompts in the same style and wording used by the prompt. If the prompt is lengthy, the caller's responses tend to be lengthy. If the prompt is vague, then the caller tends to respond with vague answers. In order to encourage callers to say the words and phrases covered by a grammar, the prompt should be short, precise, and instructive. Paraphrasing the age old Golden Rule: "Do unto the caller as you would have the caller do unto you."
Designers specify two key items for each turn of an application-directed dialog:
-
Specify the grammar to contain the words and phrases that minimize the errors made by the automatic speech recognizer.
-
Formulate the prompts to encourage the caller to speak the words and phrases in the grammar.
After specifying the grammar, apply the principle of symmetry by encouraging the caller to parrot words, phrases, and patterns used within the prompt.
-
Prompt the caller with a menu of words from the grammar. The caller can respond by selecting one of the words from the menu rather than guessing what words are in the grammar. For example, suppose the grammar contains two words: "Green" and "gray." For example, avoid using the following prompt:
Prompt: What color?
Response: Chartreuse.
Instead, enumerate the key words from the grammar:
Prompt: What color? Green or gray?
Response: Green.
Selecting from a verbal menu is cognitively easier for the caller than selecting a word from one's mind. Verbal menus also decrease speech-recognition engine mismatches, which happen when callers say words not in the current grammar.
-
Prompt the caller with a pattern that the caller can mimic. Consider the following prompt to confirm values previously spoken by the caller:
Prompt: Do you want to fly from Boston to Chicago on Thursday?
Response: No, I want to fly from Austin to Chicago on Thursday.
Words in the pattern are useful to assign spoken values to field names. In the above example, the spoken word "Austin" is assigned to the departure_city because of the pattern word "from," whereas "Chicago" is assigned to the arrival_city field because of the pattern word "to."
-
Prompt the caller with a short question This encourages the caller to respond with a short response. For example, avoid using the following:
Prompt: Please say your first name.
Response: My first name is Jim.
Instead, use the following:
Prompt: Your first name?
Response: Jim.
Short responses are easier for the speech-recognition engine to recognize rather than selecting keywords from the caller's response. A short prompt may also avoid false expectations about what the speech-recognition engine can recognize.
4. Other Guiding Principles
Consistency and symmetry are two guiding principles for designing voice caller interfaces. Other guiding principles include the following:
-
Callers learn-by-doing. Psychologists tell us that one of the best methods for learning how to do something is to do it. This "sink-or-swim" approach is especially effective when callers are assured that if they start to sink, they will receive help to get back on track. Do not present long tutorials at the beginning of a telephony application. Callers remember little of the specific instructions, and resent the time required to listen to the tutorial. Instead, encourage the callers to explore. Dialog designers encourage callers to explore the speech application in at least two ways:
-
Prompts suggest words that the caller may say to explore new portions of the application. This encourages the caller to jump off into "unknown waters." For example, the following informs the user of physical objects that can be controlled from the telephone:
"What action? Thermostat, lights, security, kitchen appliances, or alarm clock?"
-
Specify event handlers to help the caller when the speech-recognition system fails to understand what the caller says. This pulls the "sinking" caller back to the surface.
-
-
Establish specific performance criteria. Developers need to conduct usability tests and measure performance criteria to determine whether changes to an application have an overall beneficial or harmful effect. It is easy to be misled by a change that solves a specific problem yet has an overall harmful effect to a speech application.
-
Leverage the caller's knowledge. A caller familiar with the application domain should find the application easy to learn and natural to use. This occurs when the caller's conceptual model of the domain (the caller's understanding of the domain objects, relationships commands, and constraints from real life) match the application's conceptual model. Talk with several prospective callers before constructing the application; and use the terms, phrases, and names frequently used by the prospective callers.
-
Optimize the frequent case. Make it easy to do simple things and possible to do complex things. Just like airport security, fliers with no suspicious objects go to an express line, whereas fliers with suspicious objects are asked to go to a "trouble" line. Frequent fliers quickly learn how to avoid the trouble lines so they can get through the security check faster. Organize the call flow so callers can perform the simple and frequent tasks quickly and be routed to specific paths in the call flow for infrequent and difficult tasks.
NOTE
For additional suggestions on how to improve the quality and experience of using telephony applications, see the author's book, VoiceXML: An Introduction to Building Voice Applications (Prentice Hall, 2002, ISBN: 0130092622).