Entering Data by Speaking

Telephones and cell phones are everywhere. Users call from wherever they are and call whenever they want to. But when calling businesses, callers frequently are put on hold. To overcome the long hold queues, several businesses have installed IVR systems that collect data by asking the callers carefully formatted questions. Some systems require callers to answer questions by pressing the buttons on touchtone phones. These systems have several disadvantages:

Callers constantly must reposition the phone between their ears and the front of their faces so they can see the buttons that they must press. Speech recognition removes the need to constantly move the telephone.
Callers must translate options to numbers. For example, "For accounting, press 1; for human resources, press 2; for sales, press 3..." This requires callers to select the appropriate option, and translate that option to a number before pressing the appropriate button on the telephone keypad. Speech recognition simplifies this process. Callers simply speak the answer rather than trying to remember the options, selecting the best option, and, finally translating the selection to a number.
Because of the limitations of human short-term memory, developers structure menus to be long and narrow rather than short and fat. Callers often "get lost" in these long menu hierarchies and cannot find their way to the desired option.

Using speech-recognition technology solves these problems, but creates some new ones. There are always problems with new technology. Speech recognition is no exception. This article discusses three problems with using telephones to enter data by speaking—and what you can to do about them.

Problem 1: Callers Don't Know When to Speak and What to Say

Currently, many callers do not have experience using a telephony application with automatic speech recognition. They don't know that they can speak. Often, they are "tongue-tied" about what to say. Here are some hints to help callers say the right thing at the right time:

Inform the caller that they may speak. At the beginning of the application, inform the caller that the application can understand human speech and that he should respond to questions by speaking the answers. For example:

"Welcome to the Ajax banking application. You may answer questions by speaking directly into your phone."
Encourage the caller to respond to a prompt by speaking. Phrase the prompt to encourage the caller to speak. Use words in the prompt such as "say" or "speak" instead of "enter." For example:

"Say your name."

rather than

"Enter your name."
Tell the caller what to say. Prompts should lead the callers to say words and phrases in the corresponding grammar. If the caller is not familiar with the appropriate words and phrases, include them as part of the prompt. For example:

"Which account? Savings or checking?"

If the caller is already familiar with the individual words, shorten the prompt. For example:

"Which day of the week?"

instead of:

"Which day? Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, or Saturday?"
Encourage experienced callers to barge in. Novice callers usually listen to the entire prompt. They need to hear all of the instructions and options before making their selection. However, experienced callers may resent listening to complete prompts, especially if they use the application frequently. In conversations between people, barging-in may be rude. However, computers are never insulted when callers barge in. Inform callers that they may bypass lengthy prompts by "barging-in"—speaking before the prompt ends. For example:

"You may speak at any time, even if the computer is speaking."

Insert pauses in the prompt wording where expert, average, and novice callers may speak. A pause signals speakers that they should speak. Callers with different experience levels may barge in during different pauses.

"Color?"

(Pause, so experts can barge in here. They know the question and the appropriate responses.)

"Say the color you want."

(Pause, so an average caller who already knows the allowable options can barge in.)

"Green, red, or blue?"

(The novice caller responds after hearing the allowable options.)

Callers will quickly learn that barging-in will speed up a conversation, so that callers can perform their desired tasks quickly.
Continue to encourage the user to speak. Sometimes, the user says a word that is not covered by the grammar of allowable words. In these cases, a useful strategy is to reveal additional information and instruction to the caller each time the caller is prompted for the same information. For example:
- Level 1. Present a short prompt, asking the caller to respond.
- Level 2. Present a short description of what the caller should say.
- Level 3. Present an example of what the caller should do.
- Level 4. Offer to present short segments of a verbal tutorial to the caller, or transfer the caller to a human operator to resolve the caller's problem.

2. Problem 2: Speech Recognition Errors Break the Dialog's Rhythm and Slow Down the Dialog Between the Caller and the Computer

Problem 2: Speech Recognition Errors Break the Dialog's Rhythm and Slow Down the Dialog Between the Caller and the Computer

Occasionally, automatic speech-recognition systems make errors, just as humans make speech-recognition errors in daily conversations. However, the number of errors made by a speech recognition engine can be minimized:

Avoid words in the grammar that are confusing for the speech recognition engine. If two words are frequently confused by the speech-recognition engine, modify the grammar to use different words or phrases that are distinguished easily. For example, change the vocabulary from {"green", "gray"} to {"forest green", "charcoal gray"}.
The grammar should contain words frequently spoken by callers. Wizard of Oz tests and usability tests are necessary to determine which words callers frequently speak. (During a Wizard of Oz test, the developer pretends to be the computer, asks the caller questions, and records the caller's responses.)
Keep the grammar as small as possible. If the grammar is too large, the speech-recognition system may become confused about which of several words matches the caller's utterance. This confusion may result in a mismatch event, with the computer asking the user clarifying questions. Smaller grammars enable the speech-recognition engine to return accurate results more quickly.
Validate low-confidence recognitions. When the speech recognizer returns a result with a low confidence rating, confirm the result by asking the caller a yes/no question. For example, if the speech-recognition engine returns a low confidence score for the word "Austin," ask the caller to confirm: "Did you say Austin?"

If the confidence scores are similar for two words in the vocabulary, then ask a yes/no question about the most frequently used word. For example, suppose the most frequent answer to the question: "Which color? Blue or green?" is green, and the confidence scores are for blue and green are about the same. Validate by prompting the caller with the yes/no question: "Did you say green?"

Frequently, people ask these types of questions in daily conversations, and think nothing of answering these questions with a simple "yes" or "no." Most callers are never aware that a real or suspected speech-recognition error occurred. Also, callers often say the correct value after responding to a yes/no question.
- The caller mumbles or speaks a word not in the vocabulary. Prompt the caller again, but use a different wording to encourage the caller to say one of the words in the vocabulary. For example:
  
  Prompt: Which account? Savings or checking?
  
  Caller: Hmmm.
  
  Prompt: Do you want to access your savings or checking accou

Problem 3: When Detected, Callers Cannot Easily Correct Misunderstandings

Callers cannot correct an incorrect value spoken earlier in the dialog. Form fill-in dialogs prompt callers for values for a sequence of fields, summarize the values in a final prompt, and follow with the following question: "Is that correct?" For example, after soliciting values for departure city, arrival city, and travel date, the values are summarized as follows:

"You want to travel from Boston to Chicago on July 15; is that correct?"

What happens if the caller says no? The summary has an invalid value, either because the caller spoke the wrong answer or the speech-recognition engine misunderstood what the caller said.

Accept verbal corrections. Listen for keywords to identify the corrected values. For example, the caller could say the following:

"No, from Austin."

Note that the caller frequently responds by speaking the same phrase as spoken by the system. In the example above, prepositions such as "from," "to," and "on" identify each value. Callers will speak these words frequently to indicate which words should be corrected.

Sometimes, the caller will repeat the entire summary and emphasize the corrected word and its preposition. For example, the caller could say the following:

"No, I want to travel from Austin to Chicago on July 15."

They emphasize the words "from Austin." By comparing the volume of words and their corresponding prepositions, it may be possible to guess which word is being corrected.
Escape. Provide callers with rescue navigational commands. Enable the caller to say "back up" or "go home" rather than force the caller to "reboot" by hanging up and redialing.

4. Other Problems

Sometimes, callers say too much. They forget that they are talking to a machine, and speak as if they are talking to a human who can understand their every word. Although dictation systems can translate long sentences to text, the system may not "understand" the meaning of the text. Callers may need to be encouraged to answer the question and to not volunteer additional information. For example:

System: "Color?" (pause) "Say the color you want." (pause) "Green, red, or blue?"

Caller: "I think I like green better than red or blue."

This phrasing confuses the speech-recognition system, which hears the words "green," "red," and "blue," and has no idea which color the user wants. Encourage the user to answer the question simply and directly:

System: "I'm sorry, I didn't understand you. Just say the color you want: green, red, or blue."

The key to successful speech data entry programs is careful dialog design and iterative usability testing. The best practice techniques for dialog design can both accelerate the data entry process and make entering data by speaking into a phone more enjoyable. But there are no guarantees. Developers must test with a vengeance, iteratively modifying the dialog, and verifying that the modification does indeed improve the process.

NOTE

For additional suggestions on how to improve the quality and experience of using telephony applications, see the author's book, VoiceXML: An Introduction to Building Voice Applications (Prentice Hall, 2002, ISBN: 0130092622).

Entering Data by Speaking on the Telephone: Three Problems and What to Do about Them

Problem 1: Callers Don't Know When to Speak and What to Say

2. Problem 2: Speech Recognition Errors Break the Dialog's Rhythm and Slow Down the Dialog Between the Caller and the Computer

Problem 2: Speech Recognition Errors Break the Dialog's Rhythm and Slow Down the Dialog Between the Caller and the Computer

Problem 3: When Detected, Callers Cannot Easily Correct Misunderstandings

4. Other Problems