HomeTrainingTechnology TrendsAboutCalendar

Voice XML: Chapter 3 Exercises

 


VoiceXML: Introduction to Developing
Speech Applications


3-1 Section 3.2 identifies two frequent problems with touch-tone input: lost in space, and time-consuming menus.

A. List five ways to minimize the lost in space problem?

  1. Provide a standard key or provide a voice command that takes caller back to the main menu.
  2. Provide a standard key or provide a voice command that takes caller to the previous menu.
  3. Minimize the number of menu levels.
  4. Confirm user choice before jumping to the child document

B. List five ways to minimize the time-consuming menu problem?

  1. Avoid multiple menu levels.
  2. Avoid menus with many options.
  3. Provide detailed instructions only for the first menu.
  4. Provide detailed instructions only when the user asks for them.
  5. Enable the caller to say the names a list of choices so the caller can bypass listening to each prompt along the menu path. (Similar to "type-ahead" for touch-tone IVR systems.)
  6. Position the most frequently-used options near the root.

3-2 Even in the best of situations, speech recognition engines sometimes fail to correctly convert speech to text.

List five ways application designers can minimize this problem?

  1. Increase processing power.
  2. Remove background noise.
  3. Speaker adaptation.
  4. Increase grammar size.
  5. Improve the language model.
  6. Improve the acoustic model (by having the caller train the speech system)
  7. Use the caller's profile to locate the caller's acoustic model.

3-3 Even in the best of situations, speaker identification and verification sometimes fail to correctly identify/verify the caller.

List five ways application designers can minimize this problem?

The same techniques as used to improve speech recognition.

3-4 Construct a language identification prompt by doing the following:

A. Choose an area somewhere in the world and identify the primary spoken national language and two secondary languages used in that area.

In Minneapolis, Minnesota, U.S.A. English is the main language. Several individuals also speak Norwegian.

B. Construct an explicit selection menu that asks callers which of the three languages they prefer to speak. Assume that a native speaker of the language will prerecord each prompt, so you only need to specify what the native speaker should say. E.g., write the words for the prompt. Assume that the caller will respond with the English name of the spoken national language.

C. Repeat (b). Assume that an English speech synthesizer will speak each prompt. You will need to use a phoneme markup command to describe the phonemes in the prompt. The phoneme command is described in the Speech Synthesis Markup Language document on the W3C Voice Browser Working Group web site: http://www.w3.org/voice. The phoneme markup command requires the use of a phonetic language. Use the Worldbet phonetic language described on the Oregon Health and Sciences University web site at http://cslu.cse.ogi.edu/tutordemos/SpectrogramReading/ipa/ipachars.html.

D. Repeat (b) Assume that the speech recognition engine recognizes the local pronunciation of the names of the three spoken national languages. Write down the local pronunciation of the names of the two spoken national languages using the Worldbet phonetic language. (In practice, this notation is placed in a lexicon, which is used by the speech recognition engine to recognize words.) For example, "English" would be denoted as "E N l I S".

The Norwegian word for the Norwegian language is norsk, pronounced "n u s k"

The Norwegian word for English is engelisk, prounced "e N e l e s kh"

3-5 PDA's currently have displays for presenting information to the user, and both keys and a stylus for obtaining information from the user. Telephones have speakers for presenting information to the caller, and both keys and a microphone for obtaining information from the caller. Suppose that a new device is constructed having all the modes of input of both current PDAs and telephones. What additions should be made to Figure 3.1 to support these multiple modes of input and output?

 

 

a