HomeTrainingTechnology TrendsAboutCalendar

Voice XML: Appendix B Exercises

 


VoiceXML: Introduction to Developing
Speech Applications

B-1 Extend the W3C Speech Interface Framework (Figure B.1) as follows:

A. Insert languages for voice identification and verification, word spotting, language generation, classification, machine translation, summarization, music synthesis, and history.

B. Insert SPIs for ASR, TTS, touch tone recognition, audio capture, and audio.

C. Insert multimodal technologies for handwriting recognition, motion detection, and vision.

D. What else can be inserted into the W3C Speech Interface Framework to make it more complete?

B-2 Review the W3C Voice Browser web site at http://www.w3.org/voice to determine the status of each of the W3C Speech Interface Framework languages. Which languages are at Working Draft, Last Call Working Draft, Candidate Recommendation, Proposed Recommendation, and Full Recommendation status?

  • N-gram - working draft
  • Speech Grammar - candidate recommendation
  • Semantic Attachments - working draft Natural Language
  • Semantics - working draft (now part of Multimodal Interaction Working Group)
  • VoiceXML 2 - last call working draft
  • Lexicon - requirements
  • Speech Synthesis - last call working draft
  • Reusable components - requirements

B-4 For each of the following applications, briefly define the application and list its major functions. List the components of the W3C Speech Interface Framework required for to implement the application. List additional technology required to implement the application. Which of these applications would you recommend be implemented now and which would you recommend be postponed until additional standards are in place? Explain why.

A. Foreign language training

Goal: Train users to speak a second language. It should be able to converse with the user in the user's first language as well as the second language.

Requirements: The application should be able to listen to the user speak the second language and detect and correct mispronunciations. There may be a visual component, which supports language training, for example, objects which the user may manipulate and discuss using the second language.

  • N-gram - not required
  • Speech Grammar - required
  • Semantic Attachments - probably not required because of the simplicity of the language structures being used.
  • Natural Language Semantics - required
  • VoiceXML 2 - required Lexicon - required
  • Speech Synthesis - prerecorded voice is preferred because of improved pronunciation
  • Reusable components - not required

The detection of misprounced words is not easily achieved with VoiceXML. Additional programming may be necessary to achieve this.

B. Catalogue sales

Goal: Browse a visual catalogue, ask and listen to answers about products, and order goods and services.

Requirements: A multimodal user interface in which the user can browse a visual catalogue, and then ask simple questions and listen to the answers. The application requires a "shopping cart" facility and a transaction capability to pay for goods and services purchased.

There are several possible versions of this application. The first uses application-directed dialogs to lead the user to the desired product. If the product is audio, the the user may listen to "samples" and decide to purchase.

  • N-gram - not required
  • Speech Grammar - required
  • Semantic Attachments - reqeuired
  • Natural Language Semantics - not required
  • VoiceXML 2 - required
  • Lexicon - required to insert names of new recording artists and their music
  • Speech Synthesis - required because system must respond to unanticipated answers
  • Reusable components - very useful (such as a shopping card object)

The next version is a involves natural language so that the user can ask questions about a product and discuss its features. True natural langauge is not supported by current standards.

  • N-gram - required for natural language
  • Speech Grammar - required
  • Semantic Attachments - required for natural language
  • Natural Language Semantics - required
  • VoiceXML 2 - required
  • Lexicon - required
  • Speech Synthesis - required because system must respond to unanticipated answers
  • Reusable components - very useful (such as a shopping card object)

C. Receive, create, and send e-mail messages

Goal: Browse, listen, create and send e-mail messages.

Requirements: Browse, listen, create and send e-mail messages. To create messages to forward to another users, the user may record voice messages or dictate text messages

  • N-gram - not required
  • Speech Grammar - required
  • Semantic Attachments - required
  • Natural Language Semantics - required
  • VoiceXML 2 - required
  • Lexicon - required
  • Speech Synthesis - required to read received text messages.
  • Reusable components - very useful (such as a reusable browsable list object)

Browsing a long list of messages is difficult to do with a voice-only interface, but possible with a browsable list object. Creating and editing a text message is very difficult to do with voice-only. A dictation engine will be necessary, and editing the text will be difficult to achieve. A better solution is to record the outgoing message and replay to the recipient.

D. Personal calendar

Goal: access and update information on a calendar.

Required: Browse calendar entries, edit and update calendar entries

  • N-gram - required short messages placed into calendar entries
  • Speech Grammar - required
  • Semantic Attachments - required for flexibility in user responses
  • Natural Language Semantics - required
  • VoiceXML 2 - required
  • Lexicon - required
  • Speech Synthesis - required to present text to the user
  • Reusable components - very useful (such as browsable list)

Record each user entry and store with the text generated by the speech recognition engine. If the engine doesn't recognize the text correctly, the user can listen to the recording for the information. Because this is a personal calendar, others will not hear the ASR mistakes.

E. Group calendar

Same as personal calendar except that users may not want ASR mistakes presented to other users. There needs to be some way to edit text created by ASR for this application to be successful.

F. Track shipping

Goal: User enters package number using DTMF to find out the status of a package in rout for delivery.

Requirements

  • N-gram - not required
  • Speech Grammar - not required
  • Semantic Attachments - not required
  • Natural Language Semantics - not required
  • VoiceXML 2 - required
  • Lexicon - required
  • Speech Synthesis - not required.

The few responses generated by the system are known ahead of time and can be prerecorded Reusable components - very useful (such as telephone number, address, etc) This is possible today, especially if the package number is a digit string rather than an alphanumeric string.

G. Audio jukebox

Goal: Enable user to select and listen to musical selections.

Requirements: System directed menus guide user to select tunes. User can construct play list of tunes.

  • N-gram - not required
  • Speech Grammar - not required, but may be used to speak artist or tune name
  • Semantic Attachments - not required
  • Natural Language Semantics - not required
  • VoiceXML 2 - required
  • Lexicon - required
  • Speech Synthesis - not required, responses can be prerecorded
  • Reusable components - very useful (such as a shopping card object which can be redesigned into a play list)

This is similar to (b) catalogue sales, above.

H. "To do" lists and reminders

Goal: Enable use to record, search, and listen to a list of things to do.

Requirements: Dictation engine to record arbitrary list of things to do. Speech synthesis to present list of things to do to the user.

  • N-gram - required for dictation engine
  • Semantic Attachments - required for natural language
  • VoiceXML 2 - required
  • Lexicon - required
  • Speech Synthesis - required because system must respond to unanticipated answers
  • Reusable components - very useful (such as browsable list)

It may be possible to develop this application using just record and playback (no speech recognition and synthesis). But the searching the to do list would be difficult.

a