Voice XML: Chapter 11 Exercises

VoiceXML: Introduction to Developing
Speech Applications

11-1 Explain the features of a VoiceXML dialog that support callers in each of the following stages of general knowledge of the subject?

A. Orientation stage

Directed-dialogs assist users during the orientation stage by directing them answer questions. Event handlers also direct users in the orientation stage by helping them answer prompts correctly.

B. Exploration stage

Event handlers, especially the help event handler, instruct callers by informing them what is possible to do at each point in the dialog.

C. Manipulation stage

Callers in the manipulation stage can use the mixed initiative features of VoiceXML to quickly perform their tasks. Callers in this stage make heavy use of barge in.

11-2 Design and implement a VoiceXML application for collecting preference data. It should be easy to change the questions when the application is used to collect preference data for a new application.

See Figure 11.3

11-3 Design and implement a VoiceXML application for measuring performance data for the pizza application developed in exercise 8-3.

A. Determine what performance data is needed for this application

In order to measure word error rate, someone must listen to what the user says and transcribe it. Instead, we will use a weaker form of performance measure that can be completely automated by simply measuring if the system understood one of the prompt words. In effect, this is a measure of both the system and the user.

Caller Task	Measure	Typical Criteria	Calculation
The caller speaks one of "small, medium, large" to the prompt "what size?"	System understands the words "small, medium, large"	More than 95% of the time	(number of times size was understood)/ (number of times size was understood + number of times size was not understood)
The caller speaks one of "coke, pepsi, diet pepsi, lemonade, water" in response to "what drink?"	System understands the words "coke, pepsi, diet pepsi, lemonade, water"	More than 95% of the time	(number of times drink was understood)/ (number of times drink was understood + number of times drink was not understood)
The caller confirms that the drink order is correct	The system says "yes" in response to "is that correct?"	More than 95% of the time	(number of times user said yes)/(number of times user said yes + number of times user said no)
Caller orders a drink	Time taken to order drink		Average of (timestamp of end of order - timestamp of beginning of order)

B. Modify the VoiceXML application to create a log file containing start and stop times required to calculate the performance data needed for this application.

<?xml version="1.0"?> <!DOCTYPE vxml PUBLIC "-//BeVocal Inc//VoiceXML 2.0//EN" "http://cafe.bevocal.com/libraries/dtd/vxml2-0-bevocal.dtd"> <vxml version="2.0" xmlns = "http://www.w3.org/2001/vxml">   <var name="time_stamp1"/> <var name="time_stamp2"/> <var name="order_time"/> <script> <![CDATA[ var drink_size = new Array(); var drink_content = new Array(); // set listindex var drink_list_index = 0; var output_list_index = 0; ]]> </script> <form id="get_drink"> <script>  var d=new Date(); var time_stamp1=Date.parse(d); </script> <field name = "current_drink_size"> <prompt> Drink size: <break/> small, medium, or large? </prompt> <grammar type="application/grammar+xml" root = "drink_glass"> <rule id = "drink_glass" scope = "public"> <one-of> <item> small </item> <item> medium </item> <item> large </item> </one-of> </rule> </grammar> <catch event = "noinput help nomatch"> <log label = "1">size NOT understood</log> <reprompt/> </catch> </field> <block> <log label = "2">size understood</log> </block> <field name = "current_drink_content"> <prompt> Which drink? <break/> coke, pepsi, dietpepsi, lemonade, water? </prompt> <grammar type="application/grammar+xml" root = "drink_kind"> <rule id = "drink_kind" scope = "public"> <one-of> <item> coke </item> <item> pepsi </item> <item> dietpepsi </item> <item> lemonade </item> <item> water </item> </one-of> </rule> </grammar> <catch event = "noinput help nomatch"> <log label = "3">drink NOT understood</log> <reprompt/> </catch> </field> <filled> <script>  var d=new Date(); time_stamp2=Date.parse(d);  order_time = time_stamp2 - time_stamp1; </script> <log label = "99"><value expr="order_time"/></log> <log label = "4">drink understood</log> <script> <![CDATA[ drink_size[drink_list_index] = current_drink_size; drink_content[drink_list_index] = current_drink_content; drink_list_index = drink_list_index + 1; ]]> </script> <goto next="#another_drink"/> </filled> </form> <form id="another_drink"> <field name="more"> <prompt>do you want another drink? </prompt> <grammar type = "application/grammar+xml" root = "yes_no"> <rule id = "yes_no" scope = "public"> <one-of> <item> yes </item> <item> no </item> </one-of> </rule> </grammar> <catch event = "noinput help nomatch"> <log label = "5">confirmation NOT understood</log> <reprompt/> </catch> <filled> <log label = "6">confirmation understood</log> <if cond = "more == 'yes'"> <goto next="#get_drink"/> </if> <prompt> Here is your order </prompt> <goto next="#navchoice"/> </filled> </field> </form> <form id="navchoice"> <block> <prompt> <value expr="drink_size [output_list_index] "/> <value expr = "drink_content [output_list_index] "/> </prompt> </block> <block> <script> <![CDATA[ output_list_index++; ]]> </script> <if cond = "output_list_index < drink_list_index" > <goto next="#navchoice"/> </if> </block> <field> <prompt> is this correct? </prompt> <grammar type = "application/grammar+xml" root = "yes_no2"> <rule id = "yes_no2" scope = "public"> <one-of> <item> yes </item> <item> no </item> </one-of> </rule> </grammar> <catch event = "noinput help nomatch"> <log label = "7">user did NOT confirm order</log> <reprompt/> </catch> <filled> <log label = "8">user confirmed order</log> </filled> </field> </form> </vxml>

C. Create an application that reads the log file and calculates the performance data. You may use any programming language, scripting language, or even a spreadsheet application for this purpose.

11-4 Normally a human must transcribe the words and phrases spoken by a caller in order to calculate the word error rate. Explain how it may be possible to use a second speech recognition system to replace human transcriber. Specify a high-level architecture of a system for automatically estimating the word-error rate.

The usual approach for calculating the word error rate is to have a human annotator listen to the audio file and transcribe it into text. The annotator's text is then compared to the ASR's text, the differences noted, and the word error rate calculated.

It is possible to replace the human annotator by a powerful speech recognition engine (different form the production ASR) Use this text from this powerful speech recognition engine to estimate the word error rate. Because the second ASR is not perfect, it will itself produce errors, but because it works offline, it's errors should be fewer than the original ASR