Telephony Applications

Ever since the VoiceXML Forum published VoiceXML Version 0.9, the telephony world has embraced VoiceXML as a high-level declarative language for developing speech applications. Recently, the SALT Forum has published Speech Application Language Tags (SALT) for developing both telephony and multimodal applications by embedding tags into scripting languages such as XHTML, HTML, and so on. This article addresses the strengths and weaknesses of these two approaches. For a description of some of the technical differences, see "VoiceXML and SALT: How Are They Different, and Why?" (Speech Technology Magazine, May/June, 2002).

VoiceXML Enables Telephony Applications.

VoiceXML has had a revolutionary impact on how voice applications have been written. The VoiceXML Forum (founded by AT&T, IBM, Lucent, and Motorola) specified Version 1.0 of VoiceXML. The W3C Voice Browser Working Group took over the job of evolving and standardizing the language, resulting in Version 2.0 and additional languages for developing speech applications:

Speech Synthesis Markup Language (SSML) describes how a speech synthesis engine pronounces words and phrases.
Speech Recognition Grammar Specification (SRGS) describes what words and phrases a speech recognition engine can recognize at each point in the dialog.
Semantic Interpretation Language (a version of ECMS Script) extracts and translates recognized words.

No longer must a programmer deal with all of the details for invoking the speech synthesis engine, speech recognition engine, audio subsystem, and other subsystems; as well as coordinating the data exchanged among them. Now, developers only need to specify verbal menus and forms. The following VoiceXML code illustrates a verbal form with two fields that solicit the names of origin and destination cities for a travel application. The green text represents prompts that are presented to the user via a speech synthesis engine. The red text identifies grammars used by a speech recognition engine to listen for the caller's response. In the example below, the actual grammars are stored in files external to the application. The blue text represents event handlers that take over when the caller fails to respond appropriately. The purple text illustrates how the solicited data is transferred to a backend database.

<?xml version="1.0"?>
<vxml version="2.0">
<form id="TravelForm">
   <field name="OriginCity" >
     <grammar src="city.xml" />
     <prompt>Where would you like to leave from?</prompt>
     <nomatch>I didn't understand </nomatch> 		
   </field>
   <field name="DestCity" >
     <grammar src="city.xml"/>
     <prompt>Where would you like to go to?</prompt>			
     <nomatch>I didn't understand </nomatch> 				
   </field>
   <filled>  <submit ... /></filled> 						
</form>
</vxml>

VoiceXML owes its declarative nature to its Forms Interpretation Algorithm. This algorithm sequences through the fields of a form, causing a field's prompt to be presented to the user, listening for the caller's response, and invoking the appropriate event handler if the caller fails to respond appropriately. In the previous example, the Forms Interpretation Algorithm performs the following actions for the "OriginCity" field:

Invokes the speech synthesis engine, and sends it the text Where would you like to leave from?
Invokes the speech recognition engine, and sends it the grammar "city.xml" so it can listen for the names of cities.
If the user fails to say one of the city names listed in the grammar, then an event handler is invoked that encourages the user to try again.

Next, the Forms Interpretation Algorithm performs similar tasks for the "DestCity" field.

The Forms Interpretation Algorithm performs many of the control and coordination activities; this leaves the programmer to specify just the prompts, grammars, and event handlers using a declarative style of programming.

More than 50 vendors have announced products based on VoiceXML. Developers have written hundreds of VoiceXML applications that handle thousands of telephone calls everyday.

SALT Enables Telephony and Multimodal Applications.

In March 2002, the SALT Forum—founded by Cisco, Comverse, Intel, Microsoft, Philips, and SpeechWorks—published a specification for speech tags called Speech Application Language Tags (SALT).

The following illustrates a SALT implementation of the same application shown previously. The same color scheme is used as follows, so you can easily compare the SALT implementation with the VoiceXML implementation. With both approaches, the developer specifies prompts (green), grammars (red), and event handlers (blue); and submits results to a back end database (aqua).

<!-- HTML --> <html xmlns:salt="urn:saltforum.org/schemas/020124">
<body onload="RunAsk()">
  <form id="travelForm">
    <input name="txtBoxOriginCity" type="text" />
    <input name="txtBoxDestCity" type="text" />
  </form>
  <!-- Speech Application Language Tags -->
  <salt:prompt id="askOriginCity"> Where would you like to leave from? </salt:prompt>
  <salt:prompt id="askDestCity">  Where would you like to go to?   </salt:prompt>
  <salt:prompt id="sayDidntUnderstand" 
     onComplete="runAsk()"> 
     Sorry, I didn't understand.  
  </salt:prompt>
  <salt:reco id="recoOriginCity"
     onReco="procOriginCity()"
     onNoReco="sayDidntUnderstand.Start()">
    <salt:grammar src="city.xml" />
  </salt:reco>
  <salt:reco id="recoDestCity"
     onReco="procDestCity()"
     onNoReco="sayDidntUnderstand.Start()">
     <salt:grammar src="city.xml" />
  </salt:reco>

  <!--- script -->
  <script>
    function RunAsk() {
      if (travelForm.txtBoxOriginCity.value=="") {
         askOriginCity.Start();
         recoOriginCity.Start();
    } else if (travelForm.txtBoxDestCity.value=="") {
        askDestCity.Start();
        recoDestCity.Start();
        }
      }
   function procOriginCity() {
      travelForm.txtBoxOriginCity.value = recoOriginCity.text;
     RunAsk();
     }
   function procDestCity() {
      travelForm.txtBoxDestCity.value = recoDestCity.text;
      travelForm.submit();
   }
  </script>
</body>

</html>

Unlike VoiceXML, SALT has no Forms Interpretation Algorithm. Instead, SALT tags must be embedded within a host language such as XHTML, SMIL, or a scripting language. (The preceding example uses XHTML.) Developers must specify all control and coordination functions using the host language. This approach uses a procedural style of programming, as opposed to VoiceXML's declarative style. In effect, developers implement their own version of the Forms Interpretation Algorithm, which is embedded within the code specified by the host language.

SALT can be used to develop telephone applications—applications that enable callers to use their existing telephones and cell phones. However, SALT can also be used to develop multimodal applications by embedding SALT tags into languages for building GUIs such as XHTML. SALT tags can be embedded into existing GUI applications, enabling them to speak and listen to callers, as well as display text and graphics while accepting input from a keyboard or keypad, and a mouse or stylus, which results in a multimodal application.

While the VoiceXML browser generally executes on a speech server connected to a telephone, different versions of SALT browsers execute on a server or on a client, such as a handheld computer or an advanced cell phone. SALT Forum members are expected to announce the availability of SALT browsers, development tools, and other products, now that the SALT 1.0 specification has been made public this summer.

The following table summarizes the principles characteristics of both VoiceXML and SALT.

	VoiceXML	SALT
Purpose	Authoring system-driven and mixed-initiative voice dialogs over telephones and cell phones	Authoring system-driven, user-driven, and mixed-initiative voice dialogs and multimodal applications
Target Device	Interpreted on a speech server for use by telephones and cell phones	Executed on either a client or server
Language Style	Simple, high-level dialog markup language + XCMAScript	Lightweight markup tag extension of existing (X)HTML, WML, and scripting languages
Major Language features	<menu>, <form>, <grammar>, and control flow elements (<if>, <if-else> <next>)	<listen>, <prompt>, <dtmf>, <smex>, call control and prompt queue objects
Control Flow	Implicit by Forms Interpretation Algorithm	Developer specified using HTML + ECMAScript, WML, SML, and so on
Number of tags	Many	Few
Grammar Languages	Speech Recognition Grammar Specification, proprietary languages	Speech Recognition Grammar Specification, proprietary languages
Speech Synthesis	Speech Synthesis ML, proprietary languages	Speech Synthesis ML, proprietary languages
Semantic Representation	Natural Language Semantics ML	Natural Language Semantics ML
Call Control	Call Control XML (CCXML)	Call Control XML (CCXML), call control object <SMEX>

Two Approaches: Which Is Best?

It depends...

Programmers endlessly debate the advantages and disadvantages of declarative and procedural languages. But the key fact is that programmers now have a choice. They may choose between the declarative or procedural approaches based on the needs of the application, as well as the developer's skill set.

For traditional telephony applications, VoiceXML enables developers to create system-directed dialogs, in which the application prompts the caller to say requested parameter values. This style of interface is well-suited for telephone callers, who have no visual screen from which to select options. By using some clever programming techniques, system-directed dialogs can also be used as mixed-initiative dialogs, in which the user can deviate from the system-directed imposed dialog structure. There are a variety of VoiceXML tools available, including GUI tools that generate VoiceXML code, VoiceXML editors, debuggers, and test result generators. There is a well-established infrastructure for hosting and deploying VoiceXML applications.

For complex telephony applications, developers may find it necessary to override the Forms Interpretation Algorithm. In these situations, it may be better for developers to not use the standard Forms Interpretation Algorithm, and write their own control and synchronization code. Until the VoiceXML language provides a programming interface, developers should use SALT tags embedded into a host language. A future version of VoiceXML 2.0 may have a programming interface, perhaps one that resembles the SALT tags.

SALT really shines when developers need to voice-enable GUI applications. Developers specify the graphical user interfaces using a host language such as HTML or XHTML, and add SALT tags to enable the application to speak and listen to the user.

CAUTION

Simply adding voice to an existing Web application may not improve the user interface. However, multimodal user interfaces may add value to new applications—especially applications involving handhelds, cell phones, and other portable devices. (See article 4 in this series, Should You Build a Multimodal Interface for Your Web Site?, which presents suggestions for when to add new input modes to an application.)

SALT will be especially useful when developing point and speak applications. These applications are ideal for small devices that do not have a QWERTY keyboard for alphabetic data entry. The caller points to a labeled slot on the screen, and speaks the word or phrase, which is converted into text and placed into the correct slot. These applications can be downloaded into a PDA or executed on a server wirelessly connected to a cell phone with a screen and stylus.

Will SALT replace VoiceXML? No. VoiceXML is a mature technology being standardized by the W3C. It is widely deployed and used by thousands of programmers. Almost every VoiceXML vendor has a collection of dialog modules or speech objects that enable programmers to reuse existing VoiceXML code for common tasks such as soliciting a date, credit card number, or dollar amount. VoiceXML is here to stay.

However, SALT will become the new hot technology for developing multimodal user interfaces, especially for devices that are so physically small that there is no room for a traditional keyboard. Developers will also insert SALT tags into a host programming language to develop complex telephony applications that go beyond today's typical "ask a question and listen for the user's response" cycle of dialog.

VoiceXML and SALT: Approaches for Developing Telephony Applications

VoiceXML Enables Telephony Applications.

SALT Enables Telephony and Multimodal Applications.

VoiceXML

SALT

Purpose

Target Device

Language Style

Major Language features

Control Flow

Number of tags

Grammar Languages

Speech Synthesis

Semantic Representation

Call Control

Two Approaches: Which Is Best?