January/February 2002

Speaking Standards

VoiceXML Update: Voice XML 2.0

By Dr. James A. Larson

Since the release of the VoiceXML 1.0 specifications by the VoiceXML Forum in July 2000, developers have implemented and deployed dozens of VoiceXML browsers. VoiceXML 1.0 has enabled application developers to create and deploy hundreds of conversational speech applications.

The W3C Voice Browser Working Group has recently released draft specifications of four new languages that extend VoiceXML 1.0. The Voice Browser Working Group has clarified several VoiceXML 1.0 syntax and semantics and has developed three related languages named in Figure 1. All four languages are XML languages that use tags to describe language elements.

VoiceXML 2.0 specifies the basic structure of conversational dialogs. In Figure 2, VoiceXML tags (shown in black) describe a field consisting of a prompt to be read to the user by a speech synthesis engine and a grammar to be used by a speech recognition engine to listen to the user's response. The W3C Voice Browser Working Group examined nearly 400 change requests and made many clarifications and minor enhancements to VoiceXML 1.0. Probably the most significant addition is a tag that enables developers to create and debug messages and collect data for performance analysis. The real change from VoiceXML 1.0 is the addition of three new languages.

The Speech Synthesis Markup Language enables developers to specify instructions to the speech synthesizer about how to pronounce words and phrases. Based on the Java Speech Markup Language (JSML) specification, the Speech Synthesis Markup Language provides tags for specifying the structure ( and ), pronunciation ( and ), dynamics (, , and ) and the use of prerecorded audio files (

Speech Recognition Grammar Specification enables developers to describe the words and phrases that can be recognized by the speech recognition engine. The syntax of the grammar format has two forms. The form shown in red text in Figure 2 uses XML elements to represent a grammar. This format can be analyzed and verified by traditional XML syntax checkers. The form shown in red text in Figure 3 uses an augmented BNF format, which is similar to many existing BNF-like representations commonly used in the field of speech recognition. The two forms are equivalent in the sense that a grammar that is specified in one format may also be specified in the other. Both forms are modeled after JSpeech Grammar Format.

Semantic Interpretation tags enable developers to compute information returned to an application using grammar rules and tokens matched by the speech recognition engine. As examples, if the speech recognition engine recognizes the phrase "Drugstore," the green semantic interpretation tags in Figures 2 and 3 calculate the value to be returned as "Ajax."

Likewise, "Stanley" is returned in place of "Supermarket." Semantic interpretation tags can also be used to extract information from a user utterance, perform calculations, and assign values to multiple variables. The semantic interpretation tags were designed to be very similar to a subset of the ECMAScript language.

Status. A status of "working draft" means that the Voice Browser Working Group is actively soliciting suggestions for changes and enhancements. "Last call working draft" means that this is the last chance to suggest changes and enhancements before the specification proceeds to the W3C "candidate recommendation," "proposed recommendation" and finally "recommendation." Anyone in the speech community is welcome to comment on the languages by sending an e-mail to www-voice@w3.org.

The VoiceXML languages currently have a reasonable and non-discriminatory licensing policy. This policy is currently under review by the W3C.

The VoiceXML Forum (www.VoiceXMLForum-.org) is starting a conformance program to verify that language implementations conform to the specified standard.

VoiceXML 1.0 lacked standards for speech recognition, speech synthesis and semantic interpretation. The new languages listed in Figure 1 fill these holes, plus provide numerous clarifications to VoiceXML 1.0. The new languages will enable developers to create conversational applications faster and with a greater degree of portability than ever before.


Dr. James A Larson, is chairman of the W3C Voice Browser Working Group. He is the author of "Developing Speech Applications Using VoiceXML" and teaches courses in user interfaces and speech applications at Portland State University and Oregon Health and Sciences University. He may be contacted at www.larson-tech.com.