Twenty Multimodal Projects Using X+V on the Opera Browser

James A. Larson

Portland State University
Oregon Health and Sciences University
Oregon Institue of Technology

Student projects

Opera with its new voice feature enables developers to create and deploy multimodal applications. If you want to experience the multimodal applications below, you will need to have the Opera version 8.0 browser. Voice Quick Start-up Guide: How to talk with your browser

Opera can support three types of user interfaces:

Dialog Type	URI	Comments
Verbal only	jim/paymentVerbal.xml	VoiceXML code embedded into HTML so Opera browser users can hear the verbal-only dialog
GUI only	jim/paymentGUI.xml	A traditional HTML application
Multimodal	jim/paymentMM.xml	VoiceXML code embedded into HTML code using X+V

During the last two weeks of the school term, students in the Oregon Health and Sciences University course CSE 564 and the Portland State University course CS 410/510 created several XHTML plus Voice (X+V) applications. Students had a working knowledge of XHTML and VoiceXML but no previous knowledge of X+V. I presented an hour overview of X+V syntax and two sample applications to students in both courses. Teams of two students were formed to design and implement a multimodal application within two weeks. The table below contains some of the student projects. If you install the Opera browser and the voice plug-in, then you can experience these multimodal applications. By clicking "source" under the "view" pulldown, you can examine the source code to see how students implemented the multimodal user interface.

Project number	Project Name	Author	URI	Comments
1	Kid's holiday craft projects	Ashley Irving	ashley/home.xml	Choose the project by saying its number. Page through the instructions by saying "next." Do you think that the content should be spoken in addition to being displayed on the screen?
2	Making a peanut butter sandwich	Emerson Murphy-Hill	No longer available	Page through the instructions by saying "next," "back," or "read."
3	Buying a car	Glenn Diviney	glenn/index.xml
4	Origami	Khanh Duong	No longer available
5	Collecting health data	Sunil Lahudia	sunil/healthData.xml
6	Buy movie tickets	Oindrila Mukherjee	qindrila/movieSel_test.xml	Use the name "James Bond" on the second screen to reserve tickets.
7	National park tours	Medha Nirguide	medha/Main.xml
8	Restaurant menu	Quang Nguyen	quang/project3.xml
9	Banking	Rajeshwari Patil	No longer available	Account = "10"; Passcode = "hello". Amounts must be in increments of 100.
10	Library	Frank Adrian and Ken Anderson	No longer available
11	Quick finder	Driss Takir	driss/voice.xml
12	Order computer	Ashwini Kulkarni	ashwini/login.xml	Use "david" for the user name and "capital" for the password.
13	Personal travel pictures	Chris Holm	chris/index.vxml	Say "next" or "previous" to view the cities.
14	Animal shelter	David Graves and Dina Suehiro	dina/index.xml
15	Cyber reader	Tom Feliz	tom/index.xml	Hear how much better a recording is than TTS.
16	Picture album	Ricky Cancro	No longer available	PHP to generate XML code for a picture album.
17	Flash cards (addition)		jim/flashCardAddition.xml
18	Tune your violin	Based on a SALT program developed by Deborah Dahl, chair of the W3C Multimodal Interaction Working Group	dahl/tune.xml	Both hands are busy as a violin player requests to hear tones so the violin player can tune the violin.
19	Music world	Doan Ng	doan/mainpage.xml	This application simulates a hardware device with a push-to-speak button. You must click the "push to speak" soft button as you press the (default) ScrLk button on your computer.
20	The game of "go"	Sean Pearson	http://web.pdx.edu/~seanp/school/cs410/capturego.cgi

I have categorized the student projects into the following categories.

Hands-busy instruction. This category of project provides incremental instruction while the user manipulates real-world artifacts with their hands. This category includes applications in which the user diagnoses a problem (determine what is wrong with a car's motor), repaire a device (fix a leaky faucet); and construct artifacts (project 1: a talking holiday project book for children; project 2: a talking recipe book; and project 3: creating a complex origami artifact). This category has the potential of being used to diagnose or repair products without calling a live-help agent. The application may be available by connecting to an application server, or the application may be supplied on a CD-rom packaged with the product. Project 17 (not necessarily a hands-busy instruction application) illustrates the multimodal equivalent of "flash cards." Some children use "flash cards" to learn their addition tables. Repetitive drills, such as flash cards, can be used not only for math skills but also for language training, spelling, and other rote memory training.

Entertainment. This category includes audio poems, stories (project 3: a child's fairy tale book), music (a audio-controlled juke box), and games. Gaming enthusiasts can use voice as a "third hand" while manipulating controls with both real hands.

Data collection. These applications are primarily verbal forms. For each electronic form slot or menu, the user speaks an answer to a question presented to the user either verbally, visually, or both. As a fall-back, voice users may also enter information directly into a GUI-based form. Projects falling into this category, include:

3: buy a car
5: collecting health data
6: buy movie tickets
7: order meals from a restaurant menu
9: banking
11: quick finder for locating local businesses
12,: ordering a computer
14: reviewing homeless pets at an animal shelter

These applications can be very useful if your hands are busy, or if these applications are implemented on a handheld device. Usability testing is needed to determine if voice really benefits these applications when used on a desktop PC in an office environment.

Photo album tour. Another popular class of applications is a photo album tour. Some projects consist of an ordered sequence pictures that the user navigates by saying "next," "previous," or "home" (project 13: personal travel pictures). More complicated projects support a hierarchical structure in which users move up or down a hierarchy of pictures and navigate within a sequential sets of pictures at the leaves of the hierarchy (project 7: national parks tour). Project 19 (music world), replaces the photos by audio clips, which enables users to create their own playlist of downloaded tunes. These are simple applications to construct. One student constructed a PHP program to generate a photo album tour (project 16: picture album). Almost anyone can use such a generation tool to create their own personal photo album tour. Just as enabling users to author text, spreadsheets, and presentations, and e-mail make office GUI-based applications popular. Enabling users to be authors may be the key to popularizing multimodal applications.

Multimodal interface to a traditional Web page. Project 10 (library application) is an example of a GUI application that also supports verbal input. The user may switch between the traditional GUI and the multimodal user interface at any time. Because of user's typing speed, background noise, and privacy issues, extensive usability testing will determine multimodal user interfaces, such as the library application, will become popular.

New novel applications. These applications are not possible with a traditional GUI. For example, project 18 illustrates an interesting "hands busy" application: the user asks for musical notes to be played while using both hands to tune a violin or other musical instrument. Project 19 illustrates how to speak the name of a tune to be played. Imagine, speaking to your I-pod to select tunes!

Students experience with Opera's implementation of X+V

While most students were able to complete their projects within two weeks, they experienced several types of problems:

Lack of documentation. While the Opera documentation was useful to get started, it lacked a good description the <sync> element. The IBM documentation described the syntax of the <sync> element, but did not provide a good example.
Browser problems. Several students reported that the Opera browser speech recognition system stopped after several errors, even when a document was refreshed, the Opera browser still did not recognize user speech. Students worked around the problem by closing the browser and restarting it.
Debug tools. Experienced student programmers complained about the lack of development and debug tools. Several times students received the message, "an error has occurred," with no further information about the error.
X+V language. Students complained about several limitations of X+V, including: (1) no support for the W3C Speech Synthesis Markup Language (SSML); (2) no support for the W3C Speech Recognition Grammar Specification (SRGS); and (3) several missing VoiceXML 2.0 elements including the <goto> element.

I spoke with an Opera representative who indicated that the Opera documentation will be improved, and W3C SSML and SRGS will be supported in the future. He also indicated that Opera is considering development tools.

IBM recently announced the Multimodal Tools Project for Eclipse [http://www.alphaworks.ibm.com/tech/mmtp] that avoids or overcomes the bulleted problems above. There are lots of papers and articles referenced by http://www-306.ibm.com/software/pervasive/multimodal/ and http://www-128.ibm.com/developerworks/. There is also an X+V Programmer's Reference Guide in the IBM Multimodal Tools Package. For debugging, the IBM Browser has a voice log window where programmers can trace their application and an X+V debugger. The IBM version also supports the W3C speech Synthesis Markup Language (SSML) and both versions of the W3C Speech Recognition Grammar Specifications (SRGS).

If you are inspired by these examples and have a novel multmodal application in mind, I invite you to build a demo using X+V. See getting started.

Conclusion

Clever students are able to create interesting and novel applications as long as some training in VoiceXML, HTML, XPath, Java Script, and X+V is provided. Usability testing is needed to: (1) refine and improve the user interfaces to the initial prototypes and (2) determine if the application has both wide appeal and use.

I challenge students to use X+V to create other new and novel multimodal applications. I'll post them along with the applications shown above if they (1) are error free, (2) conform to general moral standards (no pornography, please), and (3) represent a new use of multimodal technology.

Update

Fall quarter students at Portland State University and Oregon Institue of Technology developed the following X+V applications

Project number	Project Name	Author	URI	Comments
21	Repot an orchrid	Dianna Carroll	Caroll/1_main.xml	A useful hands-busy appliclation that uses prerecorded voice in place of speech synthesis
22	Multimodal favorites	John Chee	Chee/favorites.xml	Speak the name of the web site you wish to visit. Questrion: C an you think of a way to keep the voice active after going to a site, > so that the user can always choose one of the site at any time by speaking? John's answer: I thought about this and I think the simplest way would be to have the website that the user chose inside a frame. Although, I don't know how Opera would like that. My thought was that a user using this application would likely be able to get used to the opera built-in voice commands fairly quickly and use phrases such as "opera back" although that still leaves something to be desired. Users could also use the application as a landing page, so that every time they open a window (or tab) the application would automatically start, users could then create a new tab and close the one that navigated away from the application.
23	World Guide	Chris Roberts	http://dev.modspox.com/cgi-bin/guide.rb	Provides speech access to wikipedia content. Chris's comments: The vxml engine that opera uses is very particular about what it will let you do and the fact that it doesn't support the 2.0 version of vxml made it very hard to get things working. I finally broke down and split it into two pages and used a ruby script to load the actual pages up. This way I can pre-insert the dynamic options that can be matched. I tried many things to allow a dynamic grammar but everything I tried from using Javascript to physically change the field during runtime to messing around the src to load the correct dynamic page just crashed the browser. So, attached is the finished script.
24	Online dictionary	Yasuko (Katharine) K Horiguchi	Kathy/onlineDictionary.xml	For the traveler, translate English phrases to Chinese
25	Internet phone system	Hwa Young Lee	Lee/phone.xml	Multimodal user inteface for an VoIP phone system
26	Buy movie tickets	Michael A .Laskowski	mal/mazeGame/mazeGame.xhtml	Trains children to pronounce difficult-to-say sounds by repeating words containing the sound as part of a maze game. This maze uses your voice to navigate the maze. Navigation is accomplished by saying a word that is in the path of the user icon (a kid on a tricycle). At each location in the maze, the words that can be said are outlined in yellow. The user can also say "help" to get the list of available words. Say "stop" to exit the game. Say "end" to navigate to the end of the maze. (can't say end until it is available for saying.) Say "help" to get the list of available words.
27	Nagoya Scenic Spots	Michiko I Luther	michiko/proj3.xml
28	Online unit conversion	Sidy	Sidy/index.xml	Author's comments: The JavaScript code: <vxml:script <![CDATA[ function convertV(m) {return Math.floor((m -32) / 1.8);} ]]></vxml:script> works fine. But Opera just won't let me call the convertV function for some reason. From all the research that I have done, it seems like coverage of JavaScript language with Opera is kind of spotty. I guess this is how the Opera browser is kept so small and fast. One more thing: even though I used the <xv:sync xv:input="..." xv:field="..."> tags to synchronize between my voiceXML form and my graphical (XHTML) form, the functionality that I got was not exactly what I expected. What happened is that the <xv:sync> tag did the right thing in triggering an event to stop the prompt ( i.e. barge in) whenever I clicked on the input corresponding to a field or simple barged in. But the entered value was NOT reflected in the HTML form until I threw another event. Fortunately, I read on the w3c specification that since a voiceXML form is contained within an XHTML form, elements of the XHTML document was accessible from within voiceXML. This allowed me to update my XHTML field after collecting my voice input.
29	Computer museum	Michael A. Turner	Turner/index.xhtml	- Say the # of the link you want to visit - To refresh the page say "refresh" - To go to the previous page say "back" - To go to the next page in the # sequence say "next" - To go to the home page say "home" - To have the computer read the specs say "read"