Multimodal Interface

Multimodal interfaces promise to enable users to enter requests and information using any of several input modes such as a mouse and keyboard, handwriting, and speech. Multimodal interfaces are the current hot technology. Everyone is talking about multimodal interfaces—how natural they are, and how users like them. It seems as if multimodal interfaces offer solutions to many of our user interface problems, as well as enabling new classes of applications.

But wait. While multimodal user interfaces promise to improve the naturalness and usability of Web applications, the incorrect application of a new mode might make a user interface cumbersome and awkward.

Three general questions will help you decide whether a new mode of input should be added to your application:

Does the new input mode should add value to the Web application?
Does the application leverage the strengths of the new mode and avoid its weaknesses?
Does the user have access to the required hardware and software required by the new mode?

Each of these questions will be discussed in detail in the following sections.

Does the New Input Mode Add Value to the Web Application?

Using a new mode just because it is available does not make a useful or even usable user interface. Each additional mode must add its own value to the Web application. Here are examples of new input modes that offer additional value:

Offload overloaded existing modes. A new mode offloads content from existing modes. For example, a keyboard and mouse are traditionally used to manage a display. Managing the location and visibility of multiple windows within a display can be offloaded from the mouse and keyboard to verbal commands, freeing the keyboard and mouse to manage the application content within the windows.
Replace traditional modes that are not practical in specific situations. In-car computing has practical constraints for the driver, whose eyes are busy watching the road; and whose hands are busy manipulating the steering wheel, gear shift, and controls for lights, horn, and windshield wipers. Speech seems a reasonable mode to replace the mouse and keyboard for drivers. Another example is that if it is too noisy for automatic speech recognition to recognize what you say, then use a keyboard or handwriting.
Increase bandwidth by using modes in parallel. New modes present new opportunities to increase the rate of communication between the user and the computer. For example, voice commands can be used as a "third hand" in games that require both hands to manipulate game controls. Voice also can be used in a drawing application to control the parameters of drawing tools. An example is using voice to change the line width and color of a line drawing tool while it is being used.
Use the most expressive mode. While the keyboard can be used to enter text, voice can be used to speak text and convey emotion. Drawing a map often conveys better accuracy than listening to verbal directions.
Use mixed modes. Some requests are easier to express using multiple modes rather than single modes. For example, the following combined verbal and pointing request:

"Put that" (point to the third icon in the fourth row) "there" (point to the second container from the left edge of the display).

is easier than saying:

"Put the third icon in the fourth row into the second container from the left edge of the display."

Each new mode should be used to provide additional value. If not, then the new mode will just get in the way of the user performing desired tasks.

Does the Application Leverage the Strengths of the New Mode and Avoid Its Weaknesses?

The following text summarizes how users manipulate content using three popular input modes:

Voice. The user speaks into a microphone.
Pen. The user manipulates a pen to write, draw, or point.
Key. The user manipulates a keyboard or keypad by pressing a key.

Developers should choose the input mode that best enables the user to enter the desired information. Users perform various tasks—including entering text, selecting words or objects, sketching simple illustrations and maps, and entering symbols such as mathematical equations. Each of these input modes can be used for text entry—the user speaks words into a microphone, writes the words using a pen, or presses keys on a keypad to spell the words. Most users easily speak and write. Training and practice may be necessary to use a keyboard efficiently.

Users frequently need to edit previously entered text by first selecting the word or phrase to be corrected and then entering the modified word or phrase. To select the word or phrase to be corrected, the user may speak the word, point to the word, or press keys to position a cursor to select the word or phrase.

Each task should dictate the input method. Menu selection is easy with a pen—just point to or circle the desired option. Menu selection with a keyboard is more difficult—press keys to position the cursor to the desired option and then press a select key. When using speech, just say the name of the desired option. Using a mouse is very much like using a pen, except that handwriting is much easier with a pen than with a mouse.

Entering a sketch—drawing simple illustrations and maps—is easy with a pen, but it is awkward with a mouse, and impossible with speech. When speaking, users must instead verbally describe the illustration or map. Entering symbols—mathematical equations, special characters, and signatures—is also easy with a pen, awkward with a mouse, and very awkward with speech.

The following table summarizes tasks and input modes. Developers should choose the input modes that best enable users to perform each task supported by a multimodal application. For example, if the application supports editing, then a pen or keypad should be used to enable the user to point to the object to be modified.

Content-Manipulation Task	Mode
Content-Manipulation Task	Voice	Pen	Key
Enter text	Speak the text	Handwrite the text	Press keys to spell each word in the text
Select a word to edit	Speak the selection	Point to or circle the selection	Press keys to position the cursor into the selection
Select a menu option	Speak the option	Point to or circle the option	Press keys to position the cursor to the option and then press the selected key
Enter a sketch or illustration	Describe verbally	Draw the illustration	Awkward
Enter a symbol	Awkward	Draw the symbol	Select from a menu

Does the User Have Access to the Required Hardware and Software Required by the New Mode?

Most telephones do not have full QWERTY keyboards, so they cannot be used for typing alphabetic data. Many PCs do not have high-quality microphones used for speech recognition. Most PDAs do not have microphones at all. 3G phone devices promise to have the communication bandwidth to support multimodal applications, but they are slow to appear in the marketplace.

It does no good to enable your application to support a mode if users cannot use the mode because of hardware limitations.

Where Will Multimodal Applications Be Used?

Users will buy two general types of devices for accessing multimodal Web applications: point and speak devices and multimodal-enabled PCs. Additional devices are possible, but currently these appear the most promising.

Point and Speak Devices

Cell phones and PDAs are converging into a single device. Like a cell phone, PDAs can be wirelessly connected to a backend server. Like PDAs, cell phones now have small displays and the computational power to support a browser.

The combined PDA/cell phone device will likely support a new style of user interface called point and speak. Users will manipulate a stylus to point to a field on an electronic form displayed on the screen, and speak to enter a value into the field. The "pointing" will turn on the speech recognizer, which uses the grammar specific for that field to listen to the user's utterance, and places its result into the field. These "point and speak" forms can be used for data entry (fill in the values of a patient history form), query (specify the query parameters and constraints), and transactions (specify the amount to be paid, the items being ordered, and the required credit card information and user-verification criteria).

Speech Application Language Tags (SALT) was designed by the SALT Forum to add speech into languages that support GUIs. SALT tags enable the user to speak and listen to applications on point and speak devices.

Multimodal PCs

The PC is inexpensive, and has plenty of computing power for multimodal user interfaces. Peripherals such as pens, handwriting pads, microphones, and cameras are readily available—and can be added to most PCs. Microsoft Internet Explorer will soon be SALT-enabled to support speech input and output. Applications that benefit from speech input on the PC include the following:

Data entry. Speak the value for each field rather than input the value with a keyboard. By using a portable microphone, the user moves out of the "office position" (sitting in front of the PC, with the mouse in one hand and the other on the keyboard), and moves around the office.
Information query. Speak a query, and review the results. Refine the query as necessary to home into the right information.
Web browsing. Speak the names of the links rather than move a mouse to the hyperlink and click. If the hyperlink is a graphic, insert a label next to the graphic that is pronounced easily. Conversay's Web browser supports these features very nicely.
Window management. Manage the many windows on a PC screen by name, rather than hunting and clicking the desired window. Warnings and error messages can be presented by voice instead of cluttering the screen with message boxes. Windows managed by voice leave your hands free to enter data and requests into the various windows. In a sense, your voice becomes a "third hand."
Eyes busy, hands busy. Users can speak and listen to an application when their eyes are busy and/or when their hands are busy (for example, a recipe application in which cooks listen to a recipe as they use their hands to prepare the dishes). Customers can assemble a toy while they listen to instructions spoken from a PC. In both cases, the user's hands and eyes are busy, yet the user can hear and speak with the computer.

Not all PC applications will benefit from speech. Speech increases noise pollution in the workplace, and some PC users do not want to be heard speaking to their computer. However, where speech adds value to an application, developers embed SALT tags into the PC application code. As with all applications involving speech, usability testing is necessary to refine the verbal prompts to the user, the grammar used by the speech recognition to hear what the user says, and the event handler specification that encourages the user to overcome speech recognition problems.

Conclusion

To be successful, each mode of a multimodal application must add value, be supported on widely available devices, and be supported by a widely available browser. Each new mode must add value to the application in which it is used. Leverage the strength of each new mode. There are many promising multimodal devices waiting to be implemented on integrated cell phone/PDA devices and PCs.

NOTE

For a further discussion of multimodal user interfaces, see the last chapter in the author's book, VoiceXML: An Introduction to Building Voice Applications (Prentice Hall, 2002, ISBN: 0130092622).