Should You Build a Multimodal Interface for Your Web Site?

Will multimodal user interfaces improve the usability of your application, or just make your users more frustrated? Make an informed decision by answering these three questions.

Multimodal interfaces promise to enable users to enter requests and information using any of several input modes such as a mouse and keyboard, handwriting, and speech. Multimodal interfaces are the current hot technology. Everyone is talking about multimodal interfaces—how natural they are, and how users like them. It seems as if multimodal interfaces offer solutions to many of our user interface problems, as well as enabling new classes of applications.

But wait. While multimodal user interfaces promise to improve the naturalness and usability of Web applications, the incorrect application of a new mode might make a user interface cumbersome and awkward.

Three general questions will help you decide whether a new mode of input should be added to your application:

Each of these questions will be discussed in detail in the following sections.

Does the New Input Mode Add Value to the Web Application?

Using a new mode just because it is available does not make a useful or even usable user interface. Each additional mode must add its own value to the Web application. Here are examples of new input modes that offer additional value:

Each new mode should be used to provide additional value. If not, then the new mode will just get in the way of the user performing desired tasks.


Does the Application Leverage the Strengths of the New Mode and Avoid Its Weaknesses?

The following text summarizes how users manipulate content using three popular input modes:

  • Voice. The user speaks into a microphone.

  • Pen. The user manipulates a pen to write, draw, or point.

  • Key. The user manipulates a keyboard or keypad by pressing a key.

Developers should choose the input mode that best enables the user to enter the desired information. Users perform various tasks—including entering text, selecting words or objects, sketching simple illustrations and maps, and entering symbols such as mathematical equations. Each of these input modes can be used for text entry—the user speaks words into a microphone, writes the words using a pen, or presses keys on a keypad to spell the words. Most users easily speak and write. Training and practice may be necessary to use a keyboard efficiently.

Users frequently need to edit previously entered text by first selecting the word or phrase to be corrected and then entering the modified word or phrase. To select the word or phrase to be corrected, the user may speak the word, point to the word, or press keys to position a cursor to select the word or phrase.

Each task should dictate the input method. Menu selection is easy with a pen—just point to or circle the desired option. Menu selection with a keyboard is more difficult—press keys to position the cursor to the desired option and then press a select key. When using speech, just say the name of the desired option. Using a mouse is very much like using a pen, except that handwriting is much easier with a pen than with a mouse.

Entering a sketch—drawing simple illustrations and maps—is easy with a pen, but it is awkward with a mouse, and impossible with speech. When speaking, users must instead verbally describe the illustration or map. Entering symbols—mathematical equations, special characters, and signatures—is also easy with a pen, awkward with a mouse, and very awkward with speech.

The following table summarizes tasks and input modes. Developers should choose the input modes that best enable users to perform each task supported by a multimodal application. For example, if the application supports editing, then a pen or keypad should be used to enable the user to point to the object to be modified.

Content-Manipulation Task





Enter text

Speak the text

Handwrite the text

Press keys to spell each word in the text

Select a word to edit

Speak the selection

Point to or circle the selection

Press keys to position the cursor into the selection

Select a menu option

Speak the option

Point to or circle the option

Press keys to position the cursor to the option and then press the selected key

Enter a sketch or illustration

Describe verbally

Draw the illustration


Enter a symbol


Draw the symbol

Select from a menu

Does the User Have Access to the Required Hardware and Software Required by the New Mode?

Most telephones do not have full QWERTY keyboards, so they cannot be used for typing alphabetic data. Many PCs do not have high-quality microphones used for speech recognition. Most PDAs do not have microphones at all. 3G phone devices promise to have the communication bandwidth to support multimodal applications, but they are slow to appear in the marketplace.

It does no good to enable your application to support a mode if users cannot use the mode because of hardware limitations.

Where Will Multimodal Applications Be Used?

Users will buy two general types of devices for accessing multimodal Web applications: point and speak devices and multimodal-enabled PCs. Additional devices are possible, but currently these appear the most promising.

Point and Speak Devices

Cell phones and PDAs are converging into a single device. Like a cell phone, PDAs can be wirelessly connected to a backend server. Like PDAs, cell phones now have small displays and the computational power to support a browser.

The combined PDA/cell phone device will likely support a new style of user interface called point and speak. Users will manipulate a stylus to point to a field on an electronic form displayed on the screen, and speak to enter a value into the field. The "pointing" will turn on the speech recognizer, which uses the grammar specific for that field to listen to the user's utterance, and places its result into the field. These "point and speak" forms can be used for data entry (fill in the values of a patient history form), query (specify the query parameters and constraints), and transactions (specify the amount to be paid, the items being ordered, and the required credit card information and user-verification criteria).

Speech Application Language Tags (SALT) was designed by the SALT Forum to add speech into languages that support GUIs. SALT tags enable the user to speak and listen to applications on point and speak devices.

Multimodal PCs

The PC is inexpensive, and has plenty of computing power for multimodal user interfaces. Peripherals such as pens, handwriting pads, microphones, and cameras are readily available—and can be added to most PCs. Microsoft Internet Explorer will soon be SALT-enabled to support speech input and output. Applications that benefit from speech input on the PC include the following:

  • Data entry. Speak the value for each field rather than input the value with a keyboard. By using a portable microphone, the user moves out of the "office position" (sitting in front of the PC, with the mouse in one hand and the other on the keyboard), and moves around the office.

  • Information query. Speak a query, and review the results. Refine the query as necessary to home into the right information.

  • Web browsing. Speak the names of the links rather than move a mouse to the hyperlink and click. If the hyperlink is a graphic, insert a label next to the graphic that is pronounced easily. Conversay's Web browser supports these features very nicely.

  • Window management. Manage the many windows on a PC screen by name, rather than hunting and clicking the desired window. Warnings and error messages can be presented by voice instead of cluttering the screen with message boxes. Windows managed by voice leave your hands free to enter data and requests into the various windows. In a sense, your voice becomes a "third hand."

  • Eyes busy, hands busy. Users can speak and listen to an application when their eyes are busy and/or when their hands are busy (for example, a recipe application in which cooks listen to a recipe as they use their hands to prepare the dishes). Customers can assemble a toy while they listen to instructions spoken from a PC. In both cases, the user's hands and eyes are busy, yet the user can hear and speak with the computer.

Not all PC applications will benefit from speech. Speech increases noise pollution in the workplace, and some PC users do not want to be heard speaking to their computer. However, where speech adds value to an application, developers embed SALT tags into the PC application code. As with all applications involving speech, usability testing is necessary to refine the verbal prompts to the user, the grammar used by the speech recognition to hear what the user says, and the event handler specification that encourages the user to overcome speech recognition problems.


To be successful, each mode of a multimodal application must add value, be supported on widely available devices, and be supported by a widely available browser. Each new mode must add value to the application in which it is used. Leverage the strength of each new mode. There are many promising multimodal devices waiting to be implemented on integrated cell phone/PDA devices and PCs.


For a further discussion of multimodal user interfaces, see the last chapter in the author's book, VoiceXML: An Introduction to Building Voice Applications (Prentice Hall, 2002, ISBN: 0130092622).