February/March 1999
By James A. Larson
In their book, The Media Equation, Byron Reeves and Clifford Nass suggest that people treat computers as if they are real people. When people interact with each other, they normally see and hear, as well as speak with each other. It is also natural for users to want to see and hear their computers, as well as speak with them. In order for users to hear a computer, it must have synthesized speech.
Synthesized speech can be categorized as realistic or fantasy; the category depends on the goal of the application with which it is used. Realistic synthesized speech attempts to be as human-like as possible; while fantasy synthesized speech often sounds non-human, but is understandable.
Two widely different approaches for creating realistic synthesized speech are model-based and concatenated.
In the model-based approach, a model of a human vocal track is constructed to produce synthesized speech using the electronic equivalents of the lips, mouth, tongue, and throat. By changing model parameters, a model produces different voices, so the model provides a wide variety of voices for different situations. The model dynamically varies the tone and speed of speech, or prosody, to convey emotion. However, the speech still sounds artificial. Various users have likened model-based speech to a Swedish moose, an intoxicated Martian, or a radio with a loose paper clip in its speaker. Researchers believe that by improving the simulated model, the quality of speech will improve.
With the concatenated approach, words and phrases are extracted from a person's recorded speech and spliced together to replay the message for the listener. For example, telephone users frequently hear concatenated speech after they dial an invalid telephone number. The prerecorded words and phrases are arranged to present a specific message to the telephone user: "The number you have reached-nine, five, seven, four, four, four, three-is not a working number."
To improve concatenated synthesized speech, researchers have identified and extracted small units of speech, called phonemes, which can be joined and replayed to produce words the original speaker never uttered. To create a new voice, it is necessary to record words spoken by another human speaker. Large amounts of disk space are required to store the phonemes and combinations of phonemes to be stitched together.
While concatenated synthesized speech sounds more human-like than synthesized speech produced by the model approach, concatenated speech often has little prosody. Some users may tire from listening to it over long periods of time.
Three approaches are used to create fantasy speech. In addition to the model-based and concatenated approaches, human speech can be altered by passing it through various types of filters. Voxware (http://www.voxware.com) has software filters that produce a variety of voices called VoiceFontstm. For example, the pitch of a speaker's voice can be changed to sound like a mechanized robot voice.
MikeTalk, developed by Tony Ezzat and Tomaso Poggio at MIT, is an example of the concatenated approach. Ezzat and Poggio videotaped a person (Figure 2) enunciating a set of key words containing a collection of visemes, the visual equivalent of phonemes. Images for each viseme are identified and extracted from the sequence. The extracted images may then be sequenced and morphed together to produce a string of images that look like a person saying a word that he/she never really uttered. Combined with realistic synthetic speech, the result is a video-realistic text-to-audiovisual speech synthesizer.
Researchers at Interval Research Corporation have developed an experimental system called Video Rewrite. This system uses existing video to create a new video of a person mouthing words not spoken in the original footage. In Figure 3, Video Rewrite reorders the mouth images to match the phoneme sequence of a new audio track, which consists of a human voice or realistic synthesized voice generated either by the model-based or concatenated approach.
If human realistic talking heads are not the goal, then animation techniques can be used to produce fantasy talking heads and bodies. Cartoons can be more expressive than human heads and may be more appropriate in some situations. Microsoft has created several cartoon-like animated agents, two of which are shown in Figure 4. In addition to synchronizing the animated cartoon's mouth to synthesized speech, Microsoft provides software for controlling the cartoon's actions and positions.
Haptek, Inc. has developed a collection of animated talking heads. Developers embed commands in the text to control the placement orientation, size, and emotion of the talking head. It is even possible to cause one talking head to morph into another talking head.
Figure 5 is a screen shot of Haptek's VirtualFriendtm Roswell. Haptek's has created several 3-D, morphing, speaking, moving, and emoting fanciful VirtualFriends.
Animated Conversational Agents
Adding speech recognition to a talking head with speech synthesis enables a new input-output paradigm, the animated conversational agent. There are many applications that can be improved with a carefully designed animated conversational agent.
These include:
Cutsey Creatures
When fonts were first introduced, users went wild over-using multiple fonts resulting in gaudy documents with less emphasis on content and more emphasis on layout. When color was introduced, documents were suddenly filled with inappropriate colors. Eventually, users learned the appropriate use of fonts and colors.
It follows that initially there will be a plethora of visual agents with many annoying, "cutsey" creatures saying inappropriate things at inappropriate times. Often new technology is unwieldy and easily misused. Visual agents require that developers not only be computer savvy, but also be the author, artistic director, and content provider-all difficult tasks that even experts sometimes fail to do well.
Animated conversational agents are a natural extension of the Reeve and Nass observation that users treat computers as if they are people.
Jim Larson is the Manager of Advanced Human I/O at the Intel Architecture Labs in Hillsboro, OR. He can be reached at jim@larson-tech.com