November/December 2004

Technology Trends: MRCP Enables New Speech Applications

By Dr. James Larson

Have you ever wished you could change your VoiceXML platform to use a speech synthesizer or speech recognizer from a different vendor?  Have you ever wanted to move your speech synthesizer or speech recognizer to a different server?  The Internet Engineering Task Force is proposing a new standard that will provide this flexibility.

Media Resource Control Protocol1 Version 2 (MRCPv2) is a network protocol, which provides a vendor-independent interface between speech media servers and speech application platforms. The Internet Engineering Task Force2 Speech Service Control’s3 MRCPv2 is based on an earlier version developed jointly by Cisco, Nuance, and SpeechWorks (now ScanSoft). 
    
The MRCPv2 protocol controls media service resources over a network. This protocol depends upon a session management protocol, such as the Session Initiation Protocol (SIP), to establish a separate MRCPv2 control session between the client and the media server. MRCPv2 defines the following types of media processing resources: 

MRCP is designed to support two important capabilities to make speech platforms more flexible:

  1. Service provider independence — Developers can switch between service providers.  For example, a developer switches from a public domain speaker recognition engine to a higher quality (and more expensive) proprietary speaker recognition engine.
  2. Service location independence — Developers can move services among servers.  For example, if a server becomes saturated, another server can be installed and some of the services from the first server can be reloaded onto the second server.

Developers can leverage the benefits of MRCP to provide this flexibility to any application or platform that uses speech recognition, speech synthesis, and speaker authentication.  For example:

MRCP can support remote media services for the applications listed above, and others that have not been invented yet.  MRCP can do for the entire speech industry what VoiceXML did for the telephony industry—provide a standard platform on which to write applications that enables media resources to be accessed remotely, and enables developers to choose the technology vendors that best support their applications within their budgets. 


1: http://www.ietf.org/internet-drafts/draft-ietf-speechsc-mrcpv2-04.txt
2: http://www.ietf.org/
3: http://www.ietf.org/html.charters/speechsc-charter.html
4: http://www.saltforum.org/
5: http://www.w3.org/TR/2004/WD-css3-speech-20040727/
6: For a catalog of personal agents from Microsoft and other vendors, see
http://www.iva-user-center.com/.  For a short tutorial on various implementations of animated agents, see http://www.speechtechmag.com/issues/4_2/cover/298-1.html.


Dr. James A. Larson is manager of Advanced Human Input/Output at Intel Corporation and author of the book, Voice XML - Introduction to Developing Speech Applications. He can be reached at jim@larson-tech.com and his Web site is www.larson-tech.com.