VoiceXML and Next-Generation Voice Services

VoiceXML and Next-Generation Voice Services Abstract This is the abstract for my paper. Adam Hocek <ahocek@broadstrokesinc.com> Table of Contents 1. Voice services... 1 2. VoiceXML framework... 2 2.1. VoiceXML dialogs... 3 2.2. SSML... 4 2.3. GRXML... 5 2.4. CCXML... 6 2.5. VoiceXML IP services... 6 2.6. VoiceXML Events... 7 2.7. Mixed initiative... 8 3. Introduction to enhanced voice services... 9 3.1. XHTML + Voice... 9 3.2. SALT... 10 3.3. XForms... 10 3.4. XSLT... 10 3.5. NLSML... 11 3.6. Natural dialogs... 12 3.7. Multimodal (multi-interface)... 12 3.8. Conclusion... 13 Bibliography... 14 Glossary... 14 1. Voice services Having applications that can interact with users using more than one interface has advantages. Consider the situation where you need a city directory service, saying the name of the restaurant is a lot easier than pulling down dropdown menus and ensuring you got the spelling correct. Conversely, a web browser is better for getting directions than a sequential vocalized list of directions. Each user interface has its own criteria for how best to represent information. This implies that there is a certain level of design customization for each interface type supported. On the other hand, on the backend there can be a common data model representing the information presented and collected from users. Voice services are one form of user interface. What makes them interesting is the naturalness (or sometime unnaturalness) in which we interact with speech. However, providing voice services requires listening and speaking capabilities, which are both complex technologies. The VoiceXML framework starts with an introduction to VoiceXML and its underlying components. VoiceXML by nature is also complex. The language allows for the scripting of voice dialogs, with speech recognition and speech synthesis capabilities, as well as, integration to other processes through events and IP connectivity. Covering VoiceXML will cover the basics of voice services. XML 2002 Proceedings by deepx 1

The second part, Introduction to enhanced voice services, looks at technologies to consider for modeling and leveraging existing web browser technology, and for improving voice dialog systems that interact more natural. The section ends with a look at multiple user interfaces and how applications might change how users interact with them. 2. VoiceXML framework One of the motivations behind VoiceXML was to address Interactive Voice Response (IVR) applications and to standardize on the different voice processing components required. IVR applications, typically present telephone users with questions and based on user responses they eventually direct users to the correct information or to an agent. In order to accomplish this functionality VoiceXML must provide some basic telephony control, speech synthesis control, and control of voice recognition capabilities through grammars. Figure 1 shows a typical IVR system. These traditional IVRs had limitations in performance and integration. The problems with traditional IVR systems can be summarized: All components (text-to-speech, voice recognition, telephony connectivity) are non-standard. Voice menus have little flexibility in handling user variations. Difficult to integrate with other IP services (database, web, etc.) Costly application development using proprietary vendor APIs. No dynamic content for personalized menus. Figure 1. Architecture of a traditional IVR Next we look at how VoiceXML addresses the shortcomings of traditional IVR systems. Shown in Figure 2 is a VoiceXML-based IVR. Functionality is separated into components for Text-To-Speech (TTS), Automatic Speech Recognition (ASR) and telephony control. VoiceXML 2.0 [VoiceXML 2.0]has deferred functionality of these three components by referencing Speech Synthesis Markup Language (SSML)[SSML] for TTS control, Grammar Markup Language (GRXML)[SRGS] for controlling ASR, and Call Control Markup Language (CCXML)[CCXML] for providing telephony control. This architecture allows for each of the sub-components (ASR, TTS, and telephony control) to have a specific XML associated with them. Complete and sophisticated voice dialogs can be authored with VoiceXML and the underlying markups: GRXML, SSML and CCXML. These XML documents allow for dynamic content creation, improved dialog handling, and simplified component integration. The remainder of this section will briefly look XML 2002 Proceedings by deepx 2

at these markups, followed by an in-depth look at VoiceXML's dialog, IP integration, and event handling capabilities. 2.1. VoiceXML dialogs Figure 2. Architecture of a VoiceXML-based IVR One of the important benefits of VoiceXML is the ability to author dialogs. Inherent to the VoiceXML specification, is the Form Interpretation Algorithm, FIA, the logic for processing forms. A form is a collection of form items. The FIA is responsible for entry into a form, the next form item to visit, the management of prompts, the activation and deactivation of grammars, and exiting a form. VoiceXML defines a "form item" as an element that is a child of form and is of one of the following element types: field, block, initial, subdialog, object, record, transfer. The FIA uses the form item variables to determine dialog flow. The form item variables are: result variable - This is an ECMAScript variable whose name is defined by the form item's name attribute, and whose scope is that of the containing dialog. When a form item has been successfully visited, its result variable will contain the result of that visitation. For example, a field's result variable will contain the answer collected from the caller. guard condition - This is an ECMAScript expression specified by the form item's cond attribute, which must evaluate to true in order for this form item to be visited. count variable - This is an internal variable that keeps track of how many times the VoiceXML interpreter has attempted to fill a form item in a given invocation of the dialog. XML 2002 Proceedings by deepx 3

A simple example showing a form that collects two field values and posts the values to a CGI (Common Gateway Interface). <form> <block> To complete your order we need the following information. </block> <field name="color"> <prompt>select from one of the following colors <enumerate/></prompt> <option>red</option> <option>blue</option> <option>green</option> </field> <field name="size"> <prompt>select from the size. You can select <enumerate/></prompt> <option>small</option> <option>medium</option> <option>large</option> </field> <filled> Thank you. Your order is being processed. <submit next="/cgi/details.cgi"/ namelist="color size"/> </filled> </form> 2.2. SSML Developers use SSML to specify how speech should be rendered. The language has elements for controlling the pronunciation, tone, inflection, and other characteristics of spoken words. Two elements for producing speech, which are part of VoiceXML, are the prompt and the audio elements. SSML elements can occur within either of these elements. SSML elements include the following: emphasis - text spoken with emphasis prosody - allows for control of pitch, rate, duration, and volume sentence - identifies a sentence paragraph - identifies a paragraph say-as - uses a type construct to render text phoneme - specifies a phonetic pronunciation voice - specifies a voice characteristic mark - used for asynchronous notification break - a pause An example that uses SSML elements to control speech and audio outputs is shown: <?xml version="1.0" encoding="iso-8859-1"?> <vxml version="2.0"> <form id="audiotest"> <block> Your <emphasis>total</emphasis> is <say-as class="currency">$299.95</say-as> <audio src="http://205.188.234.65:8006"> I'm sorry. The audio stream is not available today. XML 2002 Proceedings by deepx 4

</audio> </block> </form> </vxml> 2.3. GRXML Grammars are a way to define the domain of active spoken words or DTMF (Dual Tone Multi-Frequency)tones that are listened for. Grammars use rules and weights to specify the recognition logic. For describing grammars, VoiceXML will accept any of the following grammars: GRXML, ABNF, and JSGF[SRGS]. Though each of the grammar languages has a similar purpose, GRXML is the only format that is an XML. The grammar element is used by VoiceXML to specify a grammar. Its type attribute states which grammar is being used. The essential elements used to define GRXML grammars are: rule - a rule expansion declaration ruleref - a local or external rule reference item - define an entity one-of - a set of alternatives tag - a string associated to a rule expansion grammar - root element Grammars can be externally defined using GRXML or they can use the built-in grammars types for: digits, boolean, currency, date, number, phone, time. 2.3.1. How recognition results get processed Grammars take a user response as input and return a string value that represents the match. Using grammars, complex word patterns can be defined and tested for. In the example, the field with attribute value "favcolor" has defined an inline grammar of acceptable answers tto a prompt. <?xml version="1.0" encoding="iso-8859-1"?> <vxml version="2.0"> <form id="test"> <field name="favcolor"> <prompt>what is your favorite color?</prompt> <grammar xml:lang="en-us" version="1.0" root="example1"> <rule id="example1" scope="public"> <one-of> <item><tag>'red'</tag>red</item> <item><tag>'green'</tag>green</item> <item><tag>'blue'</tag>blue</item> <item><tag>'red'</tag>burgundy</item> XML 2002 Proceedings by deepx 5

<item><tag>'blue'</tag>indigo</item> </one-of> </rule> </grammar> <filled> <prompt> You said your favorite color is <value expr="favcolor"/>. </prompt> </filled> </field> </form> </vxml> This grammar, for the sake of simplicity, defines five colors to choose from with the item element. The interpreter is responsible for setting the ASR to recognize the five words. Each item in the grammar has a tag element that will be returned from the ASR to the VoiceXML interpreter upon recognizing one of the five colors. The returned value now has a binding to the field, "favcolor". When the field's filled gets activated, the prompt will say the user's selected color. 2.4. CCXML Using CCXML, applications can control the management of all inbound and outbound call-connectivity and control audio mixing and splitting. VoiceXML provides very little in terms of call control, in-fact it provides only two elements, disconnect and transfer. On its own VoiceXML has no ability for placing out-bound calls or any of the conferencing features offered by CCXML. Another important shortcoming of VoiceXML that CCXML addresses is a framework for managing multiple instances of VoiceXML interpreters and events between interpreter instances. The main features of CCXML are: Allows for outbound calls Support for multi-party calls Selective inbound call routing Asynchronous "external" event handling Conference objects for joining and unjoining participants Audio objects for splitting and mixing audio resources Control and connectivity to one or more VoiceXML interpreter instances VoiceXML control to start, kill, or suspend a process Supports multiple CCXML programs and interconnection through events Can be used to provide coaching, flooring, and delegation control Web server connectivity Whisper transfer Supervised transfer 2.5. VoiceXML IP services Prior to VoiceXML IP (Internet Protocol) connectivity was virtually non-existent in IVRs. Those vendors that did implement them used proprietary calls. IP connectivity can be thought of as two distinct issues. One is the ability to access documents, similar to accessing web pages over the Internet by providing a URI. The second issue XML 2002 Proceedings by deepx 6

addressed is the ability to post values to an Internet service. VoiceXML provides two elements for accomplishing variable/value passing, these are the submit and the subdialog elements. A submit uses an HTTP GET or POST to pass key/value pairs up to specified URL. Using the namelist attribute variables and their values are passed, as shown in the code excerpt: <if cond="selection=='menu'"> <submit next="http://www.jimmyspizza.com/servlet/menu" method="post" namelist="userlevel orderstatus status" fetchtimeout="180s"/> </if> A subdialog element can be used to call another form as if it were a subroutine. The form that is being called as a subdialog must end with a return element. A subdialog form can also accept input parameters. Here we use namelist to pass variables to the subdialog residing in another document and use the results returned from the subdialog (result.username and result.status) to output an audio prompt. <form> <subdialog name="result" src="#getuseraccesslevel"> <param name="userid" expr="`null'"/> <filled> <audio>the subdialog returned the name <value expr="result.username"/> and <value expr="result.status"/> <submit namelist="result" next="http://myservice.example.com/cgi-bin/process"/> </filled> </subdialog> </form>  <form id="getuseraccesslevel"> <var name="userid"/> <field name="username"> <grammar src="http://grammarlib/namegrammar.grxml" type="application/grammar+xml"/> <prompt> Please say your name. </prompt> <filled> <if cond="validuseraccess(username,userid)"> <var name="status" expr="true"/> <else/> <var name="status" expr="false"/> </if> <return namelist="username status"/> </filled> </field> </form> 2.6. VoiceXML Events As per specification, VoiceXML interpreters have limited event-handling capabilities. An event will be thrown if the interpreter encounters a semantic document error or it encounters a throw element. Inherent event handlers of VoiceXML are the elements: noinput nomatch catch error XML 2002 Proceedings by deepx 7

help One drawback with VoiceXML's event handling is that it is single-threaded. Consequently, only events that are explicitly handled by the VoiceXML application will be handled. Another point to consider is that in a real-time telephony environment there are many asynchronous events that occur. As an application evolves it may need to handle some of these events. VoiceXML is limited; instead a CCXML interpreter is a good alternative for providing a multi-threaded event framework. 2.7. Mixed initiative The term mixed initiative refers to the ability for either the computer or the user to drive the conversation. This is an important feature for making better voice interfaces. If a form contains an initial, this element will be visited before all form items. After visiting an initial, the interpreter will wait for a form-level grammar to be satisfied. Once an answer is given that satisfies the form-level grammar, the interpreter will attempt to fill any fields remaining unfilled using the standard Form Interpretation Algorithm. The trick to making a form mixed initiative is to provide a grammar that can answer any of the questions represented by the fields of the form. With a single utterance a user's response could return results to multiple fields. The example below is a mixed initiative with GRXML grammar for scheduling a flight. The user can say - "I'd like to fly from city A to city B" or "I'd like to fly to city B from city A". Both responses are acceptable. The fields for depart and arrive are each filled when the user responds with either utterance. <vxml version="2.0"> <form id="airlines"> <initial name="itinerary"> Where would you like to fly? <catch event="nomatch noinput"> <prompt>i didn't get that.</prompt> <assign name="itinerary" expr="undefined"/> <reprompt/> </catch> </initial> <grammar xml:lang="en-us" version="1.0" root="flight"> <rule id="flight" scope="public"> <one-of> <item> <item repeat="0-1">i'd like to fly</item> from <ruleref uri="#city"> <tag>depart=city.returnvalue;</tag> </ruleref> to <ruleref uri="#city"> <tag>arrive=city.returnvalue;</tag> </ruleref> </item> <item> <item repeat="0-1">i'd like to fly</item> to <ruleref uri="#city"> <tag>arrive=city.returnvalue;</tag> </ruleref> from <ruleref uri="#city"> <tag>depart=city.returnvalue;</tag> </ruleref> </item> </one-of> XML 2002 Proceedings by deepx 8

</rule> <rule id="city" scope="public"> <one-of> <item> <tag>returnvalue='new York';</tag>New York</item> <item> <tag>returnvalue='los Angeles';</tag>Los Angeles</item> <item> <tag>returnvalue='los Angeles';</tag>L A</item> <item> <tag>returnvalue='toronto';</tag>toronto</item> <item> <tag>returnvalue='london';</tag>london</item> <item> <tag>returnvalue='paris';</tag>paris</item> </one-of> </rule> </grammar> <field name="depart"> </field> <field name="arrive"> </field> <filled> <prompt> I have you flying from <value expr="depart"/> to <value expr="arrive"/>. </prompt> </filled> </form> </vxml> 3. Introduction to enhanced voice services Voice browsers offer a viable alternative to HTML browsers, not only where device real estate is limited, as in the ever smaller-sized mobile phones and PDAs, but also in the application of natural dialogs, where users are no longer confined to directed-menus or forms, and can naturally describe their objectives through speech. Here we look at some of the other technologies that can be combined with VoiceXML or offer alternatives to VoiceXML for providing enhanced voice services. 3.1. XHTML + Voice XHTML+ Voice [XHTML+Voice]is a current technology that is ready for developing voice-enabled applications. By adapting XHTML for voice input and output, and leveraging its event model, XHTML+Voice offers a good transitional technology solution. Extending existing XHTML applications to support voice is greatly simplified. XHTML 1.0 is a reformulation of HTML 4.0 into an XML. Presentation is deferred to style sheets. XHTML 1.1 structured the XML into modular components. Combing this with the DOM2 Event Model is what allows voice dialogs to be added to XHTML. Event handlers that implement VoiceXML actions can process events received by the Event Listener. The result is to provide XHTML based languages an event syntax that enables an interoperable way of associating behaviors with document-level markup. The XML event types supported by the XHTML+Voice profile include all intrinsic event types defined for HTML 4.01, plus the VoiceXML 2.0 events (noinput, nomatch, error, and help), as well as, an additional filled event for field or form-level filled elements. An XHTML element associates one of the event types with an ID attribute reference to the VoiceXML form that will handle the event. To include voice dialogs with XHTML + Voice, the voice handlers are placed within the XHTML header. Within the XHTML body, an input block for example, would listen for an "onfocus" event and then pass control to the handler. The result of the handler would then need to be assigned to the XHTML form variable. This can be accomplished with a script in the body that defines the assignment. XML 2002 Proceedings by deepx 9

3.2. SALT SALT [SALT]is probably more so an alternative to VoiceXML rather than an enhancing technology that would coexist with VoiceXML. Started in October 2001, SALT (Speech Application Language Tags) the specification was submitted to the W3C for review in August 2002. SALT is similar to the XHTML+Voice Specification; it also leverages the event-based DOM execution model to integrate with specific interfaces. SALT defines "speech tags" that can be treated as extensions to HTML, enabling developers to add a spoken dialog interface to Web applications. Speech tags are a set of XML elements, not unlike VoiceXML, which provide dialog, speech interface, and call control services. In general there are fewer elements to contend with than in VoiceXML. There is also no inherent FIA, writing this flow logic is left to the application developer. Five main element types that SALT uses: prompt - Configures the speech synthesizer and playing out prompts. reco - Configures the speech recognizer, executing recognition and handling recognition events. grammar - Specifies input grammar resources. bind - Processes recognition results into the page. dtmf - Configures and controls DTMF. 3.3. XForms XForms [XForms]is an XML application that represents the next generation of forms for the Web. By splitting traditional XHTML forms into three parts-xforms model, instance data, and user interface, presentation and content are separated. This allows for reuse and provides strong typing, which subsequently reduces the number of calls to a server. With XForms device independent modeling can be accomplished. Since XForms is intended to be integrated into other markup languages, such as XHTML or SVG, and not intended to be a freestanding document type. As such it may take time before XForms is ready for implementation. There are two issues to consider when combining XForms with VoiceXML. One benefit is that XForms provides a data type specifiable data model. This model is consistent over different browsers and facilitates the sharing of data. The other consideration is the user interface and control. The control portion can be rendered as an integral part of another markup. This could be difficult for already existing interpreters. Another possibility is to transform the control, with an XSLT, into a native, device specific markup - in this case VoiceXML. 3.4. XSLT Shown in Figure 3 is a flexible approach to generating dynamic VoiceXML, HTML, or other markup language. In this approach the data model and the presentation of data are separated. The data model referred to here is not the same data model in XForms. Instead the XForms user interface portion that describes the purpose of the interface would form the data model of the transformation. XML 2002 Proceedings by deepx 10

Figure 3. Generating VoiceXML and other markup languages with XSLT The XSLT [XSLT]takes the source XML document and, using different XSL templates, generates HTML, VoiceXML, WML, or other document types. Once generated, an appropriate browser then interprets these documents. In the case of HTML, the HTML document will be served through a Web server to the client's browser where the actual rendering of the Web page takes place. For VoiceXML documents a VoiceXML interpreter is used that will render the document to the client telephone. The XForms user interface representing the data model might look like the following: <selectone ref="as"> <caption>select Payment Method</caption> <choices> <item> <caption>cash</caption> <value>cash</value> </item> <item> <caption>credit</caption> <value>credit</value> </item> </choices> </selectone> <input ref="cc"> <caption>credit Card Number</caption> </input> <input ref="exp"> <caption>expiration Date</caption> </input> <submit submitinfo="submit"> <caption>submit</caption> </submit> Notice that the user interface does not dictate how the interface should look only the purpose is stated. The transformation will be responsible for producing the document that will render the device specific interface. 3.5. NLSML The W3C working draft on Natural Language Semantic Markup Language [NLSML] attempts to formalize the results of semantic interpreters. It is intended that semantic interpreters will generate NLSML documents extracted from the user's utterances and machine determined meaning. These documents provide results on interpretation meaning and mapping to a data model. Using the first order results of meaning represented by a NLSML document can be used to further direct the dialog into context categories. Taking into account that certain words or phrase fragments are more likely to occur and XML 2002 Proceedings by deepx 11

have a greater expectation of occurring, these expectations can be used for classifying phrase fragments into context categories[desp]. The next section on Natural dialogs will explore this process further. NLSML also has support for managing multiple input devices (multimodality). The true benefits of NLSML are that intermediate results can be integrated with further processing logic to provide natural dialog systems. Another value of NLSML is for purposes of testing and evaluating such highly interdependent and adaptive systems. Using intermediate results of meaning is beneficial for isolating system factors and for performing comparative vendor tests. 3.6. Natural dialogs Here we look at systems that understand meaning based on context. Instead of using word grammars to define the domain of acceptable phrases, we use task models to describe the acceptable tasks that can be performed. The acceptable tasks will change depending on the state a dialog has taken. To date, language processing has successfully provided understanding, though often constrained to the grammars defined. As a result, most current systems use a loosely-coupled, unidirectional interface, such as grammars within VoiceXML or n-best words, with natural language constraints applied as a post-process for filtering the recognizer output. Context provides a level of discourse that places significant constraints on what people can talk about and how things can be referred to. In other words, knowing the context will narrow down what the speaker is trying to say. Dialog systems, as shown in Figure 4, use the current context, user input, and task model to determine the system response and the new context. Figure 4. Dialog management To achieve reasonable coverage of meaning, language-processing research has developed techniques based on "partial analysis" - the ability to find meaning-bearing phrases in the input and to construct meaning representations out of them without requiring a complete analysis of the entire string [USS]. To go about building a dialog management system, the first phase would be an analysis of the different ways in which individuals can express the finite tasks of the system. To accomplish this a system would need to record, analyze, and categorize acceptable user utterances. The outcome of this analysis is a requirement for the recognition grammars needed. A second outcome will be identifying the partial categories that can be represented with some meaning. NLSML is a good candidate language. The system interactively uses these partial categories to further narrow the scope until a final task is reached. The process is not as daunting as it may seem at first, especially when taking into account that most real world customer support systems ultimately have a finite number of performable tasks, usually on the order of few dozen tasks. 3.7. Multimodal (multi-interface) Multimodal access enables users to interact with an application in a variety of ways; they can input data using speech, a keyboard, keypad, mouse, and/or stylus, and receive data as synthesized speech, audio, plain text, motion video, and/or graphics. Each of these modes can be used independently or concurrently. Modality considers more than one input and output. The interaction can be sequential or synchronous for both inputs and outputs. XML 2002 Proceedings by deepx 12

sequential input and output - By sequential multimodality we mean that only one of the different modalities is active at a given time. This applies both to input and output modes. For example, the user can have a bi-modal device with a visual and voice interface. At any time the device can accept, as input, either a button click or a spoken response. However, only the first input will be used for processing. For sequential modal output, again, only one mode is used for output at any time. One scenario a the directory service, where the user uses their phone to dial into a directory service. Mobile user dials directory service. Directory service: How may I help you? Mobile phone: I need the phone number and directions for Ray's Pizza. Directory service: Which one? Is that the one located on 215 33rd street, or at 202 Lexington Avenue, or...? [user barge-in] Mobile phone: The one on 33rd Street. Directory service: How would you like the phone number and directions, via voice, visual, or email. Mobile phone: visual The directions and number are sent as an SVG document. Another scenario is a web and voice access to the same shopping transaction. Here there are two actors, one that does the shopping and the other that holds the credit card. The shopper goes onto the web and purchases several items. They are asked for their credit card information and choose to defer the transaction until later. The user is given a transaction ID and a security code. The user can now go to the credit card holder and ask them to call in to the company with their transaction ID and security code. The credit card holder can review the shopping basket contents and if they choose proceed with the transaction by supplying their credit card information. synchronous input - Here, more than one input mode will be accepted as input simultaneously. There is of course a finite window of time in which the application will be "listening" to the user's input. Disambiguation of inputs needs to be provided. If the user clicks on one item and verbally selects another, there must be a way to resolve this conflict, whether through markup or as part of the application logic. An example could be the mobile user that has displayed a map on their phone and is engaged in a voice dialog with the directory service. They can say, "Give me all the bookstores in this area." They do this by while simultaneously moving their stylus onto the map and encircling a region. synchronous output - Multimodal output refers to more than one modality being used as output simultaneously in a coordinated manner. There needs to be a mechanism to synchronize output. The Working Draft of the W3C - "Multimodal Requirements for Voice Markup Language" specifies SMIL [SMIL]for synchronization. An example could again be with the mobile user where they are viewing a simple slide presentation on their mobile device while listening to a synchronized narration through their earpiece. Here synchronization of the two streams will need to be managed locally at the device. It is also conceivable that the slide presentation and the audio delivery are to two separate devices, a phone and a desktop browser. In this case, synchronization can be accomplished by letting the audio playback trigger events that control the visual slide presentation. 3.8. Conclusion For complex architectures that integrate multiple interpreter types and connect to external real-time events there are a few considerations that were covered and should be considered for voice and multimodal applications. As new and more specialized markup languages emerge there will likely be some overlap with existing markup languages. We saw this, for example, with CCXML and VoiceXML, both provide call management functionality. The preferred approach would be to use the richer language for the task. In this case CCXML should be used for an application that requires outbound calls or sophisticated call handling. XML 2002 Proceedings by deepx 13

Another point to consider when integrating multiple markup languages and real-time events is how the events get managed. CCXML offers an improved framework for managing multiple asynchronous events. Also to consider are how each interpreter context can pass variables to another context. Again CCXML is a good consideration as CCXML events allow for variables to be passed. When considering multimodal applications XForms' data model is valuable for abstracting data elements from different interfaces to a common data representation. For now it is still necessary to transform the user interface control logic to each browser type. An XForm control will need to be transformed with an XSLT to the appropriate browser, e.g. VoiceXML, HTML. Lastly, when designing natural dialog systems a language like NLSML will simplify integration with other decisionmaking logic and provide a manageable mechanism for monitoring and testing results. Bibliography [SSML] Speech Synthesis Markup Language Specification Available at http://www.w3.org/tr/speech-synthesis/. [CCXML] Call Control Markup Language Available at http://www.w3.org/tr/ccxml/. [VoiceXML 2.0] VoiceXML Specification Available at http://www.w3.org/tr/voicexml20/. [SRGS] Speech Recognition Grammar Specification Available at http://www.w3.org/tr/speech-grammar/. [XHTML+Voice] XHTML+Voice Specification Available at http://www.w3.org/tr/xhtml+voice/. [SALT] Speech Application Language Tags Available at http://www.saltforum.org/. [XForms] XForms Specification Available at http://www.w3.org/tr/xforms/. [XSLT] XSL Transformations Available at http://www.w3.org/tr/xslt/. [NLSML] Natural Language Semantic Markup Language Available at http://www.w3.org/tr/semantic-interpretation/. [DESP] Improved Speech Understanding Using Dialogue Expectations in Sentence Parsing, Proceedings of ICSLP2000, S. Abdou, M. Scordilis [USS] Understanding Spontaneous Speech: The Phoenix System, Proceedings of ICASSP 1991, W.Ward [SMIL] Synchronous Media Integration Language Available at http://www.w3.org/tr/rec-smil/. Glossary ASR CCXML GRXML IVR SSML Automatic Speech Recognition Call Control Markup Language Grammar Markup Language Interactive Voice Response Speech Synthesis Markup Language XML 2002 Proceedings by deepx 14

TTS Text-To-Speech Biography Adam Hocek Broadstrokes, Inc. New York United States of America ahocek@broadstrokesinc.com Adam is President/CTO of a New York based startup company developing XML-based device technologies. He is also co-author of "Definitive VoiceXML". XML 2002 Proceedings by deepx 15