Thin Client Development and Wireless Markup Languages cont. VoiceXML and Voice Portals

Thin Client Development and Wireless Markup Languages cont. David Tipper Associate Professor Department of Information Science and Telecommunications University of Pittsburgh tipper@tele.pitt.edu http://www.sis.pitt.edu/~dtipper/2727.html Slides 12 VoiceXML and Voice Portals VoiceXML together with Voice Portal provide speech enabled access to text/web/voice automated information. Allows user to navigate through voice web pages Why VoiceXML? remember it is a phone first computer/web device second Advantages Device independence works with any digital phone (wired or wireless) Easier more natural I/O Times when voice interaction more appropriate/easier while driving a car, obtaining directions, access email over phone, input info/data Low Cost 2

Standards based VoiceXML VoiceXML Forum Industry group (Motorola, Lucent, AT&T, etc) developed VXML 1.0 released in 2000 Based on XML W3C Voice Browser working group Developed VoiceXML 2.0 VoiceXML 2.1 June 2007 Current focus on improved speech and grammar recognition and text to speech translation multi-modal applications Voice + Web applications call for directions get map plus voice directions 3 VoiceXML Applications Predicted boom in VoiceXML applications especially in replacement for human operators Sample applications Information Retrieval Check weather, sports scores, directions (Cingular Voice Dial Service), stock price, tec. Directory Assistance AT&T uses this E-Commerce Catalog ordering, tickets, bill payment, etc Telephone Services Voice mail management, teleconferencing, secure phone calls Unified Messaging Browse listen to email messages over the phone Record voice and have it sent via email, SMS or voice mail. 4

VoiceXML Architecture User connects to Voice Portal that contains VoiceXML Browser VoiceXML Browser handles interaction with user (I/O) fetches information from web servers transforms VoiceXML content for delivery to user Portal contains several technology components accessed by browser to handle communication, process VoiceXML documents WWAN Internet Client Voice Portal/Gateway with VoiceXML Browser Server VoiceXML documents 5 Portal Technology Components Automatic Speech Recognition (ASR) Converts speech signal to text or numbers Strives to be speaker independent or speaker adaptive Matches speech with a given set of words or phrases (called a grammar) Much less computationally intensive than speech recognition Text to Speech Synthesis (TTS) Coverts text/numeric input to synthesized speech - older systems robotic sounding New systems use waveform concatenation ASR TTS Telephony VoiceXML Gateway Voice Browser Audio TCP/IP Examples: http://research.att.com/projects/tts.demo.html 6

Portal Technology Components Audio resource for playing prerecorded audio files Recording user input for post-processing Telephony resource Call processing Dual Tone Multi-Frequency (DTMF) keypad input Call transfer to third party Etc. TCP/IP resource Provides communication with web servers ASR TTS Telephony VoiceXML Gateway Voice Browser Audio TCP/IP 7 VoiceXML Session 1. User calls application phone number 2. VXML gateway coverts input to a http request to web server 3. Server responds to VXML gateway with content 4. Gateway converts to interactive audio session with user The score of the game is.. (1) Calling a voice application Cellular Network (4) Interactive audio between user and voice application VoiceXML Gateway (2) HTTP request INTERNET (3) Response (VoiceXML documents, audio files) Web Server (hosting VoiceXML documents and audio resources) ASR TTS Voice Browser Audio Telephony TCP/IP 8

VoiceXML Input/Output In a typical session user and application take turns in speaking/listening - I/O is crucial Methods for user input 1. Spoken Commands Interpreted by ASR accuracy improved by specifying a grammar 2. DTMF (Dual Tone Multi-Frequency) key input Users enters data on keypad accuracy improved by specifying expected input 3. Recorded speech for post processing Saved in a standard format (e.g.,.wav file) WWAN Internet Client Voice Portal/Gateway with VoiceXML Browser Server VoiceXML documents 9 VoiceXML Input/Output Methods for output to user 1. Text to Speech (TTS) synthesized speech on the fly can sound machine like Can mark up how TTS is played 2. Prerecorded audio files downloaded from server and played by portal sounds more natural to the user and easier to understand often recorded by a professional WWAN Internet Client Voice Portal/Gateway with VoiceXML Browser Server VoiceXML documents 10

Session VoiceXML Concepts Begins when user connects to portal and interacts with browser VoiceXML documents are loaded and unloaded as session continues Session end controlled by user, gateway or document Application A set of VoiceXML documents that share the same root document. 11 Dialogs VoiceXML Concepts Conversation with user- two basic types Form: presents information and collects user input, contains fields Menu: gives use options to select from and changes dialog state based on input Sub-dialogs are possible like a function call to commonly used forms/menus Dialog between user and the application needs to be carefully designed - typically application prompts user and user responds in turn 12

VoiceXML Concepts Grammars The expected user input, either spoken or DTMF key presses For example - ``say or enter your 5 digit zip code If spoken input a grammar library is often specified to help interpret the input correctly Specifying a grammar library greatly increases the accuracy of automatic speech recognition Should always include error checking and reprompting of user to handle mistakes in input 13 VoiceXML Documents VoiceXML Documents define one or more dialogs VoiceXML documents can contain Spoken prompts (synthetic speech or recorded) Output of audio files and streams Recognition of spoken words and phrases Recognition of touch tone key presses Recording of spoken input Control of dialog flow Links to other VoiceXML documents Events response to interruption or incorrect input Telephony control Call transfer to third party, hang up, etc. 14

vxml Concepts Basic concepts are inter-related as shown below Session invokes 1 or more applications Applications involves 1 or more documents Document can contain 0 to many dialogs 15 Basic VoiceXML Elements Follows XML format basic Elements start and end with tags <element name attribute name= ``attribute value > </element name> Main elements <form> dialog for presenting and collecting data <object> platform specific script that may gather user input and return <grammar> set of valid expressions that a user can say or type when interacting with an application <block> A piece of non-interactive executable code 16

VoiceXML Output Elements <prompt> outputs computer generated speech (TTS) or audio files Text for TTS can be marked up to improve quality <break> insert a pause <emphasis> increase volume (provide emphasis) <say-as> to specify a particular style Still-ers <say-as type= phone >014126249421 </say-as> <audio> plays a prerecorded file (.wav) <audio src= file.wav > common audio file cached at portal <reprompt> sends processing to original prompt 17 VoiceXML example <?xml version = 1.0 > <vxml version = 2.0 > <form> <block> <prompt> Pitt is it </prompt> </block> </form> </vxml> VoiceXML All VoiceXML files (.vxml) begin with xml, vxml prolog This document has a single form which contains a block that synthesizes and plays to the user ``Pitt is it Since a successor dialog is not specified the conversation ends Pitt is it Pitt is it. xhtml-mp, WML, chtml 18

Basic VoiceXML Elements Additional elements <menu> dialog for selecting among several options <choice> alternative in a menu dialog <field> gathers user input as defined by a specified grammar <filled> block of executable code that is run after user input field filled <record> records an audio file from user <if> <elseif> <else> conditional logic <goto> control flow from form within and between documents like links in html <var> declare variables <transfer> - transfers phone call to another number Can add scripting with Javascript 19 VoiceXML Examples <menu> <prompt> This is the main menu. Please choose a service: news, weather, or sports. </prompt> <choice next="news.vxml"> news </choice> <choice next="weather.vxml"> weather </choice> <choice next="sports.vxml"> sports </choice> </menu> 20

VoiceXML Examples <menu> <prompt> This is the main menu.for news press 6; for weather press 9; for sports press 7. </prompt> <choice dtmf= 6 next="news.vxml"> news </choice> <choice dtmf= 9 next="weather.vxml"> weather </choice> <choice next="sports.vxml"> sports </choice> </menu> Note in real applications need error checking and timeouts in place to deal with user input errors. Special VoiceXML elements for this <noinput>, <nomatch> etc. 21 VoiceXML Error handling <noinput> catches a noinput event within a timeout period <noinput> I'm sorry. I didn't hear anything. <reprompt/> </noinput> <nomatch> catches a nomatch event when input doesn t match a specified grammar <nomatch> I didn't get that. <reprompt/> </nomatch> <help> executed when user says help can be made universal to whole document or local to various parts <property name="universals" value="all" /> <help> <block> Now taking you to Coustemer Services. </block> <transfer name="services" bridge="true" connecttimeout="300" dest="phone://14088502255" /> </help> <property> can control platform features for example, how long application waits for input timeout after 10 secs <property name="timeout" value="10"> 22

Grammars Grammar specifies the natural language words or phrases that will be matched Can be included in the document or reference a separate file or standard dictionary several formats available <grammar> ;GSL2.0. grammar definition text </grammar <grammar src = filename.gram type= grammar type /> Most VoiceXML Portals specify a grammar type for example based on nuance speech technology <grammar type="application/x-nuance-gsl"> [ news weather sports ] </grammar> 23 VoiceXML Example <block> <prompt> This is the BeVocal calculator. </prompt> </block> <field name="op"> <prompt> Choose add, subtract, multiply, or divide. </prompt> <grammar type="application/x-nuance-gsl"> [add subtract multiply divide] </grammar> <help> Please say what you want to do. <reprompt/> </help> <filled> <prompt> Okay, let's <value expr="op"/> two numbers. </prompt> </field> 24

VoiceXML Applications Pros Easy to develop and implement don t need service provider Several hosting service available bevocal, Tellme, VoiceGenie, etc Easy to use and cost effective (according to Goldman- Sachs average $3 /call if human assisted vs. $.20/call if automated Easy to upgrade/modify Cons Need to carefully construct dialogs or users get frustrated Non-uniform grammars and document types can lead to cross platform problems 25 VoiceXML Example <?xml version="1.0" encoding="utf-8"?> <vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.w3.org/2001/vxml http://www.w3.org/tr/voicexml20/vxml.xsd"> <form> <property name="bargein" value="true"/> <block> <prompt> Welcome to Mad Libs. Press the pound key after you say each word. </prompt> </block> <record name="one" beep="true" maxtime="5s" finalsilence="4000ms" dtmfterm="true" type="audio/x-wav"> <prompt timeout="5s"> Say a verb. </prompt> <noinput> I didn't hear anything, please try again. </noinput> </record> <record name="two" beep="true" maxtime="5s" finalsilence="4000ms" dtmfterm="true" type="audio/x-wav"> <prompt timeout="5s"> Say a noun. </prompt> <noinput> I didn't hear anything, please try again. </noinput> 26

VoiceXML Example <block> <prompt> To be, or not to <audio expr="one"/> that is the <audio expr="two"/> Whether 'tis nobler in the <audio expr="three"/> to suffer the slings and <audio expr="four"/> of <audio expr="five"/> fortune, Or to take <audio expr="six"/> against a sea of,<audio expr="seven"/> And by <audio expr="eight"/> end them. To die, to <audio expr="nine"/> No more; and by a <audio expr="nine"/> to say we end the <audio expr="ten"/> and the <audio expr="eleven"/> natural shocks that flesh is <audio expr="twelve"/> </prompt> 27 Markup Language Future Multi-modal markup languages proposed to combine features For example X+V language proposed by Motorola, Opera Software ASA and IBM to W3C 28