Language Technology II Language-Based Interaction Dialogue design, usability,evaluation. Word-error Rate. Basic Architecture of a Dialog System (3)



Similar documents
Multilingual and mixed-lingual TTS applications

PROMISE - A Procedure for Multimodal Interactive System Evaluation

Developing Usable VoiceXML Applications

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition

2014/02/13 Sphinx Lunch

Virtual Patients: Assessment of Synthesized Versus Recorded Speech

Load Testing How To. Load Testing Overview

Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories

Outline. Lecture 13: Web Usability. Top Ten Web Design Mistakes. Web Usability Principles Usability Evaluations

Voice Driven Animation System

Specialty Answering Service. All rights reserved.

INF5820, Obligatory Assignment 3: Development of a Spoken Dialogue System

Develop Software that Speaks and Listens

Abstract. Avaya Solution & Interoperability Test Lab

Materials Software Systems Inc (MSSI). Enabling Speech on Touch Tone IVR White Paper

Christian Leibold CMU Communicator CMU Communicator. Overview. Vorlesung Spracherkennung und Dialogsysteme. LMU Institut für Informatik

Speech-Enabled Interactive Voice Response Systems

How do non-expert users exploit simultaneous inputs in multimodal interaction?

A Comparative Analysis of Speech Recognition Platforms

TTP User Guide. MLLP Research Group. Wednesday 2 nd September, 2015

Measuring Wireless Network Performance: Data Rates vs. Signal Strength

C E D A T 8 5. Innovating services and technologies for speech content management

Longman English Interactive

Avaya Aura Orchestration Designer

Outlook Profile Setup Guide Exchange 2010 Quick Start and Detailed Instructions

(Refer Slide Time: 01:52)

HOPS Project presentation

On Intuitive Dialogue-based Communication and Instinctive Dialogue Initiative

Coffee Break German Lesson 06

Modern foreign languages

Interactive Multimedia Courses-1

Screen Design : Navigation, Windows, Controls, Text,

Fan Fu. Usability Testing of Cloud File Storage Systems. A Master s Paper for the M.S. in I.S. degree. April, pages. Advisor: Robert Capra

1. Introduction to Spoken Dialogue Systems

White Paper: 5GL RAD Development

LIVE SPEECH ANALYTICS. ProduCt information

Personal Voice Call Assistant: VoiceXML and SIP in a Distributed Environment

AP GERMAN LANGUAGE AND CULTURE EXAM 2015 SCORING GUIDELINES

Envox CDP 7.0 Performance Comparison of VoiceXML and Envox Scripts

VOICE INFORMATION RETRIEVAL FOR DOCUMENTS. Except where reference is made to the work of others, the work described in this thesis is.

Reporting works by connecting reporting tools directly to the database and retrieving stored information from the database.

AP WORLD LANGUAGE AND CULTURE EXAMS 2012 SCORING GUIDELINES

CallAn: A Tool to Analyze Call Center Conversations

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Mobile Application Languages XML, Java, J2ME and JavaCard Lesson 03 XML based Standards and Formats for Applications

VoiceXML-Based Dialogue Systems

Giuseppe Riccardi, Marco Ronchetti. University of Trento

Processing Dialogue-Based Data in the UIMA Framework. Milan Gnjatović, Manuela Kunze, Dietmar Rösner University of Magdeburg

Emo Dialogue: Differences in male and female ways of communicating with affective autonomous conversational systems

Merak Outlook Connector User Guide

Qualitative data acquisition methods (e.g. Interviews and observations) -.

Usability Test Script

SELF SERVICE RESET PASSWORD MANAGEMENT DATABASE REPLICATION GUIDE

Principles of Measuring Advertising Effectiveness

Exemplar for Internal Achievement Standard. German Level 1

FET PRESENCE Brainstorm on November 13, 2003: some views.

Software Testing. System, Acceptance and Regression Testing

Coffee Break German. Lesson 03. Study Notes. Coffee Break German: Lesson 03 - Notes page 1 of 15

Name: Klasse: Datum: A. Was wissen Sie schon? What do you know already from studying Kapitel 1 in Vorsprung? True or false?

Archiving - Outlook 2003

How to become a successful language learner

ABOVE ALL WAS THE SENSE OF HEARING

QuickBooks Mac 2014 Getting Started Guide

Touch 2 Touch 2 with Go Touch 2 with Go Plus. Quick reference guide

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Guideline for setting up a functional VPN

interactive product brochure :: Nina: The Virtual Assistant for Mobile Customer Service Apps

VoiceXML Tutorial. Part 1: VoiceXML Basics and Simple Forms

RECORDING AND CAPTURING SOUND

Voice User Interfaces (CS4390/5390)

Transcription:

Language Technology II Language-Based Interaction Dialogue design, usability,evaluation Manfred Pinkal Ivana Kruijff-Korbayová Course website: www.coli.uni-saarland.de/courses/late2 Basic Architecture of a Dialog System (3) ASR TTS NL Analysis NL Generation Dialog Manager Dialog Model Database Device Interface 2 Evaluation of ASR Systems Word-error Rate WER Speed (real-time performance) Size of lexicon Perplexity Standard performance measure is "Word Error Rate" (WER): The Minimum Edit Distance between best hypothesis and correct string: Insertions+Substitutions+Deletions divided by total number of words in correct string. 3 4

A Problem for Measurement Another Problem WER is an objective quantitative measure, but: does it really measure the relevant properties of the system? Example: All words vs. content word vs. concepts Ja, das wäre eine gute Idee. Das könnten wir dann machen. Ja, dann wäre eine gute Idee. Das könnten wir machen. 5 Confidence values: a measure (usually a probability value between 0 and 1) for the degree to which the system believes into its best hypothesis (sentence level/word level) Problem: How can the assignment of confidence values be assessed at all? 6 Evaluation of TTS Intuitive evaluation by users with respect to intellegibility pleasantness naturalness No objective (though quantitative) criteria, but extremely important for user satisfation 7 Measuring of Dialogue Model Quality "Query density": Number of concepts per user query/turn "Concept efficiency": Number of user turns per concept understood by the system 8

Different levels of evaluation Different levels of evaluation Technical evaluation Usability evaluation Customer evaluation According to: L. Dybkjaer/ N.Bernsen/ W.Minker, "Overview of evaluation and usability", in: W. Minker et al., Spoken multimodal human-computer dialogue in Technical evaluation Typically component evaluation (ASR, TTS, Grammar, but e.g.: System robustness) To some extent quantitative and objective Possibly glass-box evaluation Usability evaluation Customer evaluation mobile environments, LaTeII: Language-based Springer Interaction 2005 9 10 Different levels of evaluation Different levels of evaluation Technical evaluation Technical evaluation Usability evaluation Usability evaluation Customer evaluation Evaluation of user satisfaction Typically end-to-end evaluation Mostly subjective and qualitative measures Customer evaluation 11 12

A user satisfaction questionnaire A user satisfaction questionnaire Was the system easy to understand? Did the system understand what you said? Was it easy to find the information you wanted? Was the pace of interaction with the system appropriate? Did you know what you could say at each How often was the system sluggish and slow to reply to you? Did the system work the way you expected it to? From your current experience with using the system, do you think you would use the system regularly? point in the dialogue? 13 14 PARADISE PARADISE An attempt to provide an objective, quantitative, operational basis for qualitative user assessments M. Walker/ C. Kamm/ D. Litman: "Towards developing general models of usability with PARADISE", in: Natural Language Engineering 6, 2000 15 An attempt to provide an objective, quantitative, operational basis for qualitative user assessments, in terms of a weighted linear combination of a task-based success-measure and dialogue costs. M. Walker/ C. Kamm/ D. Litman: "Towards developing general models of usability with PARADISE", in: Natural Language Engineering 6, 2000 16

PARADISE Structure Efficiency and Quality Measures Efficiency measures Maximise task success Maximise user satisfaction Efficiency measures Minimize costs Qualitative measures 17 Elapsed time System turns User turns Quality measures # of timeout prompts # of rejects # of helps # of cancels # of barge-ins Mean ASR score 18 User Satisfaction Problems Measured by adding the scores assigned to 8 questions by the subjects. Systems tested: e-mail message access train information Training set of 500 dialogues with efficiency, quality, task success features + satisfaction score 19 Is the user satisfaction score a good measure for user satisfaction? Criterion for the feature selection is the easy availability of features through log-files. Is it really the interesting features that are selected? Does the methodology extend to more complex dialogue applications in real-world environments? One application: Re-enforcement learning 20

Practical Usability Testing Wizard-of-Oz Experiments Implementing a dialogue system and testing it afterwards is usually very inefficient. First empirical access: Data collection via human-human dialogue Limited usefulness because: Human dialogue behavior differs from dialogue system behavior Human attitude towards humans differs from 21 their attitude Manfred Pinkal towards & Ivana Kruijff-Korbayová machines. Experimental WoZ systems allow to test a dialogue design (to some extent) before it has been implemented. The WoZ is not just a person in a box: WoZ setups should constrain the options of the wizards in an appropriate way support real-time reaction of the wizard constrain the access of the wizard to information, to bring him on a par with the artificial dialogue manager simulate short-comings of S&L Front-ends, if required 22 Wizard-of-Oz Experiments Ideally, a WoZ environment or tool is set up in a modular way, allowing to replace functions contributed by humans subsequently in the course of system implementation. Gradual transition between WoZ and fully artificial system. Challenges Flexible dialogue modeling in the ISU framework. Adaptivity as a feature of advanced dialogue systems The difficulty is to find the balance between required degree of freedom and necessary restriction of choices. 23 24

A shift of motivation for WoZ experiments An example Originally Collection of realistic data for speech and language processing (sound files and transliterations, lexical, grammatical variation) Studying user behavior and satisfaction in spite of bottle-necks with speech and language front-ends Currently TALK project MP3 Player Multi-modal dialogue, language German In-car/in-home scenario Saarland University, DFKI, CLT shift of focus to the exploration of wizard's/system's behavior 25 26 Goals of WOZ MP3 Experiment Tasks for the Subjects Gather pilot data on human multi-modal turn planning Collect wizard dialogue strategies Collect wizard media allocation decisions Collect wizard speech data Collect user data (speech signals and spontaneous speech) 27 MP3 domain in-car with primary task Lane Change Task (LCT) in-home domain without LCT Tasks for the subject: Play a song from the album "New Adventures in Hi- Fi" by REM Find a song with believe in the title and play it. Task for the wizard: Help the user reach their goals (Deliberately vague!) 28

A Walk Through One Exchange User View Primary task: driving Wizard: Ich zeige Ihnen die Liste an. I am displaying the list. User: Ok. Zeige mir bitte das Lied aus dem ausgewählten Album und spiel das vor. Ok. Please show me that song ( Believe ) from the selected album and play it. Secondary task on second screen: MP3 player 29 Video Recording 30 DFKI/USAAR WOZ system System features: 14 (via OAA) communicating components distributed over 5 machines (3 windows, 2 linux) Plus LCT on a seperate machine People involved to run an experiment: 5 1 experiment leader 1 wizard 1 subject 31 2 typists 32

Data Flow Wizard graphics Subject graphics synthesized Data Flow Wizard synthesized Subject 33 34 Example(1) Wizard A Walk Through the Final Turns Wizard: Database search Select album presentation (vs. songs or artists) Select list presentation (vs. tables or ual summary) says: Ich zeige Ihnen die Liste an. (I am Ich zeige Ihnen die Liste an. I am displaying the list. Audio is sent to typist Text is sent to speech synthesis displaying the list.) and User: Ok. Zeige mir bitte das Lied aus dem ausgewählten Album und spiel das vor. Ok. Please show me that song ( Believe ) from the selected album and play it. clicks on the list presentation 35 36

37 Options presenter with User-Tab Wizard Data Flow graphics 39 Subject synthesized 38 40

Example(2) Wizard Wizard types the wizard s spoken Data Flow graphics Subject synthesized I am displaying the list. 41 42 44 Example(3) User Listens to wizard synthesized by Mary and receives the selected list presentation 43

Automatically updated wizard screen with check Example(4) User Selects one album and says: Ok. Zeige mir bitte das Lied aus dem aus gewählten Album und spiel das vor. Ok. Please show me that song ( Believe ) from the selected album and play it. Wizard Data Flow graphics 46 Example(5) User Types the user s spoken Ok. Please show me that song ( Believe ) from the selected album and play it. Subject synthesized 45 47 48

Wizard Data Flow graphics Gets a correspondingly updated TextBox Window synthesized Example(6) Wizard Subject 49 Usability testing of System Prototypes Debugging (in the system and in the dialogue model) First tests of full prototype system always reveal drastic mistakes Polish the dialogue design - Needed: Experimental test runs with developers inexperienced subjects díalogue designers Dialogue designlateii: is Language-based more art than engineering. Interaction 51 50