SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY. www.phonexia.com, 1/41



Similar documents
ARMORVOX IMPOSTORMAPS HOW TO BUILD AN EFFECTIVE VOICE BIOMETRIC SOLUTION IN THREE EASY STEPS

Speech Analytics. Whitepaper

C E D A T 8 5. Innovating services and technologies for speech content management

SIPAC. Signals and Data Identification, Processing, Analysis, and Classification

SPRACH - WP 6 & 8: Software engineering work at ICSI

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

1. Introduction to Spoken Dialogue Systems

Unlocking Value from. Patanjali V, Lead Data Scientist, Tiger Analytics Anand B, Director Analytics Consulting,Tiger Analytics

GRAPHICAL USER INTERFACE, ACCESS, SEARCH AND REPORTING

NICE MULTI-CHANNEL INTERACTION ANALYTICS

Call Recording and Speech Analytics Will Transform Your Business:

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Establishing the Uniqueness of the Human Voice for Security Applications

Advanced analytics at your hands

presentation Our customers & Partners AE

Ericsson T18s Voice Dialing Simulator

A secure face tracking system

New Features for Remote Monitoring & Analysis using the StreamScope (RM-40)

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

IVR CRM Integration. Migrating the Call Center from Cost Center to Profit. Definitions. Rod Arends Cheryl Yaeger BenchMark Consulting International

Reading Assistant: Technology for Guided Oral Reading

Knowledge Discovery from patents using KMX Text Analytics

In: Proceedings of RECPAD th Portuguese Conference on Pattern Recognition June 27th- 28th, 2002 Aveiro, Portugal

Integration of Voice over Internet Protocol Experiment in Computer Engineering Technology Curriculum

Deploying Cisco Unified Contact Center Express Volume 1

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Performance analysis and comparison of virtualization protocols, RDP and PCoIP

Customer Interaction Analytics Speech Analytics The Next Frontier

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

FREE TV AUSTRALIA OPERATIONAL PRACTICE OP- 59 Measurement and Management of Loudness in Soundtracks for Television Broadcasting

Effective Java Programming. efficient software development

STINGA ISDN. Monitor & Simulator. Protocol Analyzers & Simulators for Traditional, Fixed Mobile, Converged and Next Generation Networks

Engage your customers

The Cross-Media Contact Center

Security and Risk Analysis of VoIP Networks

SoMA. Automated testing system of camera algorithms. Sofica Ltd

Voice Driven Animation System

Workforce Management IVR. A multi-service voice platform

Data Mining Solutions for the Business Environment

Testing Intelligent Device Communications in a Distributed System

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

An Introduction to astercc Call Center ver /06/09

Agile speech analytics: a simple and effective way to use speech analytics in contact centres

Avaya Aura Orchestration Designer

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

Develop Software that Speaks and Listens

EAGLE EYE IP TAP. 1. Introduction

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Large-Scale Test Mining

VoIP Technologies Lecturer : Dr. Ala Khalifeh Lecture 4 : Voice codecs (Cont.)

Masters in Human Computer Interaction

IVR Primer Introduction

White paper. Axis Video Analytics. Enhancing video surveillance efficiency

Information Leakage in Encrypted Network Traffic

The Big Data methodology in computer vision systems

IBM Cloud Security Draft for Discussion September 12, IBM Corporation

The LENA TM Language Environment Analysis System:

Automated quality assurance for contact centers

Auto-Classification for Document Archiving and Records Declaration

ERICSSON SOLIDUS ecare AGENT, MANAGEMENT AND SELF-SERVICE APPLICATIONS BUILT FOR BETTER BUSINESS

Azure Machine Learning, SQL Data Mining and R

Envox CDP 7.0 Performance Comparison of VoiceXML and Envox Scripts

Application Compatibility Best Practices for Remote Desktop Services

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

Speech Signal Processing: An Overview

Web Traffic Capture Butler Street, Suite 200 Pittsburgh, PA (412)

Voice. listen, understand and respond. enherent. wish, choice, or opinion. openly or formally expressed. May Merriam Webster.

RevoScaleR Speed and Scalability

White paper. Axis Video Analytics. Enhancing video surveillance efficiency

IT services for analyses of various data samples

COPYRIGHT 2011 COPYRIGHT 2012 AXON DIGITAL DESIGN B.V. ALL RIGHTS RESERVED

Next Generation. Surveillance Solutions. Cware. The Advanced Video Management & NVR Platform

Testing IVR Systems White Paper

DATA MANAGEMENT FOR THE INTERNET OF THINGS

Global Value 7. Productivity Suite for GammaVision. Optimizing Gamma Spectrometry Processes through Secure Data Management and Measurement Automation.

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

A Near Real-Time Personalization for ecommerce Platform Amit Rustagi

Introduction to Data Mining

Using Voice Biometrics in the Call Center. Best Practices for Authentication and Anti-Fraud Technology Deployment

Automatic Evaluation Software for Contact Centre Agents voice Handling Performance

Masters in Advanced Computer Science

OPNET Network Simulator

Masters in Artificial Intelligence

Pattern-Aided Regression Modelling and Prediction Model Analysis

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

Masters in Information Technology

Applications of speech-to-text in customer service. Dr. Joachim Stegmann Deutsche Telekom AG, Laboratories

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Transcription:

SPEECH DATA MINING, SPEECH ANALYTICS, VOICE BIOMETRY www.phonexia.com, 1/41

OVERVIEW How to move speech technology from research labs to the market? What are the current challenges is speech recognition research? text Phonexia introduction Technology deployment use cases Technologies and what is behind Speech core and application interfaces Grand challenges www.phonexia.com, 2/41

WHAT IS IN SPEECH? Speaker Gender, age Speaker identity Emotion, speaker origin Education, relation When speaker speaks Environment Where speakers speaks To whom speakers speaks (dialog, reading, public talk) Other sounds (music, vehicles, animals ) Content Language, dialect Keywords, phrases Speech transcription Topic Data mining Equipment Device (phone/mike/...) Transmit channels (landline/cell phone/skype) Codecs (gsm/mp3/ ) Speech quality www.phonexia.com, 3/41

PHONEXIA Goal: help clients to extract automatically maximum of valuable information from spoken speech. Based in 2006 as spin-off of Brno University of Technology Seat and main office in Brno, Czech Republic, active worldwide Customers in more than 20 countries governmental agencies, call centers, banks, telco operators, broadcast service companies ) Profitable, no external funding www.phonexia.com, 4/41

FROM RESEARCH TO MARKET text Research Technologies Scientific papers, reports, experimental code (Matlab, Python, C++, lots of glue (shell scripts), data files The goal is accuracy Stability, speed, reproducibility and documentation less important Openness The goal is stability (error handling, code verification, testing cycles at various levels) and speed Regular development cycles and planning Well defined application interfaces (API) Documentation, licensing Products Development of new applications Integration with client s technologies and systems The goal is functionality o integrated solution User interfaces www.phonexia.com, 5/41

USE CASES Call centers Banks Intelligence agencies www.phonexia.com, 6/41

CALL CENTERS Two main application areas: 1) Quality control 2) Data mining from voice traffic www.phonexia.com, 7/41

CALL CENTERS QUALITY CONTROL Supervisor is responsible for: Team leading Rating of calls Evaluation of operators Analysis of results Reporting Only 3% of calls are analyzed by listening 100% of calls are analyzed using speech technologies, new statistics lower staff costs, lower operating costs Higher satisfaction of customers www.phonexia.com, 8/41

CALL CENTERS QUALITY CONTROL II Technologies: VAD + discourse analysis To get important statistics about call progress (start time, speaker turns, speech speed, reaction times ) Diarization Separation of summed conversation to two channels Keyword/phase detector Detection of obligatory phrases, rough words, call script compliance Speech transcription + search To search for important places in calls www.phonexia.com, 9/41

DATA MINING FROM INCOMING CALLS Use cases: Prevention of call center from overloading (for example large power outage) Added value information for business (big data) Technology: Speech transcription Data mining tool Search engine www.phonexia.com, 10/41

BANKS Use cases: Banks have call centers text Quality control Data mining from incoming traffic Authentication of people using voice biometry Using key phrase (text dependent speaker identification) Authentication on background (text independent speaker identification) Identification of frauds People with fake identities calls repeatedly to request loans www.phonexia.com, 11/41

INTELIGENCE AGENCIES Huge amount of information, can not be processed manually text - public news, telecommunication networks, air communication, internet... Search for a needle in haystack Combination of all technologies - language identification, gender identification, speaker identification, diarization, keyword spotting, speech transcription - data mining tools - correlation with other metadata Operational and forensic speaker identification www.phonexia.com, 12/41

TECHNOLOGIES Voice activity detection Language identification Gender recognition Speaker identification Diarization Keyword spotting Speech transcription Dialog analysis Emotion recognition www.phonexia.com, 13/41

VOICE ACTIVITY DETECTION Higher accuracy, lower speed energy based VAD technical signal removal VAD based on f0 tracking neural network VAD Energy based VAD fast removal of low energy parts Technical signal removal and noise filtering - removal of tones, removal of flat spectra signal, removal of stationary signals, filtering of pulse noise VAD based on f0 tracking removal of other non-speech signals neural network VAD very accurate VAD based on phoneme recognition www.phonexia.com, 14/41

VAD CHALLENGES Important area of research, not fully solved, VAD is a key part of other technologies and directly affects accuracy of these technologies Music/singing detector Detectors of non-speech speaker sounds (cough, laugh) Detectors of other environment sounds (transport vehicles, animals, electric tools, door slam) Technical signal detectors VAD for high noisy speech (SNR lower than 0 db) VADs or distorted channels Non-parametric VADs Distant mike VADs www.phonexia.com, 15/41

LANGUAGE IDENTIFICATION Automatic recognition of the language spoken. 50 languages + user can add new ones themselves Can be used also as dialect recognition ivector based technology, discriminative training, < 1kB language prints Acoustic channel independent Usage: Crime is caused by small groups speaking specific languages very often Call record forwarding (to operator / other technologies / archive...) Analysis of the audio archive Insertion of advertisement to media Language verification in broadcast signal distribution x x x >> x x x x x x www.phonexia.com, 16/41

LID SYSTEM ARCHITECTURE (IVECTOR BASED SYSTEM) UBM Projection parameters Language parameters Calibration parameters feature extraction collection of UBM statistics projection to ivectors language classifier - MLR score calib. / transform language scores Prepared by Phonexia Fully trainable by client Language prints (ivectors) can be easily transferred over low capacity links www.phonexia.com, 17/41

SPEAKER RECOGNITION Several scenarios: speaker verification, speaker search, speaker spotting, link/pattern analysis Text independent or text dependent mode ivectors based technology, < 1kB voiceprints Voiceprint extraction and scoring Millions of comparisons in fraction of seconds Diarization (speaker segmentation) User-based system training, user-based calibration x x >> x x www.phonexia.com, 18/41

SYSTEM ARCHITECTURE - VOICE PRINT EXTRACTION UBM Projection parameters Projection parameters Norm. parameters feature extraction collection of UBM statistics projection to ivectors extraction of spk info. - LDA user normalization voiceprint prepared by Phonexia trainable by user ivector describes total variability inside speech record LDA removes non-speaker variability User normalization helps user to normalize to unseen channels (mean subtraction) www.phonexia.com, 19/41

SYSTEM ARCHITECTURE - VOICEPRINT COMPARISON Model parameters Calibration parameters voiceprint 1 voiceprint comparator WCCN+PLDA length dep. calibration piecewise LR score transform Logistic func. score trainable by user voiceprint 2 Voiceprint comparer returns log likelihood Calibration ensures probabilistic interpretation of the score under different speech lengths Score transform enables to selects log likelihood ratio or percentage score www.phonexia.com, 20/41

LID AND SID CHALLENGES LID/SID on very short records (< 3s) while keeping training at user side How to ensure accuracy over large number of acoustic channels and languages (SID) Graphical tools for system training/calibration and evaluation at user side LID/SID on Voice over IP networks LID/SID form distant mikes www.phonexia.com, 21/41

SPEAKER DIARIZATION random alignment of frames to speakers collection of GMM stats for each speaker estimation of spk. factors for each speaker conversion of spk. factors to spk. GMM, new align. generating of speaker labels spk1, spk2, spk1 Fully Bayesian approach with eigenvoice priors (Valente, Kenny) Initial number of speakers is higher than expected number of speakers A duration model is used to prevent fast jumps among speakers Target number of speaker can be chosen based on minimal speaker posterior probability, or pre-set by user Very accurate but slower and more memory consuming www.phonexia.com, 22/41

DIARIZATION CHALLENGES Diarization is a technology that still needs a lot of research Very sensitive to initialization Very sensitive to non-speech sounds (laugh, cough, environment sounds), integration of a good VAD is necessary Very sensitive to changes in transmit channels and language Even with DER close to 1% there are recordings where current algorithms fail completely (often two women speaking with high pitch) Uses iterative approach, one iteration is equivalent to one run of SID, users expect much faster run than SID Diarization from distant mikes Beam-forming and diarization from microphone arrays How to accurately estimate number of speakers www.phonexia.com, 23/41

KEYWORD SPOTTING Two approaches: KWS based on LVCSR Very accurate Slower Expensive for development Acoustic KWS Fast Less accurate Cheap development Speech transcription based on state-of-the-art acoustic and language models Posteriors from confusion network used as confidences About 100h of training data Simple neural network based acoustic model Simple language model (phone loop as background) About 20h of training data Phoneme-based calibration www.phonexia.com, 24/41

SPEECH TRANSCRIPTION PLP + bottle-neck features, HLDA fast VTLN estimated using a set of GMMs GMM or NN based system Discriminative training Speaker adaptation 3-gram language model strings / lattices / confusion networks www.phonexia.com, 25/41

SPEECH TRANSCRIPTION CHALLENGES Rather engineering then research challenges: Accuracy Speed Lower memory consumption How to train new system fully automatically How to run hundreds of recognizers in parallel How to do channel normalization and speaker adaptation for any length of speech utterance www.phonexia.com, 26/41

CONNECTION TO TEXT BASED DATA MINING TOOLS It is much easier to sell speech transcription with a higher-level data mining tool There is too much text to read The text has to many errors (users will never be happy unless the text is 100% correct) This can be overcame by integration with existing text-based datamining tools: Categorization of recordings Indexing and search using complex queries Exploration of new topics Content analysis Reaction on trends Integration is done on confusion networks (alternative hypothesis, increased probability to find specific information) www.phonexia.com, 27/41

TOVEK TOOLS CONTENT ANALYSIS www.phonexia.com, 28/41

HOW TO MOVE OUR SOFTWARE TO USERS? Decision to write new speech core in 2007 Brno Speech Core Focus on stability, speed and proper error handling Object oriented design, proper interfaces, no dependency among modules except through regular interfaces One C++ compiler, binary compatibility of libraries with others One code base for all technologies www.phonexia.com, 29/41

BRNO SPEECH CORE More than 250 objects covering large range of speech algorithms (feature extraction, acoustic models, decoders, transforms, grammar compilers, ) More than million of source code lines Code versioning, automatic builds, test suits, licensing Still easy maintainable extension of functionality inside objects splitting of functionality to more objects replacement of objects (fixed interfaces) www.phonexia.com, 30/41

FAST PROTOTYPING Research to product transfer time is essential for commercial success Research done using standard toolkits STK, TNET, KALDI, Python scripts, text For production systems: New system can be implemented in few days Often no single line of C/C++ code is written Only objects for new algorithms are implemented Objects are connected through one configuration file to form data streams www.phonexia.com, 31/41

BSCORE CONFIG [source:sfilewaveformsourcei] [fconvertor:swaveformformatconvertori] input_format_str=lin16 output_format_str=float nchannels=1 [melbanks:smelbanksi] sample_freq=8000 vector_size=200 preem_coef=0.97 nbanks=15 [posteriors:snnetposteriorestimatori] [decoder:sphndecoderi] [output:stranscriptionnodei]... [links] source->fconvertor fconvertor->melbanks melbanks->posteriors posteriors->decoder decoder->output www.phonexia.com, 32/41

APPLICATION INTERFACES Customers are used to work with specific programming tools and do not want to change their habits GUI/command line/sdks C/C++ API binary compatibility with many compilers Java API a middle layer created using Java Native Interface (JNI) C# API automatically generated using SWIG MRCPv2 network interface for integration to telephone infrastructure (IVRs) - through UniMRCP open source project REST server platform, simple network interface for each technology Supported OS Windows/Linux, 32/64 bits, Android www.phonexia.com, 33/41

USUAL DESIGN OF SYSTEMS WITH WEB BASED GUI net Data storage Speech server Application server Web server Speech Server (REST), application server, web serer, database Speech technology inside application server (TomCat, JBoss) through Java API www.phonexia.com, 34/41

CLIENT FOR REST SERVER text www.phonexia.com, 35/41

GRAND CHALLENGES Training data collection Guarantee of accuracy Reduction of hardware cost www.phonexia.com, 36/41

TRAINING DATA COLLECTION We would like to offer cheap speech technology to anyone on this planet. Hundreds of languages and thousands of dialects The data is costly - about 30 000 EUR for existing language, more than 100 000 EUR for collection and annotation of new corpora The existing data often do not match the target dialect and acoustic channels We can add only few languages to our offer per year Can we find a smarter and cheaper way to collect data? www.phonexia.com, 37/41

DATA COLLECTION PROJECT 1. Collection of data from public sources (broadcast) 2. Automatic detection of phone calls within broadcast (high variability in speakers, dialects and speaking style) 3. Language identification to verify language, speaker identification to ensure speaker variability 4. Annotation through crowd sourcing platform 5. Fully automatic process including training of ASR, unsupervised adaptation Experience from data collection for language identification (LDC and NIST adopted this process, now mainstream in Language ID) SpokenData.com ready for online annotation (ReplayWell company) Cost of collection of 100h of data, annotation and system training could be reduced bellow 15 000 EUR per language Interested? Please write to info@phonexia.com www.phonexia.com, 38/41

GUARANTEE OF ACCURACY More and more customers buy speech solutions. But each new installation brings new risks. Speaker identification is not fully language independent Language identification is not dialect independent Speech transcription is not domain independent All technologies are not channel independent How to estimate in advance if the installation is going to be successful? What will be the target accuracy of each technology? Data collection project can help Can be extended to World Map of Spoken Languages www.phonexia.com, 39/41

HARDWARE COST Speech solution is not only software, but also computational hardware, data storage, physical planning of the HW etc. Computers and cooling consume electricity The additional cost can be about 50% of total project cost Most research is directed to reach maximal accuracies Any improvement in speed can have large effect on the success of your technology Perfect optimization of software Use of HW acceleration (GPU cards etc.) www.phonexia.com, 40/41

Q & A THANKS! Phonexia s.r.o. info@phonexia.com www.phonexia.com, 41/41