Applications of Portuguese Speech and Language Technologies - Propor 2008 Special Session. Promoted by: Microsoft Language Development Center



Similar documents
Evaluation of a Segmental Durations Model for TTS

Thirukkural - A Text-to-Speech Synthesis System

T 2 O Recycling Thesauri into a Multilingual Ontology

An Open Source HMM-based Text-to-Speech System for Brazilian Portuguese

Victims Compensation Claim Status of All Pending Claims and Claims Decided Within the Last Three Years

THE ISSUE OF PERMANENCE AT AN ONLINE CLASS: an analysis based upon Moodle platform s accesses

Guedes de Oliveira, Pedro Miguel

An Approach to Natural Language Equation Reading in Digital Talking Books

José M. F. Moura, Director of ICTI at Carnegie Mellon Carnegie Mellon Victor Barroso, Director of ICTI in Portugal

Bigorna a toolkit for orthography migration challenges

Índice. Introduction. First Steps. Access Channels. Products and Services

URL encoding uses hex code prefixed by %. Quoted Printable encoding uses hex code prefixed by =.

Portuguese Research Institutions in History

Points of Interference in Learning English as a Second Language

Text-To-Speech Technologies for Mobile Telephony Services

SPEECH SYNTHESIZER BASED ON THE PROJECT MBROLA

From Portuguese to Mirandese: Fast Porting of a Letter-to-Sound Module Using FSTs

An open-source rule-based syllabification tool for Brazilian Portuguese

CGI::Auto Automatic Web-Service Creation

2. Affiliation. University type. University. Department. University site. Position. Address. Telephone. . Fax

How to work with a video clip in EJA English classes?

MULTIDIMENSÃO E TERRITÓRIOS DE RISCO

TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE

How To Study Prosody And Intonation In Portuguese

DIOGO SOUSA BATISTA. Sousa Batista & Associados Sociedade de Advogados, R.L.

Acquiring, Elaborating and Expressing Knowledge: A Study with Portuguese University Students 1

An Arabic Text-To-Speech System Based on Artificial Neural Networks

TTP User Guide. MLLP Research Group. Wednesday 2 nd September, 2015

VERSION 1.1 SEPTEMBER 14, 2014 IGELU 2014: USE.PT UPDATE REPORT NATIONAL/REGIONAL USER GROUP REPRESENTATIVES MEETING PRESENTED BY: PAULO LOPES

Web Based Maltese Language Text to Speech Synthesiser

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

training programme in pharmaceutical medicine Clinical Data Management and Analysis

How To Write A Tree Processor In A Tree Tree In A Program In A Language That Is Not A Tree Or Directory Tree (Perl)

Vitor Emanuel de Matos Loureiro da Silva Pereira

The ASCII Character Set

Portugal. Sub-Programme / Key Activity. HOSPITAL GARCIA DE ORTA PAR 2007 ERASMUS Erasmus Network European Pathology Assessment & Learning System

Chapter 12 Programming Concepts and Languages

A FORMULATION FOR HYDRAULIC FRACTURE PROPAGATION IN A POROELASTIC MEDIUM

Corpus Design for a Unit Selection Database

Process Simulation Support in BPM Tools: The Case of BPMN

Longman English Interactive

Representação de Caracteres

In: Proceedings of RECPAD th Portuguese Conference on Pattern Recognition June 27th- 28th, 2002 Aveiro, Portugal

EasyVoice: Breaking barriers for people with voice disabilities

Strategic Planning in Universities from Pará, Brazil. Contributions to the Achievement of Institutional Objectives

SONAE PREPARING FUTURE GROWTH

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

A Movement Tracking Management Model with Kalman Filtering Global Optimization Techniques and Mahalanobis Distance

PRINCIPLES AND PRACTICES OF MEDICAL DEVICE DEVELOPMENT

Lexical analysis FORMAL LANGUAGES AND COMPILERS. Floriano Scioscia. Formal Languages and Compilers A.Y. 2015/2016

MASTER IN HISTORY OF THE PORTUGUESE EMPIRE MESTRADO EM HISTÓRIA DO IMPÉRIO PORTUGUÊS (e-learning) 2013/2014

Construction Economics and Management. Department of Surveying, University of Salford, England, UK.

Luís Paulo Peixoto dos Santos. Curriculum Vitae

Available online at ScienceDirect

Streaming Lossless Data Compression Algorithm (SLDC)

António José Vasconcelos Franco Gomes de Menezes

CLASS TEST GRADE 11. PHYSICAL SCIENCES: CHEMISTRY Test 6: Chemical change

Europass Curriculum Vitae

Europass Curriculum Vitae

Memory is implemented as an array of electronic switches

Oncology Meetings: Gastric Cancer State of Art March 27 and 28th, 2014

Master of Arts in Linguistics Syllabus

Data Analysis 1. SET08104 Database Systems. Napier University

Chem 115 POGIL Worksheet - Week 4 Moles & Stoichiometry

The Musibraille Project Enabling the Inclusion of Blind Students in Music Courses

SIGNAL PROCESSING & SIMULATION NEWSLETTER

Carlos Fernando Loureiro Martins. Rua Sacra Família, 7º Drt. Trás Póvoa de Varzim Portugal.

Curriculum Vitae. January, 2005

Over the years the Catholic University of Portugal has established its position as one of the leading academic communities in the country through its

Sample and Data Collection

Base Conversion written by Cathy Saxton

XML::DT - a Perl down translation module

Crosswalk Directions:

Common Core Unit Summary Grades 6 to 8

Linguística Aplicada das Profissões

EDP ENERGIAS DE PORTUGAL, S.A. General Shareholders Meeting PROPOSAL OF RESOLUTION

WHY TRANSLATOR-FRIENDLY?

Unit 2.1. Data Analysis 1 - V Data Analysis 1. Dr Gordon Russell, Napier University

Algebra 1 Course Information

Montessori Academy of Owasso

Efficient diphone database creation for MBROLA, a multilingual speech synthesiser

Post-graduation course in Administrative Disputes from the Faculty of Law of the Catholic University of Portugal in 2005.

How To Build A Portuguese Web Search Engine

Wednesday 4 th November Y1/2 Parent Workshop Phonics & Reading

Ana Luísa Nunes Raposo

Measurement of Power in single and 3-Phase Circuits. by : N.K.Bhati

to Voice. Tips for to Broadcast

Telecommunication (120 ЕCTS)

Analysis of prosodic features: towards modelling of emotional and pragmatic attributes of speech

Interpreting areading Scaled Scores for Instruction

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Keyboards for inputting Japanese language -A study based on US patents

Word Completion and Prediction in Hebrew

B I N G O B I N G O. Hf Cd Na Nb Lr. I Fl Fr Mo Si. Ho Bi Ce Eu Ac. Md Co P Pa Tc. Uut Rh K N. Sb At Md H. Bh Cm H Bi Es. Mo Uus Lu P F.

General Information Package. Internacional Applied Military Psychology Symposium May 2015

PREPLANNING METHOD FOR MULTI-STORY BUILDING CONSTRUCTION USING LINE OF BALANCE

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Software Design Document (SDD) Template

Xi2000 Series Configuration Guide

Seminar on. Publishing Doctoral Research

Transcription:

1 Applications of Portuguese Speech and Language Technologies - Propor 2008 Special Session Hosted by: Universidade de Aveiro Promoted by: Microsoft Language Development Center

2 Propor 2008 Special Session Commitee Special Session Chair António Teixeira - DETI/IEETA, Universidade de Aveiro, Portugal Organising Committee Daniela Braga, Microsoft Language Development Center, Portugal Miguel Sales Dias, Microsoft Language Development Center, Portugal António Teixeira - DETI/IEETA, Universidade de Aveiro, Portugal Programme Committee António Teixeira - DETI/IEETA, Universidade de Aveiro, Portugal Daniela Braga, Microsoft Language Development Center, Portugal Vera Strube de Lima, Pontifícia Universidade Católica do Rio Grande do Sul, Brasil Luís Caldas de Oliveira, INESC-ID/IST, Portugal Editorial Board Daniela Braga, Microsoft Language Development Center, Portugal Miguel Sales Dias, Microsoft Language Development Center, Portugal Luanda Braga Batista, Microsoft Language Development Center, Portugal

35 A Textual Rewriting system for NLP José João Dias de Almeida, Alberto Simões Universidade Do Minho Largo do Paço 4704-553 Braga-Portugal {jj, ambs}@di.uminho.pt Abstract In this document we describe the use of Text::RewriteRules in several tasks related to Speech. 1.Introduction In this document we will present a textual rewriting system tool - Text::RewriteRules[3] - and discuss its use for several tasks related to Natural Language Processing specially for tasks related to TTS. Some of the examples presented are included in the tool Lingua::PT::Speaker[4,5] - a Perl module that implements a TTS for Mbrola[1]. During the presentation, a demonstration of Portuguese synthesis with Lingua::PT::Speaker will be done. Both Text::RewriteRules and Lingua::PT::Speaker are available from CPAN[9]. In order to completely understand the examples, some knowledge of Perl programming language is needed. In order to understand the rules, knowledge of regular expression is enough (see also ). Also, some knowledge of SAMPA[2] phonetic alphabet can help understanding rules. We will start with a description of the rewriting system (section). After that, we will take a rewriting system by example approach (sections and ). 2.Short presentation of Text::RewriteRules Text::RewriteRules is a Perl embedded Domain Specific Language that helps writing rules to process text. Basically, Text::RewriteRules is used to include rule-sets in a Perl program: each set of rules generates an independent Perl function; each rewriting step is a substitution: either a Perl regular expression substitution; or a substitution by the result of an expression. 2.1 "Hello world"-like example The following example performs the expansion of some informal English constructions: use Text::RewriteRules; RULES formalizer don't==>do not doesn't==>does not while(<>) { print formalizer($_) } 2.2 Rule types Text::RewriteRules support several types of rules. The most relevant are: simple string substitution, regexp ==> substitution string

36 conditional substitutions, regexp ==> substitution string!! condition rules with code evaluation (with optional conditions), regexp =e=> Perl expression regexp =e=> Perl expression!! condition initialization rules (begin rule), =b=> Perl initialization statement 2.3 Basic rewriting algorithms There are two kind of rule-set: Point fix rewriting system (RULES f... ), that performs all possible substitutions until no rule can be applied; Sliding rewriting system (RULES/m f... ) that performs substitutions at a cursor position, that slides from the beginning to the end of the text. 3.Text::RewriteRules by example 3.1 PT Syllable In the following example we define three rewriting systems: divide - separate words in syllable (regato - re ga to) syllableaccent - marks the tonic syllable (re ga to - re "ga to) vowelaccent - marks the tonic vowel (re "ga to - re ga: to) $v = qr{[aeiouáéíóúàãõâêôäëïöüyw]}i; ## vowels $c = qr{[bcçdfghjklmñnpqrstvwyxz]}i; ## consonant $f = qr{(?:[çdfgjkqtv] ss rr [nlcs]h bs cc)}i; ## strong consonant $ac = qr{[áéíóúãõâêô]}i; ## accents @br = qw{sl sm sn sc rn bc lr bc bd bj bp pt pc dj pç zm tp}; while(<>){ s/(\w+)/vowelaccent(syllableaccent(divide($1)))/ge; } RULES/i divide ($v)($c)($c)($v)==>$1$2 $3$4!! $br{"$2$3"} ($v)($c)($c)([lr])==>$1$2 $3$4!! $br{"$2$3"} ($v)($f)(?![ ])==>$1 $2 (.[bclnprsx])($f)(?![ ])==>$1 $2 ($v)([bclmnprsxzh]$v)==>$1 $2 (\w)([lmnrsx])([bclmnprszx])($v)==>$1$2 $3$4!!"$2$3" ne "ss" && "$2$3" ne rr ($v [lmnrsx])([bcp][lr]$v)==>$1 $2 ($v)($c)($c)($v)==>$1 $2$3$4 ([a])(i[ru])==>$1 $2 ## quebra de ditongos / tritongos ([ioeê])([aoe])==>$1 $2 u(ai ou)==>u $1 ([^qg]u)(ei iu ir $ac)==>$1 $2 ([aeio])($ac)==>$1 $2 ([íúô])($v)==>$1 $2

37 RULES/i syllableaccent ## mark the tonic syllable "=last=> ## leave if we have a tonic syllable (\w*$ac)==>"$1 ## mark a syllable when it has accent character (\w*([zlr] [iu]s?))$==>"$1 ## mark last syllable if... (\w+\ \w+)$==>"$1 (^\w+$)==>"$1 RULES/i vowelaccent :($ac)==>$1: "(\w*?($v [yw]))==>$1: ([gq]u):($v [yw])==>$1$2: "==> :=last=> ## mark the tonic vowel in the tonic syllable 3.2 Numbers to words The following rules convert numbers to words. %fixnum=( 0=> "zero", 1=> "um", 2=> "dois", 3=> "três",... 10=> "dez", 11=> "onze", 12=> "doze",... 20=> "vinte", 30=> "trinta", 40=> "quarenta",... 100=> "cem", 200=> "duzentos", 300=> "trezentos",... 1000=> "mil", 1000000=> "um milhão"); RULES num2words (\d+)[ee](-?\d+)==>$1 vezes 10 levantado a $2 ## 12E-24 scientific -(\d+)==>menos $1 ## -34 negatives (\d+)\s*\%==>$1 por cento ## 12% percentual (\d+)\.(\d{1,3})\b==>$1 ponto $2 ## 12.21 decimals (\d+)\.(\d+)==>$1 ponto digits$2. ## 12.00324 digits(\d+)\.=e=>join(" ",split(//,$1)) ## 12.00324 \b(\d+)\b==>$fixnum{$1}!!defined $fixnum{$1} ## base (0 11 100 1000) (\d+)(000000)\b==>$1 milhões ## 7000000 (\d+)(000)(\d{3})==>$1 milhão e $3!! $1 == 1 ## 1000123 (\d+)(\d{3})(000)==>$1 milhão e $2 mil!! $1 == 1 ## 1123000 (\d+)(\d{6})==>$1 milhão, $2!! $1 == 1 ## 1123123 (\d+)(000)(\d{3})==>$1 milhões e $3 ## 7000123 (\d+)(\d{3})(000)==>$1 milhões e $2 mil ## 7123000 (\d+)(\d{6})==>$1 milhões, $2 ## 7123123 (\d+)(000)\b==>$1 mil (\d+)0(\d{2})==>mil e $2!! $1 == 1 (\d+)(\d00)==>mil e $2!! $1 == 1 (\d+)(\d{3})==>mil $2!! $1 == 1 (\d+)0(\d{2})==>$1 mil e $2 (\d+)(\d00)==>$1 mil e $2 (\d+)(\d{3})==>$1 mil, $2 1(\d\d)==>cento e $1 0(\d\d)==>$1 (\d)(\d\d)==>${1}00 e $2 0(\d)==>$1 (\d)(\d)==>${1}0 e $2 0$==>zero

38 0==> ### num2words("123.12"); # returns "cento e vinte e três ponto doze" Each set of rules generate a Perl function. As standard functions, these can be composed. For example, in order to read a telephone number we can split the number and convert it to words: RULES telefone (\d{3})(\d{3})(\d{3})=e=>num2words("$1 $2 $3") 4.Lingua::PT::Speaker: a TTS with Text::RewriteRules A simple TTS was built by composition of several textual rewriting systems[5]. The following diagram 1 1 describes the rewriting systems pipeline used to transform a text into a Mbrola text: In [8] we can find several detailed algoritms regarding similar problems. 1 This is in fact a simplified version - some blocks like the nonacentuatedwords dictionary and the exception dictionary are not presented in order to make the text less complex.

39 4.1 NonWords to words In order to turn a real text into speech there is a large variety of situations to take care. For instance, we need to know how to read: numerals (see sec 3.2), ordinals, acronyms, emails, urls, math notation, etc. 4.1.1 Acronyms %letra=( a => "á", b => "bê", c =>"cê", h => "agá", k => "kapa"...); RULES/m sigla ([a-z])=e=> $letra{$1} 4.1.2 Reading emails and URL The following set of rules is used to read URL and Emails: RULES/m email \.==> ponto \@==> arroba :\/\/==> doispontos barra barra :==> doispontos (net)\b==> néte (www)\b==> dablidablidabliw (http)\b==> gátêtêpê (com)\b==> cóme (org)\b==> órg ([a-za-z]{1,3}?)\b=e=> sigla($1) (.+?)\b==>$1 ### jj@di.uminho.pt jota jota arroba dê i ponto uminho ponto pê tê 4.1.3 Reading Math expressions In the following example we present some rules used in the task of translating math to text. This example is not complete. RULES/m math \(\s*(\d+)\s*\)==> $1 \(\s*(\w)\s*\)==> $letra{$1} \(==> abre, \)==>, fecha ([a-z])(?=\(\s*(\w+)\s*\))==> $letra{$1} de ([a-z])(?=\(\s*\w+\s*(,\s*\w+\s*)*\))==> $letra{$1} de, (\d+)(?=\s*\()==> $1 vezes (\d+)\/(\d+)==> $1 sobre $2 (\w+)\/(\d+)==> $letra{$1} sobre $2 (\d+)\/(\w+)==> $1 sobre $letra{$2} (\w+)\/(\w+)==> $letra{$1} sobre $letra{$2} ([a-z])\s+2\b==> $letra{$1} ao quadrado ([a-z])\s+3\b==> $letra{$1} ao cubo ([a-z])\s+(\d)\b==> $letra{$1} à $2ª \^\s*2\b==> $letra{$1} ao quadrado

40 \^\s*3\b==> $letra{$1} ao cubo \^\s*(\d)\b==> $letra{$1} à $2ª \^\s*(\d+)\b==> $letra{$1} elevado a $2 (\d+)\s+2\b==> $1 ao quadrado (\d+)\s+3\b==> $1 ao cubo (\d+)\s+(\d)\b==> $1 à $2ª sqrt\b==> reiís de ([^\w\s]+)==> $mat{$1}!! defined $mat{$1} ([a-z])\b==> $letra{$1} log\b==>logaritmo de exp\b==>exponencial de cos\b==>cosseno de s[ie]n\b==>seno de mod\b==>módulo de rand\d==>randome de ([a-za-z]+)==>$1 ### math("f(x)=4x 2 + exp(x)") returns ### éfe de xis igual a quatro vezes xis ao quadrado, mais exponencial de xis 4.2 Words to SAMPA rr==>r ^r==>r ([nls])r==>$1r Some trivial rules to convert from our alphabet to SAMPA: ## r ass==>6ss ## s s$==>s ($vogal)s($vogal)==>$1z$2 e$==>@ ^h==> ## e ## h 4.3 Adjacent words The following set of rules is used to deal with adjacent words interference. The frontiers between words is represented by '/' character. (e a)/\1==>/$1 ## à/água -> /água 6/6(?!~)==>/a ## port6/6bert6 -> port/abert6 6/a==>/a ## 6/agu6 -> /agu6 S/([a\@eA6iouOE])==>z/$1 ## 6S/agu6S -> 6z/agu6S \@/([\@eaui6])==>/$1 ## @St@/ursu -> @St/ursu

41 4.4 Prosody We will not present the prosody rewriting system. In the tool Lingua::PT::Speaker the strategy used is to add simple prosodic markers that indicates pauses, duration and frequency variations. The markers are replaced by specific Mbrola syntax in the MBrolaGenerator rewriting system. 4.5 Local Accent The following example makes a transformation of phonemes in order to naively simulate the voice traditionally associated with Viseu[6]. These rules are presented in order to show that is possible and easy to model and discuss certain phonetic phenomena. RULES/m viseu v==>b s==>s z==>z S==>Z ### viv6 6 sidad@ d@ vizeu -> bib6 6 Sidad@ d@ BiZeu 5.Conclusion We believe that Text::RewriteRules transfers algorithmic complexity to a set of rules, that are simpler to read and constitute a way for discussion and contribution of improvements. Several other writing systems have been tested and are currently in use. References 1. The MBROLA Project: Towards a Freely Available Multilingual Speech Synthesizer. http://tcts.fpms.ac.be/synthesis/mbrola.html. 2. SAMPA: computer readable phonetic alphabet. http://www.phon.ucl.ac.uk/home/sampa/home.htm. 3. J.J. Almeida and Alberto Simoes. Text::RewriteRules - a system to rewrite text using regexp-based rules. (Perl module) Text::RewriteRules, CPAN - Comprehensive Perl Archive Network, 2007. 4. J.J. Almeida and Alberto Simoes. Lingua::PT::Speaker - perl extension for text to speech of portuguese. (Perl module) Lingua::PT::Speaker, CPAN - Comprehensive Perl Archive Network, 2008. 5. J.J. Almeida and A. M. Simões. Text to speech - a rewriting system approach. Procesamiento del Lenguaje Natural, 27:247-255, Sep. 2001. 6. J.J. Almeida and Alberto Manuel Simões. Geração de voz com sotaque. In Actas do XVIII Encontro da Associação Portuguesa de Linguística, Porto 2002, 2003. 7. Alan W Black and Kevin A. Lenzo. Building synthetic voices. Technical report, Language Technologies Institute, Carnegie Mellon University, 2007. http://www.festvox.org/festvox/festvox_toc.html. 8. Daniela Braga. Algoritmos de PLN para conversão texto-fala em Português. Tese de doutoramento, Universidade da Corunha, 2008. 9. Andreas Koenig. CPAN - Comprehensive Perl Archive Network, 1995. http://www.perl.org.

42 A Regular expressions \s space (white space, tab) [ \t\n] \w letters, digits _ [a-za-z0-9_áéíóúç...] \d digit [0-9] \b word boundary ^ at the beggining of _$ _ at the end of _* _+ _{8} _{2,8} _? (?=_) (?!_) 0 or more repetitions of _ 1 or more repetitions of _ 8 repetitions of _ between 2 and 8 repetitions of _ 0 or 1 repetitions of positive lookahead _ negative lookahead