ISSN: (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies

Similar documents
Turkish Radiology Dictation System

Lecture 12: An Overview of Speech Recognition

TEXT TO SPEECH SYSTEM FOR KONKANI ( GOAN ) LANGUAGE

Hardware Implementation of Probabilistic State Machine for Word Recognition

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

An Arabic Text-To-Speech System Based on Artificial Neural Networks

Myanmar Continuous Speech Recognition System Based on DTW and HMM

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

have more skill and perform more complex

B. Raghavendhar Reddy #1, E. Mahender *2

Speech Recognition System of Arabic Alphabet Based on a Telephony Arabic Corpus

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Thirukkural - A Text-to-Speech Synthesis System

Kindergarten Common Core State Standards: English Language Arts

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Ericsson T18s Voice Dialing Simulator

Speech: A Challenge to Digital Signal Processing Technology for Human-to-Computer Interaction

Things to remember when transcribing speech

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Phonics and Word Work

Specialty Answering Service. All rights reserved.

Develop Software that Speaks and Listens

How To Use Neural Networks In Data Mining

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Mobile Learning Applications Audit

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

CAMBRIDGE FIRST CERTIFICATE Listening and Speaking NEW EDITION. Sue O Connell with Louise Hashemi

Speech Analytics. Whitepaper

SASSC: A Standard Arabic Single Speaker Corpus

Houghton Mifflin Harcourt StoryTown Grade 1. correlated to the. Common Core State Standards Initiative English Language Arts (2010) Grade 1

School Class Monitoring System Based on Audio Signal Processing

PERCENTAGE ARTICULATION LOSS OF CONSONANTS IN THE ELEMENTARY SCHOOL CLASSROOMS

Unit 2 Title: Word Work Grade Level: 1 st Grade Timeframe: 6 Weeks

Interpreting areading Scaled Scores for Instruction

Effects of Pronunciation Practice System Based on Personalized CG Animations of Mouth Movement Model

31 Case Studies: Java Natural Language Tools Available on the Web

The Impact of Using Technology in Teaching English as a Second Language

Enabling Speech Based Access to Information Management Systems over Wireless Network

Enterprise Voice Technology Solutions: A Primer

CCSS English/Language Arts Standards Reading: Foundational Skills Kindergarten

CHANWOO KIM (BIRTH: APR. 9, 1976) Language Technologies Institute School of Computer Science Aug. 8, 2005 present

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

VOICE RECOGNITION KIT USING HM2007. Speech Recognition System. Features. Specification. Applications

The LENA TM Language Environment Analysis System:

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

The sound patterns of language

RADIOLOGICAL REPORTING BY SPEECH RECOGNITION: THE A.Re.S. SYSTEM

INTEGRATING THE COMMON CORE STANDARDS INTO INTERACTIVE, ONLINE EARLY LITERACY PROGRAMS

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

CCSS English/Language Arts Standards Reading: Foundational Skills First Grade

An Automated Analysis and Indexing Framework for Lecture Video Portal

Developing an Isolated Word Recognition System in MATLAB

Establishing the Uniqueness of the Human Voice for Security Applications

Voice Authentication for ATM Security

Phonetic-Based Dialogue Search: The Key to Unlocking an Archive s Potential

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

The effect of mismatched recording conditions on human and automatic speaker recognition in forensic applications

Speech and Data Analytics for Trading Floors: Technologies, Reliability, Accuracy and Readiness

Points of Interference in Learning English as a Second Language

Generating Training Data for Medical Dictations

Academic Standards for Reading, Writing, Speaking, and Listening

Neural Networks in Data Mining

Grade 1 LA Subject Grade Strand Standard Benchmark. Florida K-12 Reading and Language Arts Standards 27

DRA2 Word Analysis. correlated to. Virginia Learning Standards Grade 1

ELAGSEKRI7: With prompting and support, describe the relationship between illustrations and the text (how the illustrations support the text).

Understanding Impaired Speech. Kobi Calev, Morris Alper January 2016 Voiceitt

Industry Guidelines on Captioning Television Programs 1 Introduction

Text-To-Speech Technologies for Mobile Telephony Services

SPEAKER IDENTIFICATION FROM YOUTUBE OBTAINED DATA

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition

Measurable Annual Goals

9RLFH$FWLYDWHG,QIRUPDWLRQ(QWU\7HFKQLFDO$VSHFWV

Natural Language to Relational Query by Using Parsing Compiler

A secure face tracking system

Framework for Joint Recognition of Pronounced and Spelled Proper Names

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

Online Recruitment - An Intelligent Approach

Technology Finds Its Voice. February 2010

A simple application of Artificial Neural Network to cloud classification

Problems and Prospects in Collection of Spoken Language Data

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Marathi Interactive Voice Response System (IVRS) using MFCC and DTW

Speech Recognition for Real-time Closed Captioning

Common Core State Standards English Language Arts. IEP Goals and Objectives Guidance: Basic Format

Objective Intelligibility Assessment of Text-to-Speech Systems Through Utterance Verification

Speech Recognition on Cell Broadband Engine UCRL-PRES

Introduction to Pattern Recognition

Quick Start Guide: Read & Write 11.0 Gold for PC

THE COLLECTION AND PRELIMINARY ANALYSIS OF A SPONTANEOUS SPEECH DATABASE*

Artificial Neural Network for Speech Recognition

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings

SPEECH Biswajeet Sarangi, B.Sc.(Audiology & speech Language pathology)

Automatic Detection of Emergency Vehicles for Hearing Impaired Drivers

Support for Standards-Based Individualized Education Programs: English Language Arts K-12. Guidance for West Virginia Schools and Districts

Unit 1 Title: Word Work Grade Level: 1 st Grade Timeframe: 6 Weeks

Reading/Fluency Standards Based Annual Goals

Indiana Department of Education

Barrister or solicitor?

The Pronunciation of the Aspirated Consonants P, T, and K in English by Native Speakers of Spanish and French

Transcription:

ISSN: 2321-7782 (Online) Volume 2, Issue 2, February 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Paper / Case Study Available online at: www.ijarcsms.com Speech to Text Technology: A Phonemic Approach for Devanagari Script Nanekar Priyanka S 1 Kohli Urvashi 2 Bhosale Komal 3 Yeolekar Rishikesh 4 Prof. Abstract: This paper presents a brief study of Automatic Speech Recognition (ASR) and discusses the phonemic approach for Devnagari Script. Despite years of research and impeccable progress of the accuracy of ASR, it remains one of the most important research challenges, which calls for further research. The Devnagari Script inspires the work presented. It arranges 46 phonemes based on the process of its generation. Devnagari is based on phonetic principles, which consider the articulation of the consonants and the vowels they represent. The basic unit of any language is a phoneme. These phonemes are divided into two types: vowel phonemes (swara varna) and consonant phonemes (vyanjan varna). This paper discusses the fundamentals of speech recognition, the understanding of various kinds of speech and further aspects relating to the Devnagari script. Keywords: Automatic Speech Recognition (ASR), Devnagari Script, Phonemes, Swara Varna, Vyanjan Varna. I. INTRODUCTION The speech recognition is the translation of spoken words into text. ASR is the process in which the acoustic speech signal is mapped to text. This process is highly complex since the spoken words has to be matched with stored sound bites on which further analysis has to be done because not all sound bites match with pre-existing sound pieces. Determination of undetermined piece of sound requires computing power. The recognition program is designed to enable the computer to recognize the input and form text. The text stored is based on a huge set of parameters and grammar, which defines human speech. Thus the process begins when a speaker(user) speaks a sentence. The software then produces a speech waveform, which manifests the words of the sentence as well as the background noise and pauses in the spoken input. The software then attempts to decode the speech into the best estimate of the sentence. Thus the three features necessary for an ASR include large vocabularies, continuous speech capabilities, and speaker independence. Phonemes are the basic unit of speech of any language. Each language has its own set of phonemes, typically numbering between 30 and 50. The Devnagari script has a set of 46 phonemes. Voice is produced when different phonemes are articulated. The speech uttered by different persons may vary but the way it is uttered is the same, i.e., the signal extracted is the same because the vibrations produced must be similar for the same phoneme. Every letter and its pronunciation is unique and can t be represented or pronounced by using other letters which gives a unique representation of every word. This feature is absent in non-phonetic languages like English in which one pronunciation can be done in more than one way, e.g. there, their bot are pronounced similarly. 2014, IJARCSMS All Rights Reserved 374 P a g e

II. SYSTEM WORKING Voice 1000100010 SPEECH ENGINE Fig. 1 Depiction of how system works The basic flow of the architecture is as follows: TEXT output A. VOICE INPUT: The voice is taken as an input with the help of either the speaker or the headphone. Whatever the user speaks is to be captured and correctly identified. B. CONVERSION: The sound card captures the sound waves and produces the equivalent digital representation of audio that was received with the help of a microphone/speaker. C. DIGITIZATION: The process of converting the analog signal into a digital form is known as Digitization. It involves the both sampling and quantization processes. Sampling is converting a continuous signal into discrete signal, while the process of approximating a continuous range of values is known as quantization. D. SPEECH ENGINE (ASR): The job of speech recognition engine is to convert the input audio into text. To accomplish this it uses all sorts of data, software algorithms and statistics. Once audio signal is in proper format (after digitization) it then searches the best match for it. It does this by considering the words it knows, once the signal is recognized it returns its corresponding text string. CLASSIFICATION OF SPEECH: The following are the categories of speech recognition system based on their ability to recognize words. These categories are based on the various kinds of issues that are faced by the ASR while determining the sentence that the speaker speaks. ISOLATED SPEECH: involves words a pause between two utterances. The recognizer requires a single utterance at a time, although it can intake more than one word at a time. Often, these systems require the speaker to wait between utterances (usual processing during the pauses). This is one way to determine the words spoken by the ASR. CONNECTED SPEECH systems are similar to isolated words, but allow separate utterances with minimal pause between them. An utterance is a spoken word, statement, or vocal sound. Utterance can be a single word, or a collection of a few words, a single sentence, or even multiple sentences. CONTINUOUS SPEECH allows the user to speak almost naturally. This is a difficult task as compared to the other two since the utterance boundaries are to be determined. 2014, IJARCSMS All Rights Reserved ISSN: 2321-7782 (Online) 375 P a g e

SPONTANEOUS SPEECH is natural sounding and unrehearsed. An ASR System with spontaneous speech ability should be able to handle a variety of natural speech features such as words being run together. APPROACHES TO SPEECH RECOGNITION A. The acoustic-phonetic approach: Small Size Vocabulary Medium Size vocabulary Large Size vocabulary Very-large vocabulary (Out of Vocabulary) 1 to 100 words 101 to 1000 words 1001 to 10,000 words More than 10,000 words The acoustic phonetic theory proposes that finite, smallest unit of speech is a phoneme in any spoken language. All the phonetic units, i.e., phonemes manifest into a speech signal. B. The pattern recognition approach In this approach the speech patterns are determined without explicit determination or segmentation as in case of the phonetic approach. It involves two steps: Training of speech patterns. Recognition of patterns MATCHING TECHNIQUES Speech-recognition engines use one of the following techniques to match the word. A. WHOLE-WORD MATCHING: The analog speech signal is converted into digital form. The engine then maps this digital speech signal against a prerecorded template of the word. This technique takes much less processing than sub-word matching, but it requires that the user to prerecord every word that will be recognized - several hundred thousand words. Whole-word matching also requires large amounts of storage. B. SUB-WORD MATCHING: The engine looks for phonemes and then performs pattern recognition on them. This technique takes more processing than whole-word matching, but it requires much less storage. III. THE DEVANAGARI SCRIPT The basic unit of any language is a phoneme. These phonemes are divided into two types: vowel phonemes (swara varna) and consonant phonemes (vyanjan varna). They together constitute the Varnamala, which has been referred as a varnasamamnaya. The combination of consonant phoneme and a vowel phoneme produces a syllable (akshara).the basic Devnagari characters can be combined to indicate combinations of sounds. Conjunct consonants can be divided into six different groups depending on the type of modification of a consonant that takes place. The combination of two forms (C&V) into a syllable, at times creates a new integrated shape or retains partial identity of both the forms. The syllables formed by adding vowel phonemes /a/, /aa/, /i/, etc. to consonant phoneme are written by creating aksharas. All swara phonemes are added to one consonant phoneme one by one. This concept is called a baaraakhadi. The combination of two forms (C&V) into a syllable at times creates a new integrated shape or retains partial identity of both the forms. Adding vowel phonemes to a sequence of more than one consonant phoneme can also form syllables. 2014, IJARCSMS All Rights Reserved ISSN: 2321-7782 (Online) 376 P a g e

Fig. 2 Devnagari vowels classified into five types IV. CONCLUSION Now a day s speech recognition system is an very important task. Special use of speech recognition is in cellular phone for typing free messaging and calling. We can also dictate a document and emails and our system will type them for you. Thus, our main aim is to produce a typing free system. It is actually a very difficult task to implement it and requires a lot of research. Maintaining recognition accuracy and reducing a noise from voice are two important tasks to be carried out. With the use of phonemes we can create a small database so, that we cannot require internet connection to map the voice signals with words. It will overcome the requirement of internet connection which is used in other software example, Dragon. Acknowledgement We would like to express our sincere gratitude for the assistance and support of a number of people helped us. We are thankful to Prof. Rishikesh Yeolekar, Department of Computer Engineering,, and our internal guide for his valuable guidance that he provided us at various stages throughout the project work. He has been a source of motivation enabling us to give our best efforts in this project. We are also grateful to Prof. Dr.Prasanna Joeg, Head of Computer Department, and other staff members for encouraging us and rendering all possible help, assistance and facilities. References 1. "Android developers", http://developer.android.com 2. J. Tebelskis, Speech Recognition using Neural Networks, Pittsburgh: School of Computer Science,Carnegie Mellon University, 1995. 3. S. J. Young et al., "The HTK Book", Cambridge University Engineering Department, 2006. 4. K. Davis, and Mermelstein, P., "Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences", IEEE Trans. Acoust., Speech, Signal Process. vol. 28, no. 4, pp. 357-366,1980. 5. L. E. Baum, T. Petrie, G. Soules, and N. Weiss, "A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains", Ann. Math. Statist., vol. 41, no. 1, pp. 164-171, 1970. 2014, IJARCSMS All Rights Reserved ISSN: 2321-7782 (Online) 377 P a g e

AUTHOR(S) PROFILE Priyanka Nanekar MIT College Of Engineering. Komal K. Bhosale MIT College Of Engineering Urvashi Kohli MIT College Of Engineering Prof. Yeolekar Rishikesh 2014, IJARCSMS All Rights Reserved ISSN: 2321-7782 (Online) 378 P a g e