L2: Speech production and perception Anatomy of the speech organs Models of speech production Anatomy of the ear Auditory psychophysics

Similar documents
Lecture 4: Jan 12, 2005

Hearing and Deafness 1. Anatomy & physiology

Anatomy and Physiology of Hearing (added 09/06)

A diagram of the ear s structure. The outer ear includes the portion of the ear that we see the pinna/auricle and the ear canal.

L9: Cepstral analysis

The Design and Implementation of Multimedia Software

Noise. CIH Review PDC March 2012

Lecture 1-6: Noise and Filters

Your Hearing ILLUMINATED

Sound Perception. Sensitivity to Sound. Sensitivity to Sound 1/9/11. Not physically sensitive to all possible sound frequencies Range

AP Psychology ~ Ms. Justice

Lecture 1-10: Spectrograms

T Automatic Speech Recognition: From Theory to Practice

So, how do we hear? outer middle ear inner ear

Schindler 3300 / Schindler 5300 Information on noise and vibration.

Vision: Receptors. Modes of Perception. Vision: Summary 9/28/2012. How do we perceive our environment? Sensation and Perception Terminology

Tutorial about the VQR (Voice Quality Restoration) technology

Estimation of Loudness by Zwicker's Method

Light wear for a powerful hearing. Bone Conduction Headset

Trigonometric functions and sound

Pure Tone Hearing Screening in Schools: Revised Notes on Main Video. IMPORTANT: A hearing screening does not diagnose a hearing loss.

Basic Acoustics and Acoustic Filters

Workshop Perceptual Effects of Filtering and Masking Introduction to Filtering and Masking

Dr. Abdel Aziz Hussein Lecturer of Physiology Mansoura Faculty of Medicine

What are the causes of presbycusis? What can be done? How can I communicate with someone who has a hearing loss? How does hearing work?

PURE TONE AUDIOMETRY Andrew P. McGrath, AuD

Check Your Hearing -

The Effects of Ultrasonic Sound Generated by Ultrasonic Cleaning Systems on Human Hearing and Physiology

The Role of the Efferent System in Auditory Performance in Background Noise

Acoustics: the study of sound waves

physics 1/12/2016 Chapter 20 Lecture Chapter 20 Traveling Waves

Tonal Detection in Noise: An Auditory Neuroscience Insight

Doppler Effect Plug-in in Music Production and Engineering

SEMI-IMPLANTABLE AND FULLY IMPLANTABLE MIDDLE EAR HEARING AIDS

Comparative study of the commercial software for sound quality analysis

Nature of the Sound Stimulus. Sound is the rhythmic compression and decompression of the air around us caused by a vibrating object.

What Audio Engineers Should Know About Human Sound Perception. Part 2. Binaural Effects and Spatial Hearing

Mathematical modeling of speech acoustics D. Sc. Daniel Aalto

1) The time for one cycle of a periodic process is called the A) wavelength. B) period. C) frequency. D) amplitude.

Articulatory Phonetics. and the International Phonetic Alphabet. Readings and Other Materials. Introduction. The Articulatory System

English Phonetics: Consonants (i)

Sound absorption and acoustic surface impedance

Speech-Language Pathology Curriculum Foundation Course Linkages

v = λ f this is the Golden Rule for waves transverse & longitudinal waves Harmonic waves The golden rule for waves Example: wave on a string Review

BIOMEDICAL ULTRASOUND

Copyright 2008 Pearson Education, Inc., publishing as Pearson Addison-Wesley.

Solution Derivations for Capa #13

AP1 Waves. (A) frequency (B) wavelength (C) speed (D) intensity. Answer: (A) and (D) frequency and intensity.

The front end of the receiver performs the frequency translation, channel selection and amplification of the signal.

Fundamentals of Acoustics & Practical HVAC Design Considerations. Demir Doken Acoustic Engineer

application note Directional Microphone Applications Introduction Directional Hearing Aids

Presbycusis. What is presbycusis? What are the symptoms of presbycusis?

ARTICLE. Sound in surveillance Adding audio to your IP video solution

Noise and Hearing Protection

Doppler. Doppler. Doppler shift. Doppler Frequency. Doppler shift. Doppler shift. Chapter 19

Help maintain homeostasis by capturing stimuli from the external environment and relaying them to the brain for processing.

Physical Science Study Guide Unit 7 Wave properties and behaviors, electromagnetic spectrum, Doppler Effect

Canalis. CANALIS Principles and Techniques of Speaker Placement

L3: Organization of speech sounds

Acoustic Terms, Definitions and General Information

5th Congress of Alps-Adria Acoustics Association NOISE-INDUCED HEARING LOSS

LeadingEdge Systematic Approach. Room Acoustic Principles - Paper One. The Standing Wave Problem

Linear Predictive Coding

Ear Disorders and Problems

Guideline for Hearing Conservation and Noise Control

Specialty Answering Service. All rights reserved.

Sound Pressure Measurement

Acoustics. Lecture 2: EE E6820: Speech & Audio Processing & Recognition. Spherical waves & room acoustics. Oscillations & musical acoustics

Audio Engineering Society. Convention Paper. Presented at the 129th Convention 2010 November 4 7 San Francisco, CA, USA

Yerkes Summer Institute 2002

APPLICATION NOTE AP050830

Section 5.0 : Horn Physics. By Martin J. King, 6/29/08 Copyright 2008 by Martin J. King. All Rights Reserved.

1 Cornea 6 Macula 2 Lens 7 Vitreous humor 3 Iris 8 Optic disc 4 Conjunctiva 9 Ciliary muscles 5 Sclera 10 Choroid

Surgery for Conductive Hearing Loss

Paediatric Hearing Assessment

Byron's Hudson Valley Hearing Aid Centers Kingston/Lake Katrine Poughkeepsie Your hearing Journey

Ear, Nose, Throat, Teeth and the Jaw

Welcome to the United States Patent and TradeMark Office

Waves: Recording Sound Waves and Sound Wave Interference (Teacher s Guide)

Visit the Piano Learning Center of the Piano Technicians Guild at for more fun ways to learn about the piano.

PLEASE DO NOT WRITE ON THE TEST. PLACE ALL MULTIPLE CHOICE ANSWERS ON THE SCANTRON. (THANK YOU FOR SAVING A TREE.)

A Game of Numbers (Understanding Directivity Specifications)

Speech Therapy for Cleft Palate or Velopharyngeal Dysfunction (VPD) Indications for Speech Therapy

Lesson 11. Luis Anchordoqui. Physics 168. Tuesday, December 8, 15

The loudness war is fought with (and over) compression

Convention Paper Presented at the 112th Convention 2002 May Munich, Germany

What role does the nucleolus have in cell functioning? Glial cells

Direct and Reflected: Understanding the Truth with Y-S 3

The Sonometer The Resonant String and Timbre Change after plucking

Giant Slinky: Quantitative Exhibit Activity

Chapter 21 Study Questions Name: Class:

Part Number Structure ZT 028 L D CT R 15

Hearing Aids. What Is a Hearing Aid? How Common Is Hearing Loss and What Causes It? How Do We Hear?

RANDOM VIBRATION AN OVERVIEW by Barry Controls, Hopkinton, MA

Noise: Impact on Hearing; Regulation

Hearcentres Guide to Hearing Aid Terminology

Hearing Tests And Your Child

Veterans UK Leaflet 10. Notes about War Pension claims for deafness

Understanding Hearing Loss

v = fλ PROGRESSIVE WAVES 1 Candidates should be able to :

Transcription:

L2: Speech production and perception Anatomy of the speech organs Models of speech production Anatomy of the ear Auditory psychophysics Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 1

Anatomy of the speech organs The speech organs can be broadly divided into three groups Lungs: serve as a power supply and provides airflow to the larynx Vocal chords (Larynx): modulate the airflow into either a periodic sequence of puffs or a noisy airflow source A third type of source is impulsive Exercise, say the word shop and determine where each sound occurs Vocal tract: converts modulated airflow into spectrally colored signal [Quatieri, 2002] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 2

The vocal tract The vocal tract can further be divided into Velum (soft palate): controls airflow through the nasal cavity. In its open position is used for nasals (i.e., *n+, *m+). Hard palate: hard surface at the roof of the mouth. When tongue is pressed against it, leads to consonants Tongue: Away from the palate produces vowels; close to or pressing the palate leads to consonants Teeth: used to brace the tongue for certain consonants Lips: can be rounded or spread to shape consonant quality, or closed completely to produce certain consonants (i.e., [p], [b], [m]) [Huang, Acero & Hon, 2001] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 3

[Taylor, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 4

The vocal folds Two masses of flesh, ligament and muscle across the larynx Fixed at the front of the larynx but free to move at the back and sides Can be in one of three primary states Breathing: Glottis is wide, muscles are relaxed, and air flows with minimal obstruction Voicing: vocal folds are tense and are brought up together. Pressure builds up behind, leading to an oscillatory opening of the folds (video) Unvoiced: similar to breathing state, but folds are closer, which leads to turbulences (i.e. aspiration, as in the sound *h+ in he ) or whispering http://academic.kellogg.edu/herbrandsonc/bio201_mckinl ey/f25-5b_vocal_folds_lary_c.jpg http://biorobotics.harvard.edu/research/heather2.gif Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 5

Other (minor) forms of voicing include Hoarse voice: voicing period (pitch) jitters, as what results from laryngitis or a cold Breathy voice: aspiration occurs simultaneously while voicing (audio) Creaky voice: vocal folds are very tense and only a portion oscillates. Result is a harsh sounding voice (audio) Vocal fry: folds are very relaxed, which leads to secondary glottal pulses (video) Diplophonic: secondary pulses occur, but during the closed phase Vocal fry Diplophonia [Quatieri, 2002] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 6

Models of speech production Acoustic theory of speech production Speech occurs when a source signal passing through the glottis is modified by the vocal tract acting as a filter Models of this kind are generally known as source-filter models Using the theory of linear time invariant (LTI) systems, the overall process can be modeled in the z-domain (see lecture 4) as Y z = U z P z O z R z where U(z) is the glottal source, and P(z), O(z), R(z) are the transfer functions at the pharynx, oral cavity and lips which can be simplified as Y z = U z V z R z by combining P(z) and O(z) into a single vocal-tract transfer function, which represents the filter component of the model Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 7

Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 8

Resonant systems The physics of sounds Consider the mass spring shown below If you displace the mass, the system will try to return to its rest position In the process, it will lead to oscillations around the rest position Due to frictions, the mass will eventually settle onto the rest position Now consider a periodic forcing function being applied At a certain frequency f R, the size of the oscillations will increase over time rather than decrease Eventually, and in the absence of other factors, the system will break Frequency f R is known as the resonant frequency of the system f R = 1 2π k m [Taylor, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 9

The Tacoma Narrows Bridge Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 10

Travelling and standing waves Consider a person holding a rope fixed to a wall on the opposite end The person gives a jerk to the rope, and as a result a wave forms and starts travelling down the rope When the rope reaches the wall, it is reflected and begins to travel back towards the person When it reaches the person, the wave is reflected back towards the wall This process goes on until all energy in the rope dissipates [Taylor, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 11

Now consider the person performs a repetitive movement The forward and backward wave will interfere, which may lead to full cancellation (if the two waves are in anti-phase) At some frequencies, the reflected wave will reinforce the forward wave, and the rope settles into a fixed pattern The resulting wave will appear not to be moving at all (a standing wave) Thus the rope acts as a resonator: it amplifies some waves and attenuates others http://www.cccc.edu/instruction/slympany/eln/236/ Mod7/loet01-07-06new.gif Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 12

What determines the behavior of the system? The frequency of the oscillations is determined entirely by the hand The rate of travel of the wave is determined entirely by the rope Boundary conditions: whether the rope is fixed or free at each end. What is the relationship with the speech signal? The hand acts as the source (the glottal pulses) The rope acts as the filter (the vocal tract) Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 13

Acoustic waves Properties of sound waves traveling in a tube (i.e., vocal tract) are similar to those moving down a rope The effect of a sound source causes air particles to move back and forth, so the wave spreads from the source In some areas, particles come close together (compression) whereas in others they move further apart (rarefaction). One difference is that sound waves are longitudinal whereas those in a rope are transverse, but otherwise the same mathematical model can be used for both systems [Taylor, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 14

Acoustic reflection and tube models As with the rope, boundary conditions in the tube will determine how acoustic waves are reflected at the end of the tube At certain frequencies, determined by the length of the tube and the speed of sound, the backward and forward waves will reinforce each other and cause resonances We can model the volume velocity (e.g., particle velocity area) at position x and time t as: u x, t = u + t x c u t + x c [Taylor, 2009] where u + (t) and u (t) are the forward- and backward-travelling waves, and c is the speed of sound And the pressure becomes p x, t = ρc A u+ t x c + u t + x c where c/a is the characteristic impedance of the tube Notice how in this case the two waves add up as they meet at point x Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 15

If the area of the tube remains constant, the wave propagates through the tube However, if the area changes, then the impedance changes, which causes reflection, which in turn leads to standing waves, which then cause resonances Thus, the impedance pattern on the tube determines the resonance properties of the model http://www.livingcontrolsystems.com/fests chrift/nevin_files/image015.jpg Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 16

The vocal tract can then be modeled as a series of short uniform tubes connected in series By increasing the number of tubes, the vocal tract can be modeled with arbitrary accuracy As we will see in linear predictive coding (lecture 7), a tube model with N sections leads to N/2 resonances, so in practice only a few tube sections are needed to model the main formants in the speech signal http://www.gregandmel.net/burnett_thesis/2_3.ht7.jpg [Taylor, 2009] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 17

Anatomy of the ear There are two major components in auditory system The peripheral auditory organs (the ear) Converts sounds pressure into mechanical vibration patterns, which then are transformed into neuron firings The auditory nervous system (the brain) Extracts perceptual information in various stages We will focus on the peripheral auditory organ The ear can be further divided into Outer ear: Encompasses the pinna (outer cartilage), auditory canal, and eardrum Transforms sound pressure into vibrations Middle ear: Consists of three bones: malleus, incus and stapes Transport eardrum vibrations to the inner ear Inner ear: Consists of the cochlea Transforms vibrations into spike trains at the basilar membrane Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 18

Oval window http://www.bissy.scot.nhs.uk/master_code/pilsinl/042.gif Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 19

The cochlea A tube coiled in a snake-shaped spiral Inside filled with gelatinous fluid Running along its length is the basilar membrane Along the BM are located approx. 10,000 inner hair cells Signal transduction Vibrations of the eardrum cause movement in the oval window This causes a compression sound wave in the cochlear fluid This causes vertical vibration of basilar membrane This causes deflections in the inner hair cells, which then fire Frequency tuning BM is stiff/thin at basal end (stapes), but compliant/massive at apex Thus, traveling waves peak at different positions along BM As a result, BM can be modeled as a filter bank (video) Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 20

http://cobweb.ecn.purdue.edu/ ~ee649/notes/figures/basilar_m embrane.gif (apex) http://upload.wikimedia.org/wikipedia/commons/6/65/uncoiled_coc hlea_with_basilar_membrane.png Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 21

http://hyperphysics.phy-astr.gsu.edu/hbase/sound/cochimp.html Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 22

[Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 23

Auditory psychophysics Psychoacoustics is concerned with quantitative modeling of human auditory perception How does the ear respond to different intensities and frequencies? How well does it focus on a sound of interest in the presence of interfering sounds? Thresholds http://msis.jsc.nasa.gov/images/section04/image126.gif The ear is capable of hearing sounds in the range of 16Hz to 18kHz Intensity is measured in terms of sound pressure levels (SPL) in units of decibels (db) Hearing threshold: Minimum intensity at which a sound is perceived Sounds below 1kHz or above 5kHz have increasingly higher thresholds Threshold is nearly constant across most speech frequencies (700Hz-7kHz) Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 24

SPL and loudness As with other sensory systems (seeing, smelling), auditory sensations increase logarithmically with the intensity of the stimulus The relation between sound pressure p, sound intensity I and loudness S follows Steven s power law S p 0.6 I 0.3 where the unit of S is the sone, and the proportionality constant implied by the equation is frequency dependent The ear is most sensitive to tones around 4kHz Each loudness contour corresponds to a unit of a phons (the SPL in db of a 1kHz tone) http://replaygain.hydrogenaudio.org/ pics/equal_loudness.gif Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 25

Masking A phenomenon whereby the perception of a sound is obscured by the presence of another (i.e., the latter raises the threshold of the former) Masking is the major non-linear phenomenon that prevents treating the perception of speech sounds as a summation of responses Two types of masking phenomena Frequency masking A lower frequency sound generally masks a higher frequency one Leads to the concept of critical bands (next) Temporal masking Sounds delayed wrt one another can cause masking of either sound Pre-masking tends to last 5ms; post-masking can last up to 50-300ms Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 26

Frequency masking Temporal masking http://homepage.mac.com/marc.heijligers/audio/ipod/compression/e ncoding/encoding.html Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 27

Critical bands For a given frequency, the critical band is the smallest band of frequencies around it which activate the same part of the BM Critical bandwidths correspond to about 1.5 mm spacing along the BM This suggests that a set of 24 bandpass filters (with increasing bandwidth with frequency) would model the BM well If a signal and masker are presented simultaneously, only the masker frequencies within the CB contribute to masking of the signal The amount of masking is equal to the total energy of the masker within the CB of the probe [Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 28

How can you test a critical band experimentally? Take a band-limited noise signal with a center frequency of 2 khz, and play it alongside a sinusoidal 2 khz tone Make the tone very quiet relative to the noise You will not be able to detect the tone because the noise signal will mask it Now, turn up the level of the tone until you can hear it and write down its level Increase the bandwidth of the noise (w/o turning up its level) and repeat You'll find that your threshold for detecting the tone will be higher In other words, if the bandwidth of the masking signal is increased, you have to turn up the tone more in order to be able to hear it Increase the bandwidth and do the experiment over and over As you increase the bandwidth of the masker, the detection threshold of the tone will increase up to a certain bandwidth. Then it won't increase any more! This means that, for a given frequency, once you get far enough away in frequency, the noise does not contribute to the masking of the tone The bandwidth at which the threshold for the detection of the tone stops increasing is the critical bandwidth http://www.tonmeister.ca/main/textbook/node331.html Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 29

Two perceptual scales have been derived from critical bands Bark scale Relates acoustic frequency to perceptual frequency resolution One Bark equals one critical band z = 13tan 1 0.76 f khz + 3.5tan 1 f 7.5kHz normalized scales 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Mel Bark 0 0 1000 2000 3000 4000 5000 6000 7000 8000 frequency (Hz) Mel scale (more on lecture 9) Linear mapping up to 1 khz, then logarithmic at higher frequencies m = 2595log 10 1 + f 700 [Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 30

Pitch perception Like loudness, pitch is a subjective attribute, in this case related to the fundamental frequency (F0) of a periodic signal The relationship between pitch and F0 is non linear and can be described by the Mel scale [Rabiner & Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 31