Best Practices in Sociophonetics: Field Recording (Digital Tools, Transcription) Christopher Cieri Linguistic Data Consortium
Background LDC creates, shares corpora & other language resources 19 years, >78,000 copies, >1300 titles, >3100 orgs, 70 countries Consortium = mutual aid society members contribute membership fee (library), or sometimes data receive ongoing rights to data created in years member contributes grants in data for students; other arrangements possible for junior faculty, underfunded groups Large scale data collection billions of words of text thousands of hours of video tens of thousands of hours of audio >500 subjects * 2-4 sociolinguistic interviews + 12-24 calls tens of millions of annotation decisions coded >41,000 tokens for >150 dependent variables Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 2
news text web text: newsgroups, blogs, zines biomedical text & abstracts printed, handwritten & hybrid documents broadcast news broadcast conversation conversational telephone speech lectures meetings interviews read & prompted speech role play web video animal vocalizations LDC: Data Collection
LDC: Annotation data scouting, selection, triage audio-audio alignment, bandwidth, signal quality, language, dialect, program, speaker quick and careful transcription, aligned at the turn, sentence, word level orthographic & phonetic script normalization phonetic, dialect, sociolinguistic feature & supralexical documenting zoning tokenization and tagging of morphology, part-of-speech, gloss syntactic, semantic, discourse function, disfluency, sense disambiguation relevance identification, classification of mentions in text of entities, relations, events & coreference knowledgebase population time & location summarization of various lengths from 200 words down to titles translation, multiple translation, edit distance, translation post-editing, translation quality control alignment of translated text at document, sentence & word levels physics of gesture identification, classification of entities and events in video
History 1999 Gregory Guys workshop on publicly available corpora 2001 LDC DASL project, t/d deletion study 2002 William Labovs SLx Corpus and the DASLTrans 2003 Workshop at Penn of robust sociolinguistic methodology 2007 DiPaolo & Yaeger-Dror workshop with USSS, MIT-LL, Phanotics 2008 LREC paper on Phanotics project 2009 Update on methodology, Resulting paper 2010 2 nd DiPaolo & Yaeger-Dror workshop 2011 Chapter 3 in Sociophonetics: A Student s Guide 2011 NWAV workshop (?) on demographics for sociolinguistic fieldwork normalize core, familiarize readers with range of possible independent variables 2012 LSA workshop (?) demographics, situation attitude define terms in data category registry, publish modules program committee accepted, await decisions from NSF, LSA executive committee
Approach support many research communities promiscuous about standards knowledge & techniques for appropriate field recording maximize quality relative to situation no magic bullet, no singe configuration skipping question modules, interaction with subjects (DiPaolo & Yaeger-Dror 2011, Tagliamonte 2006) numerous collection methods other than field interviews existing speech corpora that already exist progress orientation minimize task avoidance Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 6
Phonation, Sound Propagation 1. source, modified by glottis and oral and nasal cavities 1. unfortunately other sounds as well 2. air molecules jostling each other 3. waves of pressure change expanding outward like a cone 4. bouncing off walls, ceiling floor, furniture, people 1. which may themselves be moving (see 2) 5. bouncing off eardrums and microphone diaphragms Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 7
Speaker Speaker Selection Criteria active & lifelong participation speech community under study place in a sample stratified by sex, age, socioeconomic class, ethnicity sometimes religious, political affiliation, membership in social group Typically these speakers do not also sit still, look straight ahead, avoid fiddling with papers, microphone cables Approach ameliorate situation accept subjects even if misbehaved oversample population, select best recordings Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 8
Uncertainty Principle the aim of linguistic research in the community must be to find out how people talk when they are not being systematically observed; yet we can only obtain this data by systematic observation (Labov 1972) techniques to reduce interviewer impact in tension with optimizing recording quality changes to conversational situation affect the resulting speech correcting subject speech head mounted mics recording booths sociolinguists in the field generally forgo very best recording conditions in favor of (nearly) vernacular speech Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 9
Environment: Noise We find potential subjects in poor locations for recording sub-optimal recording versus lost speakers modern world filled with noise we ignore; fieldworkers can become sensitive to noise, minimize Sources indoor noise: televisions, radios, music players but also refrigerator, faucet, lighting, particularly fluorescent/neon, HVAC, computers, clocks phones outdoor noise: traffic, outdoor play & intermittent noise near school, field, hospital, police station Location prepare select mitigate Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 10
Environment: Reflection Reflection sound reflects from surface at angle of incidence in amount related to size, shape, material properties of surface generally, large, flat, smooth, hard surfaces reflect more than small, irregularly shaped, textured, soft surfaces short walls, coverings, carpets, curtains better than long, empty walls, flat ceilings, bare floors, big windows, mirrors, long tables rooms that are squares or cubes are problematic for recording right angle corners increase reflection Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 11
Environment: Distance Distance inverse square law : energy of wave front decreases as a function of square of distance from source optimal distance from mouth to mic depends principally upon mic follow operations manual (online) closer = better except avoid proximity effect for directional microphones avoid placing microphone directly in airstream from mouth/nose avoid placing lavaliers in shadow of chin Interviewer at normal conversational distance Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 12
Equipment: Sensors One for each subject + one for the interviewer Types condenser (frequently electret) better frequency response, sensitivity, louder output small form factor, low cost dynamic Power rugged don t require their own power supply batteries inconvenient, risky plug-in power specifications vary across microphones, recorders battery packs add bulk Confirm compatibilities in advance of purchase Stock adequate supplies Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 13
Equipment: Sensors Polar pattern omnidirectional capture sound with same sensitivity from every direction: front, back, sides, above, and below robust to placement, movement directional robust to noise especially from behind susceptible to proximity effect boosting lower frequencies when <15 cm from source directional preferred when noise is principal concern, requires proper placement, well behaved subjects omnidirectional more flexible, better fidelity when noise is not primary concern Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 14
Polar Pattern Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 15
lavalier Equipment: Microphone: Mounting good quality recordings through near, unobtrusive placement attached via a clip to a jacket lapel worn around neck on a lanyard stabilized with pin or tape <= 20 cm from mouth (>= 15 cm if directional) not directly in airstream from the speaker mouth or nose not be placed in the shadow of the chin not be attached to the collar or placket of a shirt or blouse head mounted typically avoided because obtrusive however, current popularity headgear suggests stronger others stand-mounted, hand-held microphones obvious, risk making subjects self-conscious Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 16
Equipment: Microphone: Frequency Response different materials & manufacturing processes yields different sensitivities to different frequencies ideal might be a mic with flat frequency response across frequencies in which human speech is produced very few mics, even fewer low-cost microphones have very flat response across all speech frequencies having identified range of mic that meet other criteria compare frequency response to provide the best quality for intended use Cieri, Field Recording, Linguistics Institute, July 17, 2011 Boulder, CO 17
Frequency Response Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 18
Sampling Rate Equipment: Recorders Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 19
Sample Size Equipment: Recorders Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 20
Aliasing Equipment: Recorders Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 21
Sampling Rate 16kHz, if appropriate given source, e.g. less needed for telephone Sample Size 16 bits Compression Why risk it? Storage sampling rate * sample size/8 per second 96,000 * 24/8 * 60 * 60 = ~1GB/hour Analytic Software Requirements Equipment: Recorders Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 22
Desiderata adequate quality @ affordable price standard digital format, 16-bit samples, 16kHz sampling Equipment: Recorders uncompressed, nonproprietary allowing universal random access standard data interface for moving speech files to computer small, unobtrusive, very portable simple to use adequate storage and battery life for 1 entire day in the field monitors for battery life, remaining storage, level, clipping 2 channels with separate adjustments solid-state compatible with the microphones connector type (trs, xlr), power protocol (plug-in, phantom) Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 23
Equipment: Recorders H2 SP-CMC-2 PMD620 H4 DR-100 Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 24
Optimal Methods make coding efficient allowing researchers to consider greater percentage of tokens/variable investigate more variables minimize misses improve accuracy and balance improve consistency retains accurate time and sequence information retains mapping among sound, transcript, tokens, coding, analysis and examples in publication encourages re-use of data each additional pass requires less effort than original re-use & reanalysis profits from previous preparation
Model
Segmentation Divides corpus into manageable units indicates structural boundaries in recording provides time-alignment for transcripts and other annotations transcript becomes index to audio simplifies subsequent transcription, token selection, processing, analysis 8 seconds for transcription, FA runs better, Praat can display Preserve integrity of original signal virtual, not actual, chopping of digital signal allows multiple segmentations of the same event Speech Activity Detection (SAD) technology exists for some audio types (LDC has telephone, BUT has broadcast) segments by pause group need training material (segmented, representative sociolinguistic data) Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 27
Segmentation Segmentation for a specific purpose speaker turn, breath/pause group (1xRT), utterance, SU ( 5xRT) word level, phone level best handled as additional pass imparts additional level of analysis more difficult/costly, requires specialists free with forced alignment Issues levels of granularity multiple speakers on one channel overlapping speech even across channels how long is a pause? additional features: background, non-speaker noise, SID, style Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 28
Time as Variable Time is on the horizontal axis. Conversational situation (style) is on the vertical. Larger numbers mean greater formality. 4+ are elicited styles 3 is the default interview situation 2 is for narratives and extended descriptions 1 is for speech to another party The longer interview clearly provides greater opportunities to study style shifting!
Transcription Stoker 97 provides early justification for transcription in related field
Transcription Stoker 97 provides early justification for transcription in related field He accordingly set the phonograph at a slow pace, and I began to typewrite from the beginning of the seventeenth cylinder. He thinks that in the meantime I should see Renfield, as hitherto he has been a sort of index to the coming and going of the Count. I hardly see this yet, but when I get at the dates I suppose I shall. What a good thing that Mrs. Harker put my cylinders into type! We never could have found the dates otherwise. Stoker, Bram (1897) Dracula
Transcription Why transcribe? index to audio, intermediary to later coding searchable learn about session How to transcribe? verbatim no correction standard orthography, punctuation conventions for unintelligible speech non-standard variants speaker restarts, disfluencies, hesitations 7-10xRT using Transcriber, Xtrans Cieri, Strassel: Robust, Digital, Empirical, Reproducible Sociolinguistic Methodology, NWAV 39 November 4-6, 2010 San Antonio, Texas 32
Transcription Multiple passes focusing on different tasks limit cognitive load of any one pass tasks basic text disfluencies conversational situation dialect phenomena personal identifying information phonetics (inter-annotator agreement 70-90%)
Transcriber http://trans.sourceforge.net/en/presentation.php fastest segmentation More user friendly than strans Linux, Windows, OSX open-source multiple audio, text formats requires full segmentation of audio built for single-channel broadcast news handling of overlapping speech
XTrans http://www.ldc.upenn.edu/tools/xtrans/ fast segmenting, multi-channel, -speaker, overlaps, reads Transcriber, SPH Linux, Windows, OSX (in emulation)
Elan http://www.lat mpi.eu/tools/elan video, reads Transcriber, SPH, interacts with Praat, Linux, Windows, OSX segmentation a bit more complex than the others