Transcriptions in the CHAT format

CHILDES workshop winter 2010 Transcriptions in the CHAT format Mirko Hanke University of Oldenburg, Dept of English mirko.hanke@uni-oldenburg.de 10-12-2010 1 The CHILDES project CHILDES is the CHIld Language Data Exchange System. It is part of a larger project called TalkBank, that is dedicated to collecting and exchanging samples of spontaneous speech from different populations. The CHILDES project was founded in 1984 by Brian MacWhinney and Catherine Snow. The project provides three things, a large database with transcriptions (about 44 million words from 32 languages), tools for creating, editing and analysing these data, and a transcription standard, that makes easy exchange of data between researchers possible. 2 CHILDES online resources Almost everything you need to know about using CHILDES is on their web page: http://childes.psy.cmu.edu. In the System section you ll find some general information about the database. Among other things you can find legal information that will be relevant if you want to publish using data from the database, or if you want to contribute your own data to the corpus. Of more immediate importance to most users are the mailing lists. Here you can get answers for questions about the project and the software by email. It is always advisable, though, to search the mailing list archives for answers first, before posting to the mailing list. 1

The Programs and data section contains the core features of the project: downloadable transcripts and the CLAN software to view, search, create and edit transcripts. Perhaps the most important resources for all users are the manuals. The CHAT manual explains the transcription standard, e. g. how the files are structured, how to code utterance boundaries, special forms, errors, pauses, etc. The CLAN manual explains how to operate the software tools, for instance how to transcribe and link transcription files to audio or video files with a time index, or how to calculate MLU (mean lenght of utterance) measures, search for combinations of words, etc. The Manuals section also contains a number of training videos by Brian MacWhinney that explain the use of CLAN Also of interest might be the tutorials on recording audio or video files, and transcribing from them. The tutorials can be found under Special procedures. Both the CLAN software as well as the CHAT and CLAN manuals are updated regularly. While the changes between versions might not be drastic, it is worth checking for updates from time to time. 3 The CLAN software CLAN stands for Computerised Language ANalysis. Start by downloading the installation files from the CHILDES project web page, and install the CLAN software onto your computer. When you open the software, you will see an empty editor window and a dialog box titled commands. We don t need the command dialog for the moment, so you can simply close it. We will first get to know how we can use CLAN as an editor for making transcriptions. More functions of the software will follow in the next part of the course. In principle you don t actually need CLAN to transcribe, since the transcription files are simple text files with a.cha file ending, and you can open and edit them with any text editor that you re comfortable with. However, the CLAN editing function was specifically designed for transcribing speech or video, and it bears some very helpful features for this task. It can help you link transcripts with audio or video files, manage several tiers or layers of annotation, and it will check for you if the file conforms to the CHAT transcription standard. Let s start by looking at an existing transcript. Download the Kerstin subcorpus from the /Germanic/German section of the database. Unzip the corpus and open the file ke010322.cha with the CLAN editor. With this example, we ll start to get familiar with the CHAT transcription format and the structure of a transcript file. 2

4 CHAT format CHAT is an acronym for Codes for the Human Analysis of Transcripts. The CHAT format specifies a set of rules for transcription and specifies which additional (meta-)information should be provided with a transcript file. 4.1 CHAT files The content of each CHAT file is placed between two markers for beginning and end of the file: @Begin... @End Note that the markers are spelt with an initial capital letter and that there is no dot (.) after both. The two markers are required for every CHAT file, not including them will result in an error message when you use the syntax checking function. 4.1.1 Header Apart from the actual transcript, each file contains a header with metainformation about the transcript file. Here you find for instance the date when the recording was made, the language(s) of conversation, which persons were recorded, the name of a linked audio or video file, comments about the recording situation, and more. All header lines of a file start with an @ symbol. Let s look at the header of our Kerstin example: @Languages: deu @Participants: CHI Kerstin Target_Child, MUT Mutter Mother, MAX Max Observer, VAT Vater Father @ID: deu kerstin CHI 1;3.22 Target_Child @ID: deu kerstin MUT Mother @ID: deu kerstin MAX Observer @ID: deu kerstin VAT Father @Date: 25-SEP-1971 @Comment: KER und MUT spielen auf dem Boden mit Spielzeug; Beobachter MAX und VAT sind anwesend Each line contains a header variable and a specification of its value or a list of values. A CHAT file must always include at least the @Languages, @Participants, and @ID variables, all other header entries are optional. Section 5 of 3

the CHAT manual provides an overview over all available header entry types. Note again that the header variables are spelt with a capital letter and that header entry lines are never finished with a dot. A header entry line contains the entry name, a colon, a tab don t use spaces and the respective value. In the example above, the @Languages entry specifies that this file contains a transcript of a conversation in German. The available language codes are listed under section 5.2 of the CHAT manual. The @Participants entry lists all individuals participating in the transcribed communication. Each speaker gets assigned a three-letter speaker ID, based on either first name (e. g. MAX), or based on the role the person plays in the conversational setting, for instance CHI for the (target) child. The list of participants contains an entry for each person specifying speaker ID, name and role, formatted as in the example above. The speaker IDs will be used throughout the transcript to identify the speaker of every single utterance. Speaker roles are predefined, check part 5.2 of the manual for a list. Once the set of speakers has been specified in the @Participants header, an @ID header is required for each participant, providing additional information about the speakers. The data is entered in a tabular format, with pieces of information separated by a pipe ( ) character. You can leave some of the fields empty, as in the Kerstin example above. @ID: language corpus code age sex group SES role education The CLAN syntax checker will create @ID entries for each of the participants listed under @Participants, if you don t specify them yourself. The other header entries in the example are optional. One particularly useful optional header is @Media. In an @Media header you specify an audio or video file that might be linked with the transcript. The header contains the media file name without the file extension, and a specification of the media type (sound or video). Say we had an audio file of the recording for our ke010322.cha example called ke010322.wav (MP3 works too) and we had placed this file in the same folder as the transcript file, then we would specify the following @Media header entry: @Media: ke010322 sound We will come back later to why this link between audio file and transcript is useful. For the complete reference on header information, look at section 5 of the CHAT manual. Next, we will look at the transcription proper. 4

4.1.2 Tiers If you look at the transcription part of the CHAT file, you will notice two kinds of lines, those starting with an asterisk (*), and those starting with a percentage sign (%). The actual spoken material goes on the main lines, which start with an asterisk, followed by the speaker ID, colon, a tab again, don t use spaces the text of the utterance and an utterance delimiter. 4.1.3 Exercise: A sample CHAT transcription 1. Open the sample audio file frog sample.wav. 2. Listen to it: how many speakers are there? 3. Open CLAN, start with an empty file and start by creating all necessary headers for our new sample transcript. You can leave out the @ID headers for the moment. 4. Run the check function to check for errors (Press Esc, then L; or select Check opened file from the Mode menu). 5. Add an @Media header. 6. If you press F5 (or click on Transcribe sound or movie on the Mode menu), CLAN should start playing the audio file. 7. Since very few people can type as fast as the conversation is going on, we ll break down the sound file into chunks, or utterances. A rough approximation will do for the moment, we will refine the utterance breaks later. CLAN helps us chunking up the audio file. To do so, start playback of the audio file with F5, and press the space bar each time you think an utterance has ended. This will create so-called bullets, chunks of audio that are linked to a line of the transcript. 8. Now that you have created chunks, place the cursor on any one line and press F4 (Mode > Play bullet media). 9. Transcribe the short audio file chunk by chunk. 10. Check the file again and marvel at error messages for a moment. Important for this exercise is that you figure out how the required header(s) should look like and that you get started using bullets for your transcription. 5

4.2 Bullets Let s now have a closer look at the bullets. There is a command in the Mode menu called Expand bullets (Esc, A). If you execute it, you will find the bullet points expanded to an audio index between two bullet points. The audio index is given in miliseconds and you can simply edit the duration of any audio bit by editing the index. Play with it a little, and you will soon get a feeling for how far you have to shift the borders of a chunk around if necessary. However, make sure that adjacent indices don t overlap. By altering the audio indices you can easily split or merge linked audio chunks when you refine your separation into utterances during the transcription. If you have made some more drastic changes to your initial bulleting, you can also re-bullet after transcribing. In order to do so, place the cursor in the line before the spot where you made changes and press F5 again. The current utterance will be highlighted and the sound starts playing. Each time you press the space bar now, CLAN will add a new audio index/bullet and jump to the next utterance. 5 Transcription principles Your first exercise transcription might have yielded a couple of error messages by the CHECK function on the actual transcription lines. Now we will deal with some basic principles of transcription in the CHAT format. 5.1 Pitfalls of standard orthography As Brian MacWhinney points out in the CHAT manual under section 3.2.1, transcribers must be aware that coding spoken language in a written form always involves some kind of interpretation process. Especially with child language, assuming the child to possess the same lexical or grammatical knowledge as an adult (e. g. the transcriber) might lead to misrepresentations of the child s speech. In order to cope with this problem, CHAT allows transcriptions to be linked with the audio recordings they were based on. Also, incomprehendible or suspicious material can be specially marked and annotated in the transcript. It is in principle possible to transcribe every line phonetically as well, on a dependent tier marked %phon:. However, not every research question requires this kind of information about utterances, and given the huge amount of work an IPA transcription would require, annotating only the debatable cases inline might do the job. As a further tool to pay tribute to the special status of 6

spoken language, the CHAT transcription conventions come with a system to mark special forms, e. g. words invented by the child, or dialect forms. The manual also warns against using standard capitalisation rules. Transcriptions are all lowercase, with the only exception being proper names. Do not use capital letters, not even at the beginning of an utterance 1. Finally, CHAT has its own punctuation rules. Every main line utterance needs to be terminated, otherwise the checking function will complain: Utterance delimiter expected. The most frequently used utterance delimiters are.?!, which are used in (very) roughly the same way as in written language. Before we learn which other options exist to delimit utterances, we have to deal with an important question: What is an utterance? 5.2 Utterances Section 7.1 of the CHAT manual deals with the question how to identify utterances in a case where a child is producing many repetitions of a single word. If you read the section, it should become clear that chunking speech already requires some assumptions about the form of utterances. It is important to make these assumptions as explicit as possible, depending on one s research goal. The task still remains difficult and the CHAT manual recommendations are mere heuristics, based on prosodic, syntactic and semantic indicators. Once utterances become syntactically and semantically more complex, the different levels of linguistic representation might not align as nicely anymore. Generally, long pauses, or falling intonation are useful markers, and if they coincide with syntactically and semantically cohesive units, you have good indication for an utterance boundary. 5.2.1 Utterance terminators Each utterance is represented on a single main line. The three basic utterance terminators are described under section 7.3 of the CHAT manual. Apart from that, there are special provisions for cases where speakers get interrupted (see section 7.7 of the manual). In the case of an interruption it is necessary to mark both the end of the interrupted utterance, as well as the beginning of the continuation if existent. Such a continuation is marked with one of the linkers described in section 7.8 of the manual. 1 The Kerstin example above does not conform to this rule. The reason might be that the transcript is in German, and the manual doesn t provide explicit rules for transcriptions in a language with systematic capitalisation of nouns. I d still follow the manual s suggestion to keep everything in lower case, since capitalising only proper names allows to search/exclude these special parts of the vocabulary. 7

5.2.2 Scoped symbols So far we have used markers for special points in the transcription, like the end of an utterance. There are, however, cases in which we might want to mark an entire stretch, perhaps a number of words in a particular way. A good example is the overlap marker when we want to show that two speakers are speaking at the same time. Look at the example in section 8.4 of the manual. SAR s first utterance overlaps with the stop doing that part of her mothers preceding utterance. The markers [>] and [<] come after the stretch of words they have scope over. Text marked this way is enclosed in angular brackets (<...>). Scoped marking of text is particularly relevant for the coding of planning errors or repetitions of words. There are six different symbols to mark so-called retracings (section 8.4). Notice the difference between coding repetition multiple times and coding multiple repetition. This difference will later impact the calculation of measures like MLU, and only with the first option you have the choice of later including the repeated words into the calculation. Again, notice that the repeated/reformulated/retraced material is enclosed in angular brackets. Marking an error on the main line is also done with a scoped symbol: [* error code]. Section 8.5 lists a number of error codes that have been used for some of the English language data in the CHILDES corpus. There is no fixed set of error codes, however, and what errors you want to mark again depends on your research question. The error symbol immediately follows the site of the error. Other uses include comments on the main line ([% comment text]), explanations ([= explanation]) or replacements of non-standard or contracted forms, for instance wie gehts [: geht es] dir. For some measures you might wanna [: want to] calculate later, or if you want to create an annotation tier, the contracted string is replaced by the material in brackets. What is most important, words or stretches of text marked with a comment or an explanation can be searched for later. 5.3 Words Finally, here are some remarks about the word forms used for transcripts. Remember that the CHAT manual specifies that only proper names be capitalised. For special forms, there are a couple of codes that are attached to the transcript of a word in the form wordform@code. Take a look at the Kerstin example again. You will see a couple of word forms like wau@o or ticktack@c. The code @o marks an onomatopoeic word form, and forms marked with @c 8

are considered child-invented. Table 2 in section 6.3 of the CHAT manual provides an overview over the special form markers that have been specified in the standard. Material that is incomprehendible is marked xx (one word) or xxx (entire utterance). You can always add an explanation marker (see scoped symbols, section 8.3 of the manual) to indicate your best guess ([?]) at what was said. If you don t comprehend a word, but want to give a phonological approximation, you can either provide a phonological transcription on a separate %pho: tier, or transcribe phonological fragments preceded by &: *MAR: &t &t &k can t you go? Incomplete word forms can be completed, by adding the omitted material in brackets: haste [: hast du] mal (ei)ne mark. The search and calculation tools will later ignore the brackets and treat the form as if it was complete. When you have sufficient grounds to assume that a particular word has been omitted, you can include it in the transcript, preceded by a zero sign: *EVE: I want 0to go. However, the manual gives a warning that marking an omission like this involves a lot of interpretation and should be applied carefully and consistently. 5.4 Exercise: Using CHECK to find formal errors After this much input it s time for another exercise. Open the file ke030406 sample.cha and look for errors. A good way to start is using the CHECK function of CLAN. Once you have taken care of all the formal issues, the CHECK function should say Success! No more errors found. Now look over the transcript again, do you find other errors that the CHECK function couldn t find? 5.5 Final words and outlook What you should take home from this tutorial is that transcribing always involves making several decisions about the text. The CHAT standard sets guidelines and provides options, but there are often multiple ways to express the same thing. Sometimes it depends on your interpretation whether for instance a repetiotion of material shout be treated as retracing or reformulation. These decisions also depend strongly of the focus of your research. Finally, this brief tutorial does by all means not cover all possible coding options available in the CHAT standard. The manual will serve you as a valuable resource, especially while you re carrying out your first transcriptions. 9

Transcribing is a laborious process, accoding to some estimates it takes about 10-15 minutes per one minute of audio recording. Even if it takes you longer in the beginning, rest assured that with practice you will get better. 10