Leo Corpus

Heike Behrens
Department of Linguistics
University of Basel


Participants: 1
Type of Study: case study
Location: Leipzig, Germany
Media type: password to audio
DOI: doi:10.21415/T5N01B

Behrens, H. (2006). The input-output relationship in first language acquisition. Language and Cognitive Processes, 21, 2-24.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The Leo-corpus was collected in Leipzig, Germany by the Max-Planck-Institute for Evolutionary Anthropology. Heike Behrens was in charge of the coordination of the recordings, the transcription guidelines, and the procedures for establishing cohesion among transcribers, as well as format updates due to changes in the CHAT-conventions. Solveig Kühnert assisted in taking the recordings and acted as the expert to disambiguate unclear passages. Solveig Kühnert, Jana Jurkat, Susanne Mauritz, Antje Paulsen, Romy Elrich and Yvonne Daiber transcribed the data.


Leo (CHI), a monolingual German boy, grew up Leipzig, Germany. Both parents have a higher education. His father Thorsten (FAT) is an academic, his mother Karen (MOT) a bookseller, who worked part-time during the investigation period. They speak dialect-free, clearly articulated standard High German. At age 2;10, he started to go to Kindergarten. When Leo was 3:3, his baby sister Wilhelmine (WIL) was born. During the investigation period, the mother was the primary caretaker of the child, and was paid as a research assistant for taking the diary notes and making the audio recordings. Between age 2 and 3, at least once a week a research assistant (MEC) from the MPI came to help babysitting and allow the mother some time off. She took part in the recording sessions and sometimes also did the recordings on her own. Once, sometimes twice a week, the father did the recordings when the mother was working part-time in a bookshop. Thus, the recordings between 2,0 and 3;0 depict Leo in interaction with both his parents and our research assistant, who became a friend of the family and spent an considerable amount of time with Leo and the family.

Filenames and Metadata

All filenames start with "le" for LEO and 6 digits representing his age in YYMMDD. Thus, le020314.ch stand for Leo age 2;3.14. In the speaker-ID-tiers, the ages for Leo and his sister are computed on a daily basis, the ages for the adults stay the same and give their average age during the recording period (i.e. 30 for the mother, 35 for the father). SES is indicated by the highest degree earned in the German system (e.g., university = university degree, Abitur+Lehre= high school diploma and vocational training).


Two weeks before his second birthday, Leo’s parents completed a vocabulary checklist modelled after the McArthur CDI for English (Fenson, Dale, Reznick, Thal, Bates, Hartung, Pethick & Reilly, 1993) since a German CDI did not exist at the time. These data are compiled in the file le011114.cha. The @Comment-tiers indicate the categories from the CDI.

The boy’s language development was recorded from age 1;11.13, the onset of multiword speech, up to age 4;11. Between the ages von 1;11.13 and 3:0, daily parental diaries were kept to note the 10-30 most innovative and complex utterances of the child. Diary notes were spoken into a small dictaphone at the time and place of the action to avoid misrepresentation by having to memorize them. The caretakers typed up the utterances plus contextual information in CHAT-format in the evening. In the transcripts, all diary notes have the code [- diary].

Between 1;11.12 and 1;11.29, several test recordings were made to test the equipment and the procedure and are of varying length. The main study started at 2;0. Between 2;0.00 and 2;11.29 the daily diary notes were supplemented by five 60-minute recordings each week. Once a week, the session was also video-taped. Between age 3;0 and 4;11, there were five audio recordings per week every 4th week.

After 2;0.00 all recordings are of 60 minute length with very few exceptions due to the child not feeling well. Often, the caregivers split up the session into two segments, e.g., taping half an hour in the morning and the other half in the afternoon, because keeping a conversation or play session going for 60 minutes proved to be quite exhausting for child and parents, or sometimes other activities or the demands of other family members intervened. Between 2;6.00 and 2;6.11 there was a malfunctioning of the recording equipment which was only noted when we started to transcribe the tapes. Therefore, there are diary data only with the exception of le020608.cha, which was transcribed from the video.

The sessions were recorded with a Sony Minidisc recorder MZ-R35 using two wireless and portable Shure BG4.1 Unidirectional Condenser Microphones, and a Shure ETPD-NB Marcad Diversity Receiver. All recordings took place in the family home or hotel, when the family was on holiday. Since the microphones were wireless, they could be placed wherever the family wanted, the only request was to avoid background music or the neighbourhood of washing machines, blenders and other noisy gadgets. With this setup, the family had full control over the situations they wanted to tape and was also given the right to withhold tapes they considered too private. They never made use of this possibility but delivered a 60-minute recording every day.

Leo’s language development

The parental CDI and diary notes allow us to exactly determine the state of Leo’s language development at the onset of our study. He produced his first word combination at 1;11.13 with an active vocabulary of about 340 word forms. He also produced his first morphological contrast (a singular-plural distinction) in the same week. This means that the recordings document his language development from the very onset of multiword speech. Compared to the data of English and Italian children presented in Bates & Goodman (1999), Leo is a late talker and his vocabulary is relatively large before grammatical development sets in. But Leo turned out to be a very quick learner, acquiring sophisticated vocabulary and morphosyntax rather rapidly.

Transcription guidelines

Each recording was digitized and transcribed in SONIC-Chat (cf. MacWhinney, 2000) with transcription guidelines developed for German by Heike Behrens. The data were transcribed by a total of 6 research assistants at the Max-Planck-Institute for Evolutionary Anthropology at Leipzig under the supervision of Heike Behrens. Initially, several team sessions were held to discuss and refine the transcription guidelines, and to check for transcriber reliability by transcribing the same passages, comparing, discussing and resolving differences in the transcription until a high degree of reliability was reached. The major problems concerned the handling of disfluencies (see below) and the agreement on the end of an utterance because the parents would often produce long turns without prosodic closure of clauses or sentences. Preference was given to treat turns without a noticeable pause between sentence units as one utterance and delimit clauses by commas. If a transcriber felt the need to double check a transcription, they inserted a special character and these utterances were discussed later on. As a rule of thumb, the transcribers were instructed not to listen to an utterance more than 3-5 times to avoid pure over-interpretation. The %exp-tier was used to give additional context information or other hints to interpret the situation.

In order to facilitate data retrieval and coding, the word stems were transcribed in standard orthography to avoid having the same word in more than one orthographic form. The CHAT conventions were used to be as faithful to reductions or alternative pronunciations of the word stem as possible.

@o was used to code not only onomatopoeics, but interjections and other discourse markers as well as forms the meaning of which could not be inferred and therefore did not qualify as a child- or family-form. As a result, searches for forms excluding -@o forms result in standard vocabulary.

@c was used for child forms include short forms made up by the child as well as names he invented for places and people or animals. They are marked with @c when there is a risk of confusion with existing words, e.g.: Einfach@c 'simple' or Doppel@c 'double' are standardly used as shortcuts for naming a regular ('simple') bus and a doubledecker bus.

Other unique word forms which occur frequently are not marked:
Tschutschu 'train'
Eichi 'squirrel' ( = stuffed animal)
echen / achen = dummy nonce words that Leo uses in all kind of circumstances, e.g. if he cannot or does not want to answer a question.

@f was used for family forms represent such as nicknames for the children as well as made-up adjectives and manner words as forms of word play.

@d was used for dialect words such as Luelle@d 'saliva' and luellen@d 'drool', 'slaver'

@t was used for test words such as glorpen@t, tammen@t, dotzen@t, seiken@t, Bral@t, Muhne@t

Common compounds like "Apfelbaum" 'apple tree' were transcribed as a single word, new compounds or very long and hard-to-parse units were transcribed with "+". e.g., Baby+Giraffe, Mama+Auto, Arno+Bett, Mini+Lokomotive, Super+Zug

"+" was also used to link names ("Tante+Ida"; "Rasender+Roland" (name of a train on the island of Ruegen') or fixed phrases and interjections (ach+du+lieber+Gott@o 'oh+my+God@o'), titles of books or songs (Winnie+der+Baer, Stille+Nacht), acronyms (L+K+W), complex numbers (neunzehnhundert+vierzehn). Consequently, the "+" sign cannot be taken as an indicator of noun-compounds, but rather serves to unite sequences of words that should be treated as one constituent in syntactic analyse. Care was taken that each combination of words is represented in just one form, but there may be variation with the same stem ("Babysachen" but "Baby+Teile").

Because Leo showed different forms of disfluencies and went through phases of onset stammering where it took him several attempts to finally produce the word or utterance he wanted, extra conventions had to be established to depict these phenomena while not inflating the lexical counts by transcribing the same element several times.

[MA] was introduced as a scoped symbol and stands for multiple attempts of producing a word or phrase &=vocalizes indicates that a sequence of mumbling preceded the articulation of the utterance. This way of representing disfluencies was preferred over xxx because in most cases Leo succeeded to produce an intelligible utterance in the end. These utterances will not have to be discarded from analyses because they have unintelligible elements in them.