CHILDES German Szagun Corpus

Gisela Szagun
Institut fur Kognitionsforschung
University of Oldenburg


Participants: 22
Type of Study: naturalistic
Location: Germany
Media type: audio
DOI: doi:10.21415/T5KG7T

Browsable transcripts

Download transcripts

Link to media folder

Citation information

Szagun, G. (2001). Learning different regularities: The acquisition of noun plurals by German-speaking children. First Language, 21, 109-141.

Szagun, G. (2004). Learning by ear: On the acquisition of case and gender marking by German-speaking children with cochlear implants and with normal hearing. Journal of Child Language, 31, 1-30.

Szagun, G., Stumper, B. Sondag, N. & Franik, M. (2007). The acquisition of gender marking by young German-speaking children: Evidence for learning guided by phonological regularities. Journal of Child Language, 34, 445-471.

Szagun, G. (2010). Regular/irregular is not the whole story: The role of frequency and generalization in the acquisition of German past participle inflection. Journal of Child Language, 37, 1-32.

Szagun, G. & Stumper, B. (2012). Age or experience? The influence of age at implantation, social and linguistic environment on language development in children with cochlear implants. Journal of Speech, Language, and Hearing Research, 55, 1640-1654.

Szagun, G. & Schramm, S. A. (2016). Sources of variability in language development of children with cochlear implants: age at implantation, parental language, and early features of children’s language construction. Journal of Child Language, 43, 505-536.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This data set comprises 2 large corpora of German child language: 1) a corpus of 22 typically developing children (TD) with 212 data files; 2) a corpus of 22 deaf children with cochlear implants (CI) with 210 data files. The data were collected between 1996 and 2000. In addition to documenting language development in these two groups, this corpus is the first comprehensive data collection of child directed adult speech in German. Each of the 422 data files is a transcript from a two-hour session. Children and adults were recorded in a University playroom during play. Researchers contributing to the project include Sonja Arnhold-Kerri, Kitty Boosman, Tanja Hampf, Elfrun Klauke, Stefanie Kraft, Dorit Pefferkorn, Dagmar Roesner, Claudia Steinbrink, Gisela Szagun, Bettina Timmermann, and Sylke Wilken.

In the following description I am treating the two samples separately.

1) Typically developing children (TD)

All 22 children were recorded at ages 1;4, 1;8, 2;1, 2;5, 2;10. Originally, the 22 typically developing children were a control group for the 22 children with CI (see below) and data points had to conform to these children’s timetables at their rehabilitation centre. Out of the 22 children 6 children were recorded every 5 – 6 weeks. The children are: ANN, EME, FAL, LIS, RAH, SOE. Thus, there are 16 children with 5 data points rendering 80 two-hourly recordings of spontaneous speech, and 6 children with 22 data points rendering 132 two-hourly recordings of spontaneous speech.

At data points 1;4, 1,8, 2;1, and 2;5 a minimum of around 500 utterances of parental child-directed speech were transcribed verbatim. However, there is quite some variation, and often more adult speech is transcribed verbatim. In each transcript a comment on %com line indicates were verbatim transcription of parental utterances stops. Thereafter, parental utterances were reported when necessary to understand the context of the child's utterance, but transcription may not be verbatim. The same applies to the reporting of the investigator’s speech. At the beginning of each there is a comment on the %com line stating whether this is a data point with verbatim parental speech or not. In this new update some important changes have been made. These concern:

Note on the new spelling rules: The rules allow more flexibility for spelling compound words. Spelling in one word or two words is often accepted now. Also, spelling is closer to spoken language. Thus, the spelling of clitics now often allows spelling in one word instead of use of an apostrophe. This concerns many contractions of preposition + article, e.g. aufm (auf’m) or aufn (auf’n). Note that case marking is preserved.

Some forms we have written as they are pronounced and put the standard written from in square brackets. Examples:

The changes should make it possible to use German MOR. CHAT notations were used throughout.

Play sessions took place in a large playroom at the University of Oldenburg. Toy sets included cars and a garage and park house, zoo with zoo animals, farm with farm animals, a school with children and teachers, dolls and buggy, a doll’s house, picture books, puzzles, medical kit, fire-station with fire men, ambulance, a wooden railway set, a large blackboard with chalk for drawing, a cooking set, a shop with goods to sell, a box with a variety of animals and small objects, a toy car to ride. It was up to the children which toy to choose at a time. The child played with either the parent or the investigator, or both. The situation was mimicked as closely as possible to a situation at home. Initially, an investigator was present for a short time, then left for at least half an hour. The investigator usually returned with coffee, sat down with the parent, played with the child. Times of investigator presence were adjusted to each situation and varied in length.

2) Children with cochlear implants (CI)

All of the 22 children with CI were deaf before onset of language, 20 from birth and 2 after meningitis during the first year of life. All the children were implanted before 4 years of age. Mean age at implantation was 2;5 with a SD of 0;9 and a range of 1;2 to 3;10. The 22 CI children varied in chronological age at the beginning of the study. They were matched with the 22 TD children for initial language level with (MLU and mean number of words). Each of the CI children is given a hearing age which starts with 0 at tune-up (the first fitting of the device to the child’s comfortable level of hearing), which occurred 6 weeks after implantation. The ages at the data points here are hearing ages. All 22 children were recorded every 4½ months at hearing age 0;5, 0;9.14, 1;2, 1;6.14 and 1;11. This covered a period of 18 months from first tune-up. This paralleled the 18 months period of the 22 TD children and the five data points. In addition, 11 CI children were recorded more frequently with data points in between the main ones and data points thereafter, rendering 10 and 14 data files. The other 9 children were also recorded after 1;11, rendering between 6 and 9 data points. Altogether, there 210 data files from children with CI. The recording sessions were between 1½ and 2 hours long.

At data points 0;5, 0;9.14, 1;2 and 1;6.14 a minimum of around 500 utterances of parental child-directed speech were transcribed verbatim. For more speech, the same procedure is followed as for the TD children (see above).

Transcripts of the CI sample have not been updated to the new format yet. This means that there no capitalization of nouns and i-dialect is till used. Clitics and forms which leave out end consonants in pronounciation are spelled with apostrophe or just without the consonant. Examples are:

Some forms are spelled according to pronounciation. Examples:

A full list of the transcription rules we followed can be obtained from Gisela Szagun, e-mail:

An update in line with the present one for TD children is planned.

Play sessions took place in a playroom at Cochlear Implant Center Hannover. The toy sets paralleled those in the playroom at Oldenburg University except for the larger ones, like a wooden railway set, a large blackboard with chalk for drawing, a shop with goods to sell, a toy car to ride. It was up to the children which toy to choose at a time. The child played with either the parent or the investigator, or both. The investigator was present throughout the session.

Audio files are available for all children, but they are not linked to the transcripts. The audio files are labelled with the three letters of the child’s name and then three numbers for years and months of the child’s age. Thus sil202-2.wav is the second half of the tape for Silja at 2 years and 2 months. The files for TD have the child’s actual age; the files for CI children are labelled with their hearing ages. The tape for CI adr00500 is missing due to technical failure. Some of the early audio files for ANN, FAL and EME are also missing, because the digital recording equipment had not yet arrived and transcription occurred from analogue tapes.



The research was funded by Deutsche Forschungsgemeinschaft (DFG) grants Sz 41/5-1 (1996-98) and Sz 41/5-2 (1999-2000).