CHILDES English Lara Corpus

Caroline Rowland
Max-Planck Institute, Nijmegen
crowland@liverpool.ac.uk
website

Participants:	1
Type of Study:	naturalistic
Location:	England
Media type:	audio
DOI:	doi:10.21415/T5R88P

Browsable transcripts

Download transcripts

Citation information

Jones, G., & Rowland, C. F. (2017). Diversity not quantity in caregiver speech: Using computational modeling to isolate the effects of the quantity and the diversity of the input on vocabulary growth. Cognitive Psychology, 98, 1-21. doi:10.1016/j.cogpsych.2017.07.002.

Rowland, C. F. (2007). Explaining errors in children’s questions. Cognition, 104(1), 106-134. doi:10.1016/j.cognition.2006.05.011.

Rowland, C. F. & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. Journal of Child Language, 33, 859-877.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The Lara corpus consists of 120 hours of audio-recorded speech from one child interacting with her caregivers between the ages of 1;9.13 and 3;3.25 , and a written diary record of the child’s wh-questions produced between 2;7.21 and 3;3.30. The project had 3 aims.

To provide a unique longitudinal corpus of sampled and diary data from one child.
to assess the impact of sampling constraints on descriptions of child data and on constructivist theories of acquisition.
to investigate the extent to which children make errors in their early language acquisition, in particular, their wh-question acquisition, and the implications of these errors for theories of language acquisition.

Biographical details

The child (Lara) was the first-born monolingual English daughter of two white university graduates, and was born 16-MAY-1994 and brought up in Nottinghamshire, England. She was an only child until the birth of her sister half-way through the recording sessions (at age 2;4). Neither parent nor grandparents were from the local area; her mother and maternal grandparents (who looked after her 2 days a week) were from the south east of the UK (with south east regional dialects), her father was from the West Midlands (with a local dialect) and her paternal grandmother (who looked after her one day a week) was from the North East (with a strong north east accent and many dialectical items in her vocabulary). Lara attended a nursery for two full days a week and, from 2;6, a playgroup for up to two mornings a week. She developed a local (north Nottinghamshire) accent, but did not use many regional dialectical terms. Consent was gained from Lara’s parents and all other caregivers who were recorded. The name Lara is a pseudonym.

Transcripts

Lara was recorded in her home in conversation with her caregivers once or twice a week over approximately an 18 month period. During recording, Lara engaged in everyday play activities with her regular caregivers (mother, father, grandmothers and grandfather). No researcher was present. The aim was to record at least one hour’s data per week, but to record more if time allowed, so the amount of data varies significantly from week to week. Intensive data collection was not possible at 2;4 and 2;5 because the birth of the child’s younger sister meant that caregivers were unable to devote as much time to recording. Data collection was intensified after 2;6 in order to capture in greater detail the child’s acquisition of more complex structures such as wh-questions. In total, nearly 49,000 child utterances and over 97,000 caregiver utterances were transcribed.

Situational description

Most of the audio recordings took place downstairs in the dining room or living room of the child’s home. Occasional recordings took place in the bathroom or bedroom. Unusual situational details are given in each transcript in @Comment tiers.

Transcription

Each transcript filename indicates the age of the child in years, months and days and the duration of the recording. (e.g. Lara.1-09-13.45.cha indicates that Lara was 1 year, 9 months, 13 days at the time of recording and that 45 minutes of conversation was recorded). The data were orthographically transcribed using the CHILDES system (MacWhinney, 2000) using transcription conventions similar to those used for, and detailed in the readme.doc for, the Manchester corpus (Theakston, Lieven, Pine & Rowland, 2001). The transcriber was trained to recognize common types of grammatical errors made by young children and to note errors with error codes. All transcripts were then checked for accuracy against the audio-recordings by the principle investigator. A morphological coding line was added to all transcripts using the MOR and POST programs provided with the CHILDES system. These provide a coding of words into grammatical categories; adding a dependent MOR tier with a syntactic parse. In all transcripts, the MOR line was checked for accuracy by hand and a number of coding errors were corrected. However, the accuracy of all coding cannot be guaranteed. Error codes were used whenever the child produced an ungrammatical utterance, using the [*] sign on the mainline. The error was then described on a %err: dependent tier.
*CHI: threwed [*] it .
%err: threwed = threw .
%

Omitted words and bound morphemes were transcribed, but 0 was used to indicate that the word was omitted. For example:
*CHI: I 0have [*] got
%err: 0have=have
*CHI: you are make0ing a cake
%err: make0ing=making
% Postcodes were used on the main line in the following circumstances:
[+ SR] self-repetition of one of previous 5 utterances as long as utterance was not more than 10 seconds removed in time. Only the target child’s utterances were coded for self-repetition.
[+ I] imitation of one of previous 5 utterances as long as utterance was not more than 10 seconds removed in time. Only the target child’s utterances were coded for imitation.
[+ PI] partially intelligible utterance. All utterance were coded.
[+ IN] incomplete utterance. All utterance were coded.
[+ R] routines, including counting, songs and nursery rhymes. All utterance were coded.

Diary data

. The diary began when caregivers informally reported that Lara was starting to produce a variety of wh-questions with different auxiliaries and ended when approximately 90% of her wh-questions were correct. The diary was filled in by caregivers, who were provided with notebooks to record all wh-questions produced both within and without the home. The diary-keepers were trained to record the exact speech of the child, to recognize common errors types (e.g., to omit auxiliaries when not pronounced, to indicate contractions), and to recognize the different types of wh-question. As no notes were made when the child was at nursery, it is estimated that the diary contains approximately 80% of the wh-questions that were produced by Lara during this period. The data from the diary were then re-transcribed into CHILDES format onto a computer. The transcription conventions were identical to those used for the audio-tape transcripts.

Acknowledgements

The project was funded by the Economic and Social Research Council, Grant No. RES000220241.