CHILDES English Lara Corpus
|
Caroline
Rowland Max-Planck Institute,
Nijmegen crowland@liverpool.ac.uk
website
|
Participants: | 1 |
Type of Study: |
naturalistic |
Location: | England |
Media type: | audio |
DOI: | doi:10.21415/T5R88P |
Browsable transcripts
Download transcripts
Citation information
Jones,
G., & Rowland, C. F. (2017). Diversity not quantity in caregiver
speech: Using computational modeling to isolate the effects of the
quantity and the diversity of the input on vocabulary growth.
Cognitive Psychology, 98, 1-21.
doi:10.1016/j.cogpsych.2017.07.002.
Rowland, C. F. (2007). Explaining
errors in children’s questions. Cognition, 104(1), 106-134.
doi:10.1016/j.cognition.2006.05.011.
Rowland, C. F. & Fletcher, S. L.
(2006). The effect of sampling on estimates of lexical specificity
and error rates. Journal of Child Language, 33, 859-877.
In accordance with TalkBank rules, any use of data from this corpus
must be accompanied by at least one of the above references.
Project Description
The Lara corpus consists of 120 hours of audio-recorded speech from one
child interacting with her caregivers between the ages of 1;9.13 and
3;3.25 , and a written diary record of the child’s wh-questions produced
between 2;7.21 and 3;3.30. The project had 3 aims.
- To provide a unique longitudinal corpus of sampled and diary
data from one child.
- to assess the impact of sampling
constraints on descriptions of child data and on constructivist theories
of acquisition.
- to investigate the extent to which children
make errors in their early language acquisition, in particular, their
wh-question acquisition, and the implications of these errors for
theories of language acquisition.
Biographical details
The child (Lara) was the first-born monolingual English daughter of two
white university graduates, and was born 16-MAY-1994 and brought up in
Nottinghamshire, England. She was an only child until the birth of her
sister half-way through the recording sessions (at age 2;4). Neither
parent nor grandparents were from the local area; her mother and
maternal grandparents (who looked after her 2 days a week) were from the
south east of the UK (with south east regional dialects), her father was
from the West Midlands (with a local dialect) and her paternal
grandmother (who looked after her one day a week) was from the North
East (with a strong north east accent and many dialectical items in her
vocabulary). Lara attended a nursery for two full days a week and, from
2;6, a playgroup for up to two mornings a week. She developed a local
(north Nottinghamshire) accent, but did not use many regional
dialectical terms. Consent was gained from Lara’s parents and all other
caregivers who were recorded. The name Lara is a pseudonym.
Transcripts
Lara was recorded in her home in conversation with her caregivers once
or twice a week over approximately an 18 month period. During
recording, Lara engaged in everyday play activities with her regular
caregivers (mother, father, grandmothers and grandfather). No researcher
was present.
The aim was to record at least one hour’s data per week, but to record
more if time allowed, so the amount of data varies significantly from
week to week. Intensive data collection was not possible at 2;4 and 2;5
because the birth of the child’s younger sister meant that caregivers
were unable to devote as much time to recording. Data collection was
intensified after 2;6 in order to capture in greater detail the child’s
acquisition of more complex structures such as wh-questions. In total,
nearly 49,000 child utterances and over 97,000 caregiver utterances were
transcribed.
Situational description
Most of the audio recordings took place
downstairs in the dining room or living room of the child’s home.
Occasional recordings took place in the bathroom or bedroom. Unusual
situational details are given in each transcript in @Comment tiers.
Transcription
Each transcript filename indicates the age of the child in years, months
and days and the duration of the recording. (e.g. Lara.1-09-13.45.cha
indicates that Lara was 1 year, 9 months, 13 days at the time of
recording and that 45 minutes of conversation was recorded). The data
were orthographically transcribed using the CHILDES system (MacWhinney,
2000) using transcription conventions similar to those used for, and
detailed in the readme.doc for, the Manchester corpus (Theakston,
Lieven, Pine & Rowland, 2001). The transcriber was trained to recognize
common types of grammatical errors made by young children and to note
errors with error codes. All transcripts were then checked for accuracy
against the audio-recordings by the principle investigator. A
morphological coding line was added to all transcripts using the MOR and
POST programs provided with the CHILDES system. These provide a coding
of words into grammatical categories; adding a dependent MOR tier with a
syntactic parse. In all transcripts, the MOR line was checked for
accuracy by hand and a number of coding errors were corrected. However,
the accuracy of all coding cannot be guaranteed. Error codes were used
whenever the child produced an ungrammatical utterance, using the [*]
sign on the mainline. The error was then described on a %err: dependent
tier.
*CHI: threwed [*] it .
%err: threwed = threw .
%
Omitted words and bound morphemes were transcribed, but 0 was used to
indicate that the word was omitted. For example:
*CHI: I 0have
[*] got
%err: 0have=have
*CHI: you are make0ing a cake
%err: make0ing=making
%
Postcodes were used on the main line in the following circumstances:
[+ SR] self-repetition of one of previous 5 utterances as long as
utterance was not more than 10 seconds removed in time. Only the target
child’s utterances were coded for self-repetition.
[+ I]
imitation of one of previous 5 utterances as long as utterance
was not more than 10 seconds removed in time. Only the target child’s
utterances were coded for imitation.
[+ PI] partially
intelligible utterance. All utterance were coded.
[+
IN] incomplete utterance. All utterance were coded.
[+
R] routines, including counting, songs and nursery rhymes. All
utterance were coded.
Diary data
. The diary began when caregivers informally reported
that Lara was starting to produce a variety of wh-questions with
different auxiliaries and ended when approximately 90% of her
wh-questions were correct. The diary was filled in by caregivers, who
were provided with notebooks to record all wh-questions produced both
within and without the home. The diary-keepers were trained to record
the exact speech of the child, to recognize common errors types (e.g.,
to omit auxiliaries when not pronounced, to indicate contractions), and
to recognize the different types of wh-question. As no notes were made
when the child was at nursery, it is estimated that the diary contains
approximately 80% of the wh-questions that were produced by Lara during
this period. The data from the diary were then re-transcribed into
CHILDES format onto a computer. The transcription conventions were
identical to those used for the audio-tape transcripts.
Acknowledgements
The project was funded by the Economic and
Social Research Council, Grant No. RES000220241.