CHILDES English Thomas Corpus

Elena Lieven
MPI-Leipzig
Manchester University
lieven@eva.mpg.de

Jeannine Goh
MPI Child Study Centre
University of Manchester
jeannine.goh@manchester.ac.uk
website

Participants:	1
Type of Study:	longitudinal, naturalistic
Location:	England
Media type:	audio
DOI:	doi:10.21415/T5JG64

Citation information

Publications using these data should cite:

Lieven, E., Salomo, D. & Tomasello, M. (2009). Two-year-old children’s production of multiword utterances: A usage-based analysis. Cognitive Linguistics, 20, 3, 481-508.

Other publications based on the use of these data include:

Maslen, R., Theakston, A., Lieven, E. & Tomasello, M. (2004). A Dense Corpus Study of Past Tense and Plural Overregularization in English. Journal of Speech, Language and Hearing Research, 47, 1319-1333

Dąbrowska, E. & Lieven, E. (2005). Towards a lexically specific grammar of children’s question constructions. Cognitive Linguistics, 16, 3, 437-474.

Lieven, E. (2006). Producing multiword utterances. In B. Kelly & E. Clark (eds.) Constructions in Acquisition. Stanford, CA: CSLI Publications, pps. 83-110.

Cameron-Faulkner, T., Lieven, E. & Theakston, A. (2007). What part of no do children not understand? A usage-based account of multiword negation, Journal of Child Language, 34, 251-282.

Chang, F., Lieven, E., & Tomasello, M. (2008). Automatic evaluation of syntactic learners in typologically-different languages. Cognitive Systems Research, 9 (3), 198-213

Bannard, C. & Lieven, E.. (2009). Repetition and Reuse in Child Language Learning In Roberta Corrigan, Edith Moravcsik, Hamid Ouali, Kathleen Wheatley (eds.). Formulaic Language: Volume II: Acquisition, Loss, Psychological reality, Functional Explanations. Amsterdam: John Benjamins (pps.297-321).

Bannard, C. & Matthews, D. (2008). Stored word sequences in language learning: The effect of familiarity on children's repetition of four-wrod sequences. Psychological Science, 19 (3), 241-248

Lieven, E., Salomo, D. & Tomasello, M. (2009). Two-year-old children’s production of multiword utterances: A usage-based analysis. Cognitive Linguistics,20, 3, 481-508.

Ph.D. dissertations (largely based on these data): Cameron-Faulkner, Maslen, Kiravainen

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This corpus contains the data from a longitudinal naturalistic study of one child over a period of three years. The child is called Thomas. He was born 03-APR-1997 into a middle class family. His primary care-giver is his mother. This large dataset is best considered in three sections (Sections A, B, C). Section A differs from B and C in the frequency of recordings, and section C differ from A and B in its use of an updated transcription and morphosyntactic coding system. More details of these differences are given below.

THE FREQUENCY OF DATA

Section A (Thomas aged 2-00-12 to 3-02-12) A VERY INTENSIVE PERIOD

Thomas is recorded for one hour, five times a week, every week for the entire period. One of each of the five recordings is a video. There are 279 scripts and 49 videos.

Section B (Thomas aged 3-03-02 to 3-11-06) AN INTENSIVE PERIOD

Thomas is recorded for one hour, one week in every month. During this week there are five recordings one of which is a video. There are 43 scripts and 12 videos.

Section C (Thomas aged 4-00-02 to 4-11-20) AN INTENSIVE PERIOD

Thomas is recorded for one hour, one week in every month. During this week there are five recordings one of which is a video. There are 57 scripts and 12 videos.

Procedure Over the three year period the audio of a total of 379 sessions was recorded using a standard Sony mini-disc recorder and Sennheiser evolution radio microphones. The microphones were positioned around the downstairs of the house, allowing Thomas to move freely during his play whilst still capturing his speech. For 73 of these recordings a video recording was also taken using a standard video-camera. These videos are now in DVD format but permission was not gained for submission to the CHILDES database. All of the audio recordings took place in Thomas’s home where he was engaged in normal play activities with his mother. In most of the video recordings the investigator is also present and is engaged in play with Thomas. The videos were mainly recorded in Thomas’s home, although a number were recorded in the laboratory at the Max Planck child study centre at the University of Manchester. Most of the recordings are 60 minutes long.

Known inconsistencies in the data

The corpus was gathered over a number of years during which time CLAN was updated, the experience of the transcribers increased, transcribers came and went, and problems were identified and rectified along the way. This has inevitably led to some inconsistencies in transcription some of which are listed below.

A decision was made after Thomas A that Pluses (+) should only be used with compound nouns (e.g. fire+engine, washing+machine, fishing+rod, snip+snip@f, quack+quack@f, etc.) and NOT be used when transcribing repetitions such as no+no+no+no, jumpity+jump+jump, wait+wait+wait. Repetitions are instead coded as <no no no> [/] no, <jumpity jump> [//] jump, <wait wait> [/] wait. During the changeover the coding of repetition is not always consistent.

When Thomas was two years old he omits many words. The transcribers were asked to mark errors where Thomas missed auxiliaries and when the missed words confused the utterance. The transcribers also marked overextensions. Some examples are provided below.

Missing auxiliary:	Mummy 0is [*] come-ing
Overextension:	brokened [*]
Omissions:	David 0and [*] Sharon
	Mummy-0’s [*] watch
	Lots of train-0s [*]
Confusions:

You will however find no error coding in utterances such as:
           & nbsp; Fallen all down
           & nbsp; Watch postman
           & nbsp; One tree blow off
           & nbsp; tree-s fallen on the leaves all down
           & nbsp; Thomas smell it
Note: The transcribers did initially struggle with error coding and their use of codes becomes more accurate and consistent as the study goes on.
The marker @sc is used to mark schwas. A word is marked as a schwa whenever the child does not fully pronounce the target word e.g. I@sc play. In the early files Thomas tends to use the sounds a or o in the place of prepositions, adverbs, pronouns etc. In most of these cases the target word is not identifiable and these are therefore are coded as a@sc and o@sc. Later in the study as Thomas’s language becomes clearer the transcribers try to place the word they think Thomas is trying to say before the @sc sign, e.g. the@sc. They may also transcribe the actual sound they hear e.g. pwosh@sc. The transcriber’s interpretation of @sc does have some variation therefore searches and analysis must be undertaken with this in mind. Moreover the way in which the MOR program codes @sc varies and care must also be taken when using the MOR line. The sound files are provided with this data which will allow the @sc codes to be listened to again if required.
There are some inconsistencies in the codes @c (child invented form) and @o (onomatopoeia) and @f (family invented form). For example, miaow@c, miaow@f, miaow@f or mmm@o, mmm@c, mmmm@f.
The transcribers vary in the way they spell Mrs, it is mostly transcribed as Mrs however Misses is also used. Care must be taken not to confuse this second spelling with the third person verb misses. Mrs and Misses are however always coded with capital letters and on most occasions a + joins a name e.g.. Mrs+Platford, Misses+Platford. Similarly Mr may also be spelt Mister, although this does not have the possible confusion with a verb. Other possible spelling variations are listed below:
• Purdy, Purdie, Purdey (cat)
• Granddad, Grandad
• Beilbie, Bilbey (name)
• Nee+naa@o, nee+naw, nee+nah (sound of police car)
• Play+doh, Play+dough
• Teletubbie, telytubby (television program)
• Incy+wincy+spider, Incey+wincey+spider
• Miaow, miaou, meeiow, meow
Some common transcription errors:
• whose with who’s
• your with you-‘re
• have with of (e.g. might of instead of might have)
• it-’s (verb) with its (poss)
• let-‘us with let’s

More Notes on transcription

Phonological forms: The focus in this study is early grammatical development and not specific phonological forms that Thomas uses. Therefore, unless Thomas uses what appears to be child-specific forms, the target word is transcribed rather than an approximation of the child’s phonological form.

Thomas’s early language

ah+phss@o: expression that Thomas uses to refer to sleep/sleeping/snoring etc.
alander@c: lad (also, also says “land” instead of “lad”)
apple: Thomas’s name for Jeannine
a@sc do: Thomas uses this expression when he wants an action to be repeated - asking mummy to do something again
a@sc: he uses it very often in his speech, usually in the place of pronouns, prepositions and adverbs
backside@c: the back garden; also used as “back outside” or “back inside”
backways@c: backwards
bang+a+drum+time@c: music lesson
bee+ba@c: a police car, ambulance or a fire engine -real or toy; sometimes used for other types of cars as well
Beechy@c: -Dimitra- for a short while
big splash: bath
black juice: blackcurrant juice
Bow: this is how he refers to their cat
bow@c: for other cats or other animals
Bow+Wow: a dog - one of Thomas’s toys
choc+choc@f: chocolate
choo+choo@f: train
crane-ing@c: lifting things up - usually using a toy crane to do so
done and gone: he uses them often in his speech but it’s almost impossible to tell one from the other and he may not have yet distinguished one from the other. When transcribing we make the choice between gone and done mainly in terms of context
doc+doc@c: doctor
dot+dot@f: uses this expression to refer to little scratches
Hat: (actually sounds something between shat, sat and hat. Because of the “s” sound it’s also been transcribed as is@sc Hat or &s Hat)

He uses it in three different ways:

hat - to refer to an actual hat
Hat - in order to refer to Dipsy -the green teletubby which usually wears a hat
hat@c - when he wants to say “green”

mumm+mumm@c: car
nap+nap@c: nappy
Nin+Nin: this is what he calls his mother
nip, nip-s, nip-s@c: nipples
Noo+Noo or noo+noo@c : this what he calls vacuum cleaners - Noo+Noo is the vacuum cleaner on the teletubby show
pap+pap@c: parrot
Po: the name of the red teletubby - he uses it:

Po- to refer to the teletubby
po@c - when he wants to say “red”po@c - to refer to red objects in general

poo: poo -actual poo
pooh: smell -either good or bad
shining@c or shiny@c: has frequently used this expression to refer to the sun
snip+snip@o or snip+snip@c : uses this expression to refer to anything that may make this sound -e.g. scissors, cutting, chopping
snipsnip+man@c: hairdresser , barber
o@sc: as a@sc above - he doesn’t use it as often
quack+quack@f: ducks or birds in general
ta@d: thanks
ta@d much (in)deed: thank you very much indeed
what-‘is this: Thomas often says a@sc this, wo’this or wo’dis. The way we transcribe it is:
*CHI: what-‘is this [= actually says wo’dis]
The MOR programme will code “what’s this”.
Wodar@c: (it’s also been transcribed in the following ways : a@sc there, &wo there, wada@c, a@sc dar@c; Thomas tends to use this expression when he wants to be given something.

It is possible that wodar is used as different expression to a@sc there or &wo there, but we do not yet have the contextual information to make any distinctions
wow+wow@c: dog

Error Coding

Errors that are coded during transcription are as follows (APP 3: Error coding more guidelines)

Missing morphemes e.g. ‘two dog-0s’, ‘He’s go-0ing’ , ‘Mummy-0’s sock’ etc.
Case errors e.g. ‘Her do it’, ‘Me get it’
Missing or incorrect auxiliaries and copulas e.g. ‘It 0is going there’, ‘I 0am getting a drink’,
Word Class Errors e.g. double determiners ‘a that one’,
Agreement errors e.g. ‘a bricks’, ‘these penguin’, ‘Does she likes it?’, ‘It don’t go there’.
Pronominal Errors e.g. ‘Carry you’ when the child wants to be carried
Wrong word e.g. ‘I put it off’ - where the context indicates ‘take’ is appropriate.
Overgeneralisation e.g. ‘it broke-ed’

Not all errors are easy to identify. In utterances such as the following “what doing trucks” it’s difficult to pinpoint the type of error that has been made. In such cases an error marker [*] is placed on the main tier and a question mark in the error line

When to use an error code

An error code should be used whenever what the child says is grammatically incorrect. If there is something wrong with the sentence, you as the transcriber, need to flag it up using the [*] sign. You should place the [*] sign straight after the word that is the problem. If we do not flag up the errors then the researcher may not know what the child intended to say, for example:

*CHI: me Mummy stopped

You may know from hearing the transcript if the Mummy has stopped or if the child has stopped, or if the Mummy has stopped the child. Maybe whether there is an omitted has or had. These are all useful things for the researcher to know.

If you know there is an error but there is ambiguity surrounding it then it is best to use a [?] on the error line. You can use angular brackets to show it is the whole sentence or some words in the sentence that you are unsure about

*CHI: [*]
%err: [?]

Omitted/missing words

These are generally transcribed correctly but to revise. An ‘O’ is used to indicate that there is a word omitted and that you have indicated what it is by preceding it with the 0. Commonly words like have and has (auxilaries) are often omitted or even parts of words, for example:

CHI: I 0have [*] got
%err: 0have=have

CHI: I am go0ing
%err: go0ing=going

CHI: I want two sweet0s [*]
%err: sweet0s=sweets

What is said after the ‘0’ is taken out when we run the grammar program and what is left behind should read exactly what the child actually said. Anything after the 0 is what you have corrected.

Additions and overextensions

The following is VERY important, if the child has wrongly added an ‘ed’ ending on a word it should be coded like this:

*CHI: threwed [*] it .
%err: threwed = threw .

If in the next example you are sure that they mean one sweet:

CHI: I want a sweets [*]
%err: sweets=sweet

If you are not sure if it was one sweet:

CHI: I want [*]
%err: [?]

More than one error on a line

Any number of errors can be coded on a single %err line as long as there is one [*] symbol for each error and each coding on the %err line is separated by a semi-colon.

CHI: I am go0ing [*] homes [*]
%err: go0ing=going; homes=home

Please note what is on the left side of the equals sign is what is in the transcript what is on the right side is what it should be.

Using [= actually says]

We use [= actually says ] quite a lot in the transcript, this should only be used if the child makes a mistake in a word, for example , the following examples are fine:

*CHI: hitting [= actually says higging]
*CHI: spaghetti [= actually says getti]

Acknowledgements

Funding was supplied by these sources:

The Department of Comparative and Developmental Psychology, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.