Thomas Corpus

Jeannine Goh
MPI Child Study Centre
University of Manchester


Participants: 1
Type of Study: longitudinal, naturalistic
Location: England
Media type: audio
DOI: doi:10.21415/T5JG64

Project Description

This corpus contains the data from a longitudinal naturalistic study of one child over a period of three years. The child is called Thomas. He was born 03-APR-1997 into a middle class family. His primary care-giver is his mother. This large dataset is best considered in three sections (Sections A, B, C). Section A differs from B and C in the frequency of recordings, and section C differ from A and B in its use of an updated transcription and morphosyntactic coding system. More details of these differences are given below.


Section A (Thomas aged 2-00-12 to 3-02-12) A VERY INTENSIVE PERIOD

  • Thomas is recorded for one hour, five times a week, every week for the entire period. One of each of the five recordings is a video. There are 279 scripts and 49 videos.

    Section B (Thomas aged 3-03-02 to 3-11-06) AN INTENSIVE PERIOD

  • Thomas is recorded for one hour, one week in every month. During this week there are five recordings one of which is a video. There are 43 scripts and 12 videos.

    Section C (Thomas aged 4-00-02 to 4-11-20) AN INTENSIVE PERIOD

  • Thomas is recorded for one hour, one week in every month. During this week there are five recordings one of which is a video. There are 57 scripts and 12 videos.

    Procedure Over the three year period the audio of a total of 379 sessions was recorded using a standard Sony mini-disc recorder and Sennheiser evolution radio microphones. The microphones were positioned around the downstairs of the house, allowing Thomas to move freely during his play whilst still capturing his speech. For 73 of these recordings a video recording was also taken using a standard video-camera. These videos are now in DVD format but permission was not gained for submission to the CHILDES database. All of the audio recordings took place in Thomas’s home where he was engaged in normal play activities with his mother. In most of the video recordings the investigator is also present and is engaged in play with Thomas. The videos were mainly recorded in Thomas’s home, although a number were recorded in the laboratory at the Max Planck child study centre at the University of Manchester. Most of the recordings are 60 minutes long.

    Known inconsistencies in the data

    The corpus was gathered over a number of years during which time CLAN was updated, the experience of the transcribers increased, transcribers came and went, and problems were identified and rectified along the way. This has inevitably led to some inconsistencies in transcription some of which are listed below.

    1. A decision was made after Thomas A that Pluses (+) should only be used with compound nouns (e.g. fire+engine, washing+machine, fishing+rod, snip+snip@f, quack+quack@f, etc.) and NOT be used when transcribing repetitions such as no+no+no+no, jumpity+jump+jump, wait+wait+wait. Repetitions are instead coded as <no no no> [/] no, <jumpity jump> [//] jump, <wait wait> [/] wait. During the changeover the coding of repetition is not always consistent.

    2. When Thomas was two years old he omits many words. The transcribers were asked to mark errors where Thomas missed auxiliaries and when the missed words confused the utterance. The transcribers also marked overextensions. Some examples are provided below.
             Missing auxiliary:Mummy 0is [*] come-ing
             Overextension:brokened [*]
             Omissions:David 0and [*] Sharon
             Mummy-0’s [*] watch
             Lots of train-0s [*]

    3. You will however find no error coding in utterances such as:
                   Fallen all down
                   Watch postman
                   One tree blow off
                   tree-s fallen on the leaves all down
                   Thomas smell it
       Note: The transcribers did initially struggle with error coding and their use of codes becomes more accurate and consistent as the study goes on.

    4. The marker @sc is used to mark schwas. A word is marked as a schwa whenever the child does not fully pronounce the target word e.g. I@sc play. In the early files Thomas tends to use the sounds a or o in the place of prepositions, adverbs, pronouns etc. In most of these cases the target word is not identifiable and these are therefore are coded as a@sc and o@sc. Later in the study as Thomas’s language becomes clearer the transcribers try to place the word they think Thomas is trying to say before the @sc sign, e.g. the@sc. They may also transcribe the actual sound they hear e.g. pwosh@sc. The transcriber’s interpretation of @sc does have some variation therefore searches and analysis must be undertaken with this in mind. Moreover the way in which the MOR program codes @sc varies and care must also be taken when using the MOR line. The sound files are provided with this data which will allow the @sc codes to be listened to again if required.

    5. There are some inconsistencies in the codes @c (child invented form) and @o (onomatopoeia) and @f (family invented form). For example, miaow@c, miaow@f, miaow@f or mmm@o, mmm@c, mmmm@f.

    6. The transcribers vary in the way they spell Mrs, it is mostly transcribed as Mrs however Misses is also used. Care must be taken not to confuse this second spelling with the third person verb misses. Mrs and Misses are however always coded with capital letters and on most occasions a + joins a name e.g.. Mrs+Platford, Misses+Platford. Similarly Mr may also be spelt Mister, although this does not have the possible confusion with a verb. Other possible spelling variations are listed below:
      • Purdy, Purdie, Purdey (cat)
      • Granddad, Grandad
      • Beilbie, Bilbey (name)
      • Nee+naa@o, nee+naw, nee+nah (sound of police car)
      • Play+doh, Play+dough
      • Teletubbie, telytubby (television program)
      • Incy+wincy+spider, Incey+wincey+spider
      • Miaow, miaou, meeiow, meow

    7. Some common transcription errors:
      whose with who’s
      your with you-‘re
      have with of (e.g. might of instead of might have)
      it-’s (verb) with its (poss)
      let-‘us with let’s

    More Notes on transcription

    Phonological forms: The focus in this study is early grammatical development and not specific phonological forms that Thomas uses. Therefore, unless Thomas uses what appears to be child-specific forms, the target word is transcribed rather than an approximation of the child’s phonological form.

       Thomas’s early language

          He uses it in three different ways:
    1. hat - to refer to an actual hat
    2. Hat - in order to refer to Dipsy -the green teletubby which usually wears a hat
    3. hat@c - when he wants to say “green”
    1. Po- to refer to the teletubby
    2. po@c - when he wants to say “red”po@c - to refer to red objects in general

    Error Coding

    Errors that are coded during transcription are as follows (APP 3: Error coding more guidelines)

    Missing morphemese.g. ‘two dog-0s’, ‘He’s go-0ing’ , ‘Mummy-0’s sock’ etc.
    Case errorse.g. ‘Her do it’, ‘Me get it’
    Missing or incorrect auxiliaries and copulase.g. ‘It 0is going there’, ‘I 0am getting a drink’,
    Word Class Errorse.g. double determiners ‘a that one’,
    Agreement errorse.g. ‘a bricks’, ‘these penguin’, ‘Does she likes it?’, ‘It don’t go there’.
    Pronominal Errorse.g. ‘Carry you’ when the child wants to be carried
    Wrong worde.g. ‘I put it off’ - where the context indicates ‘take’ is appropriate.
    Overgeneralisatione.g. ‘it broke-ed’

    Not all errors are easy to identify. In utterances such as the following “what doing trucks” it’s difficult to pinpoint the type of error that has been made. In such cases an error marker [*] is placed on the main tier and a question mark in the error line

    When to use an error code

    An error code should be used whenever what the child says is grammatically incorrect. If there is something wrong with the sentence, you as the transcriber, need to flag it up using the [*] sign. You should place the [*] sign straight after the word that is the problem. If we do not flag up the errors then the researcher may not know what the child intended to say, for example:

    *CHI: me Mummy stopped

    You may know from hearing the transcript if the Mummy has stopped or if the child has stopped, or if the Mummy has stopped the child. Maybe whether there is an omitted has or had. These are all useful things for the researcher to know.

    If you know there is an error but there is ambiguity surrounding it then it is best to use a [?] on the error line. You can use angular brackets to show it is the whole sentence or some words in the sentence that you are unsure about

    *CHI: [*]
    %err: [?]

    Omitted/missing words

    These are generally transcribed correctly but to revise. An ‘O’ is used to indicate that there is a word omitted and that you have indicated what it is by preceding it with the 0. Commonly words like have and has (auxilaries) are often omitted or even parts of words, for example:

    CHI: I 0have [*] got
    %err: 0have=have

    CHI: I am go0ing
    %err: go0ing=going

    CHI: I want two sweet0s [*]
    %err: sweet0s=sweets

    What is said after the ‘0’ is taken out when we run the grammar program and what is left behind should read exactly what the child actually said. Anything after the 0 is what you have corrected.

    Additions and overextensions

    The following is VERY important, if the child has wrongly added an ‘ed’ ending on a word it should be coded like this:

    *CHI: threwed [*] it .
    %err: threwed = threw .

    If in the next example you are sure that they mean one sweet:

    CHI: I want a sweets [*]
    %err: sweets=sweet

    If you are not sure if it was one sweet:

    CHI: I want [*]
    %err: [?]

    More than one error on a line

    Any number of errors can be coded on a single %err line as long as there is one [*] symbol for each error and each coding on the %err line is separated by a semi-colon.

    CHI: I am go0ing [*] homes [*]
    %err: go0ing=going; homes=home

    Please note what is on the left side of the equals sign is what is in the transcript what is on the right side is what it should be.

    Using [= actually says]

    We use [= actually says ] quite a lot in the transcript, this should only be used if the child makes a mistake in a word, for example , the following examples are fine:

    *CHI: hitting [= actually says higging]
    *CHI: spaghetti [= actually says getti]