CHILDES Welsh CIG2 Corpus

Bob Jones
Department of Education
University of Wales


Participants: 500
Type of Study: naturalistic
Location: Wales
Media type: no longer available
DOI: doi:10.21415/T53G7G

Citation information

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This project ran from the 1st of July 1999 until the 30th of June 2000. It was directed by Bob Morris Jones and staffed by two researchers, Merris Griffiths and Mared Roberts, in the Department of Education, University of Wales, Aberystwyth, Ceredigion SY23 2AX, Wales, UK.

The data is based on the spontaneous recordings of children between the ages of three and seven years of age, speaking Welsh. They were recorded in schools throughout Wales in undirected play situations, mainly playing in pairs with various toys in a box of sand. The children are from different school, socio-economic, regional, and linguistic backgrounds.

The original recordings were collected during the period 1974-1977 by a project which was located in the same department, funded by the Welsh Office, directed by Professor C.J. Dodson, run by Bob Morris Jones, and staffed at various times by Brec'hed Piette, Hefin Jones, John Jones, Wyn James, Christine James, and Nesta Dodson.


There are two cohorts: children from three to five, and children from five to seven. The first digit in the names of the files that make up the database gives the age of the children. The file names of the five year olds of the older cohort are distinguished by the letter 'a' after the first digit. The remaining digits complete the file name in all cases.

The scale of the database can be indicated by the following summary:
three year olds: 25 files (c3001 - c3025), 418kb, 42 children
four year olds: 31 files (c4001 - c4031), 498kb, 62 children
five year olds: 39 files (c5001 - c5039), 859kb, 77 children
five 'a' year olds: 44 files (c5a001 - c5a044), 855kb, 87 children
six year olds: 48 files (c6001 - c6048), 1.00mb, 96 children
seven year olds: 52 files (c7001 - c7052), 1.14mb, 104 children


Personal names, local place-names, and local places-of-work have been made anony-mous by using random nonsense-strings of letters: all begin with an initial capital, and the place names have a final 0. The names of public figures, fictional characters, and more distant places have been retained. Making names anonymous loses some information about word-forms, especially about mutations - where they occur - and word-play.

The children produced many noises while playing, and some attempt has been made to transcribe these, although they are not intended to capture the phonetic details. They have the suffix @i. Nonsense forms, in word-play for instance, have the suffix @wp.

English is also spoken by various children to different degrees in the database. Single English words - either by themselves or within a Welsh utterance - are not marked. But phrases or sentences of English words are enclosed in scope symbols < ... >, and are fol-lowed by the comment [% Saesneg] - 'Saesneg' being the Welsh word for 'English'.

Similarly, phrases and sentences which are from songs, nursery rhymes, and similar material are enclosed within < ... > and are followed by the comment [% ca:n] - 'ca:n' (or 'c‚n', to use the circumflex - see below) is the Welsh for 'song'.

Unfinished words (that is, fragments and not shortened words) are indicated by an ini-tial &.

There are many homonyms, many of which come about through phonological processes of elision and assimilation in spontaneous speech. Digits and the apostrophe are used to distinguish different word-forms which otherwise have the same spelling. The lexicon gives the lexeme to which they belong. The apostrophe is declared in the 00depadd.cut file to cater for word-initial occurrences.

In spontaneous speech, patterns of a Welsh copula followed by a personal subject pronoun occur as a pronoun only. Such pronouns are indicated by a final apostrophe. There are instances, mainly of directive-like utterances within the context of a game, were it is not entirely clear what the pattern is. But these instances have likewise been give a final apostrophe.

The data files contain utterances by children and adults. The former are identified as Target_child or Child on the @Participant header line in the data files; the latter are iden-tified as Investigators and Teachers. The utterances of the adults have been transcribed in full, but not as painstakingly as those of the children; in particular, homonyms have not all been disambiguated through transcription.

The lexicon contains the word-forms produced by the children. It does not contain word-forms produced by adult participants. The lexicon contains all the Welsh words and single English-words which occur within a Welsh utterance or by themselves. It does not contain English words which are in English phrases or sentences. It does not contain proper names, the spellings of noises or nonsense words - they can be identified in the data by an initial capital, the suffix @sn, and the suffix @gl, respectively. Neither does it contain xxx (for indecipherable material), and unfinished fragments which begin with &.

The categories and their codes in the lexicon are as follows:

Multi-membership, if found in the corpus, is indicated by the Childes convention for this, that is, a backward slash after the first entry, followed on the succeeding line(s) by another entry. These categories serve only to identify data which can be recovered for analysis. They are not intended to represent probing analyses.


This database in Childes format was produced by a project that was funded by the Economic and Social Research Council (ESRC) of the UK with an award of £60,611 (R000237978).