CHILDES Welsh CIG2 Corpus

Bob Jones
Department of Education
University of Wales
bm.jones123@btinternet.com
website

Participants:	500
Type of Study:	naturalistic
Location:	Wales
Media type:	no longer available
DOI:	doi:10.21415/T53G7G

Citation information

Some citation information here.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This project ran from the 1st of July 1999 until the 30th of June 2000. It was directed by Bob Morris Jones and staffed by two researchers, Merris Griffiths and Mared Roberts, in the Department of Education, University of Wales, Aberystwyth, Ceredigion SY23 2AX, Wales, UK.

The data is based on the spontaneous recordings of children between the ages of three and seven years of age, speaking Welsh. They were recorded in schools throughout Wales in undirected play situations, mainly playing in pairs with various toys in a box of sand. The children are from different school, socio-economic, regional, and linguistic backgrounds.

The original recordings were collected during the period 1974-1977 by a project which was located in the same department, funded by the Welsh Office, directed by Professor C.J. Dodson, run by Bob Morris Jones, and staffed at various times by Brec'hed Piette, Hefin Jones, John Jones, Wyn James, Christine James, and Nesta Dodson.

Participants

There are two cohorts: children from three to five, and children from five to seven. The first digit in the names of the files that make up the database gives the age of the children. The file names of the five year olds of the older cohort are distinguished by the letter 'a' after the first digit. The remaining digits complete the file name in all cases.

The scale of the database can be indicated by the following summary:
three year olds: 25 files (c3001 - c3025), 418kb, 42 children
four year olds: 31 files (c4001 - c4031), 498kb, 62 children
five year olds: 39 files (c5001 - c5039), 859kb, 77 children
five 'a' year olds: 44 files (c5a001 - c5a044), 855kb, 87 children
six year olds: 48 files (c6001 - c6048), 1.00mb, 96 children
seven year olds: 52 files (c7001 - c7052), 1.14mb, 104 children

Notation

Personal names, local place-names, and local places-of-work have been made anony-mous by using random nonsense-strings of letters: all begin with an initial capital, and the place names have a final 0. The names of public figures, fictional characters, and more distant places have been retained. Making names anonymous loses some information about word-forms, especially about mutations - where they occur - and word-play.

The children produced many noises while playing, and some attempt has been made to transcribe these, although they are not intended to capture the phonetic details. They have the suffix @i. Nonsense forms, in word-play for instance, have the suffix @wp.

English is also spoken by various children to different degrees in the database. Single English words - either by themselves or within a Welsh utterance - are not marked. But phrases or sentences of English words are enclosed in scope symbols < ... >, and are fol-lowed by the comment [% Saesneg] - 'Saesneg' being the Welsh word for 'English'.

Similarly, phrases and sentences which are from songs, nursery rhymes, and similar material are enclosed within < ... > and are followed by the comment [% ca:n] - 'ca:n' (or 'c‚n', to use the circumflex - see below) is the Welsh for 'song'.

Unfinished words (that is, fragments and not shortened words) are indicated by an ini-tial &.

There are many homonyms, many of which come about through phonological processes of elision and assimilation in spontaneous speech. Digits and the apostrophe are used to distinguish different word-forms which otherwise have the same spelling. The lexicon gives the lexeme to which they belong. The apostrophe is declared in the 00depadd.cut file to cater for word-initial occurrences.

In spontaneous speech, patterns of a Welsh copula followed by a personal subject pronoun occur as a pronoun only. Such pronouns are indicated by a final apostrophe. There are instances, mainly of directive-like utterances within the context of a game, were it is not entirely clear what the pattern is. But these instances have likewise been give a final apostrophe.

The data files contain utterances by children and adults. The former are identified as Target_child or Child on the @Participant header line in the data files; the latter are iden-tified as Investigators and Teachers. The utterances of the adults have been transcribed in full, but not as painstakingly as those of the children; in particular, homonyms have not all been disambiguated through transcription.

The lexicon contains the word-forms produced by the children. It does not contain word-forms produced by adult participants. The lexicon contains all the Welsh words and single English-words which occur within a Welsh utterance or by themselves. It does not contain English words which are in English phrases or sentences. It does not contain proper names, the spellings of noises or nonsense words - they can be identified in the data by an initial capital, the suffix @sn, and the suffix @gl, respectively. Neither does it contain xxx (for indecipherable material), and unfinished fragments which begin with &.

The categories and their codes in the lexicon are as follows:

?? = multi-category form which is ambiguous in context
a1 = pro-form place adjuncts like FANNA 'there', FAMA 'here', FANCW 'yonder'
ab = conjuncts and disjuncts like HEFYD 'also', FELLY 'therefore'
ad = other adjuncts
ag = apsect markers YN 'progressive', WEDI 'perfective'
an = adjectives
ar = prepositions
as = adverbs ALLAN 'out', YMLAEN 'onwards'. I-FFWRDD 'away', I-LAWR 'down', etc.
at = adverbs beginning with TU - TU-ALLAN 'outside', TU-OL 'behind', etc.
b4 = Welsh finite verb with English inflection
bd = English verbs in "-ed", "-en" or equivalent e.g. 'crashed', 'drunk'
be = verbnoun forms (compare English plain infinitive) including auxiliaries
but not BOD 'be'
bf = finite-verb forms (including the imparative forms) except BOD 'be'
bg = English verbs in "-ing"
bp = English plain infinitive forms
cd = co-ordinating conjunctions
ce = verbnoun (compare English plain infinitive) of BOD 'be'
cf = finite forms of BOD 'be'
cm = MWY 'more' as a comparative particle before adjectives
cn = greetings and farewells
cy = subordinating conjunctions like ACHOS 'because'
eb = standard exclamations like AA 'ah', OO 'oh'
en = nouns
er = the post-modifying words ARALL 'other' and ERAILL 'others'
es = EISIAU 'wants, needs' - a nominal form
g1 = nominal wh- words - BETH 'what', PWY 'who'
g2 = adverbial wh- words - PRYD 'when', PAM 'why', SUT 'how'
g3 = the wh- word PA 'which'
g4 = compounds involving wh- words like BETH+BYNNAG 'whatever', PRYD+BYNNAG 'whenever'
g5 = the wh- word FAINT 'how much/many'
ga = grammatcically invariant answer words IE 'yes', NAGE 'no', DO 'yes' a NADDO 'no'.
gc = the comparative particle NA 'than'
gd = demonstrative words DYNA 'there/that is', DYMA 'here/this is', DACW 'yonder is'
gg = intensifiers like RHY 'too', GO 'gairly', MOR 'so'.
gm = quantifiers like DIGON 'enough', LLAWER 'much/many, MWY 'more'
gr = preverbal particles like MI, FE, NI and focussing particles like MAI, AI
gt = the predicatival particle YN
ll = pro-form adjuncts YNA 'there', YMA 'here' and ACW 'yonder'
ly = letters of the alphabet
mo = words indicating epistemic modality EFALLAI 'perhaps', HWYRACH 'perhaps'
ne = the negator DIM 'no/not' both as quantifier and adverb
on = onomatopoeic-type forms
pa = politeness expressions
pe = determiners
pi = forms of PIAU, used to indicate ownership
qq = for obscure forms
r1 = personal pronouns
r2 = demonstrative pronouns
r3 = indefinite pronouns like RHYWUN 'someone'
r4 = negative pronouns
r5 = reflexive pronouns
r6 = reciprocal pronouns
r7 = conjunctive pronouns like FINNAU 'me too'
r8 = prefixed (possessive) pronouns
r9 = the 'alternative' pronoun LLALL 'other', LLEILL 'others'
rd = RHAID 'must, necessity'
ri = numbers
rp = universal pronouns like PAWB 'everyone'
rq = indefinite phrases like BETH+'NA 'thingie', LLE+'NA, BE+TI'+'N+GALW 'what do you call it'
sg = standard verbal pauses like YMM 'uhm'
sy = standard paralinguistic forms like HY-HY 'uh-uh', MM-MM 'uhm-uhm'
ya = manner-adverbial particle YN e.g. YN GYFLYM 'quickly'

Multi-membership, if found in the corpus, is indicated by the Childes convention for this, that is, a backward slash after the first entry, followed on the succeeding line(s) by another entry. These categories serve only to identify data which can be recovered for analysis. They are not intended to represent probing analyses.

Participants

This database in Childes format was produced by a project that was funded by the Economic and Social Research Council (ESRC) of the UK with an award of £60,611 (R000237978).