CHILDES Welsh CIG1 Corpus

Bob Jones
Department of Education
University of Wales


Participants: 7
Type of Study: naturalistic
Location: Wales
Media type: no longer available
DOI: doi:10.21415/T5N593

Publications using these data should cite:
Aldridge, M., Borsley, R. D., Clack, S., Creunant, G., and Jones, B. M. (1998). The acquisition of noun phrases in Welsh. In Language acquisition: Knowledge representation and processing. Proceedings of GALA '97. Edinburgh: University of Edinburgh Press.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This project, (C.I.G for Caffael yr Iaith Cymraeg “Acquisition of the Welsh Language”), was based in the Linguistics Department of the University College of Wales, Bangor and Aberystwyth. The Principal Researchers were Dr. Robert Borsley, Dr. Michelle Aldridge, Prof. Ian Roberts (in Bangor), and R. Morris-Jones (in Aberystwyth).

The full-time research assistant in Bangor was Susan Clack and, in Aberystwyth, Gwennan Creunant was employed on a part-time basis. The project was initially to run for 12 months (from January 1996) but a subsequent extension on the basis of unspent monies gave another three months employment for both research assistants. This documentation file was written by Susan Clack on July 17, 1997.

The aims of the project as outlined in the ESRC grant application were:

Transcriptions were taken from 45 minute (usually) audiotapes only. Roughly, 9 months of six children (plus 4 months of another) are recorded from approximately18–21 months to 28–30 months. The purpose was to tape naturalistic, spontaneous utterances and not to do specific elicitation.

Factors of sex of child and position in the family were not considered in choosing participants. The area from which the children were drawn is predominantly Welsh speaking (60–70% and higher in some of the villages). All the parents, (apart from one who learned Welsh from 3;0) were first language speakers of Welsh.

There are two components to this corpus. One is the Bangor dataset and the other is the Aberystwyth dataset. In Bangor, three participants (one female first child, one male second child, one female second child) were forthcoming. The first dropped out after 4 months. Tapes of a further child (first-born male) previously recorded in 1994 and roughly transcribed for a pilot study (Borsley and Aldridge) were totally redone in CHAT format. The Bangor files were prepared with the CHILDES editor.

Transcripts of a further three children (all first-born female) were prepared simulta-neously in Aberystwyth by Gwennan Creunant. These files have been converted to CHAT format by the Principal Researcher in Aberystwyth who had previously worked with his own computer programs (for morphological tagging and glossing) for the analysis of the speech of older children. Tapes were made fortnightly in the initial stages (mainly one wordish) and weekly (holidays and illness permitting) in the later stages.

In Bangor there are 27 tapes in the ALAW Corpus, 26 in the RHYS Corpus. Taping of ELIN ceased after 11 sessions. It was decided to redo 29 of the 42 transcriptions of DEWI into CHAT. This means there are 93 transcriptions of Bangor area children. Transcriptions on the whole follow standard orthography (see below for exceptions) with occasional phonographic representations. Standard Welsh orthography is basically phonetic which makes a phonographic representation feasible. All speakers spoke basically Northern dialects.

The Aberystwyth corpus comprises 75 files. Transcriptions are generally phonograph-ic. In Aberystwyth, on the mid-Wales coast, Southern and Northern dialects are heard and this is reflected in these phonographic transcriptions.

The Bangor Dataset

All the Bangor children live in the Arfon area Gwynedd, North Wales. In this area Welsh is spoken by approximately 70% of the population but this figure is higher in some of the villages where the children live. The education policy of the area is largely monolingual Welsh until at least 7;0. Otherwise, bilingual policies of administration are the norm in the public sector.

For the Bangor tapes the same toys were used in all sessions. These included a suitcase of animals, cars, a Fisher-Price™ parade of shops, Barbie dolls and Action Men with clothes, plus odds and sods. In some transcripts, child’s mother, grandmother, grandfather, father and/or siblings are present. The investigator was present on all occasions (but one of Elin and a few of Dewi which were mainly tapes made by parents for various reasons). The participants vary from corpus to corpus and day to day. In the majority the investigator and the child are alone for most of the time. In the case of Dewi, the same toys were also used but the taping was usually done in the investigator’s home, Dewi living close by in the same village.

All Bangor transcripts represent at least 30 minutes of tape time (45 minutes of most Kevin tapes). Some are longer for a variety of reasons, such as quality, ease of transcription, unusual quietness of child, sibling dominance in parts of tape, and so forth. All but a handful of tapes were made between 9:30 and 10:30 A.M. Times of the few that were not are noted in the initial headers. All but a handful of tapes were transcribed on same day as taping. This means that context was fresh. Generally, an attempt has been made to add background and contextual information especially where utterances may be ambiguous. Efforts have also been made at making specific remarks about potentially ambiguous or odd utterances on %com lines, for example, the shaking of a head (when recalled) for negation where the form is declarative.

About 70% of the tapes have been listened to by an independent checker with the transcript available. Comments were made on transcripts. The tapes were then totally listened to again by the transcriber with checked transcripts. Amendments and corrections were then made, elevating best guesses to full status, interpreting xxx’s, adding and reiterating %COM especially in relation to intonational status of an utterance. All transcripts (except Dewi which were second transcriptions) not independently checked have been checked with tape for a second (or third in some cases) by the transcriber. Here we have tried to be true to what is heard rather than what we know to be prescriptively correct. This comment is particularly relevant with regard to mutations (on the nonchild lines).

In the documentation of each corpus, which follows at the end of this file, there are comments for all children except Dewi. These comments were usually made straight after transcribing and are of a general background nature with some impressions as to develop-ment.

The Bangor research assistant would like to thank the following: First and foremost the parents of the children for their unstinting cooperation. Also, Bill Hicks of Cysyll for installing the Welsh Spellchecker, Professor Cathail O’Dochartaigh (formerly of Cysyll, Bangor University and now of Glasgow University), Dr. Margaret Deuchar for advice in the early stages, Dr. Michelle Aldridge for her patience and good sense, Gwennan in Ab-erystwyth for sharing the lows of transcription work, Vivienne Pritchard for cheerfully checking, and many others including the members of the Manchester/Bangor reading group into Child Language, in particular Ginnie Gathercole, Marilyn Vihman, and Elena Lieven.

CHAT Usage

Generally, CHAT conventions have been followed (hopefully) but there are some divergences. These are noted here together with some general points.

Disambiguation Devices

All Welsh/Welsh and Welsh/English homonyms have been disambiguated in the Bangor corpus (for the purpose of aiding glossing in Aber and making word lists). A variety of ways have been adopted to do this and details can be found in the following introduction to the lexicon and in the lexicon itself. The only disambiguation that has not been done is that of the predicate marker “yn” on the non-CHAT lines. The following disambiguation codes for common words are indicated below:

Representation of English

The representation of English words in this corpus has posed some problems. Many English words have orthographic Welsh forms that might well be considered part of the Welsh language, for example, doli>doll. Welsh orthography is used where it is seems appropriate. In this sense the transcripts are not strictly phonographic. For example, ice cream is represented by ice+cream@s and not eis+crim (see Aber corpus) whether or not it is pronounced in the English way or the Welsh way.

The specific conventions adapted to address these concerns were as follows:

This notation has been used partly to eliminate English for the purposes of analysis, to identify chunks of code-switching and to eliminate homophonous (with Welsh) forms. It will be the case that some words marked with @s will also appear in xs or xxs strings. The main motivation for the policy adopted here is to maintain a constant representation of Welsh phonography/orthography/phonology in contrast to that of English. In mind is the fact that not everyone who may look at this corpus will be as well versed in English as Welsh speakers tend to be in Wales. Often, decisions as to how mark words have felt arbitrary but the preceding guidelines have been followed as far as possible.

Notations and Orthography

The symbols @s, xs, xxs, @o (onomatopoeic words) and markings for homonyms have been added after initial transcription in the later stages. As far as is possible the context in each case has been checked for accuracy. Sometimes something like woofwoof may appear marked with either @o or @c. This is relevant for categorization purposes. The same applies to the words “bang” and “bwm.”

Conventional spellings are used in the most part. There are a very few exceptions: is-da>eistedd=sit; isio>eisiau=want; plus verbal and prepositional forms noted below.

In the lexicon, the usual dialect form or spoken form occurs after “@”. Most of these cases are subject to regular rules: 1) words with “e” in final syllable going to “a”; 2) dropping of silent “f” in words like nesaf, af, and so forth.

In the case of inflected prepositions (especially inflected forms of “gan”), there are different orthographic representations. The same applies to a few verbs such as rhoi/rhoid/ rhaed = give.

In the cases where conventional spellings are not used the conventional form (or root form) occurs after a dash. Alternatives, either English or Welsh, occur after slashes. English translations occur after equals signs. Alternative Welsh forms for English words appear after >. If they do not occur in the corpus they are marked with *. If they occur the % follows. Welsh words that appear with the English “s” plural appear in a category: nple. This notation is usually used where a Welsh plural could (and may otherwise) be expected. There will be a handful of plurals where the English plural morpheme is well established. The categories used are: n= noun, vn=verbnoun, a=adjective, av=adverb, wh=wh word, fb=finite be, p=preposition, ip=inflected preposition, fv=finite verb, loc=locative adverb, g=greeting (or like). Soft mutations are indicated by ^, nasal by ^^, and aspirate by ^^^.

Number of repeats on words appear first. In later files these have almost totally been replaced with retracing symbols. This seemed more appropriate after the early one and two word stages. Proper nouns appear with capital letters. English forms have not been marked on these yet.

In the CHAT files, only pronunciation forms follow target form in brackets [:]. This is not the usual (text replacement) use of these brackets with CHILDES. These phonographic representations, made possible because Welsh orthography is phonetic to a high degree, are not consistently done but have been added to add a flavor of the child’s phonological competence.


This project was funded by a grant from the Economic and Social Research Council (ESRC) of the UK.