York Corpus

Bernadette Plunkett
Department of Linguistics
University of York


Cécile De Cat
Department of Linguistics
University of Leeds


Participants: 3
Type of Study: naturalistic
Location: France, Belgium, Canada
Media type: audio
DOI: doi:10.21415/T5NK63

Browsable transcripts

Download transcripts

Link to media folder

Citation information

Publications using these data should cite at least one of the following articles:

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Additional studies that make use of the York corpus include:

Project Description

This directory contains transcripts from a study of three children acquiring French that were collected and compiled during an project entitled “The Syntactic Acquisition of Wh-Questions in French: a cross-dialectal comparison” run from the University of York (UK). Data collection began in early 1997.

The project involved an 18-month study of three children, each one a speaker of a different dialect of French. The children were taped fortnightly for approximately half an hour in a familiar environment. The sessions were videotaped and separately audio-recorded using Sony professional cassette recorders. The three fieldworkers collecting the data were all native speakers of French. Initial transcriptions were in most cases done by these investigators on the basis of the audiotape, then checked against the video and coded by the research assistant on the project Cécile De Cat, a native speaker of Belgian French. The names used for the target children in these corpora are all pseudonyms. The data are in French, without English glosses. Comments are in English.

Researchers who require more information as well as any using data from the York corpus are asked to contact Bernadette Plunkett by email and to send her copies of any research papers using this data. The conventions used in this corpus are under constant re-evaluation; users with comments or anyone who notices inconsistent application of the conventions listed below are also asked to contact her with details. The corpus has recently been digitised and the digital sound stream has been used to double check the consistency of certain aspects of transcription, but since permission for public release of the audio corpus was not originally sought from participants only the transcripts have been donated.

The Belgium corpus contains 36 chat files, Liea001.cha-Liea036 which correspond to the transcripts of the Belgian child (Léa, Liège) from 2;8.22 to 4;3.21. The Canada corpus contains 36 chat files Mona001.cha-Mona037 which correspond to the transcripts of the Canadian child (Max, Montréal) from 1;9.19 to 3;2.23. The France corpus contains 35 chat files Para001.cha-Para035 which correspond to the transcripts of the French child (Anne, Paris) from 1;10.12 to 3;5.4. Other children were also present during some recording sessions. Only two of them have a significant presence, however. They are Pol (born on 21-AUG-1992), who is Max's brother, and Lore (born on 6-MAR-1995), who was at the same childminder’s as Anne. The sessions during which they were present are represented in a table in the Canadian and French sections respectively, together with a calculation of their age in those sessions.

Warnings and Codes

The following are some specific warnings regarding these transcriptions.

In some instances, the child utters a vowel in between [e] and [a] as a word. When it appears in front of a noun, we interpret this vowel as the precursor of le / la, i.e. not yet a real determiner, but not a missing determiner either. When used before a verb it is most likely to represent a pronoun (either subject or object). The transcription of such elements using an 'e' on its own should not be confused with the /e/ which appears in any phonological transcription preceded by the %pho symbol; in that case ‘e’ represents a half close front vowel. 3.7. Missing and uninterpretable elements 0 Used in the York corpora only where the identity of a missing word is uncertain. 0 vient. [ = il vient or elle vient] (xx) Used where the missing word is clearly identifiable the word itself appears inside parentheses. (tu) veux. [ where the interpretation is clearly not je veux] (xx_xx) Used to indicate material which has been syntactically elided rather than erroneously omitted, in order to aid understanding especially when utterances found by syntactic searches may be looked at out of context. (tu_veux_acheter) une poupée?

French motherese

Onomatopoeia (marked @o)

The Corpus from France

Anne is the first-born child (18-JUN-1995) of a couple living in Paris who are originally from the south of France (Pyrénées region). Her mother (Anne-Gaël, born SEP-1967) is a landscape architect, and her father (Denis, born 30-DEC-1963) was temping in show business at the time of the recordings. Anne’s little brother (Jonas) was born during the study (born 8-JAN-1998).

The sessions in this corpus take place in different locations and occasionally are spread over more than one day. Most are in the family home in Paris but during the day, Anne often goes to a child-minder and some sessions take place at her house. The child-minder also looks after another little girl called Lore (born 6-MAR-1995). Some sessions were recorded by the family themselves when they were away from Paris. The only language Anne is exposed to is French.

Several nicknames marked @n are used for Anne, in particular Nounette and Bibiche. The family also uses a word of its own, marked @f: papouille (F), meaning a kiss, which would be câlin, baiser in standard French.

Some special attention was paid to the transcriptions of Anne's pronunciation in the first few files since she was rather difficult to understand. In many cases, information is given as to the way she pronounced words, in case these had been misinterpreted and when her utterances were felt to be unintelligible, a rough phonetic transcription was given. However, this information has to be used with care. Since pronunciation was not a focus of our study, only when a note of the pronunciation is made can it be inferred that a word was not pronounced in the target form. Even after some development had taken place, the changes from the adult pronunciation were too numerous to list. When no phonetic transcription or note appears the interpretation was considered to be clear but nothing should be inferred as to the way Anne pronounced the string in question.

Forms have not been marked as dialectal in this corpus because the French spoken in France is usually taken as the standard against which other varieties are compared. Where non-standard forms were used, they are followed by $SF, with a rendering in standard French.

The corpus from Canada

Max is the second child of Martine (27-AUG-1965) and Dominique (2-MAR-1965). He has an older brother called Pol (born 21-AUG-1992). At the time of recording Dominique was a financial controller; his highest diploma is a Baccalaureat (degree) in administration. Martine worked in a shop and taught literacy part-time; her highest diploma is a Baccalaureat (degree) in education.

Catherine, the interviewer, who was the initial transcriber for about half of the corpus, comes from Acadia, where a slightly different dialect is spoken. Where possible, dialectal traits were tagged (either with @d or with [$D]). However, the first pass transcription for part of the corpus was done by a speaker of Swiss French and the research assistant who checked and coded the transcriptions is a native speaker of Belgian French, not Canadian French; as a result, not all of the dialectal traits may have been identified as such. Where the transcriber was unsure of the status of a non-standard form, this may be merely coded with an indication of the standard French form following it [SF: ], rather than [$D: *].

Max is not a particularly voluble child but his brother Pol is very talkative. Efforts were made to tape on occasions when Pol was not in the house since he tended to dominate the conversation. When present Pol was encouraged to participate in activities with another adult. Max is referred to by several nicknames, marked @n: Lou, Loulou, Tutu (only used by Pol). Max sometimes refers to Catherine as Tine.

The following words, marked @d in the corpus belong to Canadian French. An explanation of their use or corresponding form in standard form is given below, as well as a translation in English.

Aside from lexical variation, there are numerous syntactic differences between the Canadian and standard varieties of French, of which the major ones seen in the corpus are as follows:


The names of certain means of transport which are masculine in standard French are feminine in Canadian French (bus,...). Sometimes, this “feminisation” seems to be extended to other words of this class with hélicoptère being referred to as elle. However due to the velar characteristic of the ‘l’ in il, the preceding vowel sounds more open than in standard French (so that elle may have been transcribed when in fact the speaker meant il). But in a passage where the participants are playing with toy cars, the singular (of voiture/auto) is treated as feminine before the following utterance, where the nouns are masculine but are used with a feminine-sounding adjective.
*CAT: [$SF: elles] sont toutes prêtes, [$SF: celles+là] [= cars] Là!
Here though, the use of ceux-là, is quite clear.
The quantifier tout is pronounced [tut] in Canadian French, even when it refers to masculine entities. When the final ‘t’ was pronounced in a context in which it would not have been in standard French, the sign @d was usually added to it, sometimes the pronunciation is mentioned explicitly but in most cases it has simply been transcribed with the feminine form instead.
Fait is often pronounced [fEt], as if it was feminine, even though it is masculine
*CAT: il s' est fait@d [% pho: fEt] mal où ?
Aller variably takes the apparently third person form va in the first person singular which is vais in standard French:
*CAT: je [$D: vais] t' aider un peu .
In clefts, where the null operator has a first singular referent, the agreement is as in English clefts, with a default third person form, while in standard French, full agreement is expected.
*CAT: c(e) est TU moi qui [$SF: ai] Crocro ?


The strong pronominals lui and elle, which in standard French would not be used deictically if referring to inanimates, are sometimes used this way.
*MAX: pi # et (1)elle@d, c(e) est à moi .
*CAT: apporte LE, (1)lui@d !
Certain groups of postverbal clitics are ordered differently from in standard French
*CAT: envoie [$D: LE MOI] !

Markers of embedding

Complementiser que is often absent. Unlike in other varieties, its absence is not phonologically conditioned. To find these cases it is necessary to search for $D.
*CAT: [$D: tu veux que] je t' aide?
*MOT: [$D: est ce que] c(e) est Papa ?
In other cases, que is used where it would be absent in standard French:
*CAT: dessine l' habit à Donald, toi, pour voir quelle couleur [$D: 0] il est !
In the Accadian variety spoken by Catherine, si ‘whether’ is often followed by que
*CAT: je (ne) sais pas [$D: si] ça marche comme ça .


Both direct and indirect interrogatives commonly have different forms from the standard. In Yes-No questions, the marker TU may appear, as mentioned in section One.
*MOT: il y a TU [$D: autre chose] là+dedans ?
In standard French Wh+est-ce que is essentially limited to matrix contexts; this is not true for Canadian varieties where it is standardly used in indirect questions and where it can even appear inside a PP apparently replacing a simple wh-word (see Plunkett, 2000).
*CAT: [$D: c(e) est pour quoi faire] ?
*CAT: à [$D: quoi] est ce qu' il joue, pol ?
*CAT: c(e) est [$D: QUI qui] fait coin@o coin@o ?
*CAT: [$D: là où] c(e) est rouge, ça s' attache probablement .
*CAT: je pense sur son doigt qu' il montrait [$D: où] c(e) était sale .
When the embedded clause is copular, a wh-word can stay in situ in an indirect question
*MAX: mais je (ne) sais pas il est où .

Auxiliary choice

*CAT: ah ben il [$D: a] disparu
*CAT: tu T(E) [$D: es] trouvé une petite chaise ?
*CAT: i(ls) ont [$SF: sont] tombé ?

The corpus from Belgium

This corpus focuses on Léa (born 17-MAY-1994), who is the first child of Marc (born 7-APR-1963) and Dominique (born 12-SEP-1962). Léa has a little brother, Luc (born 27-DEC-1995). Marc is a physiologist, and was attending a course in osteopathy at the time of the project. Dominique has a secondary school diploma in marketing. She stopped working when she had Léa. During the project she was attending a course to become a chiropodist. The only speaker who is not from Belgium is Parrain, Léa's grandfather, who is French but has lived for many years in Belgium.

Léa is a very talkative child, with a powerful imagination. She often invents games or little stories, thinking that everybody believes what she says. Sometimes it is very difficult to make logical sense of what she is saying, because she is making things up as she goes along. From the 22nd session onwards, she got in the habit of calling her grandparents Monsieur and Madame, and using the vous form with them, possibly to emphasise the game-like character of the taped sessions.

All the recording sessions were conducted by the maternal grandmother with the collaboration of either her husband or her daughter for the filming. The sessions took place either at Léa’s home or at the grandmother’s. Léa was free to choose what she wanted to do, but was usually encouraged to play; often Léa's younger brother Luc was present but in the earlier sessions he is not yet speaking and in the later ones he rarely says anything.

Several nicknames marked @n are used for Léa: Blabla (this is also the name of a character in a TV program for kids), Divine, Loulou, Louloute, Minouche, Nouche, Nounou, Nounouche, Pimprinelle.

Many of the child invented forms marked @c in the corpus have no clear meaning and are invented by Léa for fun. The meanings that can be discerned are listed below.

  • audruche : autruche (deformed by dialectal pronounciation: word-final [t] + word-initial [r] (sonorant) is pronounced [dr], as in petite robe, petite rue)
  • magique (une) : une manique (belgian French)
  • pianer : faire du piano
  • refaudrait : faudrait encore / de nouveau
  • suitais : étais
  • tiendre : tenir
  • There are two family peculiar forms marked @f:
  • bidou : bidon (tummy)
  • loup (un) : une crotte de nez : 'a bogie'
The words below marked @d in the corpus are given in Belgian French, followed by their counterpart in standard French, and a translation into English. When an example is needed for the sake of clarity, it is given in italics on the following line.
  • abille dépêche-toi 'hurry up'
  • baffe claque 'slap'
  • bidon ventre 'belly'
  • binette frimousse 'cute face'
  • bouger enlever
  • bouler envoyer (ailleurs), lancer 'throw'
  • carabistouilles bêtises 'nonsense or white lies'
  • chemisette maillot de corps 'vest'
  • cent et un cent un
  • chique (une) un bonbon 'a sweet'
  • craque (une) une sottise, une blague 'joke, nonsense'
  • d’abord tag similar to donc, alors
  • dîner (le) le déjeuner 'lunch'
  • donc (not used in the same contexts)
  • drache forte pluie 'downpour'
  • essuie (un) un torchon 'a tea towel''
  • histoires choses
  • lavette (une) un torchon 'cloth, used for cleaning'
  • licher lécher 'to lick'
  • mallette cartable 'briefcase or satchel'
  • manique (une) torchon ou poignée 'cloth', 'dishcloth', 'face cloth'
  • mémère comère 'gossip'
  • palette petite pelle 'shovel'
  • paraît (not used as a tag in this way)
  • pet / pette un pet, un derrière 'backside'
  • plasticine pâte à modeler 'placticine'
  • quatre heures (le) le goûter 'tea', 'snack for kids”
  • ramassette pelle à poussières 'dustpan'
  • rattaquer recommencer 'give it another go'
  • rou(m)doudoum hop, voilà
  • s’il te plait voici 'here you are”
  • 's’il vous plait s’il te plaît 'please', or 'here you are',
  • souper (le) dîner 'dinner'
  • tantôt tout à l’heure 'earlier' or 'in a moment',
  • tchinisse un rien du tout, 'little bit of fluff'
  • tenture(s) rideau(x)'curtain(s)'tévé télé(vision) 'TV'
  • tiens donc! alors! (exclamation)
  • toi! (not used as a tag in this way)
  • torchon serpillère 'dish cloth', 'floor cloth'
Aside from this lexical variation, there are a number of syntactic peculiarities and fixed expressions that are found.


A number of phrases un (petit) peu, (pour) une fois, (pour) voir seem to have a special intensifying use after verbs in the imperative in this variety. They can occur in combination with each other. *MAM: raconte MOI un petit peu@d pour une fois !
*MAM: essaie un peu@d celui+là pour voir !
*MAM: va un peu@d voir !
*PAR: aha attends voir que je regarde un petit peu !
*PAR: [$D: mets LES (pour) qu' on voie de quoi tu as l' air] !


A number of verbs which take infinitival complements with à or with a bare infinitival may have their complement introduced by de in Belgian French.
*MAM: Léa, elle [$D: aime bien courir] aussi .
*LEA: mais moi, j(e) [$D: aime mieux te voir], Mamy .
*MAM: t(u) (n') [$D: aimes pas nettoyer] ?
*MAM: bon si tu [$D: continuais à laver] ta poupée, maintenant
*MAM: [$D: continue à faire] ce que tu faisais Là !
Direct quotation is often introduced by the finite complementiser que.
*LEA: non elle (n') est plus fatiguée, [$D: elle a dit] .

Modals and auxiliaries

Causative faire is usually replaced by mettre.
*MOT: tu LA [$D: fais sécher] où ?
*MOT: tourne le bouton pour [$D: faire cuire] les oeufs !
Pouvoir to indicate ability is usually replaced by savoir
*LEA: c(e) est QUI qui [$D: peux] ME (LE) [= the balloon] gonfler ?
*LEA: tu [$D: peux] avancer un peu ta chaise ? [= so she can go past]
Pouvoir has a special use in the expression ne pouvoir mal (de)
*LEA: [$D: je ne risque rien] ?
*MOT: en manger, elle ne [$D: risque pas] .
Avoir has a special use in the expressions avoir facile, avoir difficile, avoir bon
*MAM: tu [$D: y arriveras mieux] .
*MAM: et [$D: on est bien], tous ensemble après .
*MAM: [$D: tu aimes bien ... dire ] qu' il ne faut pas dire ouais ["] .
Faire has a special use in which it is used as an auxiliary followed by a past participle
*MAM: ah voilà ce [$D: qu' il est écrit] .


The study was funded by an Economic and Social Research Council grant to Bernadette Plunkett, #R000221972.