Yip-Matthews Bilingual Corpus


Virginia Yip
Dept. of Modern Languages & Intercultural Studies
Chinese University of Hong Kong

website

Stephen Matthews
School of Humanities (Linguistics)
University of Hong Kong

website

Huang Yue-Yuan
Language Centre
Hong Kong Baptist University

website

Participants: 8
Type of Study: longitudinal
Location: China
Media type: audio, video
DOI: doi:10.21415/T5MS3Q

Browsable transcripts

Download transcripts

Link to media folder

Citation information

Articles using the Timmy corpus should cite:
Yip, V. and S. Matthews. (2000) Syntactic transfer in a bilingual child. Bilingualism: Language and Cognition 3.3, 193-208.

Articles using the Sophie corpus should cite:
Matthews, S.& V. Yip. (Forthcoming) Relative clauses in early bilingual development: transfer and universals. In Giacalone, A. (ed.) Typology and Second Language Acquisition. Mouton de Gruyter.

Additional references, and publications based on the Hong Kong Bilingual Child Language Corpus:

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Here is lovely press release on this project:

Project Description

This corpus investigates the language acquisition of five children exposed to both Cantonese and English regularly from birth. The corpus currently contains files from Alicia, Timmy, Sophie, Kathryn, and Llywelyn. Investigation of Alicia's bilingual development was undertaken as part of the project “Multimedia Perspectives On Bilingual Development” funded by the Research Grants Council of the Hong Kong Special Administrative Region, China (ref. no. CUHK4014/02H) awarded to Virginia Yip (Chinese University of Hong Kong and Stephen Matthews (University of Hong Kong) and the project “Language Differentiation in Bilingual Acquisition” funded by a CUHK Direct Grant (2003-2004). We gratefully acknowledge the support and help of Chen Ee San, Michelle Li and Uta Lam: a dedicated team who became part of the family and friends of the children. Brian MacWhinney's support saw through every technical aspect of the construction of the first fully video-linked set of transcripts in the Hong Kong Bilingual Child Language Corpus. We are indebted to him for his expert advice and innovative ideas to make the multimedia corpus a reality.

Alicia

Born in Hong Kong on 28 May 2000, Alicia is the third of three siblings, the younger sister (mui4mui2) of Timmy (7 years older) and Sophie (4 years older). The family background is described under the entries for Timmy and Sophie. Apart from parental input following the one parent—one language pattern, interaction with her brother and sister took place in both Cantonese and English. While there is the possibility of non-target English input from the siblings, by this time they were both attending an English primary school and speaking increasingly standard British English. Alicia began attending a local Chinese kindergarten in the morning at 2;03, and an English-speaking kindergarten in the afternoon from 3;03. The kindergartens were each monolingual in the respective language, apart from basic English lessons at the Chinese kindergarten.

Regular recording of Alicia began in September 2001 at age 1;03 and still continued at the time of release of her data when she was 4;04. Video and audio recordings were made regularly for approximately half an hour in each language. Unlike the Timmy and Sophie corpora which feature interaction with two research assistants, in Alicia’s corpus she interacts in English with her father and in Cantonese with a research assistant who speaks Cantonese natively. This practice eliminates non-native input, and also facilitates the investigation of questions involving parental input. As in the cases of Timmy and Sophie, a language diary was kept throughout the recording period.

The corpus as in the current release covers transcripts from age 1;03.10 up to 3;00.24, on an approximately biweekly basis. Alicia lived in Hong Kong continuously throughout this period of recording, apart from brief visits to Singapore (three-day visits at 1;03.02, 2;04.27, and 4;02.03) and Hawaii (two-week visits at 3;02.17 and 4;02.20). Her caretakers were her maternal grandmother who spoke Chiu Chow (a Southern Min dialect) as her native tongue and Cantonese when addressing the children; a Filipino domestic helper, Belma, who spoke fluent English and some Cantonese, from birth to 2;11 and from age 2;10 onwards an Indonesian helper, Siti, who spoke fluent Cantonese and some English.

In her pattern of language development Alicia resembles Sophie, with Cantonese developing faster than English during the recording period and extensive transfer from Cantonese to English. This is the result of less than balanced input, with relatives other than her father speaking predominantly Cantonese. In personality she is more like Timmy, being reserved and shy of strangers. As can be observed in the recordings, she often needs time to ‘warm up’. On the other hand, when necessary she will stand up for herself vociferously: when a friend boasted of being able to speak English, she exclaimed ngo5 dou1 zi1dou6 gong2 Ing1man4 ge3 ‘I know how to speak English too’ (2;11.28), where in place of the target verb sik1 ‘know (how)’ she uses the verb zi1dou6 ‘know (a fact)’, no doubt under English influence.

The 80 transcripts are coded following the same system used for the other children in the corpus.
File no. File name (Acyymmdd)File no.File name (Aeyymmdd)Age of CHI
1.Ac01090741.Ae0109071;03.10
2.Ac01092142.Ae0109211;03.24
3.Ac01100543.Ae0110051;04.07
4.Ac01101944.Ae0110191;04.21
5.Ac01110245.Ae0111021;05.05
6.Am0111191;05.22
7.Ac01121146.Ae0112111;06.13
8.Ac01122647.Ae0112261;06.28
9.Ac02011248.Ae0201121;07.15
10.*Ac02012749.Ae0201271;07.30
11.Ac02021050.Ae0202101;08.13
12.Am0202141;08.17
13.Am0203031;09.03
14.Ac02031051.Ae0203101;09.10
15.Ac02041352.Ae0204131;10.16
16.Ac02042653.Ae0204261;10.29
17.Ac02050354.Ae0205031;11.05
18.Ac02052455.Ae0205241;11.26
19.Ac02061056.Ae0206102;00.13
20.Ac02062357.Ae0206232;00.26
21.Ac02062858.Ae0206282;01.00
22.Ac02071359.*Ae0207132;01.15
23.Ac02080960.Ae0208092;02.12
24.Ac02081661.Ae0208162;02.19
25.Ac02083062.Ae0208302;03.02
26.Ac02091363.Ae0209132;03.16
27.Ac02101164.Ae0210112;04.13
28.Ac02102265.Ae0210222;04.24
29.Ac02110166.Ae0211012;05.04
30.Ac02111567.Ae0211152;05.18
31.Ac02113068.Ae0211302;06.02
32.Ac02120769.Ae0212072;06.09
33.Ac03010770.Ae0301072;07.10
34.Ac03012571.Ae0301252;07.28
35.Ac03022272.Ae0302222;08.25
36.Ac03031573.Ae0303152;09.15
37.Ac03032274.Ae0303222;09.22
38.Ac03041275.Ae0304122;10.15
39.Ac03042676.Ae0304262;10.29
40.Ac03050377.Ae0305032;11.05
41.Ac03051778.Ae0305172;11.19
42.Ac03060779.Ae0306073:00.10
43.Ac03062180.Ae0306213;00.24

Timmy

The regular audio recording of Timmy took place between November 1994 and De-cember 1996. The subject is the first-born of three siblings, born in Hong Kong on 21 May 1993. His first sister was born when he was 2;09.07 and second sister was born when 7;0.07. His mother is a native speaker of Hong Kong Cantonese and his father of British English. Both are professors of linguistics at universities in Hong Kong.

Cantonese is the native language of some 90% of the population of Hong Kong. Al-though widely considered a dialect of Chinese, Cantonese is not mutually intelligible with Mandarin. It differs substantially from other forms of Chinese in grammar, as well as phonology and lexicon. There are six tones in Hong Kong Cantonese. The romanization used is the JyutPing system developed by the Linguistic Society of Hong Kong. IPA and Yale romanization equivalents are given in Matthews and Yip (1994:400-401).

Timmy’s exposure to Cantonese and English began from birth. His father took sabbatical leave in the USA when he was three months old, during which time English tapes were played to Timmy. The primary caretakers in this period were Timmy’s maternal grandmother, his mother and a Mandarin-speaking domestic helper. His parents took him to Los Angeles from age seven months to one year. He then spent the summer of 1994 in Canada, the UK and briefly in France. By the time regular audio recording started at 1;05.20, the live-in domestic helper was a Filipino woman who spoke fluent English. A trip to Australia was made at 3;01.17 and he visited his paternal relatives in England for three weeks at 3;02.28.

The parents followed the one parent—one language principle when addressing the child. The language between the parents is mainly Cantonese with a great deal of English mixed in, as is characteristic of the speech of Hong Kong middle class families. Despite the one parent – one language principle, the quantity of input from the two languages is by no means balanced: on the whole, Timmy had more Cantonese than English input in his first three years. The language of the community is Cantonese and the extended family (maternal grandmother and relatives) also speak Cantonese (and in some cases Chiu Chow). Regular input in English came solely from the father and the family's Filipino domestic helper, while English-speaking relatives visited only occasionally. In a number of recording sessions he showed a preference for using Cantonese, even when the research assistants tried to induce him to speak English.

In addition to Cantonese, Chiu Chow (or Chaozhou) was spoken by the child’s grand-mother and some relatives. The ancestral language of a sizeable minority in Hong Kong, Chiu Chow is spoken in eastern Guangdong province and belongs to the southern Min di-alect group. Although diverging from Cantonese in many respects, it shares the broad ty-pological characteristics that contrast with English. The child showed some passive knowledge of Chiu Chow but never produced more than occasional words.

The recording continued on a weekly basis until Timmy was 3;06.25, except when he was away from home on a trip. Transcriptions are initially available on an approximately biweekly basis, as a number of tapes still await transcription. Unless otherwise stated, most of the files contain transcription of one side of the tape, i.e. about thirty minutes of recorded interaction between the child and other participants. Certain recordings are unusable due to various reasons such as technical failure of the recording instruments or failure to elicit the less preferred language on a few occasions. The subject was reserved and sensitive as a child, which is reflected in some of his transcripts as he at times became taciturn.

The files are classified as (I) mixed (File nos. 1-13), (II) Cantonese (File nos. 14- 47) and (III) English (File nos. 48-85). The early mixed files involve natural interaction be-tween the child, investigators and members of the family, without conscious prompting of either language. In these mixed files, a great deal of code-switching occurs during the course of conversation both on the part of the child and adult speakers. Subsequently, we tried to elicit one language at a time, e.g. in the first half hour of recording, English was spoken by one research assistant (RA) in order to elicit English, while the other RA used Cantonese in the second half hour to elicit Cantonese. The RAs who interacted with Timmy were all native speakers of Cantonese except Linda Peng Ling Ling, who is a native speaker of Mandarin and used primarily English in the later recording sessions. All the RAs speak English as their second language.

In practice, this one person -- one language strategy did not always work as intended for elicitation purposes. As a result one or more adults present at the recording may be speaking both English and Cantonese to the child who in turn code-mixes from time to time. Hence some files, especially the early ones under the category Cantonese, for exam-ple, actually contain a considerable amount of English and language mixing. As the child’s languages develop, the division into Cantonese and English files can be made more easily. Spontaneous speech data were recorded at the child’s home where the routines included activities such as role-playing, playing with toys and reading story books.The parents also kept a diary to supplement the audio-recording data. This enabled the researchers to address a wider range of phenomena, as certain structures (such as relative clauses) scarcely appear in the longitudinal corpus data (see Yip & Matthews 2000).To facilitate comparison with monolingual Cantonese and English data, the data collection and corpus creation were modeled on previous works: Cancorp created by Lee et. al. (1996) which contains eight monolingual Cantonese speaking children’s data from 1;5 – 3;8, and various English-speaking corpora (see MacWhinney 2000).

Sophie

Born on 28 February 1996, Sophie is the younger sister of Timmy, the first bilingual subject to be included in the Hong Kong bilingual corpus, born two years and nine months earlier (a younger sister was born when she was 4;03). Sophie's mother is a native speaker of Hong Kong Cantonese and her father of British English, and her exposure to Cantonese and English started from birth. The one parent-one language principle was adopted in principle, especially when addressing the child, but code-mixing occurred when the parents conversed with each other, which formed part of the child’s input. Apart from parental input, interaction with her brother took place in both Cantonese and English. She was regularly video-taped and audio-taped by two research assistants in each recording session, one responsible for each language, from 28 August 1997 to 28 February 2000 (1;06 - 4;00) on a weekly basis. In each recording session one research assistant interacted with the child for approximately half an hour in English and the other for half an hour in Cantonese.

The corpus as initially released covers transcriptions of her data from age 1;06 up to 3;00.09, on an approximately biweekly basis. Pictures chronicling Sophie and Timmy at different stages from infancy to primary school can be viewed at http://www.cuhk.edu.hk/ ils/home/bilingual.htm

Sophie lived in Hong Kong continuously throughout the period of recording. She did not take her first trip abroad (to Australia) until 4;04. Her caretakers were her maternal grandmother who spoke Cantonese and ChiuChow and a Filipino domestic helper, Belma, who spoke English and some Cantonese. She started attending a local Chinese kindergarten at 2;6 in the morning and in addition, attended an English-speaking kindergarten in the afternoon from 3;02. She continued to attend both schools until 5;01. The kindergartens were each monolingual in the respective language.

While the circumstances are similar overall to those prevailing in Timmy's case, Sophie’s different personality and character lead to differences in the data. While her brother was reserved and passive, she was typically lively and talkative in recording sessions, even becoming argumentative as she grew older. In addition, being cared for primarily by her grandmother and remaining in Hong Kong exclusively during her preschool years means that the predominance of Cantonese input is even greater in her case than in Timmy's. This is reflected in the fact that while recordings eliciting both languages are available from age 1;06.00, these early recordings are dominated by Cantonese with occasional English words, and she only began to produce English sentences after age 2. While in many respects her development recapitulates that described for Timmy (such as wh in situ, null objects and prenominal relative clauses: see Yip & Matthews 2000), her English also shows some forms of transfer which are not evident in Timmy, such as extension of the verb give to permissive and even passive usages.

Since her grandmother speaks Chaoyang (Chaozhou) dialect as well as Cantonese, Sophie developed some passive knowledge of this dialect. She learnt that producing occasional phrases in Chaozhou was a source of amusement, but did not produce full sentences. There is also the possibility of syntactic influence from Chaozhou, for example in the ordering of double objects.

The parents kept a diary of Sophie’s utterances to supplement the audio and video-re-cording data. The diary continued beyond age 4, when regular recording ceased, in order to follow up some of the features.

The format of the English and Cantonese data is as described for Timmy in the first installment of the Hong Kong corpus: the grammatical category labels for the English corpus are based on the MOR grammars for English in the CHILDES Windows Tools, while the Cantonese data were tagged using a program developed by Lawrence Cheung on the basis of the grammatical categories used in the Hong Kong Cantonese child language corpus (Cancorp) created by Lee et. al. (1996), which contains eight monolingual Cantonese speaking children’s data from 1;5 – 3;8. There are three tiers with the main tier showing Cantonese in the JyutPing romanization, and Cantonese characters and grammatical categories shown on separate tiers. The thirty-three grammatical categories used for tagging the corpus are listed below in Table 1. Details of the Morpheme tier (%mor) and Cantonese tier (%can) as well as instructions for downloading and viewing the Cantonese characters can be found in the readme file accompanying the data for Timmy.

Together with this corpus, a total of three sample audio-linked transcripts and two video-linked transcripts are available for access. The three audio-linked transcripts feature Cantonese and English as well as some Cantonese-English code-switching. Two of the audio-linked transcripts have video-linked counterparts. The shorter of these transcripts has a video-linked counterpart, with a sound track that is less clear than in the audio-linked one. In the shorter video excerpt (3:00) sound quality may be improved by adjusting the balance to turn down the right channel. The video-linked files feature each language respectively as the base language as well as code-switching in the longer one (4:13).

For this release, there is a total of 80 files, half in Cantonese and half in English and there are two files for the same date since they were recorded on the same day. Though there seems to be a perfect symmetry in terms of the files in each language, it should be noted that in the early English files before Sophie turned 2, she did not yet speak English fluently despite the investigators’ elicitation in English. The file name is made up of Sophie’s initial S, followed by the initial that stands for the language, either c for Cantonese or e for English, followed by the year, month and date of recording.e.g. Sc970828 refers to the Cantonese file containing the recording made in the year 1997, August 28 and Se970828 refers to the English file for the recording made on the same date. Thus each of the 80 files has a unique file name.

Kathryn

The corpus of Kathryn’s bilingual development constitutes the third installment of the Hong Kong Bilingual Child Language Corpus. Kathryn was born in Hong Kong on January 23, 1992. Her siblings are James (four years and eight months older) and Alasdair (one year and nine months older), who also feature in the recordings occasionally. Her father, a neuro-surgeon at a university hospital, is a native speaker of Cantonese and her mother of British English. The mother, a housewife at the time of study, was the principal caregiver. The family employed a Filipina domestic helper for a brief period until Kathryn was around age 3, and subsequently a part-time Cantonese cleaner who also spoke fluent English.

Kathryn attended the Cantonese section of an international Kindergarten from 2;07. According to her mother’s observations this set her subsequent pattern of language use: with her friends mostly Cantonese-speaking. In April 2003 Kathryn is now eleven years old and she was reported to use more Cantonese as the language of social interaction, while English is the language for academic settings.

Audio recording was conducted by two research assistants in each recording session, one responsible for each language, from 15 November 1994 (2;09.23) to 30 July 1996 (4;06.07) on a bi-weekly basis. The data initially released total 26 files (13 in each language) on a monthly basis from 3;06.18 -- 4;06.07. In each recording session one research assistant interacted with the child for approximately half an hour in English and the other for half an hour in Cantonese.

Of the five Cantonese-English bilingual children studied to date, Kathryn shows the most balanced pattern of development, with relatively little evidence of language dominance or concomitant transfer compared to the siblings, Timmy and Sophie who show dominance in Cantonese over English.

A perhaps unintentionally funny remark:

Ngo5 de1di4 hai6 zing2 tau4 gaa3 (3;11;27) file Kc960119
My daddy is do head PRT
‘My daddy fixes heads.’

In Cantonese this would normally denote a hairdresser, but here describes her father’s work as a neurosurgeon. Photographs of Kathryn from infancy to primary school can be viewed at http://www.cuhk.edu.hk/ils/home/kathryn.htm

All the transcripts released in this corpus are linked to the audio files. One can read the transcript while listening to the interaction between the child and the research assistants.

Investigation of Kathryn's bilingual development was undertaken as part of the project "The Development of Bilingual Competence in Hong Kong Children" funded by the Research Grants Council (RGC ref. no. HKU 336/94H) and subsequently funded by the project "A Cantonese-English Bilingual Child Language Corpus" (RGC ref. no. CUHK4002/97H) and the current project "Multimedia Perspectives on Bilingual Development". (CUHK 4014/02H) The recording of Kathryn could not have been so successful and enjoyable without the generous support of her parents and siblings. We are especially indebted to Kathryn’s mother Gillian for all the help she rendered over the years of our investigation. We also gratefully acknowledge the support and help of the colleagues and students who have been friends and supporters of our work over the years. Among them, special thanks are due to Winnie Chan, Linda Peng Ling Ling, Bella Leung, Lawrence Cheung, Gene Chu, Betty Chan, Chen Ee San, Michelle Li, Emily Ma, Uta Lam, Richard Wong and Angel Chan: a dedicated team who became part of the family and friends of the children. We thank Mary MacWhinney for digitizing a substantial portion of Kathryn’s tapes for us in Spring 2001. Brian MacWhinney's impressive technical know-how and practical tips have greatly facilitated the completion of the corpus and production of the entire audio-linked corpus. His input during and after his sabbatical in Hong Kong in 2000-2001 has made all the difference to every aspect of the corpus.

File no. File name (Kcyymmdd)File no.File name (Keyymmdd)Age of CHI
1Kc95081014Ke9508103;06.18
2Kc95090515Ke9509053;07.13
3Kc95102016Ke9510203;08.27
4Kc95111717Ke9511173;09.25
5Kc95122018Ke9512203;10.27
6Kc96011919Ke9601193;11.27
7Kc96020720Ke9602074;00.15
8Kc96030421Ke9603044;01.09
9Kc96040922Ke9604094;02.17
10Kc96050823Ke9605084;03.15
11Kc96062124Ke9606214;04.29
12Kc96070325Ke9607034;05.10
13Kc96073026Ke9607304;06.07

Llywelyn

The corpus of Llywelyn’s bilingual development constitutes the forth installment of the Hong Kong Bilingual Child Language Corpus. Llywelyn was born on 21 June 1993, the second of two children. His English name is of Welsh origin, while his Chinese name Wai Lung means “great dragon”. His father is a native speaker of British English and a professional linguist. He is on frequent conference trips due to professional reasons and was occasionally absent from home during Llywelyn's early years, including six months' sabbatical leave in Australia (Jan 95-July 95), one absence of a month (14 July-12 August 95) and one of 3 weeks (Jan 96-29 Jan 96). His mother is a native Cantonese speaker and an accountant by profession. The family employed two Filipino domestic helpers. Another important role in Llywelyn’s language ecology was played by his brother Kenny, three years and eight months older, who was very advanced in terms of language and cognitive development. A gifted student, he excelled at a prestigious Chinese-medium school. His metalinguistic awareness was also notable: he coined words like jyu2sh [a blend of Cantonese jyu2+English fish], and even invented his own private language, ‘sill-ish’. Kenny also features in some of the recordings, sometimes commenting on his little brother's language, as in the following exchange from Le960328 (2;09.07):

*BRO: Llywelyn , what's this ?
*CHI: it's racing+car .
*BRO: very good .

Llywelyn's English shows some of the same features observed in previous children in the corpus, such as null objects (Yip & Matthews 2000; Yip, to appear), while his early Cantonese shows some phonological influence from English.

Audio recording was conducted by two research assistants in each recording session, one responsible for each language, from 21 December 1994 (1;06) to 19 December 1996 (3;05.28) on a bi-weekly basis. In each recording session one research assistant interacted with the child for approximately half an hour in English and the other for half an hour in Cantonese.

Investigation of Llywelyn’s bilingual development was undertaken as part of the project "The Development of Bilingual Competence in Hong Kong Children" funded by the Research Grants Council (RGC ref. no. HKU 336/94H) and subsequently funded by the project "A Cantonese-English Bilingual Child Language Corpus" (RGC ref. no. CUHK4002/97H) and the current project "Multimedia Perspectives on Bilingual Development". (CUHK 4014/02H) The recording of Llywelyn could not have been so successful and enjoyable without the generous support of his family. We are especially indebted to Llywelyn’s parents Gregory and Zelinda for their support over the years of our investigation. We also gratefully acknowledge the support and help of the colleagues and students who have contributed to our work over the years. Among them, special thanks are due to Winnie Chan, Linda Peng Ling Ling, Bella Leung, Lawrence Cheung, Gene Chu, Betty Chan, Chen Ee San, Michelle Li, Uta Lam, Richard Wong and Angel Chan: a dedicated team who became part of the family and friends of the children. We thank Brian MacWhinney for his continued support in facilitating the completion of the corpus and production of the entire audio-linked corpus.

File no. File name (Leyymmdd)File no.File name (Lcyymmdd)Age of CHI
1.Le95070318.Lc9507032;00.12
2.Le95080319.Lc9508032;01.13
3.Le95082420.Lc9508242;02.03
4.Le95091921.Lc9509192;02.29
5.Le95100522.Lc9510052;03.14
6.Le95110223.Lc9511022;04.12
7.Le95120124.Lc9512012;05.10
8.Le96011025.Lc9601102;06.20
9.Le96012526.Lc9601252;07.04
10.Le96020827.Lc9602082;07.18
11.Le96022928.Lc9602292;08.08
12.Le96032829.Lc9603282;09.07
13.Le96042530.Lc9604252;10.04
14.Le96061931.Lc9606192;11.29
15.Le96071832.Lc9607183;00.27
16.Le96072533.Lc9607253;01.04
17.Le96110434.Lc9611043;04.14

Charlotte

The corpus of Charlotte’s bilingual development constitutes the fifth installment of the Hong Kong Bilingual Child Language Corpus. Charlotte was born on 23 August 1996, the second of two children. Charlotte’s elder sister Claire is 2 years and 9 months older. The siblings are very close to each other and often appear together in the taping session. Claire is referred to as gaa1gaa1 ‘big sister’ or (when speaking English) gaa1gaa4 with falling intonation, a coinage based on the Cantonese kinship term gaa1ze1 ‘elder sister.’ Pictures of Charlotte from 1:06 to 3:05 can be viewed in Charlotte’s gallery at http://www.cuhk.edu.hk/ils/home/Charlotte.htm

Charlotte’s mother, a teacher, is a native speaker of Cantonese. Her father is a professor from the UK, who was on sabbatical leave in New Zealand when Charlotte was born. At four and a half months she moved to Hong Kong where she was cared for by a Philipina domestic helper, first Margie and then (from age 2 onwards) Erlina.

Throughout the period of study, Charlotte was more dominant in English, making an interesting contrast with Cantonese-dominant children such as Timmy (Yip and Matthews 2000) and Sophie (Matthews and Yip 2003).

Charlotte was observed during the period when she was 1;05.10 - 3;06.14. For this release, there is a total of 38 files, half in Cantonese and half in English. The data initially released cover the period from 1;08;28 to 3;00;03. Audio and video recording was conducted by two research assistants in each recording session, one responsible for each language, from 12 February 1998 to 2 March 2000 on a bi-weekly basis. In each recording session one research assistant interacted with the child for approximately half an hour in English and the other for half an hour in Cantonese.

There are two files for the same date since they were recorded on the same day. The file name is made up of Charlotte's initial c, followed by the initial that stands for the language, either c for Cantonese or e for English, followed by the year, month and date of recording, e.g. cc980521 refers to the Cantonese file containing the recording made in the year 1998, May 21 and ce980521 refers to the Cantonese file for the recording made on the same date. Thus each of the 38 files has a unique file name.

Investigation of Charlotte's bilingual development was undertaken as part of the project "A Cantonese-English Bilingual Child Language Corpus" (RGC ref. no. CUHK4002/97H) and the current project "Multimedia Perspectives on Bilingual Development". (CUHK 4014/02H) The recording of Charlotte could not have been so successful and enjoyable without the generous support of his family. We are especially indebted to Charlotte’s parents Polly and Paul for their support over the years of our investigation. We also gratefully acknowledge the support and help of the colleagues and students who have contributed to our work over the years. Among them, special thanks are due to Linda Peng Ling Ling, Bella Leung, Lawrence Cheung, Gene Chu, Betty Chan, Chen Ee San, Michelle Li, Uta Lam, Richard Wong Angel Chan, Thomas Tsang and Joan Huang: a dedicated team who became part of the family and friends of the children. We thank Brian MacWhinney for his continued support in facilitating the completion of the corpus and production of the entire audio-linked corpus.

File no. File name (ccyymmdd)File no.File name (ceyymmdd)Age of CHI
1.cc98052120.ce9805211;08.28
2.cc98060421.ce9806041;09.12
3.cc98070222.ce9807021;10.09
4.cc98072823.ce9807281;11.05
5.cc98091724.ce9809172;00.25
6.cc98101525.ce9810152;01.22
7.cc98102926.ce9810292;02.06
8.cc98111027.ce9811102;02.18
9.cc98121028.ce9812102;03.17
10.cc99011229.ce9901122;04.20
11.cc99021130.ce9902112;05.19
12.cc99031131.ce9903112;06.16
13.cc99041532.ce9904152;07.23
14.cc99042933.ce9904292;08.06
15.cc99052734.ce9905272;09.04
16.cc99061135.ce9906112;09.19
17.cc99070836.ce9907082;10.15
18.cc99072237.ce9907222;10.29
19.cc99082638.ce9908263;00.03

Janet

Janet was born on May 06, 2000. Her mother is a native speaker of Cantonese and her father a native speaker of British English. They were both school teachers. Janet was exposed to the two languages from birth. The Janet corpus covers the age from 2;09.16 to 3;11.11. Due to the imbalance of input, her Cantonese was very much ahead of English. There was little production of English words before 2;09 though our recording started at an earlier age. She exemplifies a case of childhood bilingualism with an extended silent period before the weaker language was active in production. It was not until 2;09 that she began to produce more English.

Investigation of Janet’s bilingual development was undertaken as part of the project “Childhood Bilingualism and Second Language Acquisition in Hong Kong Children” funded by the Research Grants Council of the Hong Kong Special Administrative Region, China (ref. no. CUHK4692/05H). We thank Janet for her participation and gratefully acknowledge the support and help of her parents, our team members especially Angel Chan, Andrew Chau, Uta Lam and Michelle Li. Brian MacWhinney’s painstaking efforts and incredible support saw through every technical aspect of the construction of this fully video-linked set of transcripts in the Hong Kong Bilingual Child Language Corpus. We are indebted to him for his expert advice and innovative ideas to make the multimedia corpus a reality.

File no. File name (Jcyymmdd)Fil no.File name (Jeyymmdd)Age of CHI
1.JC03032223.JE0303222;09.16
2.JC03040524.JE0304052;10.30
3.JC03050325.JE0305032;11.27
4.JC03051726.JE0305173;00.11
5.JC03052427.JE0305243;00.18
6.JC03060728.JE0306073;01.01
7.JC03062829.JE0306283;01.17
8.JC03071230.JE0307123;02.06
9.JC03071931.JE0307193;02.14
10.JC03080232.JE0308023;02.27
11.JC03083033.JE0308303;03.24
12.JC03092734.JE0309273.04.21
13.JC03101835.JE0310183;05.12
14.JC03110836.JE0311083;06.02
15.JC03112237.JE0311223;06.16
16.JC03121338.JE0312133;07.07
17.JC03122739.JE0312273;07.21
18.JC04011040.JE0401103;08.04
19.JC04013141.JE0401313;08.25
20.JC04021442.JE0402143;09.08
21.JC04040343.JE0404033;10.28
22.JC04041744.JE0404173;11.11

Grammatical categories

The grammatical category labels for the English corpus are based on the MOR grammars for English in the CHILDES Windows Tools while those for the Cantonese corpus are based on those of Cancorp with thirty-three categories distinguished, as shown in Table 1 (see MacWhinney 2000:364-365). These are as used in Cancorp apart from the following modifications:

Grammatical categories for the Cantonese corpus
1. adjadjectivesau3 leng faai hou2teng1thin, pretty, fast, good to listen to
2. advffocus adverbdou1 sin1 jau6 zung6also, first, again, still
3. adviadverb of intensitygam3 hou2 taai3, zeoi3so, very, too, most
4. advmadverb of mannergwaai1gwaai1dei2 maan6maan2obediently, slowly
5. advssentential adverbjan1wai so2ji5 bat1jyu4because, therefore, how about
6. aspaspectual markerzo2 gwo3 gan2 hoi1 haa5PFV, EXP, PROG, HAB, DEL
7. auxauxiliary/modal verbjing1goi1 wui52| m4hou2should, would, don't
8. clclassifierbun2 go3 |gaa3 tiu4CL
9. comcomparative morphemeleng3 di1 more beautiful, prettier than her
10. conjconnectiveding6hai6 tung4maai4 waak6ze2or, and, or
11. corrcorrelativejat1lou6 jat1lou6 jyut6 V jyut6while, the more…the more
12. detdeterminerli1 go2 dai6this, that, number
13. dirdirectional verblei4/lai4 heoi3 ceot1 jap6 soeng5 lok6come, go, out, in, go up, go down
14. exexpressive utteranceai1jaa3 m4goi|oops, well, please/thanks
15. gengenitive markerTimmy ge3 pang4jau5 Timmy's friends
16. insemphatic inserted markergam3 gwai2 lyun6what a mess!
17. loclocalizerzoeng1 toi2 dou6 soeng6min6 on the table, up there
18. nnnounce1 wun6geoi6 sing1sing1 kau3fu2car, toy, star, uncle
19. nnprpronounNgo5 lei5 keoi5 ngo5dei6 I/me, you, s/he, we/us, you(pl), they/them
20. nnppproper nounciu1jan4 je4sou1 jing1gwok3Superman, Jesus, Britain
21. negnegative morphemem4| mai6 mou5not, not, not have
22. onomaonomatopoeic expressionwou1wou1 baang4 gok6go6ONOMA
23. prt(postverbal) particledak1 dou3 saai3 maai4 jyun4can, until, all, as well, finish
24. prepprepositionhai2 bei2at, for
25. qquantifierjat1 sap6saam1 mui5one, thirteen, each
26. rflreflexive pronounzi6gei2self
27. sfpsentence-final particleaa3 laa1 gaa3 ho2SFP
28. vdditransitive verbbei2 sung3xgive, give (as a gift)
29. vergergative (unaccusative) verbdit3 tyun5Fall, break
30. vffunction verbhai6 jau5be, have
31. viintransitive verbsiu3 jau1sik1 kei4tou2smile, rest, pray
32. vttransitive verbsik6 gong2 zi1dou3eat, say, know
33. whwh phrasesbin1go3 mat1je5who, what, where, why

Morpheme tier %mor: The %mor tier was generated using a tagging program developed by Lawrence Cheung. Since Cantonese has many homophonous morphemes, it was necessary to carry out disambiguation with respect to word class. The disambiguation and checking were performed by Gene Chu and Simon Huang for both Cantonese and English files.

Cantonese tier %can:The child’s Cantonese was first transcribed using romanized Cantonese instead of Chinese characters. The %can tier was generated at a later stage to provide readers who can read Chinese characters with quicker access to the speakers' utterances. Fonts for Cantonese characters are available at the Hong Kong SAR government website, http:// www.5c.org/ as well as through Microsoft. The same characters are used for allophonic representations of a morpheme. Due to ongoing sound changes, there is variation especially between n/l and ng/Ø (Matthews and Yip 1994: 29-30). For example, the first person pronoun is represented as ngo5 in the corpus but is often pronounced o5. The second person pronoun is represented as lei5 although the prescribed form is nei5. For the demonstrative there are several variant forms: li1/ni1/ji1/ nei1/lei1 ‘this’. The experiential aspect marker may appear as gwo3 or go3. Other alternative forms result from contraction, for example mat1je5 'what' becomes me1 and hou2 m4 hou2 'is it okay?' becomes hou2 mou2.

Sound-linked files: As an initial demonstration of how transcripts can be read and heard simultaneously using CLAN, a total of three sample audio files (two English, one Cantonese) linked to excerpts of transcripts are provided. Subject to sufficient funding, it is hoped to make further audio files available at a later date, as well as to provide English glosses for the Cantonese transcripts.

Acknowledgments

Longitudinal data of Timmy's language development were collected as part of two projects funded by the Research Grants Council of Hong Kong: (1) RGC ref. no. HKU336/ 94H to Stephen Matthews (University of Hong Kong), Virginia Yip (Chinese University of Hong Kong) and Huang Yue-Yuan (Hong Kong Baptist University) and (2) RGC ref. No.CUHK4002/97H "A Cantonese-English Bilingual Child Language Corpus" to Virginia Yip and Stephen Matthews. We gratefully acknowledge the help of our students and colleagues, especially Linda Peng Ling Ling, Bella Leung, Lawrence Cheung, Simon Huang, Gene Chu, Patricia Man, Winnie Chan, Betty Chan, Tommi Leung, Peggy Leung, Chen Ee San, Shirley Sung, Uta Lam, Richard Wong, Huang Yue Yuan, Emily Ma, and Angel Chan: a dedicated team who became part of the family and friends of the children. The advice of Brian MacWhinney in the last stages of the preparation of Timmy corpus and the audio and video digitization was most timely and indispensable.