CHILDES Chinese-English CHCC Corpus (Child Heritage Chinese Corpus)

Ziyin Mai
Department of Linguistics and Translation
City University of Hong Kong


Virginia Yip
Linguistics and Modern Languages
Chinese University of Hong Kong


Stephen Matthews
Department of Linguistics
University of Hong Kong

Participants: 3
Type of Study: longitudinal, naturalistic
Location: United States, Hong Kong
Media type: audio
DOI: doi:10.21415/T5SD6M

Browsable transcripts

Download transcripts

Link to media folder

Citation information

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

This corpus documents the language development of American-born Chinese children, who were exposed to their heritage language Chinese (Mandarin and/or Cantonese) at home, and the societal majority language English at school and at home. The corpus currently contains data from three children: Luna, Avia and Winston. File names give the child's age.


Luna was born in New York in 2011. She is the first-born of the family. She has been exposed to Mandarin Chinese at home since birth and English in nursery and preschool since she was 0;09. Her parents are professionals who are native speakers of Mandarin and speak English fluently as their second language. Both of them come from mainland China, the mother from a Cantonese-speaking city in southern China, and the father from the north. They had completed their first degrees in natural sciences and engineering in prestigious universities in Beijing before they arrived in the United States to pursue doctoral degrees. At home, the parents address Luna almost exclusively in Mandarin with only occasional switches to English. Before the age of 0;09, Luna was primarily taken care of by her parents and her maternal grandparents, who also addressed Luna in Mandarin. From 0;09 to 3;00, Luna spent approximately seven hours a day during weekdays at a local daycare, where English was the language between staff members and between staff and the infants. The family relocated to Maryland when Luna was 3;01, where Luna attended an English-medium pre-school from 3;02 to 5;08, and she was particularly close to the Chinese-speaking children in her class. Spanish was introduced to Luna by her school teachers in her pre-school when she was 3;02, with one hour of exposure per week. Luna is now attending first grade (Kindergarten) in an elementary school in Maryland. She has taken a few month-long trips outside the US before 4;11 (age of the last recording in this release), most of them to China visiting friends and family in major cities. Luna had a little sister when she was 4;03, and the parents have adhered to the Mandarin-only policy to address both girls at home. Mandarin is the language of communication between Luna and her sister.

Regular recording of Luna began in 2013 at age 2;00 and still continued at the time of writing. On the initial launch of Luna’s corpus, the transcripts of the speech data from 2;00 to 4;11 will be released. From 2;00 to 4;01, Luna was video-recorded by her parents at home at weekly or bi-weekly intervals. From 3;10 to 4;11, Luna was invited to interact with our research staff at CUHK via Skype at regular intervals. The Skype sessions, both audio and video, were recorded using Camtasia by the researchers. The Skype sessions were scheduled for the early mornings of Hong Kong and early evenings of Maryland. Most of them were 40-60 minutes in duration, evenly divided into a Mandarin session, where Luna interacted with a Mandarin-speaking researcher in Mandarin, and an English session, where the same or another research assistant, who spoke English as their second language, elicited English from Luna. The Skype sessions consisted of child-friendly activities including story-telling, role play, games with toys and free chat. The Mandarin and English sessions were later split into different files in transcription and tagging. To compensate for data loss due to unstable connections in web-based calls, Luna’s mother, who observed all of the Skype calls, audio-recorded the Skype conversations on another device (e.g. smartphone, sound recorder). The transcription of Skype sessions was done based on the audio recordings provided by the parents and checked against the Camtasia recordings, in cases where the speakers’ utterances were unintelligible due to unstable internet connection. To address the parents’ requests to protect the family’s privacy, only the transcripts and corresponding audio recordings, some of them converted from video recordings, is released. In sum, the initial release of the Luna corpus contains 13 hours and 22 minutes’ recordings of Luna interacting with parents at home from 2;00-4;01, and 13 hours and 29 minutes’ recordings of Luna interacting with researchers via Skype from 3;10-4;11 (4 hours 56 minutes in Mandarin, 8 hours and 33 minutes in English), totalling 26 hours and 51 minutes.

To our knowledge, this is the first longitudinal corpus documenting a child’s language development drawing on web-based data collection methods. In Mai and Yip (2017), Zhu, Mai and Yip (2018) and Yip, Mai and Matthews (2018), we demonstrated how the skype sessions are conducted and discussed the methodological issues that this method gives rise to. Compared to traditional home recordings, web-based data collection incurs lower costs and offers greater flexibility. It is transformational for the study of heritage bilingualism, as a critical question in the field is how knowledge of one's mother tongue is acquired under less-than-optimal conditions in childhood and subsequently maintained or attrited later in life. Web-based data recording enables us to collect the heritage children’s speech data by crossing location barriers: Hong Kong based researchers are able to interact with the US-based heritage children and make recordings of the data, supplementing traditional recordings normally conducted by home visits.


Avia was born in Ann Arbor, MI in 2012, and currently lives with her parents in Seattle, WA. Her mother, originally from northern China, arrived in the United States with her parents when she was five, and had spoken Chinese on a daily basis with her parents until she attended college. At the time of recording, Avia’s mother was fluent in both Mandarin and English, with stronger abilities in English than in Mandarin, especially on professional topics and formal registers. Avia’s father is American, with English as his first and dominant language. Both parents are professionals with doctoral degrees in engineering. With strong convictions in the necessity and possibility of passing on the Chinese heritage to Avia and the linguistic and cognitive benefits of early childhood bilingualism, the parents adopt the “one parent-one language” principle in their interaction with Avia, with the mother exclusively speaking Mandarin, and the father speaking English. The language between the parents is English. Avia has a younger brother, who is three years younger than her. The parents observed that Avia mainly spoke Mandarin with her brother, mixed with some English. Before the age of 2;05, Avia was taken care of mainly by her parents and a Mandarin-speaking au pair. From 2;05 to 4;11, Avia attended a bilingual daycare, where Mandarin was the dominant language, three days per week for eight hours a day, followed by an English-medium local elementary school after 4;11. Avia lived in the US continuously throughout the period of recording, apart from brief visits to mainland China (4 weeks in total by 4;11).

The Avia corpus contains 19 hours of video recordings documenting Avia’s interaction with her parents from 2;0 to 3;11 in the home setting (15 hours in Mandarin with the mother and 4 hours in English with the father). They were recorded by Avia’s parents at 2-3 week intervals from 2014 to 2016, and transcribed and tagged in CLAN (MacWhinney, 2000) by our research team.


Winston was born in Boston, MA in 2012, and moved to Seattle, WA with his parents at the age of 0;04. Similar to Luna, Winston’s parents had completed their first degrees in science and engineering in prestigious universities in Beijing before they arrived in the United States to pursue doctoral degrees. Both are highly proficient in English and use English on a daily basis in the workplace. Unlike Luna, Winston’s parents grew up bilingually in Guangzhou, a Cantonese-speaking city in southern China, speaking both Cantonese (Guangzhou) and Mandarin with family and friends, and are determined to pass on both Cantonese and Mandarin to Winston. At home, Winston’s mother speaks Cantonese to Winston while his father speaks Mandarin most of the time. Winston’s younger brother was born when he was 3;08. The maternal and paternal grandparents played a major role in taking care of the children. They addressed the children in Cantonese and Mandarin 70% and 30% of the time respectively. In addition, the family hired a Mandarin-speaking domestic helper for four months (0;07-0;10) and a Cantonese-speaking one for ten months (2;05-3;02), who had daily interactions with Winston in the respective languages. Winston also met with an English tutor for 2-3 hours per week starting from 5;00.

Winston attended a monolingual English nursery for a month from 1;08 to 1;09, and switched to an international Montessori nursery at 1;09. In the latter Winston was first enrolled in a “quadrilingual programme”, in which he was exposed to Mandarin, Japanese, Spanish and English for fourteen months from 1;10 to 3;00, and then a “bilingual programme” (English and Mandarin) from 3;00 to 4;04. At 4;04, Winston started attending a preschool where English is used as the sole language of instruction. In each school, Winston spends six hours at school on weekdays. For Winston’s parents, developing relatively balanced abilities in Cantonese, Mandarin and Cantonese is one of the most important considerations in choosing Winston’s school and after-school activities, as well as domestic helpers and tutors. The parents have observed and monitored Winston’s language development closely and adjusted the amount of input in English, Mandarin, and Cantonese accordingly. Winston is attending an English-medium elementary school for gifted children at the time of writing.

Winston’s parents started to record their interactions with Winston in Cantonese and Mandarin weekly at home when Winston was 1;00. Like Luna, we collected Winston’s speech data in English via Skype interactions, which started when Winston was 3;00 on a weekly and then bi-weekly basis. For this release, there is a total of 22 hours and 24 minutes of recordings, covering Winston’s development in Cantonese, Mandarin and English from 1;07 to 3;07. In Mai, Wu, Wong, Law and Yip (2018), we have discussed the combined roles of cross-linguistic influence and input frequency by examining the development of the zoeng-construction.


We would like to express our gratitude to Brian MacWhinney, Director of CHILDES for his expertise, advice and technical support in constructing the Child Heritage Chinese Corpus (CHCC).

Our special thanks go to the Research Assistants who participated in the Skype recording sessions and/or transcribed the speech data: Shanrong Xie, Lana Yinyin Liang, Sophia Zishu Yu, Kay Wong, Riki Yuqi Wu, Julia Yujing Fan, Scarlet Wan Yee Li, Hannah Lam, especially Vaness Tsz Yan Law, Joy Jieyu Zhou and Alice Yanxin Zhu. We gratefully acknowledge the support and help of our lab members: Xiangjun Deng, Emily Haoyan Ge and Jiangling Zhou.

The research was supported by a start-up grant to set up the University of Cambridge-Chinese University of Hong Kong Joint Laboratory for Bilingualism at CUHK, CUHK funding for the CUHK-Peking University-University System of Taiwan Joint Research Centre for Language and Human Complexity, a General Research Fund from the Hong Kong Research Grants Council (“Childhood bilingualism and heritage language development”, Project no. 14632016).