CHILDES English-Portuguese LOBILL Corpus

Catherine Lonngren-Sampaio
English Language and Linguistics
University of Hertfordshire, UK

Participants: 2
Type of Study: longitudinal, naturalistic
Location: Brazil, UK
Media type: audio
DOI: doi:10.21415/20XC-4T46

Browsable transcripts

Download transcripts

Link to media folder

Citation information

Lonngren-Sampaio, C. (2015) The investigation of code-switching in a computerised corpus of child bilingual language. Unpublished doctoral dissertation, University of Hertfordshire.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by the above reference.

Project Description

The LOBILL (LOnngren BILingual Language) Corpus is longitudinal in nature and is composed of the spoken language of two bilingual children in their interactions with mono and bilingual interlocutors, in diverse family situations. The main subjects of the corpus are MEG and JAM, a sister and brother who were 5;10 and 3;5 years old at the beginning of data collection. Both were born in Fortaleza, Brazil and attended a Brazilian school from the age of 1;6 until they moved to England in 2004 when they were 8;7 and 6;3 respectively. Their mother (identified as MOT), who is the researcher, is English and married to a Brazilian, (identified as PAI), who speaks English fluently. MOT is a near-native speaker of Portuguese, having studied the language at university and lived in Brazil for twelve years before returning to England in 2004. The bilingual siblings’ language experience can be divided into two major phases which correspond to before and after moving to England. Before the birth of their children, Portuguese formed the basis on which all daily interaction between MOT and PAI took place, although code-switching did take place. From the birth of MEG in 1995 the family language dynamics changed: MOT spoke exclusively English to her daughter while PAI used Portuguese when addressing his daughter. This daily use of English at home led to greater use of English between the parents, mostly in their code-switching practices. This pattern was further consolidated when JAM was born in 1998, MOT continuing to speak English to both siblings while PAI interacted with them mostly in Portuguese. Other daily inputs of English were restricted to television programmes (Cartoon Network and Discovery Channel) and English story books (read by the mother). Occasional visits from English relatives provided another important source of contact with English and most years both children spent short periods on holiday in England with their mother, where they stayed with their English Grandmother (1996, 1998, 2000 and 2003).

Despite the mother’s use of English to both children, the interaction between the siblings was predominantly in Portuguese, following the model of interaction experienced with their peers at their Brazilian school. Whilst in England on holiday, there was more use of English between the siblings, especially when in the presence of English cousins. When the family moved to England in June 2004, MEG was 8;7 and JAM was 6;3. MEG had been reading and writing in Portuguese for 2 years while JAM had only just learnt to read and write in Portuguese. Although MEG was able to read in English, her written English showed clear influence from Portuguese. JAM was able to read some English but there was no evidence that he was able (or unable) to write English words. Immediately after moving to England in June 2004, the mother, MEG and JAM stayed with the children’s grandmother (GRA) and their auntie (BEC). Their father, PAI, was due to arrive in August, two months later. They began primary school three days after arriving and thus both at home and at school they were immersed in English. For the next two months the children’s only source of Portuguese were their interactions with each other and telephone calls to their father in Brazil. With the arrival of their father at the end of August and a move into a family home of their own, Portuguese again began to feature in their interactions at home on a daily basis. As described below, the transcriptions contained in the corpus cover the period up to the December after the family's arrival in England.

Data collection procedures

The data collection began in August 2001 and finished in December 2004, totalling approximately 24 hours of recordings. Originally collected using a mini-cassette recorder, the recordings were later digitized to enable their contribution to TALKBANK. The major part of the corpus consists of naturalistic data involving dinner table conversations between family members, free play situations and telephone calls between the children and Brazilian relatives. Recordings were also made of ‘interviews’ which took place mostly between the mother and MEG and JAM. In order to aid the selection of specific files when using CLAN, the file names were designed to contain the following information: file number/interaction code/main language of interaction/which sibling(s) was(were) present/the month and year of recording. Therefore, the file 001CHenJ&MAUG01, tells us that it was the first recording(001), it involves chatting (CH), English was the main language of interaction (en), both JAM and MEG were present (J&M), and it was recorded in August 2001 (AUG01). The interactions were broadly categorized as one of six types, although naturally some crossover occurs. The codes for these interaction types are as follows: More specific information about each file can be found in this file which lists the interlocutors for each recording, the ages of the siblings and the location and activity of each interaction.

Transcription and coding

All 119 recordings were transcribed by the researcher (and mother of the siblings). For practical and financial reasons there was no second transcriber to assist with the transcription process. However, each transcription was checked again during a second, and often third, listening of each recording, carried out at a later date. Due to the nature of the research being carried out on the LOBILL corpus (on code-switching), it was crucial that the main line utterances be coded according to the language(s) used and that each utterance be followed by an addressee code on the %add dependent line. The following example illustrates the use of these codes:

*MEG: mas@s [//] the water is very very cold ? [+ pe]

%add: MOT

In such code-switched utterances, postcodes (here [+ pe]) were used to code the direction and number of switches within each utterance. By using 'p' to represent a Portuguese word or sequence of words in Portuguese and 'e' for the English equivalent, the postcode can contain any number of ps and es and therefore can cover any number of switches which may occur within one utterance. The error code [*] was used to mark any perceived errors and, where possible, further information was included on the %err dependent line.

CLAN analyses

A detailed description of how the corpus was analysed both quantitatively and qualitatively can be found in the doctoral dissertation cited below. However, it is important to note that the language coding used to code the corpus in 2015 differs to that used in the current version (2022). This means that the command lines, which can be found in the footnotes of the dissertation, can no longer be replicated.


This work was supported in part by grants VARIAD (FF12012-35058) and Contact (FFI2016-75082) from the Spanish Ministry of Education to Dr. Aurora Bel Gaya.