MCF Bilingual Corpus

Madalena Cruz-Ferreira
Independent Scholar

Participants: 3
Type of Study: naturalistic
Location: Sweden, Portugal
Media type: audio
DOI: doi:10.21415/T52W2D

Citation information

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Other references include:

Project Description

This corpus contains longitudinal and cross-sectional data from three children, two girls and one boy, primary bilinguals in Portuguese and Swedish, who acquired English as the language of schooling.

The children

Karin, Sofia and Mikael are siblings, from an upper middle-class family background. The father is a native speaker of (Central Standard) Swedish and the mother, who is also the researcher and a trained phonetician, is a native speaker of (European) Portuguese.

Karin and Sofia were born in Sweden, in September 1986 and July 1988, respectively, Mikael was born in Portugal in October 1990. From birth, the children have been exposed to Portuguese and Swedish according to the one-person, one-language principle that the parents adhere to since then. The parents are otherwise fluent in one another’s language as well as in English. In all exchanges between the children and Portuguese or Swedish relatives and friends the one-person, one-language principle is easily maintained. The children have been exposed to several accents of Swedish and Portuguese, the latter including Brazilian Portuguese.

Due to the father’s professional commitments, the family has had several moves to different countries since the children’s birth. A schematic indication follows, in order to highlight the extent of the children’s exposure to different languages.

When in Europe, the family traveled to Sweden for the summer and to Portugal for Christmas or vice-versa. In Asia, the family travels to both countries for either the summer or Christmas. Before 1993, the children also had irregular exposure to English, through exchanges between the parents and foreign guests to the home, or from social gatherings involving Swedish and Portuguese relatives or friends.

At the age of 6, all three children started attending once weekly Swedish Supply School in the countries where the family has lived, where they learn about the language and the country. The children never had any formal tuition of this type in Portuguese, although they are comfortably familiar with the culture of both Sweden and Portugal.

As far as exposure to other languages than those involved above is concerned, Karin (10;0) and Sofia (9;2) started curricular lessons in Mandarin at school from grade 5. Both girls have Latin at school, and Sofia has French. They are, of course, exposed to the local languages spoken in Singapore, the main ones being Mandarin, as well as other Chinese languages, Malay and Tamil. They are also familiar with different accents of English, in-cluding non-native accents.

Sofia, the latest speaker of all three, was diagnosed at age 4 with 40% deafness due to recurrent middle-ear infections for which she had been receiving regular medication since babyhood. She underwent grommet and adenoid surgery twice, first in Portugal at 4;9 and later in Singapore at 6;2, when the problem was solved. The noteworthy consequence of this problem was that up to the age of 10 her delivery was rather slurred in both Portuguese and Swedish, whereas her delivery in English, which she started learning with normal hearing, was faster and clearer from the very start. Mikael had a lisp, which he spontaneously corrected at age 5;9. The children are otherwise healthy and their development is normal.

The children have always lived with both parents, and always taken active part in the family’s life. The mother is the main caregiver, having stayed at home during the children’s first years. The children are therefore mostly exposed to Portuguese at home. In order to counterbalance this asymmetry, compounded by the regular absences of the father due to business travel, the parents chose to address one another mostly in Swedish in the presence of the children. While consistently using either Portuguese or Swedish in exchanges with each parent, the children started by using Portuguese among themselves, except when recalling or discussing events specifically related to Sweden, like skiing or the midsummer celebration, for which they used Swedish. From the start of their regular schooling in English, they gradually started using more English among themselves, English being now almost exclusively the language of their exchanges. None of the children has ever felt self-conscious about using Portuguese or Swedish with their parents in front of non-speakers of the languages, including other children.

Data collection

Data are being collected, since the birth of each child, through audio recordings, video recordings and diary notes made by the mother.

Audio and video tapes are reviewed soon after recording, and supplemented by diary notes wherever clarification is needed. Otherwise, extensive diary notes are used to record each child’s progress, both linguistic and in other developmental areas. Recordings are typically made whenever a new linguistic trait appears in the children’s speech, in the same way that progress in other areas is noted down in the diaries, that is, on no regular chronological basis. The data in this corpus concern the children’s Portuguese, Swedish and, from 1993, their English. Most of the data reflect spontaneous speech, except in cases where the child was specifically asked to speak (or sing, or read) ‘for the record’, for example, to say the colour or animal names in a picture book.

Typical recording sessions took place, in the first months of the children’s life, with the child safely lying down and playing on its own or interfacing with one parent or relative. Later, the tape recorder was turned on in an inconspicuous place where the children were busying themselves or being attended to. The children were obviously aware of the camera during video recordings, but its presence soon became an uninteresting detail of their routine. Recordings encompass a broad spectrum of situations. Aside from the recordings made to capture specific progress, which were usually made at home, recordings include daily routines, solitary play or with other children, festive gatherings with family and friends, and outings. The data therefore give a broad view of each child’s full (socio)linguistic ability, including making acquaintance with adults and children, voice modulations and strategies to call the attention of distant hearers, or strategies to overcome background noise. For the recordings of spontaneous interaction with children outside of the family, parental permission to use the data was duly requested and obtained.

One possible shortcoming of the recorded data is that the mother was regularly present during collection, except in those cases when the tape recorder was left on with the children on their own. Other shortcomings of spontaneous child speech collection are well-known to researchers in this area, from the children’s unwillingness to cooperate, to disruptions from siblings or equipment during recording of one particular child. The detail included in the diaries therefore constitutes an invaluable complementary resource.

Transcription and coding

Data were transcribed and coded by the researcher, who is competent in all three lan-guages. Transcription was made as soon as possible after recording, and rechecked when coding into CHAT format, from January 2000.

All files in the corpus include a %pho: tier and a %int: tier. Both tiers are also used to transcribe adult utterances with characteristic features of child-directed speech, or other-wise non-standard.

The %pho: tier.

The %int: tier.
This tier transcribes uses of pitch, adapting the principles of nuclear notation described in the CHAT Manual, and includes indication of voice quality and paralinguistic features, e.g., creak, tempo.

Adult speech and target-like child speech is transcribed by means of abbreviated paired symbols. In simple falling, rising or level tones, the first symbol denotes the high, mid or low pitch at which the tone starts, and the second symbol denotes the type of pitch movement, falling, rising or level. The one exception is the Portuguese extra-low fall, see below. ‘High’, ‘mid’ and ‘low’ are relative terms: a ‘mid’ pitch level denotes the speaker’s average tone range, as it is impressionistically detected in regular contact with any speaker, ‘high’ and ‘low’ being accordingly defined in relation to ‘mid’ for each speaker. In complex tones, the successive symbols indicate the type of pitch movement: The conventions are as follows:

Simple falls:

Simple rises: Level tones: Complex tones:

Complementary indication of where the pitch ends is added where relevant, e.g., “HF to mid”
Other conventions are:

These symbols always follow symbols indicating pitch start or type, so that confusion between the H denoting ‘high’ and the H denoting ‘head’ is avoided. Examples of their use are:

Transcription of each tone group (tg) is given on successive lines of the %int: tier. Prehead, head and tone are separated by + signs in the transcription, e.g. (file ptgsw.K880500, lines 865 and 867-869):
*DAD: vad heter //det för nåt # vad är //det för nåt # vad //heter det.
%int: 1tg, MH+ML; 2tg, MH+LR; 3tg, LH+MF.

In babbled speech, no assumption is made concerning the existence of an intonational nucleus. Transcription of babble concerns pitch height and movement on each babbled syllable, according to similar conventions. The main difference is that + signs here indicate syllable boundaries, e.g. (file ptgsw.M901215, lines 77-81):
*MIK: yyy.
%int: 1tg, ML+long MF; 2tg, LL; 3tg, HL+short LF.

Other conventions and symbols

Orthography - adult utterances, and children’s utterances recognised as (renderings of) target forms, are given in standard orthography. A form of ad-hoc ‘baby orthography’ is also used for child connected speech that, although replicating target utterances, distorts segments and prosody beyond any readable use of CHAT conventions for truncated child utterances. In these cases, standard orthography is given in the %gls: tier. It is hoped that ‘baby orthography’ will be easily understandable by native users of the database. One example is in file ptgsw.SM910100, lines 24-25:
*SOF: a/mã # k/lhi klh/kó?
%gls: mamã, a Karin está na escola?

Ptg, Sw, Eng - indicate quotation of data in Portuguese, Swedish and English, respectively, in %com, %exp or %lan tiers. Notations of the type PtgEng are used in the same way for multilingual mixes, with the first symbol indicating host language (the language accepting an intrusion) and the second guest language (the intruding language). In the %lan tier, the use of one language symbol on its own indicates a probable rendition of a target in the language.

The files contain monolingual and/or mixed production by one or more of the children. The filenames include a language prefix, the child(ren)’s initial(s) and the date of recording, given as yymmdd. An indication of 00 for the day means that the exact day of recording is unknown. Files containing all three languages are prefixed ptswen.