CHILDES Swedish Lund Corpus

Sven Strömqvist
Department of Linguistics
Lund University
sven.stromqvist@ling.lu.se
website

Participants:	5
Type of Study:	naturalistic
Location:	Sweden
Media type:	media on request
DOI:	doi:10.21415/T5V02K

Citation information

Plunkett, K., & Strömqvist, S. The acquisition of Scandinavian languages. In D. I. Slobin (Ed.), The crosslinguistic study of language acquisition: Volume 3, pp. 457-556. Hills-dale, NJ: Lawrence Erlbaum Associates.

Strömqvist, S., Richthoff, U., & Andersson, A. B. (1993). Strömqvist’s and Richthoff’s corpora: A guide to longitudinal data from four Swedish children. Gothenburg Papers in Theoretical Linguistics, 66.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The 74 computerized transcription files contained in this second release of the Swedish corpus relate to the project “Databasorienterade studier i svensk barnspraaksutveckling” (Database oriented studies of Swedish child language development), in which the language development in five monolingual Swedish children is analyzed.

The five children under study grew up in middle-class families on the west coast of Sweden. The families speak standard Swedish with a modest touch of the regional variant. The recorded material relates to a wide range of activity types: everyday activities in the home (such as meals, bedtime procedures, cooking, washing, etc); freeplay; story telling; as well as adult–child interaction; child–child interaction; and soliloquy.

Data collection for two of the children — a boy Markus from 1;3.19 to 6;0.09, and a girl Eva from 1;0.21 to 3;9.23 —is already completed. The data from Markus and Eva, who are siblings, constitute one component of the larger Swedish corpus. The second compo-nent includes data from Harry and Thea, who are siblings, and from Anton. These files constitute “Richthoff’s corpus.” Data collection started at 1;11.08 for Anton, at 1;5.26 for Harry, and at 1;0.02 for Thea.

Index

The name of each of the computerized transcription files reflects the name of the child and his or her age (in months and days) at the time of the recording. The present release of the corpus contains 74 transcription files: 28 from Markus (ma15_19.cha to ma33_29.cha), 20 from Anton (ant23_08.cha to ant34_04.cha) and 26 from Harry (har18_20.cha to har35_07.cha).

Transcription Conventions

All main tiers (both child and adult) have been morphologically segmented by means of the symbols # (prefix), + (lexical compound) and - (suffix). The utterance delimiters ! and ? indicate exclamation and question, respectively. A full stop is used as a default utterance delimiter but has no specific linguistic meaning. It should be read as ambiguous with respect to functions like statement, request, and so forth. Utterances have been identified on intonational criteria. In the present release, only the 28 Markus files are checked for re-liability. The reliability check indicates a breaking point at 18;10. In the transcripts before ma18_10.cha the two project transcribers agreed on utterance segmentation in 80-85% of the cases, whereas after 18_10 they agreed in 96-99% of the cases.

Lexicon Files

The transcripts are morphologically oriented and take Swedish orthography as a point of departure but allow for deviations from the orthographic norm in order to capture qual-ities of spoken Swedish. In particular, we have tried to avoid the fallacy of overrepresenting or underrepresenting the child’s knowledge of morphology in terms of the adult norm. The three children so far transcribed vary considerably in acquisition structure and way of speaking and this is reflected in the transcripts. The word forms in the transcripts of Markus are, as a rule, sufficiently transparent to be successfully interpreted by a speaker of Swedish. In contrast, several of the early transcripts of Harry are less transparent and majority of the transcripts of Anton are rather opaque. As a guide to these opaque word forms we have constructed a set of lexicon files for Harry and Anton. Each of Harry’s 26 and Anton’s 20 transcript files is matched with a lexicon file containing a list of the opaque word forms in the transcript file, the transcriber’s interpretation of the opaque word form in terms of the closest adult/target word form (the child’s form is often ambiguous and several interpretations/target forms are rendered) and the token frequency of the opaque word form. There is a strong tendency for ambiguous forms to be among the most frequent forms, generally. The file har32_25.cha has a matching lexicon file har32_25.lex, which, among many other entries and lines, contains the line “27 e aer/en/ett” which means: 27 tokens of the transcribed form “e” which is used by the child as sometimes “aer” (copula:PRES), sometimes “en” (indefinite article:common gender), and sometimes “ett” (indefinite article:neuter gender).

Coding

In the present version of the text files, three things are coded: time, word accents, and feedback. First, a %tim tier is used to indicate the temporal location of an utterance in min-utes and seconds from the start of the recording (e.g., “32:12” means 32 minutes and 12 seconds). Second, a %wac: word accent tier is used to code word accents. So far, the marked word accent, “accent 2” (grave), is coded only when it occurs in utterance focus position. The code used for marking accent 2 in focus position is WAC2:FOC. Unclear cases are marked WAC2:FOC? (The auditive identification of accent 2 contours is far from unproblematic. The presence of only a %wac tier indicates an instance of accent 2 on which the two transcribers agreed. For cases where there was a disagreement between the two transcribers, an additional %wan tier is used to indicate a conflicting judgment.) Third, a %nfb tier is used to code so-called narrow feedback morphemes. Only feedback giving morphemes (such as hm, naehae) have been coded so far. The code used for marking feed-back givers is “FBG.” Unclear cases are marked “FBG?” In addition to the three coding tiers mentioned, a fourth %aaf tier is used to indicate that one or several word forms on the main tier have been subjected to acoustic analysis and are stored in an acoustic analysis file. The acoustic analysis tier provides information necessary for the identification of the matching aaf file(s). Whereas %tim: is a standard option from the CHILDES manual, %wac:, %nfb:, and %aaf: are not. The three latter codes have only been used for project internal purposes.

An Acoustic Archive

In addition to the computerized transcription files, we have created a computerized acoustic archive containing a sample of a little more than 500 disyllabic word forms from Markus 18;10 to 26;10. The archive is created in MacSpeech Lab environment. The sample contains both monomorphemic and dimorphemic word forms, the latter being either lexical compounds or stems plus an inflectional suffix. Further, the sample contains word forms that make up one-word utterances as well as word forms from the initial, medial or final position in multi-word utterances. Copies of the acoustic archive can be obtained from Sven Strömqvist who welcomes comments and questions relating to the Swedish corpus.

Acknowledgements

The project is supported by the Swedish Research Council for the Humanities and Social Sciences (HSFR), grant F 783/91 and F 517/92 to the Department of Linguistics, University of Göteborg, Sweden. A comprehensive guide to the Swedish corpus is presented in Strömqvist, Richtoff, and Anderson (1993).