CHILDES Swedish Lund Corpus
|
Sven Strömqvist
Department of Linguistics
Lund University
sven.stromqvist@ling.lu.se
website
|
Participants: | 5 |
Type of Study: | naturalistic |
Location: | Sweden |
Media type: | media on request |
DOI: | doi:10.21415/T5V02K |
Plunkett, K., & Strömqvist, S. The acquisition of Scandinavian
languages. In D. I. Slobin (Ed.), The crosslinguistic study of language
acquisition: Volume 3, pp. 457-556. Hills-dale, NJ: Lawrence Erlbaum
Associates.
Strömqvist, S., Richthoff, U., & Andersson, A. B. (1993).
Strömqvist’s and Richthoff’s corpora: A guide to longitudinal data from
four Swedish children. Gothenburg Papers in Theoretical Linguistics, 66.
In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.
Project Description
The 74 computerized transcription files contained in this second release
of the Swedish corpus relate to the project “Databasorienterade studier
i svensk barnspraaksutveckling” (Database oriented studies of Swedish
child language development), in which the language development in five
monolingual Swedish children is analyzed.
The five children under study grew up in middle-class families on the
west coast of Sweden. The families speak standard Swedish with a modest
touch of the regional variant. The recorded material relates to a wide
range of activity types: everyday activities in the home (such as meals,
bedtime procedures, cooking, washing, etc); freeplay; story telling; as
well as adult–child interaction; child–child interaction; and soliloquy.
Data collection for two of the children — a boy Markus from 1;3.19 to
6;0.09, and a girl Eva from 1;0.21 to 3;9.23 —is already completed. The
data from Markus and Eva, who are siblings, constitute one component of
the larger Swedish corpus. The second compo-nent includes data from
Harry and Thea, who are siblings, and from Anton. These files
constitute “Richthoff’s corpus.” Data collection started at 1;11.08 for
Anton, at 1;5.26 for Harry, and at 1;0.02 for Thea.
Index
The name of each of the computerized transcription files reflects the
name of the child and his or her age (in months and days) at the time of
the recording. The present release of the corpus contains 74
transcription files: 28 from Markus (ma15_19.cha to ma33_29.cha), 20
from Anton (ant23_08.cha to ant34_04.cha) and 26 from Harry
(har18_20.cha to har35_07.cha).
Transcription Conventions
All main tiers (both child and adult) have been morphologically
segmented by means of the symbols # (prefix), + (lexical compound) and -
(suffix). The utterance delimiters ! and ? indicate exclamation and
question, respectively. A full stop is used as a default utterance
delimiter but has no specific linguistic meaning. It should be read as
ambiguous with respect to functions like statement, request, and so
forth. Utterances have been identified on intonational criteria. In the
present release, only the 28 Markus files are checked for re-liability.
The reliability check indicates a breaking point at 18;10. In the
transcripts before ma18_10.cha the two project transcribers agreed on
utterance segmentation in 80-85% of the cases, whereas after 18_10 they
agreed in 96-99% of the cases.
Lexicon Files
The transcripts are morphologically oriented and take Swedish
orthography as a point of departure but allow for deviations from the
orthographic norm in order to capture qual-ities of spoken Swedish. In
particular, we have tried to avoid the fallacy of overrepresenting or
underrepresenting the child’s knowledge of morphology in terms of the
adult norm. The three children so far transcribed vary considerably in
acquisition structure and way of speaking and this is reflected in the
transcripts. The word forms in the transcripts of Markus are, as a rule,
sufficiently transparent to be successfully interpreted by a speaker of
Swedish. In contrast, several of the early transcripts of Harry are
less transparent and majority of the transcripts of Anton are rather
opaque. As a guide to these opaque word forms we have constructed a set
of lexicon files for Harry and Anton. Each of Harry’s 26 and Anton’s 20
transcript files is matched with a lexicon file containing a list of the
opaque word forms in the transcript file, the transcriber’s
interpretation of the opaque word form in terms of the closest
adult/target word form (the child’s form is often ambiguous and several
interpretations/target forms are rendered) and the token frequency of
the opaque word form. There is a strong tendency for ambiguous forms to
be among the most frequent forms, generally. The file har32_25.cha has a
matching lexicon file har32_25.lex, which, among many other entries and
lines, contains the line “27 e aer/en/ett” which means: 27 tokens of the
transcribed form “e” which is used by the child as sometimes “aer”
(copula:PRES), sometimes “en” (indefinite article:common gender), and
sometimes “ett” (indefinite article:neuter gender).
Coding
In the present version of the text files, three things are coded: time,
word accents, and feedback. First, a %tim tier is used to indicate the
temporal location of an utterance in min-utes and seconds from the start
of the recording (e.g., “32:12” means 32 minutes and 12 seconds).
Second, a %wac: word accent tier is used to code word accents. So far,
the marked word accent, “accent 2” (grave), is coded only when it occurs
in utterance focus position. The code used for marking accent 2 in focus
position is WAC2:FOC. Unclear cases are marked WAC2:FOC? (The auditive
identification of accent 2 contours is far from unproblematic. The
presence of only a %wac tier indicates an instance of accent 2 on which
the two transcribers agreed. For cases where there was a disagreement
between the two transcribers, an additional %wan tier is used to
indicate a conflicting judgment.) Third, a %nfb tier is used to code
so-called narrow feedback morphemes. Only feedback giving morphemes
(such as hm, naehae) have been coded so far. The code used for marking
feed-back givers is “FBG.” Unclear cases are marked “FBG?” In addition
to the three coding tiers mentioned, a fourth %aaf tier is used to
indicate that one or several word forms on the main tier have been
subjected to acoustic analysis and are stored in an acoustic analysis
file. The acoustic analysis tier provides information necessary for the
identification of the matching aaf file(s). Whereas %tim: is a standard
option from the CHILDES manual, %wac:, %nfb:, and %aaf: are not. The
three latter codes have only been used for project internal purposes.
An Acoustic Archive
In addition to the computerized transcription files, we have created a
computerized acoustic archive containing a sample of a little more than
500 disyllabic word forms from Markus 18;10 to 26;10. The archive is
created in MacSpeech Lab environment. The sample contains both
monomorphemic and dimorphemic word forms, the latter being either
lexical compounds or stems plus an inflectional suffix. Further, the
sample contains word forms that make up one-word utterances as well as
word forms from the initial, medial or final position in multi-word
utterances. Copies of the acoustic archive can be obtained from Sven
Strömqvist who welcomes comments and questions relating to the Swedish
corpus.
Acknowledgements
The project is supported by the Swedish Research Council for the Humanities and Social Sciences
(HSFR), grant F 783/91 and F 517/92 to the Department of Linguistics,
University of Göteborg, Sweden. A comprehensive guide to the Swedish
corpus is presented in Strömqvist, Richtoff, and Anderson (1993).