CHILDES Polish CDS (Child-Directed Speech) Corpus

Ewa Haman
Department of Psychology
University of Warsaw

The Excel spreadsheet can be downloaded from here and the description can be downloaded from here .

Project Description

The Polish Frequency List of Child-Directed Speech is based on utterances of parents, grandparents, experimenters and other people (older than 8 years) talking in the presence of the children. Although not all of these utterances are child-directed, they still represent the type of speech that children are exposed to (stories and poems telling, talks between parents). The Polish Frequency List of CDS compiles data from seven corpora of child-directed speech and child language, with the oldest coming from 1940s and the newest coming from 2000s.

The data include: (1) the Polish part of the CHILDES database: Szuman Corpus and Weist Corpus; (2) Polish Child Speech Corpus (Joanna Szwabe, Adam Mickiewicz University, Poland); (3) Gdańsk corpus – speech diary of one girl (Ewa Dąbrowska, Northumbia University, UK & Elena Lieven, Manchester University); (4) speech diary of two brothers (Marta Szreder; York University); (5) narrative mother-child data (Ewa Haman, University of Warsaw & Andrea Zevenbergen, State University of New York, Fredonia); (6) speech diary of twin sisters (Ewa Haman, University of Warsaw).

All data are from naturalistic conversations, except for the narrative mother-child data from the crosslinguistic project of Zevenbergen and Haman (e.g. Zevenbergen, Haman, Olszańska & Thielges, 2008). Those data were included as well, since their task was to talk about any three past events freely chosen from their own experience.

Although only the Szuman and Weist corpora are currently available from CHILDES, all other data are intended to be attached there in the future.

Table 1
AuthorDates of registries No of childrenChildren’s ageChildren’s genderAll word tokens Word tokens in CDS
Szuman1945-46, 1953-61,
120;10 - 6;11 female, male1 035 691375 168
Weist1980-198141;7 - 3;2female, male73 74544 003
Dąbrowska199912;0 - 2;1female101 52968 833
Haman2002-200321;11 - 2;23female105 28072 740
Haman &
2006-2007453;1 - 6;1female, male50 39336 265
Szreder2008-200922;2 - 3;7male29 42416 641
Szwabe2006-2009623;0-6;11female, male383 429185 789
TOTAL1281 779 491799 439

As the Table 1 shows, our data include transcriptions of speech directed to 128 children aged 0;10-6;11. The data include more than 799 thousands word tokens in CDS.

The Polish language uses a high degree of inflection. It has seven cases and three genders. Declination endings in Polish depend on case (nominative, genitive, dative, accusative, locative, instrumental, vocative), number (singular or plural), gender (masculine, feminine, neuter), animacy (animate or inanimate) and whether a particular word denotes a human or not. Thus one Polish word may have more than ten forms. For example adjective goły (bare/naked) can be used in a sentence in 11 ways: goły, gołego, gołemu, gołym, goła, gołą, gołej, gołe, goli, gołych, gołymi marking 70 inflectional forms – 7 cases x 2 numbers x 5 genders (comparative and superlative forms are not listed). Because we wanted to get e.g. information about the variety of words in CDS (not of their forms) it was necessary to prepare a frequency list that counts all the forms of the same word together and contains the basic form of the word as well as inflected forms. (e.g. goły 46 as well as: goła 16, goli 14, gołe 9, goły 6, gołego 1). The Polish Frequency List of CDS is prepared in this way.

A database (PostgreSQL 8.3) comprising all data available in CHAT format was prepared to get the frequency list of CDS in the data. The database contains all corpora except for Polish Child Speech Corpus which is in TEI format. The database reflects the structure of CHAT format. Additionally each utterance has codes assigned for its author, listeners, their age and gender. The database structure makes it possible to get the utterances and the words from the whole corpus that fulfill the specified criteria as the author’s age and/or gender, the listeners’ age and/or gender, the registering date of the utterance. It is also possible to merge more criteria together. It is e.g. possible to get the frequency list of speech of people older than 10 years directed to two-year-old girls or the frequency list of speech of adult men directed to boys aged between 5 and 7. A first raw frequency list for all CDS data available in CHAT format was obtained from the database. The raw frequency list of CDS in Polish Child Speech Corpus was prepared by its author Joanna Szwabe. The corpora (except of the Polish Child Speech Corpus) were inserted into the database. We got the utterances that fulfill the specified criteria (author older than 8;0, at least one of listeners younger than 7;0) and lower-cased them. From these utterances we got the first raw frequency list. We merged it with Polish Child Speech Corpus raw frequency list which was generated outside the database (by its author Joanna Szwabe) and was already lower-cased. The merged raw frequency list contained all word forms not classified according to the Polish inflection system. All these data (two merged frequency lists) were lemmatized by lemmatizing program prepared by Jarosław Strojek on the base of the Polish Language Corpus (PWN, 2009). Additionally, because of the relatively high number of homonyms in Polish, frequencies of homonymic forms were divided according to the homonym list which contains information on the proportion of the distribution of each homonymic form in the Polish Language Corpus (PWN, 2009). Only the homonymy of different lemmas has been accounted, so is the string goli occurring altogether 19 times classified 14 as a form of the lemma goły and 5 times of the lemma golić (się) (shave). The homonymy of two word forms inside one lemma, as gołym: the instrumental singular or the dative plural of goły does not upset our statistics.

All corpora used include more than 1,179,000 word tokens with more than 794,000 word tokens in CDS (speech directed to children aged between 0;10 and 6;11), about 44,000 word types, and 21,000 different lexemes. The 46 most frequent lexemes cover more than half of the CDS items in the corpora and 90% of the CDS items are covered by the first 1,811 lexemes.

Lemmatized list of Polish CDS is available in xls format. Columns show:

There are some fractions in the column showing occurrences. They are the result of rounding the occurrences of the homonyms according the proportion of the distribution of each homonymic form in the Polish Language Corpus (PWN, 2009). For example, for the homonymic form byli the probability of being an inflected form of the word być (to be) is 0.966 and the probability of being an inflected form of the word były (previous) is 0.034. The occurrence of the word byli in our corpora is 68. According to the probabilities above the occurrence of byli as a form of the word być is 95.69 and as a form of the word były is 2.31.


Additional contributors to this this project include Bartłomiej Etenkowski, Magdalena Łuniewska, Joanna Szwabe, Ewa Dąbrowska, Marta Szreder, and Marek Łaziński.

Inquiries should be directed to:
Ewa Haman
Bartłomiej Etenkowski
Magdalena Łuniewska