||Derived Corpora and Counts
Researchers have constructed several derived corpora and frequency counts based on segments of the CHILDES database.
- BabySRL Corpus : Cynthia Fisher, Dan Roth, and Christos Christodoulopoulos contributed
this version of the Brown corpus that has been parsed and labelled for semantic roles.
- Brent_Ratner Corpus: Michael Brent at Washington University contributed this corpus
derived from the CDS of the CHILDES Bernstein-Ratner corpus. It is designed to train an automatic segmenter.
The current version of this derived corpus was contributed by Sharon Goldwater.
- Determiners: Counts of the emergence of the determiner category across several
CHILDES corpora as analyzed in a forthcoming Psychological Science paper from Meylan, Frank, Roy, and Levy.
- French phonologized IDS Maria Julia Carbajal, Camillia Bouchon, Emmanuel Dupoux and
Sharon Peperkamp contributed this corpus of
phonologized French infant-directed speech based on eight French CHILDES corpora,
and taking into account several French phonological rules.
It was created using a combination of a phonological dictionary (Lexique 3.80) and a script designed to apply these rules.
- Hungarian-Italian IDS: Judit Gervain contributed this phonological transcription of
the Infant-Directed Speech in the Hungarian and Italian segments of CHILDES.
- Johnson Sesotho Corpus: Mark Johnson contributed this corpus of CDS (child-directed speech)
from the CHILDES Sesotho corpus. The goal of the corpus was to train an automatic segmenter.
The available materials include the Python script that can be run on the Sesotho corpus, along with the
output in the form of sentences of child directed speech (CDS).
- Pearl_Sprouse Corpus: Lisa Pearl and Jon Sprouse contributed this corpus of
Penn TreeBank style parses for selected corpora from the American English segment of the CHILDES database.
- Polish IDS: Luc Borota contributed this corpus of phonological transcriptions of
the Infant-Directed Speech in the Polish segments of CHILDES.
- UCI_Brent_Syl Corpus: Lisa Pearl and Lawrence Phillips at UC Irvine contributed
this corpus derived from the CDS of the CHILDES Brent corpus. The goal was to train an automatic segmenter.
The corpus comes with the scripts and dictionary used to produce it.
- Ping Li of Penn State contributed frequency counts of child directed speech
for viewing directly or zipped for downloading,
along with the documentation.
- Portuguese Word Frequency: Ângela Maria Vieira Pinheiro contributed this
count of word frequency in the writings of Brazilian school children.