Derived Corpora

CHILDES

Derived Corpora and Counts

Researchers have constructed several derived corpora and frequency counts based on segments of the CHILDES database.

Derived Corpora

BabySRL Corpus : Cynthia Fisher, Dan Roth, and Christos Christodoulopoulos contributed this version of the Brown corpus that has been parsed and labelled for semantic roles.
Brent_Ratner Corpus: Michael Brent at Washington University contributed this corpus derived from the CDS of the CHILDES Bernstein Ratner corpus. It is designed to train an automatic segmenter. The current version of this derived corpus was contributed by Sharon Goldwater.
Gaskins Metaphor Corpus This corpus lists and codes the metaphors found in the Lara, Thomas, and MPI-EVA-Manchester English corpora and the Szuman and Weist-Jarosz Polish corpora.
Determiners: Counts of the emergence of the determiner category across several CHILDES corpora as analyzed in a forthcoming Psychological Science paper from Meylan, Frank, Roy, and Levy.
Johnson Sesotho Corpus: Mark Johnson contributed this corpus of CDS (child-directed speech) from the CHILDES Sesotho corpus. The goal of the corpus was to train an automatic segmenter. The available materials include the Python script that can be run on the Sesotho corpus, along with the output in the form of sentences of child directed speech (CDS).
Pearl_Sprouse Corpus: Lisa Pearl and Jon Sprouse contributed this corpus of Penn TreeBank style parses for selected corpora from the American English segment of the CHILDES database.
UCI_Brent_Syl Corpus: Lisa Pearl and Lawrence Phillips at UC Irvine contributed this corpus derived from the CDS of the CHILDES Brent corpus. The goal was to train an automatic segmenter.