RusLan-M Corpus


Valeriia Lelik
Center for Language and Brain
HSE University

Anastasiya Lopukhina
Rastle lab
Royal Holloway, University of London

Mariia Diachkova
Center for Language and Brain
HSE University

Olga Dragoy
Institute of Linguistics
Russian Academy of Sciences

Svetlana V. Dorofeeva
Center for Language and Brain
HSE University

Irina A. Sekerina
Department of Psychology and Ph.D. Program in Linguistics
College of Staten Island and The Graduate Center, City University of New York

Browsable transcripts

Download transcripts

Link to media folder
Participants: 2
Type of Study: longitudinal
Location: Russian Federation
Media type: video, audio
DOI: doi:10.21415/1XMG-D508

Citation Information

Lelik, V., Diachkova, M., Dorofeeva, S. V., Lopukhina, A., Dragoy, O., Sekerina, I. A. (2025). RusLan-M Corpus. Retrieved from https://childes.talkbank.org doi:

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Corpus Description

The RusLan-M corpus comprises longitudinal spontaneous speech samples of video recordings from two Russian-speaking monolingual children, a boy Yasha and a girl Tosya. Multimedia sources were anonymized by blurring all faces in the videos, muting or removing personal details (e.g., addresses and last names) in the transcripts, and replacing some videos with audio recordings to protect the participants’ privacy.

Tosya was recorded by her mother and caregiver from the age of 10 months, with the final recording made at 3;10. She lived in Russia and had an older sister. The total duration of Tosya’s recordings was approximately 29 h (21,421 child and 40,811 adult utterances), with approximately 36,275 and 126,944 tokens, respectively, where tokens represent all distinct, non-unique word forms.

Yasha was recorded by his father from the age of 1 year and 5 months, with the final recording made at 3;0. He also lived in Russia and had an older brother. The total duration of the Yasha recordings was approximately 12 h (13,965 child and 12,278 adult utterances), with approximately 27,592 and 41,141 tokens, respectively.

The transcripts of both child and child-directed speech were lemmatized and morphologically annotated using Mystem, an automatic tool for Russian morphological analysis (https://yandex.ru/dev/mystem). Trained research assistants then manually verified the annotations and resolved the cases of homonymy. Annotated tables containing nouns and verbs for both children are available on the OSF page (https://osf.io/6zdkc/).

For each participating child, one of his/her parents gave an informed consent for the use of the data. For the target children and other participants we use their original names. Surnames and addresses are replaced by a code xxx and commented on the %com tier.

Acknowledgements

This work is an output of a research project implemented as part of the Basic Research Program at HSE University. This research was supported in part through computational resources of HPC facilities at HSE University.

We are grateful to the parents and children participating in the study. We thank Irina Korkina, Tatyana Masumi for taking a huge part in transcription of recordings, Anastasiya Sycheva for involving in final preparations of material and checking the transcripts. We thank Pavel Pashentsev and Konstantin Lopukhin for the help with the code creation. We are grateful to the following students and researchers who participated in the transcription of recordings, the revision of transcripts, and the manual control of the automatic morphological annotation (in alphabetical order): Anastasia Andreeva, Angelica Dzhioeva, Anna Elagina, Tatiana Eremicheva, Sofia Geren, Elizaveta Klykova, Maria Kozlova, Polina Kozlova, Anastasia Kudrina, Daria Morozova, Veronika Prigorkina, Nadezhda Psaryova, Ksenia Revak, Ivan Shirokov, Alexandra Trepalenko, Vladislava Staroverova, Julia Vorobyova, Nina Zdorova.

Usage Restrictions

There are no restrictions on the use of the transcripts.