Demuth Corpus

Katherine Demuth
Macquarie University


Participants: 4
Type of Study: naturalistic
Location: South Africa
Media type: audio
DOI: doi:10.21415/T57P4N

Citation information

Demuth, K. 1992. Acquisition of Sesotho. In D. Slobin (ed.), The Cross-Linguistic Study of Language Acquisition, vol 3, 557-638. Hillsdale, N.J.: Lawrence Erlbaum Associates.

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.

Project Description

The Demuth Sesotho Corpus was compiled by Katherine Demuth in the southern African country of Lesotho from 1980-82. Data was collected in a small Lesotho mountain village of 550 people in the district of Mokhotlong, where it was possible to establish close rapport with both the children and their families. The Corpus contains a longitudinal study of four target children’s language development as they interacted with members of the extended family including mothers and/or grandmothers, an uncle and occasionally the father (in one family), and especially older siblings, cousins, and peers. These target children are: Hlobohang (boy) 2;1-3;0, Litlhare (girl) 2;1-3;2, ‘Neuoe (girl) 2;4-2;9, and Tsebo (girl) 3;8-4;1. The two older girls were cousins living in the same household, and where therefore recorded together. Monthly recordings of spontaneous speech consisted of 3-4 hours each over approximately one year, resulting in a corpus of 98 hours of speech containing approximately 13,250 utterances containing lexical verbs or approximately 1/2 million morphemes.

Transcription and Organization of the Corpus

Broad phonemic transcription was conducted by Katherine Demuth with the assistance of the mothers and grandmothers as soon as recording sessions were complete. These transcripts were then verified independently by a researcher at the National University of Lesotho. The original transcription was by hand. The data were subsequently computerized and one third of the corpus hand-tagged by Sesotho speakers at Brown University in the 1990’s. A computational morphological parser was then developed (with the assistance of Mark Johnson, Brown University) to tag the remaining part of the corpus, files were then converted to CHILDES format, and the audiotapes were digitized. These still remain to be linked to the transcripts.

The data contain a transcription of the utterance (using a modified form of disjunctive Lesotho orthography, where orthographic ‘e’, ‘o’, and ‘l’ are rendered as phonetically more transparent ‘w’, ‘y’ and ‘d’ respectively. This is followed by a morphologized ‘adult’ (i.e. grammatical) equivalent, where hyphens correspond to morphemes within a prosodic word. Since Sesotho, like most Bantu languages, is a null subject language, with a rich grammatical agreement and morphological system, this means that an entire utterance may consist of one morphologically complex ‘word’. This is then also followed by morphological tags (see below), an English ‘running’ gloss, and situational comments.

The corpus should therefore be accessible to and of broad interest for researchers of child language, Bantu linguistic structures, and computational linguists interested in morphological parsing and/or machine translation. Collection of this corpus was supported in part by Fulbright and SSRC (Social Science Research Council) dissertation funding. Computerization and tagging of the corpus has been supported in part by NSF grants BNS-08709938 and SBR-9727897. Thanks to all who have assisted with this research over the years, and the children and families who provided the data.


There are three groups of files, organized by target child: H, L and T & N. The name of each file indicates the target child of that file, the session from which it was taken, and the audiotape on which it was recorded. There are a total of 131 transcript files, each corresponding to 30 minutes of original tape. For example, file ‘L7C’ is from the set of files with L as the target child (“L files”), from session number 7, and the letter C indicates that it is the third tape from which the file is transcribed. Since there are two target children in the T files, the older child T has been give the designation CHI, and the younger target child is identified by her three character name abbreviation NAI. Thus, researchers wanting to extract all information on these 4 target children will want to include NAI along with CHI in there search process.

Since these data were collected in spontaneous home and neighborhood situations, many of the files include many speakers, especially once the target children are over the age of 2;6. These include younger peer children (2;6-4-year-olds), older sibling children (5-7-year-olds), and adults (teen-aged cousins, parents, grandparents, visitors). Thus, researchers wanting to examine typical Sesotho adult speech can extract data from these speakers, all identified at the top of each file. About 40% the corpus contains utterances from the 4 target children, about 40% are adult utterances, and about 20% are from peers or older siblings

Morphological Tagging and Codes

• Morphemes that constitute ‘words’ are connected with a hyphen in the adult: ke-a-e-rek-a sm1s-t^p-om9-v^buy-m^in I’m buying it. • Fusion of two morphemes is indicated with a slash ‘/’ in the morphological tags: ke-u-rek-ets-e sm1s-om2s-v^buy-ap/t^pf-m^in I bought for you.

Nominal System

• Noun-class prefixes are separated from the base noun by a hyphen. The class to which the noun belongs is indicated in parentheses: n^ noun class marker se-kolo n^7-school(7,8) • Names for people, places, games and songs are indicated as follows: n^name person’s name n^place name of a place n^game name of a game n^song name of a song • Other nominal concords are indicated by the number of the corresponding noun class: ps possessive ya-ka ps9-1s tsa lona ps10 pn2p ps/home hahabo, hahao, hae, heso, etc. beso ps2/home pn independent pronoun wena pn2s tsona pn10 d demonstrative pronoun tse, tsena d10 eo, ena, etc. d9 aj adjective tse-ngata 10-aj le-le-phutswa 5-5-aj e-mo-holo 1-1-aj e-nyenyane 9-aj e-ngwe 9-aj se-se-ng 7-7-aj tse-di-ng 10-10-aj ba-ba-ng 2-2-aj nm numeral ele-ngwe 9-nm tse-pedi 10-nm di-le tharo 10-cp nm lc locative marker -tafole-ng table(9,10)-lc On the table. dim diminuitive -na, -ne, -ana

Verbal System

• Copulas are not coded as verbs and may or may not reflect the noun class of their argument: cp copula ke se-sepa cp n^7-soap(7,8) It’s soap. nna ke e-mo-holo pn1s cp1s 1-1-aj Me, I’m big. e ka moo cp9 pr loc It’s over there. u le-shano cp2s n^5-liar(5,6) You lie. neg: ha se nna ng cp pn1s It’s not me. • Ideophones are denoted by ‘id^’ and their semantic meaning. They may be preceded by the verb ho re ‘to say’ or they may stand on their own. On occasion an ideophone may occur with tense and a subject marker, similar to a verb: • o-tla-r-e polopoqo sm1-t^f1-v^say-m^in id^fall_over_and_over He will fall. u-tla-polopoqo sm2s-t^f1-id^fall_over_and_over You will fall. yare chobe cj id^enter It enters. • Verbs are denoted by ‘v^’ and their semantic meaning. When there is no overt (phonologically realized) tense marker, the present tense marker is linked to the verb (or object marker, see below) using an underscore (‘_’): ke-a-u-shap-a sm1s-t^p-om2s-v^thrash-m^in I’m thrashing you. ke-j-a se-wete sm1s-t^p_v^eat-m^in n^7-carrot(7,8) I’m eating a carrot. • Subject markers prefix to the verb. First and second person subject markers are marked singular or plural, other subject markers are marked according to the noun class to which they belong: 1st pers. sing. ke- ke-rek-il-e sm1s-v^buy-t^pf-m^in I bought. 1st pers. pl. re- re-y-a sm1p-t^p_v^go-m^in We go. 2nd per. sing. u- u-a-ultw-a sm2s-t^p-v^hear-m^in You hear. 2nd per. pl. le- le-pheh-il-e sm2p-v^cook-t^pf-m^in You cooked. Class 1 o- o-a-j-a sm1-t^p-v^eat-m^in He eats. Class 2 ba- ba-tl-a sm2-t^p_v^come-m^in They come. Class 9 e- e-w-el-e sm9-v^fall-t^pf-m^in It fell. • Object markers/reflexives prefix to the verb stem. If an object marker is present and there is no overt tense marker, the present tense is attached to the object marker so as not to come between the object marker and the verb. Sesotho permits only one object marker: 1st pers. sing. -n- u-n-hat-a sm2s-t^p_om1s-v^step-m^in You step on me. 1st pers. pl. -re- u-a-re-sheb-a sm2s-t^p-om1p-v^look-m^in You look at us. 2nd per. sing. -u- ke-u-rat-a sm1s-t^p_om2s-v^like-m^in I like you. 2nd per. pl. -le- ke-le-pheh-a sm2p-v^cook-t^pf-m^in You cooked. Class 1 -mo- ba-mo-otl-a sm2-t^p_om1-v^beat-m^in They beat him. Class 3 -o- ke-a-o-ten-a sm1s-t^p-om3-v^wear-m^in I wear it. Class 9 -e- ke-e-tseb- sm1s-t^p_om9-v^know-m^in I know it. Reflexive -i- ke-a-i-pat-a sm1s-t^p-rf-v^hide-m^in I hide myself. • Verbal extensions are infixed after the verb stem and before the final vowel: ap applicative -el-, -ets- (ap/t^prf) rc reciprocal -an- c causative -is-, -es- (chesa) , -y- (kenya) p passive -w- nt neuter/stative -ahal- bon-ahal-a v^see-nt-m^in visible -eh- lahl-eh-a v^lose-nt-m^in get lost xt extensive -aka- rv reversive -oll- fas-oll-a v^tie-rv-m^in untie cl completive -ell-

Tense (t^)

• Every verb must have a tense marker expect for the imperative or infinitive. ‘_’ is used where t^ is not an isolated morpheme. • Simple tenses do not require a second subject marker. Most tense markers are infixed after the subject marker and before the verb stem (before an object marker if one is present), except for the perfective (t^pf) which is infixed after the verb stem and before the final vowel: p present -a-, -ø- pf perfective -il-, -ets- f1 future tla neg: ha ke-na-ho-rek-a ng sm1s-t^f1-if-v^buy-m^in f2 future -tlilo-, -tlo- f3 future -ya- f4 future -ilo- f5 future -yo- if infinitive ho- sa persistive -sa- ke-sa-y-a sm1s-t^sa-v^go-m^in ke-sa-ilo-rek-a sm1s-t^sa-t^f4-v^buy-m^in np narrative past -a ka-bon-a sm1s/t^np-v^see-m^in po potential -ka- n-ka-rek-a sm1s-t^po-v^buy-m^in po/ng negative potential -kebe-, -keke- be -be- ba -ba- ts recent past -tswa- ke-tswa-rek-a sm1s-t^ts-v^buy-m^in ke -ke- tloha -tloha- tena -tena-, -tene • Compound Tense Forms require a second subject marker after the tense marker. With compound tense forms all morphemes are written conjunctively. In the affirmative, the mood of the verb is participial (m^pt), in the negative it is x (m^x). ne past continuous -ne- ke-ne-ke-rek-a sm1s-t^ne-sm1s-v^buy-m^pt ke-ne-ke-rek-il-e sm1s-t^ne-sm1s-v^buy-t^pf-m^pt neg: u-ne-u-sa-thus-a sm2s-t^ne-sm2s-ng-v^help-m^x pp past progressive -ntse- o-ntse-a-y-a sm1-t^pp-sm1-v^go-m^pt ps past -ile- o-ile-a-bon-a sm1-t^ps-sm1-v^see-m^pt ny not yet -eso-, -so- ha-ke-eso-j-e ng-sm1s-t^ny-v^eat-m^x se exclusive -se- re-se-re-qet-il-e sm1p -t^se-sm1p-v^finish-t^pf-m^pt o-ne-o-se-o-ents-e sm2s-t^ne-sm2s-t^se-v^do/t^pf-m^pt neg: ke-se-ke-sa-bin-e sm1s-t^se-sm1s-ng-v^sing-t^pf/m^x

Relative Clauses

• Subject and object relative markers are coded according to the noun class of the nouns to which they refer. Relative clauses are always participial mood (m^pt): sr subject rel. marker ... mo-tho ya il-e-ng ... n^1-person(1,2) sr1-v^go/t^pf-m^pt-rl or object rel. marker ntho eo ke-e-bon-e-ng thing(9,10) or9 sm1s-om9-v^see-t^pf/m^pt-rl lr locative relative ... moo ke-ba-bon-e-ng teng ….r sm1s-om2-v^see-t^pf/m^pt-rl obr oblique relative (object of by-phrase or time) mo-hla oo re-y-a-ng ... n^3-day(3,4) obr3 sm1p-t^p_v^go-m^pt-rl ... -rl relative suffix moo ke-ba-bon-e-ng teng lr sm1s-om2-v^see-t^pf/m^pt-rl loc sr/cp relative copula mo-nna ya le-venkele-ng n^1-man(1,2) sr1/cp n^5-shop(5,6)-lc

Mood (m^)

• The mood of infinitives is m^in if affirmative indicative, m^x if neg in indicative assertive, declarative, affirmative statements pt participial in compound tenses, subordinate clauses, relative clauses x negative in negative utterances ha ke-tseb-e ng sm1s-t^p_v^know-m^x I don’t know i imperative m-ph-e om1s-v^give-m^i Give me. ip imp plural bon-ang v^see-m^ip Look! s subjunctive ere ke-bon-e ht sm1s-t^p_v^see-m^s Let me see. sp plural subjunctive ha re-y-eng ht sm1p-t^p_v^go-m^sp Let’s go.


cm complementizer hore cj conjunction le, kanthe, ebile, kapa, hobane, etc. cd conditional ha, haeba pr preposition ka, ha, ho, etc. ht hortative particle ha, ere ng negatives ha, sa, ska, skana, skaba, etc. loc locatives mane, moo, nqa, fatse, kwana, mona, pela, hodima, etc. av adverb pele, maobane, etc. ij interjection ee, ako, hei, oo, etc. wh question eng, kae, neng, mang, hobaneng, etc.


-wh question marker o-bin-a-ng ? sm1-t^p_v^sing-m^in-wh ? di-ng 10-wh ho-kae 17-wh q question markers hakere, na, hana ? questions o-batl-a papa ?

Other issues

ako ako bone => ake u-bon-e ij sm2s-t^p_v^see-m^s Please (you) look. ako bona => ako bon-a ij v^see-m^i Please look. ska (seka, skaba, skana, etc.) ng always with imperative verb ska n-qal-a ng om1s-v^attack-m^i Don’t pick on me. u-skaba-bolel-a sm2s-ng-v^tell-m^i Don’t tell.


x2, x3, etc. Identical consecutive renditions of the same utterance are indicated under the situational comments