CHILDES Sesotho Demuth Corpus
| Participants: || 3 |
| Type of Study: || naturalistic |
| Location: || Lesotho |
| Media type: || audio |
| DOI: || doi:10.21415/T57P4N |
Link to media folder
Demuth, K. (1992). Acquisition of Sesotho. In D. Slobin (ed.), The Cross-Linguistic Study of Language Acquisition, 3, 557-638. Hillsdale, N.J.: Lawrence Erlbaum Associates.
In accordance with TalkBank rules, any use of data from this corpus must be accompanied by at least one of the above references.
The Demuth Sesotho Corpus was compiled by Katherine Demuth in the
southern African country of Lesotho from 1980-82. Data was collected in
a small Lesotho mountain village of 550 people in the district of
Mokhotlong, where it was possible to establish close rapport with both
the children and their families. The Corpus contains a longitudinal
study of four target children’s language development as they interacted
with members of the extended family including mothers and/or
grandmothers, an uncle and occasionally the father (in one family), and
especially older siblings, cousins, and peers. These target children
are: Hlobohang (boy) 2;1-3;0, Litlhare (girl) 2;1-3;2, ‘Neuoe (girl)
2;4-2;9, and Tsebo (girl) 3;8-4;1. The two older girls were cousins
living in the same household, and where therefore recorded together.
Monthly recordings of spontaneous speech consisted of 3-4 hours each
over approximately one year, resulting in a corpus of 98 hours of speech
containing approximately 13,250 utterances containing lexical verbs or
approximately 1/2 million morphemes.
Transcription and Organization of the Corpus
Broad phonemic transcription was conducted by Katherine Demuth with the
assistance of the mothers and grandmothers as soon as recording sessions
were complete. These transcripts were then verified independently by a
researcher at the National University of Lesotho. The original
transcription was by hand. The data were subsequently computerized and
one third of the corpus hand-tagged by Sesotho speakers at Brown
University in the 1990’s. A computational morphological parser was then
developed (with the assistance of Mark Johnson, Brown University) to tag
the remaining part of the corpus, files were then converted to CHILDES
format, and the audiotapes were digitized. These still remain to be
linked to the transcripts.
The data contain a transcription of the utterance (using a modified form
of disjunctive Lesotho orthography, where orthographic ‘e’, ‘o’, and ‘l’
are rendered as phonetically more transparent ‘w’, ‘y’ and ‘d’
respectively. This is followed by a morphologized ‘adult’ (i.e.
grammatical) equivalent, where hyphens correspond to morphemes within a
prosodic word. Since Sesotho, like most Bantu languages, is a null
subject language, with a rich grammatical agreement and morphological
system, this means that an entire utterance may consist of one
morphologically complex ‘word’. This is then also followed by
morphological tags (see below), an English ‘running’ gloss, and
The corpus should therefore be accessible to and of broad interest for
researchers of child language, Bantu linguistic structures, and
computational linguists interested in morphological parsing and/or
There are three groups of files, organized by target child: H, L and T
& N. The name of each file indicates the target child of that file, the
session from which it was taken, and the audiotape on which it was
recorded. There are a total of 131 transcript files, each corresponding
to 30 minutes of original tape. For example, file ‘L7C’ is from the
set of files with L as the target child (“L files”), from session number
7, and the letter C indicates that it is the third tape from which the
file is transcribed. Since there are two target children in the T
files, the older child T has been give the designation CHI, and the
younger target child is identified by her three character name
abbreviation NAI. Thus, researchers wanting to extract all information
on these 4 target children will want to include NAI along with CHI in
there search process.
Since these data were collected in spontaneous home and neighborhood
situations, many of the files include many speakers, especially once the
target children are over the age of 2;6. These include younger peer
children (2;6-4-year-olds), older sibling children (5-7-year-olds), and
adults (teen-aged cousins, parents, grandparents, visitors). Thus,
researchers wanting to examine typical Sesotho adult speech can extract
data from these speakers, all identified at the top of each file. About
40% the corpus contains utterances from the 4 target children, about 40%
are adult utterances, and about 20% are from peers or older siblings
Morphological Tagging and Codes
• Morphemes that constitute ‘words’ are connected with a hyphen in
the adult: ke-a-e-rek-a sm1s-t^p-om9-v^buy-m^in I’m buying it.
• Fusion of two morphemes is indicated with a slash ‘/’ in the
morphological tags: ke-u-rek-ets-e sm1s-om2s-v^buy-ap/t^pf-m^in I
bought for you.
• Noun-class prefixes are separated from the base noun by a hyphen. The class to which the noun belongs is indicated in parentheses:
n^ noun class marker se-kolo n^7-school(7,8)
• Names for people, places, games and songs are indicated as follows:
n^name person’s name
n^place name of a place
n^game name of a game
n^song name of a song
• Other nominal concords are indicated by the number of the corresponding noun class:
ps possessive ya-ka ps9-1s
tsa lona ps10 pn2p
ps/home hahabo, hahao, hae, heso, etc.
pn independent pronoun wena pn2s
d demonstrative pronoun tse, tsena d10
eo, ena, etc. d9
aj adjective tse-ngata 10-aj
nm numeral ele-ngwe 9-nm
di-le tharo 10-cp nm
lc locative marker -tafole-ng table(9,10)-lc On the table.
dim diminuitive -na, -ne, -ana
• Copulas are not coded as verbs and may or may not reflect the noun class of their argument:
cp copula ke se-sepa cp n^7-soap(7,8)
nna ke e-mo-holo pn1s cp1s 1-1-aj Me, I’m big.
e ka moo cp9 pr loc It’s over there.
u le-shano cp2s n^5-liar(5,6) You lie.
neg: ha se nna ng cp pn1s It’s not me.
• Ideophones are denoted by ‘id^’ and their semantic meaning. They may be preceded by the verb ho re ‘to say’ or they may stand on their own. On occasion an ideophone may occur with tense and a subject marker, similar to a verb:
o-tla-r-e polopoqo sm1-t^f1-v^say-m^in id^fall_over_and_over He will fall.
u-tla-polopoqo sm2s-t^f1-id^fall_over_and_over You will fall.
yare chobe cj id^enter It enters.
• Verbs are denoted by ‘v^’ and their semantic meaning. When there is no overt (phonologically realized) tense marker, the present tense marker is linked to the verb (or object marker, see below) using an underscore (‘_’):
ke-a-u-shap-a sm1s-t^p-om2s-v^thrash-m^in I’m thrashing you.
ke-j-a se-wete sm1s-t^p_v^eat-m^in n^7-carrot(7,8) I’m eating a carrot.
• Subject markers prefix to the verb. First and second person subject markers are marked singular or plural, other subject markers are marked according to the noun class to which they belong:
1st pers. sing. ke- ke-rek-il-e sm1s-v^buy-t^pf-m^in I bought.
1st pers. pl. re- re-y-a sm1p-t^p_v^go-m^in We go.
2nd per. sing. u- u-a-ultw-a sm2s-t^p-v^hear-m^in You hear.
2nd per. pl. le- le-pheh-il-e sm2p-v^cook-t^pf-m^in You cooked.
Class 1 o- o-a-j-a sm1-t^p-v^eat-m^in He eats.
Class 2 ba- ba-tl-a sm2-t^p_v^come-m^in They come.
Class 9 e- e-w-el-e sm9-v^fall-t^pf-m^in It fell.
• Object markers/reflexives prefix to the verb stem. If an object marker is present and there is no overt tense marker, the present tense is attached to the object marker so as not to come between the object marker and the verb. Sesotho permits only one object marker:
1st pers. sing. -n- u-n-hat-a sm2s-t^p_om1s-v^step-m^in You step on me.
1st pers. pl. -re- u-a-re-sheb-a sm2s-t^p-om1p-v^look-m^in You look at us.
2nd per. sing. -u- ke-u-rat-a sm1s-t^p_om2s-v^like-m^in I like you.
2nd per. pl. -le- ke-le-pheh-a sm2p-v^cook-t^pf-m^in You cooked.
Class 1 -mo- ba-mo-otl-a sm2-t^p_om1-v^beat-m^in They beat him.
Class 3 -o- ke-a-o-ten-a sm1s-t^p-om3-v^wear-m^in I wear it.
Class 9 -e- ke-e-tseb- sm1s-t^p_om9-v^know-m^in I know it.
Reflexive -i- ke-a-i-pat-a sm1s-t^p-rf-v^hide-m^in I hide myself.
• Verbal extensions are infixed after the verb stem and before the final vowel:
ap applicative -el-, -ets- (ap/t^prf)
rc reciprocal -an-
c causative -is-, -es- (chesa) , -y- (kenya)
p passive -w-
nt neuter/stative -ahal- bon-ahal-a v^see-nt-m^in visible
-eh- lahl-eh-a v^lose-nt-m^in get lost
xt extensive -aka-
rv reversive -oll- fas-oll-a v^tie-rv-m^in untie
cl completive -ell-
• Every verb must have a tense marker expect for the imperative or infinitive. ‘_’ is used where t^ is not an isolated morpheme.
• Simple tenses do not require a second subject marker. Most tense markers are infixed after the subject marker and before the verb stem (before an object marker if one is present), except for the perfective (t^pf) which is infixed after the verb stem and before the final vowel:
p present -a-, -ø-
pf perfective -il-, -ets-
f1 future tla
neg: ha ke-na-ho-rek-a ng sm1s-t^f1-if-v^buy-m^in
f2 future -tlilo-, -tlo-
f3 future -ya-
f4 future -ilo-
f5 future -yo-
if infinitive ho-
sa persistive -sa- ke-sa-y-a sm1s-t^sa-v^go-m^in
np narrative past -a ka-bon-a sm1s/t^np-v^see-m^in
po potential -ka- n-ka-rek-a sm1s-t^po-v^buy-m^in
po/ng negative potential -kebe-, -keke-
ts recent past -tswa- ke-tswa-rek-a sm1s-t^ts-v^buy-m^in
tena -tena-, -tene
• Compound Tense Forms require a second subject marker after the tense marker. With compound tense forms all morphemes are written conjunctively. In the affirmative, the mood of the verb is participial (m^pt), in the negative it is x (m^x).
ne past continuous -ne-
neg: u-ne-u-sa-thus-a sm2s-t^ne-sm2s-ng-v^help-m^x
pp past progressive -ntse-
ps past -ile-
ny not yet -eso-, -so-
se exclusive -se-
re-se-re-qet-il-e sm1p -t^se-sm1p-v^finish-t^pf-m^pt
neg: ke-se-ke-sa-bin-e sm1s-t^se-sm1s-ng-v^sing-t^pf/m^x
• Subject and object relative markers are coded according to the noun class of the nouns to which they refer. Relative clauses are always participial mood (m^pt):
sr subject rel. marker
... mo-tho ya il-e-ng ... n^1-person(1,2) sr1-v^go/t^pf-m^pt-rl
or object rel. marker
ntho eo ke-e-bon-e-ng thing(9,10) or9 sm1s-om9-v^see-t^pf/m^pt-rl
lr locative relative
... moo ke-ba-bon-e-ng teng ….r sm1s-om2-v^see-t^pf/m^pt-rl
obr oblique relative (object of by-phrase or time)
mo-hla oo re-y-a-ng ... n^3-day(3,4) obr3 sm1p-t^p_v^go-m^pt-rl ...
-rl relative suffix
moo ke-ba-bon-e-ng teng lr sm1s-om2-v^see-t^pf/m^pt-rl loc
sr/cp relative copula
mo-nna ya le-venkele-ng n^1-man(1,2) sr1/cp n^5-shop(5,6)-lc
• The mood of infinitives is m^in if affirmative indicative, m^x if neg
in indicative assertive, declarative, affirmative statements
pt participial in compound tenses, subordinate clauses, relative clauses
x negative in negative utterances
ha ke-tseb-e ng sm1s-t^p_v^know-m^x I don’t know
m-ph-e om1s-v^give-m^i Give me.
ip imp plural
bon-ang v^see-m^ip Look!
ere ke-bon-e ht sm1s-t^p_v^see-m^s Let me see.
sp plural subjunctive
ha re-y-eng ht sm1p-t^p_v^go-m^sp Let’s go.
cm complementizer hore
cj conjunction le, kanthe, ebile, kapa, hobane, etc.
cd conditional ha, haeba
pr preposition ka, ha, ho, etc.
ht hortative particle ha, ere
ng negatives ha, sa, ska, skana, skaba, etc.
loc locatives mane, moo, nqa, fatse, kwana, mona, pela, hodima, etc.
av adverb pele, maobane, etc.
ij interjection ee, ako, hei, oo, etc.
wh question eng, kae, neng, mang, hobaneng, etc.
-wh question marker o-bin-a-ng ? sm1-t^p_v^sing-m^in-wh ?
q question markers hakere, na, hana
? questions o-batl-a papa ?
ako ako bone => ake u-bon-e ij sm2s-t^p_v^see-m^s Please (you) look.
ako bona => ako bon-a ij v^see-m^i Please look.
ska (seka, skaba, skana, etc.) ng always with imperative verb
ska n-qal-a ng om1s-v^attack-m^i Don’t pick on me.
u-skaba-bolel-a sm2s-ng-v^tell-m^i Don’t tell.
x2, x3, etc. Identical consecutive renditions of the same utterance are indicated under the situational comments
Collection of this corpus was supported in part by
Fulbright and SSRC (Social Science Research Council) dissertation
funding. Computerization and tagging of the corpus has been supported
in part by NSF grants BNS-08709938 and SBR-9727897. Thanks to all who
have assisted with this research over the years, and the children and
families who provided the data.