Thai-Zlatev Corpus

Peerapat Yangklang
Assistant to the President
Silpakorn University


Jordan Zlatev
Division of Cognitive Semiotics
Lund University


Participants: 50 transcripts -- ages 4, 6, 9, 11, and 20
Type of Study: narrative
Location: Thailand
Media type: no longer available
DOI: doi:10.21415/T5NK5Q

Browsable transcripts

Download transcripts

Citation information

If you publish any paper based on this data, please send an MS-Word or PDF-formatted version of your paper as an attachment to

Zlatev, J. and Yangklang, P. (2001). Frog stories in Thai: Transcription and Analysis of 50 Thai narratives from 5 age groups. (forthcoming)

In accordance with TalkBank rules, any use of data from this corpus must be accompanied by the above reference.

Project Description

Though our project focused on the development of spatial expressions in Thai, we made a serious effort to make the data as consistent and general as possible so that it could be used for other studies as well. We would also like to thank everyone who helped us carry out the collection and transcription of this data: Janich Feangfu, Maneeya Sangjan, Mingmit Sriprasit, Soraya Osathanonda, Martha Karrebaek Hentze and Katarina Lindblom.

The child data was collected in three Bangkok schools and the adult data was collected from students of Chulalongkorn University. The interviewer, always a native Thai speaker, first showed the Frog Story book to the subject and let him scan through it by himself for about 5 minutes. For the children, the instruction were approximately as follows: This story is about a boy, his dog, and a frog. I’ll let you take a look at the pictures of the story, first. Then, I will ask you to tell me the story, picture by picture.

The interviewer sometimes encouraged the child to proceed with the story. These utterances of the interviewer have not been transcribed. Even though we tried to keep the elicitation conditions as uniform as possible, there were inevitable differences due to the fact that five different interviewers collected the data. (The name of the interviewer appears first in the @Transcriber list.)

1.1.1 Transcription

Each recorded narrative was transcribed in standard Thai orthography, in almost all cases by the person, who performed the interview. The Thai transcription was then converted into a phonemic notation via the semi-automatic Thai Transcription program, developed at the Department of Linguistics, Chulalongkorn University. The consonants are as follows:

Table 1: Thai Consonants
labial  postdental  palatal   velar  glottal

Table 2: Thai Vowels

Tones were marked as: Mid: 0, Low: 1, Falling: 2, High: 3, Rising: 4. Due to requirements of CHAT, the ? for glottal stop was omitted. The presence of the glottal stop is nevertheless derivable from the data since Thai syllables can not begin with a vowel or end with a short vowel. Whenever that seems to be the case in the data, there is an “invisible” glottal stop before the initial vowel or after the final short vowel.

1.1.2 Segmentation

Thai orthography does not place spaces between words: the space within the utterance corresponds to a hesitation pause. The transliteration program does not perform word seg-mentation either. Therefore, in order to allow the CLAN programs to perform automatic analyses (mlu, frequency counts etc.) the phonemic transcription needed to be segmented into words manually. This was straightforward in most cases since the vast majority of Thai words, especially in the colloquial register, are monosyllabic. However, it is not always clear if certain multi-syllabic expressions should be treated as (a) mono-morphemic words, (b) multi-morphemic words including lexical compounds or (c) phrases consisting of one or more words. In deciding how to analyze particular examples, we used the following criteria, disregarding diachronic evidence:

1. Mono-morphemic word IFF at least one of the syllables in the expression does not have a transparent separate meaning, e.g. naa2taaN1 (‘window’). Even though this expression is probably a compound diachronically, the compounding is not transparent for present-day speakers, so we decided to treat it as mono-morphemic.

2. Multi-morphemic word (“+” between the syllables) IFF all the syllables have transparent separate meanings, but the meaning of the whole is not derivable by combining that of the parts, e.g. phuu2+jaj1 (‘person’+‘big’ = ‘adult’). Lexical compounds are one subclass of this category. Derivations such as khwaam0+suk1 (PROPERTY + ‘happy’ = ‘happiness’) are also included in this category, even though their derivation is semantically regular.

3. Phrase (SPACE between the syllables) IFF the syllables have separate meaning, and combine systematically to give the meaning of the whole, e.g. maa4 (‘dog’) noj4 (‘little’) = ‘little dog’; raN4 (‘nest’) phUN2 (‘bee’) = ‘beehive’. In the same group as the second category, and thus marked in the same way (with a “+” connecting the parts) were expressions that appeared to be formulaic, e.g. may0+pen0+raj0 (‘never mind’). As for reduplications (cf. Luksaneeyanawin 1984), these were marked by connecting the parts with a double plus sign “++”, e.g. dek1++dek1 (‘children’). Using this notation compounds and other multi-morphemic words, formulaic expressions and reduplications can be treated as single lexical items (which is intuitively correct) by the CLAN programs, at the same time as analysis can easily be performed on their parts if required. For example, by adding the switch +b+ in the command line of the program MLU the constituent morphemes will be counted separately.

1.1.3 CHAT Formatting

The rough phonemic transcription was then checked against the original tape recordings and corrections were made. Deviations from standard pronunciation were included, using the convention offered by CHAT, placing the standard in square brackets behind pronounced form, e.g. laN0 [: raN0]. We then listened through the tape once more in order to mark all pauses: short (#) and long (##) and extra-long vowels, e.g. maa:4. Repetitions and re-tracings were marked using the CHAT conventions, i.e. the repeated material was surrounded by <> and followed by [/], [//] or [///].

Following the CHAT convention, each main line was made to include only one utterance – defined with a combination of phonetic and grammatical criteria. Thus, a line/utterance ends when both conditions are met:

  1. There is short pause (#), a long pause (##), or a “vowel lengthening”, and
  2. This coincides with the end of a clause, marked as [c].
If only (1) is met, the pause is marked within the utterance/line. If only (2) is met, [c] marks the end of the clause but not the utterance/line. However, we sometimes allow a line/utterance to end even if there is a word between the pause and clause boundary.

The operational definition of a clause provided by Berman and Slobin (1994:660) (“a unit that contains … a predicate that expresses a single situation (activity, event, state). Predicates include finite and non-finite verbs, as well as predicative adjectives.”) could not be used since serial verb constructions, which are ubiquitous in Thai, can involve up to six verbs in what is arguably a representation of a “single situation”. While knowing that our decisions will not satisfy all scholars of Thai linguistics, we have tried hard to provide a clear set of criteria for identifying clause boundaries in the corpus, and by marking them with the conventional CHAT symbol [c], allow a first-order surface representation of grammatical complexity. According to our criteria [c] was used:

  1. Before the introduction of a new explicit or implicit subject.
  2. Before the relative-clause markers (RCMs) thii2 and sUN2 (‘which’). If there is only a noun phrase between the previous [c] and the RCM, the clause boundary [c] is instead placed at the end of the relative clause. It is also used in other places where a relative-clause marker may be inserted.
  3. Where clause boundaries are indicated by the presence of conjunctions such as lx3, lxxw3 (‘and’), lxxw3 kO2 (‘and then’), kO2 (‘then’), thxx1 (‘but’), phrO3 (‘because’), mUa2, phOO0 (‘when’), con0 (‘until’), mxx3 (‘though’) or in other places where a conjunction may be inserted.
  4. After wa2 (‘that’), if it is both preceded and followed by text segments with main verbs (excluding cases where wa2 is a main verb, and where it has nouns and other non-verb expressions as complements).

Given this way of marking clauses, we were faced with a dilemma as to how to satisfy the CHAT convention of having only one utterance per line (tier). If we chose only phonetic criteria, i.e. pauses and intonation contours, to define utterance boundaries, we would have to break up clauses – as previously defined – into many lines, and thus decrease readability and analyzability. On the other hand, if we neglected prosodic criteria and only segmented the text into clauses, we would miss the information that some clause boundaries coincided with pauses, and thus seemed to constitute processing units, while others did not. We resolved this dilemma with a compromise, operationally defining an “utterance”, i.e. the unit to be placed on a single line/tier, through a combination of phonetic and grammatical criteria:

U1. An utterance boundary (.) occurs when there is BOTH a phonetic indication of utterance closure – a short pause (#), a long pause (##) or a vowel lengthening (:) – AND a clause boundary, marked with [c].

This means that if there is only a pause but no clause boundary, then the pause is marked within the utterance. If the utterance seems to terminate without a clause being completed – the speaker “trails off” – this is marked by ending the line with the symbol “+…” instead of the utterance delimiter “.”. Likewise if there is a clause boundary, but no pause of vowel lengthening, the utterance is assumed to continue until a clause boundary and an utterance boundary coincide. An exception of the condition U1 was made when there was only a single word (or short phrase) serving as a “filler” between the clause boundary and the actual pause, and in this case the utterance was terminated after this filler, as in the following authentic example:
lxxw3 dek1 tUUn1 khUn2 maa0 # phrOOm3 maa4 [c] lxxw3 # .
and child wake up come together dog and
‘And the child woke up, with the dog and.’

Finally, each of the 50 narratives was read though once again by at least two different checkers, correcting for any inconsistencies. Furthermore, a listing of all the words in the corpus was produced using the CLAN command freq +k *.cha +u +r6, and we went through this list word by word, making sure that each word is transcribed consistently throughout the corpus. In addition to standard CHAT codes, we used the %tai dependent tier for the Thai transcription and a double ++ to indicate reduplication as in “luuk2++luuk2.”

1.1.4 Files

The files are summarized in the following table. The subjects’ names do not appear in the transcripts.

Table 3: Thai Frog Files
File  Age  Sex  Date
Please see this link for a general description of the Frog Story methods.


This data was collected as part of the First Language Acquisition of Thai project, funded by The Swedish Foundation for International Cooperation in Research and Higher Education (STINT) and hosted by the Department of Linguistics, Chulalongkorn, Thailand during 2000.