Each audio file is provided with its corresponding orthographic transcription and TIMIT-
style word-level segmentation as well as morpho-syntactic annotation files. All the data processing
to obtain these files were performed manually by the trained linguists.
The transcriptions files contain the exact orthographic transcription of the utterances, which may
differ from the original text. For example, the numbers, abbreviation, foreign words and dates are
expanded depending on how they were uttered by the speakers. In addition, the transcription of the
stories have the sentence boundaries labeled with and .
The segmentation was performed using WaveSurfer [8], an open-source tool for sound
visualization and manipulation, which supports TIMIT word-level transcription format. Although it
supports Unicode, it does not support Kazakh symbols well. Therefore, we used an ASCII version
of the Kazakh letters. Also, we used # symbol for the pauses and silence, and ^ symbol for other
non-speech events.
The morpho-syntactic annotation is includes part-of-speech and morpheme segmentation of each
word as well as the information on syntax for each sentence.