Атты І халықаралық конференция ЕҢбектері

Transcripts and annotation

жүктеу/скачать 8,57 Mb.

Pdf көрінісі

бет	234/326
Дата	07.01.2022
өлшемі	8,57 Mb.
	#19269

1 ... 230 231 232 233 234 235 236 237 ... 326

Байланысты:
Болатбек М. (1)

Transcripts and annotation


Each audio file is provided with its corresponding orthographic transcription and TIMIT-
style word-level segmentation as well as morpho-syntactic annotation files. All the data processing
to obtain these files were performed manually by the trained linguists.
The transcriptions files contain the exact orthographic transcription of the utterances, which may
differ from the original text. For example, the numbers, abbreviation, foreign words and dates are
expanded depending on how they were uttered by the speakers. In addition, the transcription of the
stories have the sentence boundaries labeled with ~~and~~ .
The  segmentation  was  performed  using  WaveSurfer  [8],  an  open-source  tool  for  sound
visualization and manipulation, which supports TIMIT word-level transcription format. Although it
supports Unicode, it does not support Kazakh symbols well. Therefore, we used an ASCII version
of the Kazakh letters.  Also,  we used # symbol for the pauses and silence, and ^  symbol for other
non-speech events.
The morpho-syntactic annotation is includes part-of-speech and morpheme segmentation of each
word as well as the information on syntax for each sentence.

жүктеу/скачать 8,57 Mb.

Достарыңызбен бөлісу:

1 ... 230 231 232 233 234 235 236 237 ... 326