Recognition Results All the audio data was separated into training and test sets. The test set is balanced based on
gender and includes one representative from each region. The quantitative information about both
sets is given in Table 2. The overall performance of recognition on test data is 6.9% WER.
Table 2. Distribution of data in training and test sets.
Train set
Test set
# of speakers
153
16
# of audio files
11367
1176
Conclusions and Future Work In the current work we have conducted the experiments on large vocabulary continuous speech
recognition task for Kazakh. First we build the first acoustic database of Kazakh speech, which is
balanced with respect to gender, region and age group. Next, we build the acoustic and language
models using CMU Sphinx toolkits. Finally, we evaluate our system on test data obtaining a word
error rate of 6.9%.
While we build a state-of-the-art speech recognition system, it is assumed to be a baseline for our
future work on speech recognition research. Thus, our next step will be to improve WER by
exploiting class-based language models with morphological cues. This kind of approach seems
more effective for inflectional languages such as Kazakh, Turkish and Russian.