Acoustic Database Most of the modern speech processing systems requires large amount of audio and text data for
training the acoustic and language models. Depending on the type of an application data needed
varies from high quality microphone read speech (WSJ0 [1]) to conversational telephone speech
(Switchboard [2] or CALLHOME [3]), from continuous speech (TIMIT [4]) to connected
(TIDIGITS [5]) and isolated words (PhoneBook [6]). In our current work, we collected a corpus of
28 hours high quality microphone read Kazakh speech of 169 native speakers for the large
vocabulary continuous speech recognition tasks. The acoustic database is initiated as a part of the
Kazakh Language Corpus compiled in [7].
Text materials The text materials to be uttered were carefully selected from the primary text corpus and divided
into two parts: short sentences and stories.
The “sentences” part has more than 12K different sentences randomly and equally extracted
from the five stylistic genres mentioned above. The sentences are chosen so that they have more
than 120K words contained in the list of the most frequent words covering the 95% of all the texts
in the primary corpus. Additionally, the sentences were grouped according to their length in words.
Thus, we have 10 groups of sentences having the lengths from 6 to 15 words in each.
The “stories” part contains the short online news extracted from massmedia section of the
primary text corpus. Each story has not more than 300 words.
All the text materials were subdivided into numbered small and nonintersecting sets to be uttered
by the speakers. A standard set for one speaker has exactly 75 sentences (by 10 sentences from five
shorter groups and by 5 sentences from five longer groups) and 1 story.
Speakers The speakers that took part in the recordings are volunteers recruited by advertisements in the
local newspapers and personal referral. The main criteria of speaker selection were a region where
he/she learned Kazakh or spent most of his/her life, age, gender and the ability to read Kazakh.
The first criterion helped to capture variability present in speech due to the speakers’ settlement
both local and external. Totally there are 15 region groups: 14 official regions (“oblast”) of
Kazakhstan and one group for those who lived outside of the country.
The speakers are divided into four age groups not including children and school students:
I group – 18-27 years;
II group – 28-37 years;
III group – 38-47 years;
IV group – 48 years and above.
We did not strictly balance the speakers by their gender due to the difficulties in finding the
volunteers but still tried to keep the number of speakers of one gender per profile not more than 3.
The female and male distributions are 57% and 43%, respectively.
The other important criterion was the ability to read Kazakh since not all the interviewees could
read in Kazakh sufficiently fluent, what is a common issue in a bilingual country such as
Kazakhstan. Additionally we kept the records of the speakers’ education whether they graduated
last from school, college or university.
Totally, we recorded 169 speakers. The following Table 1 presents the distribution of the
speakers across the regions, gender and age groups. The blank spots show the speaker profiles that
we could not recruit. Mostly, these correspond to the distant regions and elder male groups.