G. Altenbek, X.L. Wang
22
Вестник Карагандинского университета
The following table is a high-frequency word statistics of Kazakh masterwork. Kazakhstan scholars in
the literature did statistical analysis on the vocabulary of Kazakh masterwork Road of Abai. They list
the top
500 high-frequency vocabulary, which the first shown in Table 7.
T a b l e 7
High frequencies words in 《
ABAY road》
Rank Word Frequency Rank Word Frequency Rank Word Frequency
1
ەد(v)
9828
6
اد
5653
11
ىسو
4379
2
لوب
9341 7
لەك
5175 12
تيا
4301
3
ە
7844
8
لﯚب
4696
13
لوس
4175
4
يابا
6747
9
لا
4546
14
لو
4117
5
زۇ
5747
10
ەد
4506
15
رﯩب
3896
Comparison of Tables 6 and 7 shows that there are five words with the same high frequency, just the
frequency is different. The highest frequency word is «ر ٮب» (meaning: one). Based on this, we further
proved the stability and universality of
the Kazakh vocabulary, and explained the unity of the Kazakh
language.
Conclusion
In this paper, the research mainly focuses on a corpus-based Frequency Statistic of Kazakh Word at
Morphological feature, which are the most difficult aspect of Kazakh natural language processing using the
statistical method by Arabic script in P. R. China. This paper standardized the processing coding and storage
scheme of
our corpus, then constructed Kazakh Language Corpus (KzLC). The Kazakh word tagging corpus
is labeled various word level information based on the corpus of the language materials, in order to solve
POS tagging of Kazakh lexical analysis, this study proposed a Kazakh word tagging Standards including
POS tagging,
stem tagging, affix tagging, multi-class word POS tagging as an attribute.
Our research project has shown the corpus-based approach to processing of word frequencies, this study
has been completed statistical content: Kazakh words starting letters with statistics, the Kazak of word fre-
quency statistics, the Kazakh word length statistics, all the relationship between
the length and frequency of
the word. The experimental results illustrate the Kazak inner link between word frequency, and at the same
time to verify the Kazakh word frequency comply with Zipf's law of power law.
In the experiment, we used the data of 2008 year of Xinjiang daily (Kazakh version) and Kazakh Text-
books from primary school and junior high school to high school in Xinjiang.by our Kazakh
Language Cor-
pus (KzLC).
For the future work we will plan to construct the syntactic annotation treebank of Kazakh language,
which has been continuously improved, modified and expanded in the recent years. Then we have plan to
construct a semantics annotation treebank of Kazakh language.
This research work is supported by Natural Science Foundation of P.R. China (NSFC), under grant
No. 61363062, No.61063025 No.61572151 and other No. NMLR 201601.
References
1 Altenbek, G., Dawel, A., & Muheyat, N. (2009). A Study of Word Tagging Corpus for the Modern Kazakh Language //
Journal of Xinjiang University, 26(4),
394–401.
2 Makhambetov, O., Makazhanov, A., Yessenbayev, Zh., Matkarimov, B., Sabyrgaliyev, I., & Sharafudinov, A. (2013). As-
sembling
the Kazakh Language Corpus, In Proceedings of the Conference on Empirical Methods in Natural Language Processing
2013 (EMNLP 2013). Association for Computational Linguistics, 1022–1031.
3 Doszhan, G. (2013). Problems of Creation of the All-Turkic National Corpus. International Conference on Information //
Business and Education Technology, 1018–1023.
4 Yesim, Aksan. & Mustafa, Aksan. (2009). Building a national corpus of Turkish: Design and implementation, Working pa-
pers in corpus-based linguistics and language education. Tokyo: Tokyo University of Foreign studies, 299–310.