Part of most high frequencies words in each stage for textbooks
Primary school
Junior high school
High school
word
frequency
word
frequency
word
frequency
رﯩبٔ
1972
رﯩبٔ
2143
رﯩبٔ
3263
نەم
1334
اد
1949
اد
2783
اد
1303
پەد
1724
نەم
2442
پەد
1166
نەم
1626
پەد
2329
لو
1000
ەد
1494
ەد
2182
ەد
990
لو
1221
لﯘب
1509
نەكە
742
لﯘب
917
لو
1473
لﯘب
707
ىدە
913
ىدلاوب
1197
سو
ى
603
نەكە
842
نەگەد
1192
ﯔﯩنو
576
ىسو
824
ىسو
1178
0
50000
100000
150000
200000
250000
300000
350000
400000
词次
词种
词尾种
词干种
高中
中学
小学
G. Altenbek, X.L. Wang
22
Вестник Карагандинского университета
The following table is a high-frequency word statistics of Kazakh masterwork. Kazakhstan scholars in
the literature did statistical analysis on the vocabulary of Kazakh masterwork Road of Abai. They list the top
500 high-frequency vocabulary, which the first shown in Table 7.
T a b l e 7
High frequencies words in 《ABAY road》
Rank Word Frequency Rank Word Frequency Rank Word Frequency
1
ەد(v)
9828
6
اد
5653
11
ىسو
4379
2
لوب
9341 7
لەك
5175 12
تيا
4301
3
ە
7844
8
لﯚب
4696
13
لوس
4175
4
يابا
6747
9
لا
4546
14
لو
4117
5
زۇ
5747
10
ەد
4506
15
رﯩب
3896
Comparison of Tables 6 and 7 shows that there are five words with the same high frequency, just the
frequency is different. The highest frequency word is «ر ٮب» (meaning: one). Based on this, we further
proved the stability and universality of the Kazakh vocabulary, and explained the unity of the Kazakh
language.
Conclusion
In this paper, the research mainly focuses on a corpus-based Frequency Statistic of Kazakh Word at
Morphological feature, which are the most difficult aspect of Kazakh natural language processing using the
statistical method by Arabic script in P. R. China. This paper standardized the processing coding and storage
scheme of our corpus, then constructed Kazakh Language Corpus (KzLC). The Kazakh word tagging corpus
is labeled various word level information based on the corpus of the language materials, in order to solve
POS tagging of Kazakh lexical analysis, this study proposed a Kazakh word tagging Standards including
POS tagging, stem tagging, affix tagging, multi-class word POS tagging as an attribute.
Our research project has shown the corpus-based approach to processing of word frequencies, this study
has been completed statistical content: Kazakh words starting letters with statistics, the Kazak of word fre-
quency statistics, the Kazakh word length statistics, all the relationship between the length and frequency of
the word. The experimental results illustrate the Kazak inner link between word frequency, and at the same
time to verify the Kazakh word frequency comply with Zipf's law of power law.
In the experiment, we used the data of 2008 year of Xinjiang daily (Kazakh version) and Kazakh Text-
books from primary school and junior high school to high school in Xinjiang.by our Kazakh Language Cor-
pus (KzLC).
For the future work we will plan to construct the syntactic annotation treebank of Kazakh language,
which has been continuously improved, modified and expanded in the recent years. Then we have plan to
construct a semantics annotation treebank of Kazakh language.
This research work is supported by Natural Science Foundation of P.R. China (NSFC), under grant
No. 61363062, No.61063025 No.61572151 and other No. NMLR 201601.
References
1 Altenbek, G., Dawel, A., & Muheyat, N. (2009). A Study of Word Tagging Corpus for the Modern Kazakh Language //
Journal of Xinjiang University, 26(4), 394–401.
2 Makhambetov, O., Makazhanov, A., Yessenbayev, Zh., Matkarimov, B., Sabyrgaliyev, I., & Sharafudinov, A. (2013). As-
sembling the Kazakh Language Corpus, In Proceedings of the Conference on Empirical Methods in Natural Language Processing
2013 (EMNLP 2013). Association for Computational Linguistics, 1022–1031.
3 Doszhan, G. (2013). Problems of Creation of the All-Turkic National Corpus. International Conference on Information //
Business and Education Technology, 1018–1023.
4 Yesim, Aksan. & Mustafa, Aksan. (2009). Building a national corpus of Turkish: Design and implementation, Working pa-
pers in corpus-based linguistics and language education. Tokyo: Tokyo University of Foreign studies, 299–310.
А corpus-based frequency statistic…
Серия «Филология». № 2(86)/2017
23
Г. Алтынбек, Кс.Л. Ванг
Достарыңызбен бөлісу: |