Issn 2518-198х индексі 74623 Индекс 74623 Қ ар ағ анд ы уни вер си тетiнiң ÕÀÁÀÐØÛÑÛ


А corpus-based frequency statistic of kazakh language



Pdf көрінісі
бет12/46
Дата12.10.2022
өлшемі5,53 Mb.
#42591
1   ...   8   9   10   11   12   13   14   15   ...   46
А corpus-based frequency statistic of kazakh language 
Kazakh language is an agglutinative language and it belongs to the Turkish Language group. Kazakh is a low-
resource language by Arabic script in China, there are still many serious challenges in these research areas by 
natural language processing. This paper standardized the processing coding and storage scheme of Kazakh 
corpus, then constructed Kazakh Language Corpus (KzLC), which lay the foundation for further research on 
syntactic analysis etc. of Kazakh language processing. Aiming at frequency issue of Kazakh language, this 
paper focused on relation of Zipf's law of power law in Kazakh word, which is based on frequency statistic of 
the word. On the basis of frequency statistics of Kazakh words from Kazakh textbooks, this research came up 
worth word information analysis and statistic method based on corpus, which revealed language rule and 
phenomenon among Kazakh words information. 
Keywords: Kazakh language, statistics, corpus linguistics, word frequency, Information retrieval, morpholo-
gical analysis. 
Introduction 
Natural language processing has become one of the significant technologies during the information 
technology development among different countries and nations. Frequency statistic word are important tasks 
in Kazakh natural language processing research, A corpus-based approach to natural language processing 
research has become very popular, A corpus-based to Morphological analyzers have been developed for dif-
ferent languages. 
Kazakh Language belongs to the Turkish Language group in the Altaic language family, and it is an ag-
glutinative language with word structures formed by adding derivational or inflectional affixes to root words. 
There are three different scripts of Kazakh characters all around the world: Cyrillic letter is used in Kazakh-
stan; P. R. China use Arabic letter, and Latin letter is widely used other country. 
Previous work done on Kazakh language processing research such as Altenbek has designed the Kazakh 
corpus for Kazakh Arabic letter [1]. Makhambetov have designed the Kazakh Cyrillic letter corpus compila-
tion process (Makhambetov et al. 2013). Washington et al. have established Finite-state morphological trans-
ducers for Kazakh Cyrillic letter [2]. Doszhan introduced about creation of all Turkic national corpus prob-
lem [3]. According to our knowledge, there is no research on frequency statistic by Morphological feature for 
Arabic letter Kazakh. This paper is a first time to do research work for frequency statistic of Arabic Kazakh 
Word based on corpus by Morphological feature. 
In this paper, the research mainly focuses on a corpus-based Frequency Statistic of Kazakh Word at 
Morphological feature, which are the most difficult aspect of Kazakh natural language processing using the 
statistical method by Arabic script in P. R. China. Our research project has shown the corpus-based approach 
to processing of word frequencies, then Kazakh corpus is design consideration and building the corpus. Also 
the result expresses the relation of frequency of the Kazak word, which is based on the corpus from main-
stream newspaper media and Kazakh school Textbooks. and the resulting Kazakh word frequency distribu-
tion accords with law of Zapf. Kazakh language is different from English and other languages that have been 
studied in corpus, as frequency statistic of words is an important part of Kazakh language processing. This 
research results not only created corpus resource for further information processing of Kazakh language, but 
also provided corpus data to Kazakh language linguistics research. This exploration is the basis task of ma-
chine translation, speech recognition, information retrieval and many other application developments in the 
Kazakh language. 
Related work 
Since the first corpus of Brown University was established in 1963 to 1964, corpus linguistics has be-
come a important tasks in natural language processing field. The purpose of American Brown corpus is to 
study the modern American English, using the principle of system to collect 1 million words English text, 


А corpus-based frequency statistic… 
Серия «Филология». № 2(86)/2017 
15 
based on the use of rules and 86 kinds of grammatical markers of automatic POS tagging. The modern Brit-
ish English LOB corpus in 70s, also collected 1 million times, using 133 kinds of grammatical markers, it is 
POS using CLAWS (Constituent Likelihood Automatic Word-tagging System Constitute) achieve automatic 
part of speech tagging system by statistical information. In the past year, many researchers have been con-
structed their own language corpus, such as Korean National Corpus, Turkish National Corpus [4], Russian 
National Corpus, Chinese Peking University Corpus. 
Study on morphology analysis of based-corpus can not only large-scale real language, and language 
specific qualitative explanation, corpus analysis provides a new research platform based on the use of lan-
guage, language can be analyzed from the characteristics of the phenomenon of word frequency and syntac-
tic language. The dictionary is written based on the corpus, collocation can search for specific words; vocab-
ulary corpus based development, can survey the vocabulary of grammatical features, using feature; corpus 
based techniques can provide language learners with examples of language analysis.


Достарыңызбен бөлісу:
1   ...   8   9   10   11   12   13   14   15   ...   46




©emirsaba.org 2024
әкімшілігінің қараңыз

    Басты бет