Issn 2518-198х индексі 74623 Индекс 74623 Қ ар ағ анд ы уни вер си тетiнiң ÕÀÁÀÐØÛÑÛ

А corpus-based frequency statistic of kazakh language

жүктеу/скачать 5,53 Mb.

Pdf көрінісі

бет	12/46
Дата	12.10.2022
өлшемі	5,53 Mb.
	#42591

1 ... 8 9 10 11 12 13 14 15 ... 46

Байланысты:
аааа сейдімбек

А corpus-based frequency statistic of kazakh language
Kazakh language is an agglutinative language and it belongs to the Turkish Language group. Kazakh is a low-
resource language by Arabic script in China, there are still many serious challenges in these research areas by
natural language processing. This paper standardized the processing coding and storage scheme of Kazakh
corpus, then constructed Kazakh Language Corpus (KzLC), which lay the foundation for further research on
syntactic analysis etc. of Kazakh language processing. Aiming at frequency issue of Kazakh language, this
paper focused on relation of Zipf's law of power law in Kazakh word, which is based on frequency statistic of
the word. On the basis of frequency statistics of Kazakh words from Kazakh textbooks, this research came up
worth word information analysis and statistic method based on corpus, which revealed language rule and
phenomenon among Kazakh words information.
Keywords: Kazakh language, statistics, corpus linguistics, word frequency, Information retrieval, morpholo-
gical analysis.
Introduction
Natural language processing has become one of the significant technologies during the information
technology development among different countries and nations. Frequency statistic word are important tasks
in Kazakh natural language processing research, A corpus-based approach to natural language processing
research has become very popular, A corpus-based to Morphological analyzers have been developed for dif-
ferent languages.
Kazakh Language belongs to the Turkish Language group in the Altaic language family, and it is an ag-
glutinative language with word structures formed by adding derivational or inflectional affixes to root words.
There are three different scripts of Kazakh characters all around the world: Cyrillic letter is used in Kazakh-
stan; P. R. China use Arabic letter, and Latin letter is widely used other country.
Previous work done on Kazakh language processing research such as Altenbek has designed the Kazakh
corpus for Kazakh Arabic letter [1]. Makhambetov have designed the Kazakh Cyrillic letter corpus compila-
tion process (Makhambetov et al. 2013). Washington et al. have established Finite-state morphological trans-
ducers for Kazakh Cyrillic letter [2]. Doszhan introduced about creation of all Turkic national corpus prob-
lem [3]. According to our knowledge, there is no research on frequency statistic by Morphological feature for
Arabic letter Kazakh. This paper is a first time to do research work for frequency statistic of Arabic Kazakh
Word based on corpus by Morphological feature.
In this paper, the research mainly focuses on a corpus-based Frequency Statistic of Kazakh Word at
Morphological feature, which are the most difficult aspect of Kazakh natural language processing using the
statistical method by Arabic script in P. R. China. Our research project has shown the corpus-based approach
to processing of word frequencies, then Kazakh corpus is design consideration and building the corpus. Also
the result expresses the relation of frequency of the Kazak word, which is based on the corpus from main-
stream newspaper media and Kazakh school Textbooks. and the resulting Kazakh word frequency distribu-
tion accords with law of Zapf. Kazakh language is different from English and other languages that have been
studied in corpus, as frequency statistic of words is an important part of Kazakh language processing. This
research results not only created corpus resource for further information processing of Kazakh language, but
also provided corpus data to Kazakh language linguistics research. This exploration is the basis task of ma-
chine translation, speech recognition, information retrieval and many other application developments in the
Kazakh language.
Related work
Since the first corpus of Brown University was established in 1963 to 1964, corpus linguistics has be-
come a important tasks in natural language processing field. The purpose of American Brown corpus is to
study the modern American English, using the principle of system to collect 1 million words English text,

А corpus-based frequency statistic…
Серия «Филология». № 2(86)/2017
15
based on the use of rules and 86 kinds of grammatical markers of automatic POS tagging. The modern Brit-
ish English LOB corpus in 70s, also collected 1 million times, using 133 kinds of grammatical markers, it is
POS using CLAWS (Constituent Likelihood Automatic Word-tagging System Constitute) achieve automatic
part of speech tagging system by statistical information. In the past year, many researchers have been con-
structed their own language corpus, such as Korean National Corpus, Turkish National Corpus [4], Russian
National Corpus, Chinese Peking University Corpus.
Study on morphology analysis of based-corpus can not only large-scale real language, and language
specific qualitative explanation, corpus analysis provides a new research platform based on the use of lan-
guage, language can be analyzed from the characteristics of the phenomenon of word frequency and syntac-
tic language. The dictionary is written based on the corpus, collocation can search for specific words; vocab-
ulary corpus based development, can survey the vocabulary of grammatical features, using feature; corpus
based techniques can provide language learners with examples of language analysis.

жүктеу/скачать 5,53 Mb.

Достарыңызбен бөлісу:

1 ... 8 9 10 11 12 13 14 15 ... 46