Задача нормализации слов казахского языка д. Р. Рахимова a,b, А. О. Турганбаева a



Pdf көрінісі
бет2/10
Дата24.05.2023
өлшемі427,3 Kb.
#96884
түріЗадача
1   2   3   4   5   6   7   8   9   10
Практическая значимость. Результаты работы могут найти применение при анализе текста, нормализации 
(лемматизации) текста, а также в информационно-поисковых системах, в машинном переводе казахского языка 
и других прикладных задачах. 
Ключевые слова
обработка естественного языка, казахский язык, система окончаний, нормализация, алгоритм стеминга
Благодарности
Исследование выполнено при поддержке Министерства образования и науки Республики Казахстан в рамках 
научного проекта АР 05132950.
doi: 10.17586/2226-1494-2020-20-4-545-551
NORMALIZATION OF KAZAKH LANGUAGE WORDS
D.R. Rakhimova
a,b
, A.O. Turganbaeva
a
a
Al-Farabi Kazakh National University, Almaty, 050040, Republic of Kazakhstan
b
Institute of Information and Computing Technologies, Almaty, 050000, Republic of Kazakhstan
Corresponding author: diana.rakhimova@kaznu.kz 
Article info
Received 01.06.20, accepted 25.06.20
Article in Russian
For citation: Rakhimova D.R., Turganbaeva A.O. Normalization of Kazakh language words. Scientific and Technical 
Journal of Information Technologies, Mechanics and Optics, 2020, vol. 20, no. 4, pp. 545–551 (in Russian). doi: 
10.17586/2226-1494-2020-20-4-545-551


Научно-технический вестник информационных технологий, механики и оптики,
546 
2020, том 20, № 4
ЗАДАЧА НОРМАЛИЗАЦИИ СЛОВ КАЗАХСКОГО ЯЗЫКА 
Abstract
Subject of Research. Models and existing algorithms for normalization of natural language words are considered. 
The paper describes algorithms for automatic selection of the basic principles for a number of natural languages and 
possible ways of the normal word form synthesis for the Kazakh language. The research is aimed at creation of a 
complete classification for the Kazakh language ending system and development of a normalization algorithm for 
words based on the proposed classification approach for endings and suffixes. Method. Word formation analysis by 
applying endings for all Kazakh language parts of speech was carried out; a classification of endings and suffixes was 
presented. The paper discusses all kinds of placement options for endings and suffixes. The total number of various 
suffixes is 26 526 units and the endings is 3 565 units. All considered types are lexically and semantically valid, but 
some of them are not applicable. Only those, that are most commonly used, are added to the affix base. The order, that 
the affixes are added to the base, is presented using sets. Thus, the base is correctly selected. The study does not examine 
word-forming suffixes, as they change the word stem and contextual interpretation. Basically, word-forming suffixes are 
added to nouns. Main Results. A complete classification system for endings and suffixes of the Kazakh language has 
been developed. Deterministic finite automata for various parts of speech are created using all possible options, adding 
suffixes and endings, taking into account the morphological and lexical features of the Kazakh language grammar. 
A lexicon-free stemming algorithm is developed using the proposed classification system for endings of the Kazakh 
language. A normalization system has been implemented, proving the operability of the developed algorithm without 
a dictionary. The algorithm implementation was tested on the Kazakh language corpus. Punctuation and stop words 
were initially removed from the specified corpus. Practical Relevance. The results of the work can find application in 
the text analysis and normalization (lemmatization), as well as in information retrieval systems, in machine translation 
from the Kazakh language, and other applied problems.


Достарыңызбен бөлісу:
1   2   3   4   5   6   7   8   9   10




©emirsaba.org 2024
әкімшілігінің қараңыз

    Басты бет