Задача нормализации слов казахского языка д. Р. Рахимова a,b, А. О. Турганбаева a

жүктеу/скачать 427,3 Kb.

Pdf көрінісі

бет	2/10
Дата	24.05.2023
өлшемі	427,3 Kb.
	#96884
түрі	Задача

1 2 3 4 5 6 7 8 9 10

Байланысты:
zadacha-normalizatsii-slov-kazahskogo-yazyka

Практическая значимость. Результаты работы могут найти применение при анализе текста, нормализации
(лемматизации) текста, а также в информационно-поисковых системах, в машинном переводе казахского языка
и других прикладных задачах.
Ключевые слова
обработка естественного языка, казахский язык, система окончаний, нормализация, алгоритм стеминга
Благодарности
Исследование выполнено при поддержке Министерства образования и науки Республики Казахстан в рамках
научного проекта АР 05132950.
doi: 10.17586/2226-1494-2020-20-4-545-551
NORMALIZATION OF KAZAKH LANGUAGE WORDS
D.R. Rakhimova
a,b
, A.O. Turganbaeva
a
a
Al-Farabi Kazakh National University, Almaty, 050040, Republic of Kazakhstan
b
Institute of Information and Computing Technologies, Almaty, 050000, Republic of Kazakhstan
Corresponding author: diana.rakhimova@kaznu.kz
Article info
Received 01.06.20, accepted 25.06.20
Article in Russian
For citation: Rakhimova D.R., Turganbaeva A.O. Normalization of Kazakh language words. Scientiﬁc and Technical
Journal of Information Technologies, Mechanics and Optics, 2020, vol. 20, no. 4, pp. 545–551 (in Russian). doi:
10.17586/2226-1494-2020-20-4-545-551

Научно-технический вестник информационных технологий, механики и оптики,
546
2020, том 20, № 4
ЗАДАЧА НОРМАЛИЗАЦИИ СЛОВ КАЗАХСКОГО ЯЗЫКА
Abstract
Subject of Research. Models and existing algorithms for normalization of natural language words are considered.
The paper describes algorithms for automatic selection of the basic principles for a number of natural languages and
possible ways of the normal word form synthesis for the Kazakh language. The research is aimed at creation of a
complete classiﬁcation for the Kazakh language ending system and development of a normalization algorithm for
words based on the proposed classiﬁcation approach for endings and sufﬁxes. Method. Word formation analysis by
applying endings for all Kazakh language parts of speech was carried out; a classiﬁcation of endings and sufﬁxes was
presented. The paper discusses all kinds of placement options for endings and sufﬁxes. The total number of various
sufﬁxes is 26 526 units and the endings is 3 565 units. All considered types are lexically and semantically valid, but
some of them are not applicable. Only those, that are most commonly used, are added to the afﬁx base. The order, that
the afﬁxes are added to the base, is presented using sets. Thus, the base is correctly selected. The study does not examine
word-forming sufﬁxes, as they change the word stem and contextual interpretation. Basically, word-forming sufﬁxes are
added to nouns. Main Results. A complete classiﬁcation system for endings and sufﬁxes of the Kazakh language has
been developed. Deterministic ﬁnite automata for various parts of speech are created using all possible options, adding
sufﬁxes and endings, taking into account the morphological and lexical features of the Kazakh language grammar.
A lexicon-free stemming algorithm is developed using the proposed classiﬁcation system for endings of the Kazakh
language. A normalization system has been implemented, proving the operability of the developed algorithm without
a dictionary. The algorithm implementation was tested on the Kazakh language corpus. Punctuation and stop words
were initially removed from the speciﬁed corpus. Practical Relevance. The results of the work can ﬁnd application in
the text analysis and normalization (lemmatization), as well as in information retrieval systems, in machine translation
from the Kazakh language, and other applied problems.

жүктеу/скачать 427,3 Kb.

Достарыңызбен бөлісу:

1 2 3 4 5 6 7 8 9 10