Научно-технический вестник информационных технологий, механики и оптики,
546
2020, том 20, № 4
ЗАДАЧА НОРМАЛИЗАЦИИ СЛОВ КАЗАХСКОГО ЯЗЫКА
Abstract
Subject of Research. Models and existing algorithms for normalization of natural language words are considered.
The paper describes algorithms for automatic selection of the basic principles for a number of natural languages and
possible ways of the normal word form synthesis for the Kazakh language. The research is aimed at creation of a
complete classification for the Kazakh language ending system and development of a normalization algorithm for
words based on the proposed classification approach for endings and suffixes.
Method. Word formation analysis by
applying endings for all Kazakh language parts of speech was carried out; a classification of endings and suffixes was
presented. The paper discusses all kinds of placement options for endings and suffixes. The total number of various
suffixes is 26 526 units and the endings is 3 565 units. All considered types are lexically and semantically valid, but
some of them are not applicable. Only those, that are most commonly used, are added to the affix base. The order, that
the affixes are added to the base, is presented using sets. Thus, the base is correctly selected. The study does not examine
word-forming suffixes, as they change the word stem and contextual interpretation. Basically, word-forming suffixes are
added to nouns.
Main Results. A complete classification system for endings and suffixes of the Kazakh language has
been developed. Deterministic finite automata for various parts of speech are created using all possible options, adding
suffixes and endings, taking into account the morphological and lexical features of the Kazakh language grammar.
A lexicon-free stemming algorithm is developed using the proposed classification system for endings of the Kazakh
language. A normalization system has been implemented, proving the operability of the developed algorithm without
a dictionary. The algorithm implementation was tested on the Kazakh language corpus. Punctuation and stop words
were initially removed from the specified corpus.
Practical Relevance. The results of the work can find application in
the text analysis and normalization (lemmatization), as well as in information retrieval systems, in machine translation
from the Kazakh language, and other applied problems.
Достарыңызбен бөлісу: