АВТОМАТТЫ АУДАРМА ЖҮЙЕСІНДЕ ПАЙДАЛАНЫЛАТЫН MOSES БАҒДАРЛАМАСЫ ТУРАЛЫ 314
10 Sundetova A., M.L. Forcada, A. Shormakova, A.Aitkulova КазНУ им. аль-Фараби, Алматы, Казахстан STRUCTURAL TRANSFER RULES FOR ENGLISH-TO-KAZAKH MACHINE TRANSLATION IN THE FREE/OPEN-SOURCE PLATFORM APERTIUM 317
Каманур.У, Андасова.Б.З, Байгушева.Б.М Л.Н. Гумилев атындағы Еуразия ұлттық университеті, Астана ҚАЗАҚ-АҒЫЛШЫН-ҚЫТАЙ ДЫБЫСТЫҚ СӨЗДІГІН ӘЗІРЛЕУ
ТҮРІК ТІЛДЕРІНЕ ОҚЫТУДЫҢ ТЕХНОЛОГИЯЛАРЫ МЕН ИНТЕЛЛЕКТУАЛДЫ ЖҮЙЕЛЕРІ ИНТЕЛЛЕКТУАЛЬНЫЕ СИСТЕМЫ И ТЕХНОЛОГИИ ДЛЯ ОБУЧЕНИЯ ТЮРКСКИМ ЯЗЫКАМ INTELLIGENT SYSTEMS AND TECHNOLOGIES FOR LEARNING TURKIC LANGUAGES 1
Омарбекова А.С., Шарипбай А.А. Евразийский национальный универсистет имени Л.Н.Гумилева, НИИ «Искусственный интеллект», Астана, Казахстан ТЕХНОЛОГИЯ СОЗДАНИЯ ЭЛЕКТРОННЫХ УЧЕБНЫХ ИЗДАНИЙ НА ЛАТИНИЦЕ 331
Алсеитова А.Т., Ниязова Р.С. Л.Н. Гумилев атындағы Еуразия ұлттық университеті, Астана АВТОМАТТАР ТЕОРИЯСЫ БОЙЫНША МУЛЬТИМЕДИАЛЫҚ ОҚЫТУ ҚҰРАЛЫН ЖАСАУ
A.A. SHARIPBAY Scientific-Research Institute "Artificial intelligence", Astana, Kazakhstan PROBLEMS AND PROSPECTS OF COMPUTER PROCESSING OF THE KAZAKH LANGUAGE
1. The current state of the Kazakh language. The Kazakh language is the national language of the Kazakh people, the indigenous population
of the Republic of Kazakhstan and residing in many other countries around the world. So he, like
any national language, which is the main means of communication of native speakers should
The Kazakh language is the state language of the Republic of Kazakhstan and the demand for it,
as in the internal relations of the country and in international relations has increased. So he, like any
state language should be used in all areas of intellectual activity in our society and to get support
from the state.
However, the status of the Kazakh language proves otherwise. First, the Kazakhs living in
different countries can not be written to communicate among themselves, as they use different
alphabets and spelling. Secondly, the Kazakhs living in their home country, can not fully use the
Kazakh script due to the large diversity of terminology and spelling in printed matter. Third, the
current Kazakh alphabet based on Cyrillic not always supported by modern computers and
electronic means of communication and still produces linguistic differences.
The main reason for this state of the Kazakh language is the lack of basic standards on phonetics,
spelling, grammar rules, terminology, etc. In the textbooks on the Kazakh language, there are many
contradictory statements regarding the linguistic foundations of language.
To illustrate the effect of such a provision of the Kazakh language, you can give a short history
of his writing.
It can be assumed that the source of writing Kazakh language are ancient Turkic writing, which
are preserved in the runic monuments were first deciphered in 1893 by the Swedish scientist
Thomsen (about it you can learn more in the museum of writing in the L.N.Gulilyov ENU).
A new Arabic-based Turkic writing method called usul jadid was invented by prominent Tatar
enlightener Ismail Gaspirali(1851-1914). Using this method Akhmet Baytursynuly converted the
Kazakh writing system to the Arabic alphabet in 1912. He refined the Kazakh phonetic system and
developed the new Arabic-based Kazakh alphabet which consisted of 28 letters. This alphabet was
used in our country until 1929, while the Kazakhs living abroad (particularly in China) still use it.
Soviet rule changed the Kazakh alphabet twice: first it was changed to the 29-lettered Latin-
based alphabet in 1929, then 42-lettered Cyrillic-based alphabet in 1940. The latter reform has
preserved all 33 letters of the Russian alphabet and added 9 letters for specific Kazakh sounds.
The latest reform is absurd from the point of view of the progressive linguistics, as it was done
by force without features of the Kazakh language and destroyed its internal unity and grammatical
patterns. This reform is not only not allowed to develop, but still damages the Kazakh language.
Not necessary to be an expert in linguistics to verify this, because this reform is required to save the
Kazakh text grammatical rules of the Russian language, i.e. writing and even reading the Russian
words in the Kazakh text should be in accordance with the rules of Russian language. As a result of
this reform in the Kazakh orthography has accumulated a significant amount of contradictions,
which affect the culture of writing, spelling interfere with learning and as a result the Kazakh
language is deformed by the day, changing their unique sound and grammatical patterns. For
clarity, imagine that for the enrichment of the Russian language is proposed to add a sound of
English, while keeping unchanged the rules of writing and reading English. It is clear that such a
proposal for the enrichment of the Russian language is a complete nonsense. But despite the
absurdity of such enrichment of the Kazakh language, some believe that the correct pronunciation
and spelling of Russian words in the Kazakh text is a sign of literacy.
Thus, forced cyrillization of writing of the Kazakh language has led to the fact that today a
phonetic system and spelling norm of the Kazakh language were wrong and evil. Therefore, today it
is necessary to correct the situation to the Kazakh language in the future, not to lose their identity
Not understanding of these problems by experts and figures in linguistics and language policy is
bewildering. Until now, publishes textbooks and scientific books whose authors can’t distinguish
between the concept of "sound" and "letter", does not know what a "phoneme", etc. Especially
surprising the fact that many textbooks and training manuals on the Kazakh language contradict
2. Problems of reforming of the Kazakh language and the change of the alphabet. The current state of the Kazakh language not only allows it to full functioning as a state
language, but also directly prevents it. The Kazakh language is deformed day by day, changing its
sounding and disrupting the natural grammatical rules. Therefore it is necessary to reform the Kazakh language, to eliminate these and other contradictions.
For organizing and unification of the sound system is expected to develop and approve standard
phonetics of the Kazakh language, which would require changing its alphabet and development
spelling norms, as well as the processing of grammatical rules, etc.
It is important to note that changing of the alphabet of the Kazakh language is not necessary to
engage in a modification of the current based on the Cyrillic alphabet, and it is necessary to pass at
once to the Latin alphabet, to leave all the linguistic problems in Cyrillic and not to be confused
where the new rules, and where the old.
At present we have developed the project of national standard of phonetics of the Kazakh
language, which, based on the theory of A.Baitursynov determined the number of sounds of the
Kazakh language, carried out their classification, built signs for the sounds of the phonetic
transcription of the Kazakh language. The proposed project is a phonetic alphabet sounds of the
Kazakh language examination will be held in the International Phonetic Association. These works
are performed within the project for 2012-2014. “Creation of acoustic corpus of the Kazakh
language and revision of its phonetic structure, representation of Kazakh phonemes in the
International Phonetic Alphabet”, financed by the MES on a priority “Fundamental and applied
research in the field of economic, social and human sciences”.
In this paper, we have only confirmed that the Kazakh language has only 28 sounds, 9 of them
vowels and 19 consonants. Among the 5 vowel sounds а,о,ұ,ы,е are the phonemes, and 4 sounds
ә,ө,ү,і – their allophones. All 19 consonants б,ғ,г,д,ж,з,й,қ,к,л,м,н,ң,п,р,с,т,у,ш are phonemes. In
1929, before developing a Latinized alphabet the number of consonants of the Kazakh language
increased to 20, added sound хы, denoting the Latin letter h. This sound like the sound of в, ф,
borrowed in 1940 from the Russian language does not violate the phonetic and phonological
patterns of the Kazakh language, as they will help to correctly pronounce the international terms
such as: валюта, вакуум, вакцина, вариант, вектор, вексель, вето, викторина, вирус, виртуал,
вице, вокал, вольт, хадис, хаки, халат, хаос, химия, хлор, хор, хром, хроника, хрусталь, хунта,
факт, факультет, фаза, файл, фауна, федерация, фельетон, физика, филармония, фильм,
фонетика, формула, форфор, фосфор, фотон, фракция, функция.
Thus, we can assume that in the sound system of the Kazakh language will be 31 sounds, they
are: а, ә, б, в, ғ, г, д, ж, з, е, й, к, қ, л, м, н, ң, о, ө, п, р, с, т, у, ұ, ү, ф, х, ш, ы, і.. This proves
that the need to change the alphabet of the Kazakh language.
3. The problems of computer processing of the Kazakh language At present the development of any natural language is unthinkable without information
technologies. There are already developed multimedia dictionaries, automation programs for
teaching, machine translation, speech recognition and synthesis, text generation, text content
interpretation, etc. for English, French, Russian, Japanese, Chinese and other languages. This level
of development is possible for the Kazakh language too.
Computer processing of the Kazakh language involves the creation and use of the appropriate
information resources in the Kazakh language. The databases of spelling and terminological
dictionaries, the Kazakh language training systems, programs translators from the Kazakh language
to another and vice versa, code conversion systems of the text in the Kazakh language from one
graphics to another, speech recognition and synthesis system, etc. In this regard we carry out the
project “Automation of recognition and generation of the Kazakh written and spoken languages”
which is also financed by MES since 2011 on a priority “Information and telecommunication
It is known that all computers and telecommunications equipment manufactured in the world
support and will support the English alphabet. To input and process characters of other alphabets
one needs special code tables and drivers, their development and installation requires a considerable
amount of money and intellectual efforts.
It is known that development of information resources is a science intensive and costly process.
It is carried out by the use of modern information technologies that support the English alphabet by
default. Today in many countries, Germany and Russia, are being discussed issues of development
of information resources in the English-based alphabets, which not requires the development of
fonts and program drivers, various sorting, search and recognition programs (e.g. for scanners). And
all that requires large extra costs and highly qualified human resources. By this reason, in order to
considerably facilitate development of information resources in Kazakh and reduce their costs it is
necessary to convert the Kazakh writing system to the Latin graphics. We must also retain the
same sequence of letters, this will enable us to use built-in sorting, search and recognition programs included in any system packages that rapidly change depending on the computer
hardware specifications. Otherwise operating on the basis of Cyrillics alphabet of the Kazakh
language generates a collateral and unjustified problem of support of the Kazakh alphabet without
which decision possibility of use of modern information technologies is absolutely excluded. Some
drivers are incompatible with each other that leads to a redoing of previously created information
resources and require an additional labor and material costs.
Now because of the lack or inconsistency of such drivers, even in attempt of transfer of the text
in the Kazakh language from one computer to another it is necessary to perform anew work on its
set and formatting. At mass penetration into an everyday life of computers with various
configurations and program systems the volume of such unnecessary works will increase. And if
now not to take drastic and reasonable measures, the problem will pass from category difficult
solvable in unsoluble, and we should in empty be spent as, first, quite often there will be vain works
on creation of earlier created information resources, secondly, development and adjustment of
necessary drivers demand considerable financial means and intellectual efforts.
For those who worries that because of alphabet change we will lose and we won't be able to read
earlier published literary, cultural, scientific, it is possible to tell that there are no serious problems.
For this purpose it is necessary to enter only once into memory of the computer the necessary text
presented on the basis of any graphics. Translation of the Kazakh text from one graphics on other graphics can be automated, and then this text can be printed out for reading in any graphics [www.alphabet.kz]. So in 2013, we have prepared a project "Scientific, methodological,
technological and methodological base for the conversion of the Kazakh writing to the Latin
alphabet" and won grant funding of the Ministry of Education and Science of the Republic of
Kazakhstan for 2013-2015. Currently, under the guidance of my research made a lot of work on the computer processing of
the Kazakh language. There are strong theoretical and practical groundwork in the following
number and classification of sounds of the Kazakh language, their linguistic
characteristics and methods of notation;
the mathematical model of the phonetic system of the Kazakh language, which
presupposes the separation of sounds into natural classes, suitable for automatic processing of
the applied model of the automatic transcriptor is constructed, allowing to pass from
alphabetic structure of a word to a phonetic transcription.
the mathematical model of morphological rules of the Kazakh language is constructed;
methods and analysis algorithms and synthesis of words and word forms of the Kazakh
language are developed.
the mathematical model of syntactic rules of the Kazakh language is constructed;
methods and analysis algorithms and synthesis of word combinations and offers of the
Kazakh language are developed.
methods and algorithms of delimitation of speech, segmentation of a speech signal
within the limits of wide phonetic classification are developed;
methods and algorithms of recognition of pairs sounds with application two-threshold
scalar recognizers and some specific sounds of the Kazakh language are developed;
the algorithm of recognition of separate Kazakh words of the limited dictionary with use
of a dynamic time number is developed;
for fast search of the data the tree of transcriptions is constructed of the big dictionary;
the concept recognition syntactically related sentences are developed.
algorithms of transcription of the Kazakh words and sentences are developed;
synthesis algorithms of the Kazakh speech from the free text are developed;
the algorithm of training of model of synthesis of the Kazakh speech to the certain
announcer is developed.
There are realizations of the methods set forth above and algorithms in the form of program
applications which show fruitfulness of used approaches and mathematical models. Three
participants are defended candidate dissertation on the specialty 05.13.11 – Mathematical and
software of computers, systems and computer networks.
Three participants work over theses for a doctor's degree PhD on the topic of the project.
Five participants are working on master's thesis on the topic of the project.
The progress in automation of processing of the Kazakh writing and recognition and synthesis of
the Kazakh speech will bring direct and immediate results in all spheres of intellectual activity of
our country that confirms prospects of a computerization of the Kazakh language 4. Prospects for computer processing of the Kazakh language. In order that it is correct to make a computerization and development of the Kazakh language it
is offered to solve 77 problems in the following 7 directions:
I. Standardization of Kazakh language: I.1. A standard phonetics;
I.2. A standard the alphabet and encoding of letters;
I.3. A standard orthography;
I.4. A standard grammar (morphology and syntax);
I.5. A standards industrial terms;
I.6. A standards onomastics and toponymy
I.7. A standard of measurement of Kazakh language knowledge.
II. Electronic dictionaries of Kazakh language: II.1. Electronic orthographic dictionary;
II.2. Electronic orthoepic dictionary;
II.3. E-semantic dictionary of word forms;
II.4. Electronic Phraseological dictionary;
II.5. Electronic dictionaries;
II.6. Multilingual electronic dictionaries;
II.7. Electronic dictionary diphones.
III. Formalization of Kazakh language III.1. Formalization of the phonetic and phonologic rules of sounds;
III.2. The formalization of the rules of word-formations with endings;
III.3. The formalization of the rules of word-formations with the suffix;
III.4. The formalization of the rules of formation of phrases;
III.5. The formalization of the rules of formation of simple sentences;
III.6. The formalization of the rules of formation of complex sentences;
III.7. The formalization of rules for the preparation of transcription and diphones.
IV. Automation of processing writing ІV.1. Drivers and converters with any graphics (encoding) to another schedule (encoding) and
IV.2. Morphological analyzer;
IV.3. Generator word forms;
IV.4. The parser;
IV.5. Generator sentences;
IV.6. Converter of text in a semantic network;
IV.7. Semantic search engine.
V. Speech technologies of Kazakh language V.1. Creation of an acoustic database;
V.2. Creation of an acoustic database;
V.3. Recognition of continuous speech;
V.4. Synthesis of individual words;
V.5. Synthesis of speech;
V.6. Hardware implementation of speech recognition;
V.7. Hardware implementation of the speech synthesis.
VІ. E-learning Kazakh language VІ.1. E-learning and certification of the Kazakh language for the simple;
VІ.2. E- learning and certification of the Kazakh language for basic level;
VІ.3. E- learning and certification of the Kazakh language for intermediate level;
VІ.4 E- learning and certification of the Kazakh language for good level;
VІ.5. E- learning and certification of the Kazakh language for the higher level;
VІ.6. E- learning and certification of the Kazakh language proficiency level;
VІ.7. Educational Web-portal of the Kazakh language.
VII. Automatic translators VII.1. Creation of Kazakh-Russian-English-Chinese translation and audio dictionary
VII.2. Creation of Kazakh-Russian and Russian-Kazakh translator;
VII.3 Creation of Kazakh-Russian and Russian-Kazakh audio translator;
VII.4. Creation of Kazakh-English and English-Kazakh translator;
VII.5. Creation of Kazakh-English and English-Kazakh audio translator;
VII.6. Creation of Kazakh-Chinese and Chinese-Kazakh translator;
VII.7. Creation of Kazakh-Chinese and Chinese-Kazakh audio translator;
Введение Востребованность языка как когнитивного и коммуникативного средства, а также его
развитие, соответственно, и дальнейшее сохранение его как культурного явления, очевидно,
в значительной степени зависят от активности функционирования языка в компьютерных
Данный тезис раскрывается нами на примере внедрения татарского языка в так
называемое киберпространство (то есть пространство взаимодействия человека и
компьютерных систем и технологий). При этом, как будет показано ниже, обеспечение
функционирования татарского языка в компьютерных системах и технологиях является
актуальным не только в плане повышения его активности и конкурентоспособности среди
других языков в качестве средства накопления информации и общения с компьютером, но
также и в плане создания новых технологий хранения и обработки информации на основе
татарского языка в силу целого ряда когнитивных особенностей его структуры и
Очевидно также, что для обеспечения равного функционирования татарского и русского
языков как государственных в Республике Татарстан, необходимо, чтобы татарский язык так
же, как и русский, стал рабочим языком компьютеров. Соответственно, наряду с задачей
использования татарского языка в инфокоммуникационных технологиях и создания
специальных программ обработки татарского языка, ставится также задача татарской
локализации их интерфейсной оболочки, т.е. средств общения компьютера с человеком.
Исследования и разработки по внедрению татарского языка в компьютерные технологии в
Республике Татарстан начались практически с конца 1980-х годов, с разработки первых
драйверов периферийных устройств, текстового редактора и татарского корректора,
необходимых для компьютерного издания татарских книг, газет и журналов и ведения
делопроизводства. В 1993 г. для решения задач в рамках научно-прикладной программы
Академии наук РТ «Компьютерное обеспечение функционирования татарского языка как
государственного» и для разработки средств компьютерного обеспечения татарского языка
как государственного в рамках Государственной программы РТ по сохранению, изучению и
развитию языков народов Республики Татарстан была создана Совместная научно-
исследовательская лаборатория Академии наук РТ и Казанского государственного
университета «Проблемы искусственного интеллекта».
Фундаментальные исследования и прикладные разработки по поддержке татарского языка
в информационных технологиях изначально осуществляются в трех основных направлениях:
1) внедрение татарского языка в информационные технологии («Татарский язык в ИТ»),
2) разработка и адаптация информационных технологий для татарского языка («ИТ для
3) использование когнитивных возможностей татарского языка для создания новых
информационных технологий («Татарский язык для ИТ»).