Issn 2518-198х индексі 74623 Индекс 74623 Қ ар ағ анд ы уни вер си тетiнiң ÕÀÁÀÐØÛÑÛ

жүктеу/скачать 5,53 Mb.

Pdf көрінісі

бет	13/46
Дата	12.10.2022
өлшемі	5,53 Mb.
	#42591

1 ... 9 10 11 12 13 14 15 16 ... 46

Байланысты:
аааа сейдімбек

Corpus Design Considerations
Corpus is a collection of some language texts sored in an electronic database. And it is the basic re-
source of natural language processing with statistical language model. The corpus plays an important role in
knowledge acquisition, and corpus construction is a sign of corpus size and corpus selection.
The normative principles of constructing the Kazak Language Corpus.
Since 2009, we have constructed a word level corpus of Kazakh language, which has been continuously
improved, modified and expanded in the past years. It is particularly important to find out the standards for
the initial construction of the corpus, which will affect many of the later research results. Therefore, we carry
out the following research.
(i) The objectives and collection criteria for source material: Quality is the lifeline of the corpus, em-
phasizes the universality and normative analysis; corpus obtained can basically summarize the whole or part
of the Kazakh language specified characteristics as a representative corpus. So, it is to identify the principal
aspects of corpus creation and the main decisions to be made.
(ii) Corpus size: The size of the corpus in relation to statistical data are reliable, and scarce resources
belonging to the Kazakh language, source of corpus is less, scale is larger, the construction difficulty and cost
more. Kazakh language is a low-resources, we select the corpus from mainstream newspaper media and Ka-
zakh school Textbooks.
(iii) The processing and depth principles: Processing materials include text format processing and text
description. According to the Kazakh characteristics, to provide users with data processing depth information
for the purpose of Kazakhstan linguistics, corpus processing specification includes: the letter frequency sta-
tistics, word frequency statistics, word segmentation, word formation of additional components, the stem or
POS tagging etc..
(iv) The encoding, description, storage format: According to the Kazakh language corpus source, input
method and format inconsistencies, to facilitate unified management and sharing of resources, follow by the
XML language and UNICODE encoding form for the corpus storage format, the Kazakh corpus description
information to achieve the complete word tagging specification, is beneficial to the understanding and appli-
cation of language resources and sharing. Tagged corpus is stored in XML file and TXT form separately.
Text documents description. In the initial version of our corpus, we used the data start from January 1,
2008 Xinjiang Daily (Kazakh version) for the Kazakh Language Corpus (KzLC) to whole year. The corpus
consists of raw texts and POS tagged XML format texts. XML annotation content and text structure infor-
mation as follow:
Title: sub- title
<subtitle></ Subtitle>
Paragraph
Sentence
Word
The word in the corpus annotation specification: adding the properties for a part of speech-POS, stem,
affix, unknown word etc. In this experiment, a corpus based storage format: UNICODE as a corpus character
encoding, using XML language as storage format; pure text or database form.
The Kazakh word corpus POS annotation: The Kazakh word tagging corpus is labeled various word
level information based on the corpus of the language materials, in order to solve POS tagging of Kazakh
lexical analysis, this study proposed a Kazakh word tagging Standards including POS tagging, stem tagging,

G. Altenbek, X.L. Wang
16
Вестник Карагандинского университета
affix tagging, multi-class word POS tagging as an attribute. Through these annotation information, more in-
depth analysis of the language material, can obtain more knowledge about the Kazakh language, lay the
foundation for information processing but also for the construction of Kazakh language.
Part-Of-Speech (POS) tagging is the process of assigning a part-of-speech label to each of a sequence
of words. There are many different tag-sets for the parts of speech of a language Include n, num, int, v, adj,
adv, pron, part, ono and q. Table 1 presents our POS tag sets.
T a b l e 1

жүктеу/скачать 5,53 Mb.

Достарыңызбен бөлісу:

1 ... 9 10 11 12 13 14 15 16 ... 46