54
Fig.3 Kazakh Tree Bank
Fig.4 Kazakh
verb phrase identify system
9 . Experiment results and analysis
9.1 Dataset
In this paper, as the data set we are using is the data of 31 days of January 2008 of the Xinjiang
Daily corpus. The corpus consists of the raw texts and the POS tagged XML format texts.
Experiments were done for phrase extraction .
Fig. 5 Verb
phrase Annotated corpus
9.2 Experiment results
The experiments of the accuracy rates are evaluated using as follow standard evaluation
measures:
55
recall=a/(a+b)*100%;
precision= a/(a+c)*100%;
leakage=b/(a+b)*100%;
error=c/(a+c)*100%;
Note: recall +leakage=1;precision +error=1;a is number of correctly identified phrases. b is
number of missed phrases. c is number of wrongly identified phrases.
In the test corpus, there are 3000 correct tagged sentences as training data for the close test, and
other 1000 sentences for the open test.
Table 3.phrase indentify test
meth
od
Test type precision
(%)
recall(
%)
error
(%)
leakage
(%)
rule
Close
test
81.58
72.51
18.42
27.49
rule
Open test 78.22
70.01
21.78
29.99
ME
Close
test
91.62
87.33
8.81
15.67
ME
Open test 87.89
83.13
12.11
16.87
10 Conclusion
This paper identified Kazakh phrases based on rules and the maximum entropy method. It used
the Kazakh word, part of speech, affixes context information to design template of features by
maximum entropy model. GIS algorithm was investigated to the feature set of parameter estimation,
and the final output of the optimal recognition results of the phrase. Based on statistical methods,
we can obtain higher accuracy in the close test, but were unable to get a good result in the open test,
which requires training more and more corpora.
Acknowledgments
This work is supported by National Natural Science Foundation of China(NSFC) under Grant
No. 61063025.
Reference
[1] Church K.A stochastic parts program and noun phrase parser for unrestricted text[J]. In
Proceedings of the Second Conference on Applied Natural Language Processing. Texas, USA.
1988,19(8):136-143.
[2] Steven Abney. Parsing by chunks[M]. Dordrecht: Kluwer Academic Publishers,1991:257-
278
[3]Rob Koeling. Chunking with Maximum Entropy Models[J]. Proceedings of CoNLL-2000 and
LLL-2000,2000,109(15):139-141
[4] Zhao Jun and Huang Changning,. Chinese basic noun phrase structure analysis model,
Computer sinence[J].,1999,22(2):141-146.
[5]Qiang Zhou,2004,Annotatiion scheme for Chinese Treebank, Journal of Chinese Information
Processing, Vol 18(4),Pages 1-8.
[6] Gulila.Altenbek,Ruina-Sun,Kazakh Noun Phrase Extraction based on N-gram and
Rules,2010
International
Conference
on
Asian
Language
Processing
(IALP2010),Harbin,China,2010, Pages 305-308.
[7] Gulila, A. and Dawel,A. and Muheyat,N.(2009).A Study of Word Tagging Corpus for the
Modern
Kazakh Language, Journal of Xinjiang University[J]., 26(4), Pages 394-401.
[8] Adam Berger, Stephen Della Pietra, and Vincent Della Pietra(1996),A Maximum Entropy
Approach to Natural Language ,Processing Computational Linguistics, 22(1), Pages 39-71.
[9]Adwait Ratnaparkhi. Learning to parse natural language with maximum entropy
models[J].Machine Learning,1999,341(3):151-176
[10]Porter,M.F.(1980)..An
algorithm for suffix stripping, Program, 14(3):130−137.
56
[11]Karttunen,Lauri(1983). KIMMO: A general morphological processor. Texas Linguistic
Forum, 22:163–186.
[12]Gülşen,E. and Eşref,A.(2004).An affix stripping morphological analyzer for Turkish,
Proceedings of the International Conference on Artificial Intelligence and Application, Austria,
299-304.
[13]Kemal Oflazer(1994).Two-level description of Turkish morphology. Literary and Linguistic
Computing,9(2):137-148.
[14]Beesley,K.R.(1996).Arabic finite-state morphological analysis and generation. In COLING-
96, Copenhagen,pages 89-94.
[15]Milat,A.(2003).Modern
Kazakh language, Xinjiang People's press, China.
[16]Dingjing Zhong. Practical Grammar of Modern Kazakh Language. Beijing: Central
University for Nationalities Press,2004.
Attachment 1 :
Достарыңызбен бөлісу: