А corpus-based frequency statistic…
Серия «Филология». № 2(86)/2017
17
Based on the Corpus evaluation principles of the standard, typical, structural and balanced, our work
adopts the method of human-computer interaction and statistics to construct the annotated corpus. This con-
struction tasks include dictionary resources, words and phrases annotation by syntactic level in Kazakh lan-
guage corpus. The analysis of word statistics information based on corpus
Study on word statistics information analysis based on corpus includes many aspects,
the natural lan-
guage information statistical corpus based analysis of lexical distribution analysis on keywords based analy-
sis of various circumstances analysis, the analysis of corpus based vocabulary research, including frequency
collocation research, dictionary compilation, new words and popular words etc.. This paper focuses on
the word frequency, the Kazakh word
length and different stage, the Kazakh word frequency is consistent
with the linguistic rules of Zipf's law.
In order to make a corpus based statistical analysis of Kazakh word information, the following terms are
introduced. Frequency analysis refers to the distribution of data by means of frequency distribution tables
and charts. Frequency refers to the ratio of the frequency of the word to that of the current corpus. The for-
mula is as follows:
F
/
100
i
i
n N
%.
Here
i
n
— is the number of occurrences of the word;
N — is the total number of occurrences of the corpus;
F — is frequency;
i
F
— is the frequency of the word
i.
Data of Corpus. In the experiment, there are two corpora for our research,
First experimental data is
«Xinjiang daily» Kazakh version of 2008 year electronic text data, Second experimental data is for analysis
of word frequency use the Kazakh Textbooks. The Kazakh word frequency comply. Kazakh language has a
complex morphological structure, similar to typical agglutinative languages. Words can be formed by long
concatenations of morphemes with some order or semantic features. Kazakh words letters analysis with sta-
tistics: Firstly, We use the first experimental data for the following Initials letter statistical analysis of Kazakh
words.
Figure 1. Initials letter statistics of Kazakh words
Figure 1 show the letter «
ا
» accounts for 9.71% of
our total vocabulary, then accounted the letter «
ق
»
and «
ب
» for 9.09 % and 8.66 %, the letter «
ھ
» is minimum number of words Accounts for only 0.019% of the
total vocabulary only in interjection.
Then, we also use the first experimental data for the following all letter statistical analysis of Kazakh
words. According to Figure 2, these three letters,
appear lots of times in the textbook.
However, three other letters,
、
ھ
did not appear to many times, which show the frequency of
utilization of Kazakh letters.
Figure 2. All letter statistics of Kazakh words
0
2
4
6
8
10
12
ا
ب
إ
گ ع
د ج
ز ي ك ق ل
م ن ث پ ر س ت
ة ش و وء ذ
ؤ ذء ئ ئء ح چ ف
ه
اء
哈萨克语33个基本字母
开头字
母的
百分
比(%
)
系列1
哈萨克语语字母
0
50000
100000
150000
200000
250000
300000
ا ب إ گ ع د ج ز ي ك ق ل م ن ث پ ر س ت ة ش و وء ذ
ؤ ذء ئ ء ح چ ف ه اء
字母
字母的个数
哈语字母
G. Altenbek, X.L. Wang
18
Вестник Карагандинского университета
Finally, in order to do further
explanation, for a comparative analysis of the statistics of the Arabic Ka-
zakh alphabet statistical data and the Cyrillic Kazakh alphabet frequency statistics, the Kazakhstan scholar
Makhambetov (Makhambetov et al., 2013). The Cyrillic Kazakh alphabet frequency statistics table shown in
Figure 3. The Table shows that we are ranked first а, е, ы, н, «а, е, ы, н» four letters in Kazakhstan's Ka-
zakh language is also ranked the first four. And the ranking of the «ə, һ» letters in Kazakhstan Kazakh lan-
guage is also ranked low. This experiments show that the frequency of pure Kazakh voice letters all over the
world are basically consistent.
Figure 3. Kazakh Cyrillic letter statistics (Kazakhstan)
The Kazakh word frequency comply with Zipf's law of power law: The Zipf's law is named after the
American linguist George Kingsley Zipf, it states that given some corpus of natural language utterances, the
frequency of any word is inversely proportional to its rank in the frequency Table 3. The Zipf's law formula
is as follows:
.
r
f
r c
Or
γ
,
r
p
cr
here
f — is frequency;
r — is rank;
c —
is constant; γ — is parameter.
We use the first experimental data for the relationship between the frequency and length of words
statistical analysis of Kazakh words. Length of the word is calculated based on numbers of letters in the
word.
T a b l e 3
Достарыңызбен бөлісу: