А corpus-based frequency statistic…
Серия «Филология». № 2(86)/2017
19
Figure 4.The relationship between the length and word frequency
The relationship between the length and frequency data from textbooks corpus in Figure 4. As about
that the relationship between length and frequency of all word has a remarkable characteristic is to the left,
which means most of the Kazakh word length is short, E.g. On the one hand, The length from 5 to 8 account-
ing is 51.3936 %
of all words, from 13 to 30 accounting is only 4.9289 % of all words, on the other hand,
Dragging a long «tail» is the another big characteristic of power law.
Statistical analysis of Kazakh words, stems, suffixes: We use the second experimental data for the
following Kazakh words, stems, suffixes statistical analysis of Kazakh words in primary school, junior high
school and high school.In order to explore Statistical analysis
of Kazakh words,here word,stem and suffix
are a kind of word segmentation unit in our corpus for non recurring terms, which is, the number of Kazakh
word, stem and suffix except the punctuation
mark and the English language, as shown in Figure 5.
Figure 5. The relationship between the word of length and frequency in Kazakh textbooks
Figure 5 shows that words in three different taxtbooks are composed of 1 to 20 charaters.The Words
composed of less than 3 or more than 15 charaters are used less commonly. In primary school, junior high
school, and high school textbooks, word composed of less than 3 or more than 15 characters account for
2.07 % and 0.7 %, 1.91 % and 0.67 %, and 1.69 % and 0.9 % respectively.
In order to explore Statistical analysis of Kazakh stem for non recurring terms,we analyze
the length of
stem in primary school, junior high school and high school, as shown in Figure 6.
Figure 6. The relationship between the stem of length and frequency in Kazakh textbooks
哈萨克语词频与长度关系
0
5
10
15
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
哈萨克语词长
哈萨克语词频百分
比
(%
)
词频
0
2000
4000
6000
8000
10000
12000
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 >20
词长
词
种
数
小学
初中
高中
0
500
1000
1500
2000
2500
3000
3500
4000
1
2
3
4
5
6
7
8
9
10
11
12-
17
词干长
词
干
种
数
小学
初中
高中
G. Altenbek, X.L. Wang
20
Вестник Карагандинского университета
Figure 6 shows that commonly used word stems in three different taxtbooks are composed of 4 to
8 charaters. 96.67 % (1505) of the word stems are composed of 4 to 8 charaters in primary school Kazakh
language textbook. 94.84 % (11,167) of the word stems are composed of 3 to 9 charaters in junior high
school Kazakh language textbook. 93.13 % (14,976) of the word stems are composed of 3 to 9 charaters in
high school Kazakh language textbook. Word stems composed of less than 3 or more than 9 charaters are
used less commonly.
This study summarizes the length of the word suffix used in three different textbooks for non recurring
terms as shown in Figure 7.
Figure 7. The relationship between the length and frequency of ending in Kazakh textbook
Figure 7 shows that the common suffix for words used in the three textbooks are usually composed of 1
to 5 characters. 95.49 %, 94.97% and 93.92 % of the suffix types
appear in primary school, junior high
school, and high school textbooks, respectively. There is little difference among the number of suffix types
used in textbooks for different levels.
Figure 7 shows the similarity among the distribution of the endings in the words used in three different
Kazakh textbooks, which indicates the characteristics of closed endings of Kazakh words, and the relatively
stable word choice in Kazakh language. Finally, the analysis on the longest Kazakh
words searched from the
three Kazakh textbooks demonstrate that there tends to be little difference among the length of the longest
words, as shown in Table 4.
T a b l e 4
Достарыңызбен бөлісу: