Language Modeling As for the language model, here we used our text materials to create a standard tri-grams based
model with Good-Turing smoothing [15] compiled into ARPA format by CMU-Cambridge
Language Model Toolkit 0.7 [16]. The format of language model file is as follows:
\data\
ngram 1=nr # number of 1-grams
ngram 2=nr # number of 2-grams
ngram 3=nr # number of 3-grams
\1-grams:
p_1 wd_1 bo_wt_1
\2-grams:
p_2 wd_1 wd_2 bo_wt_2
\3-grams:
p_3 wd_1 wd_2 wd_3
\end\
where ngram k – is the number of the corresponding n-grams, p_k - the logarithm (base 10) of
conditional probability p of an n-gram, wd_k – a word in n-gram, and bo_wt_k - the logarithm
(base 10) of the backoff weight for the n-gram.
For our experiments we have totally over 12500 sentences, which produce 29586 unigrams,
100354 bi-grams and 120755 tri-grams.