324
Apertium-based MT systems are transfer systems implemented as text pipelines (see Figure 1)
consisting of the following modules:
1. A deformatter that separates the text to be translated from the formatting tags. Formatting
tags are encapsulated as “superblanks” that are placed between words in such a way that the
remaining modules see them as regular blanks (for instance, tags in the HTML text I see
the
sky are encapsulated as I see [
]the sky[] and everything in square brackets is
treated just as regular blanks).
2. A morphological analyser, yielding, for each surface form (SF), for each lexical unit as it
appears in the text, a lexical form (LF) composed of: lemma (dictionary or citation form), lexical
category (or “part-of-speech”), and inflection information. For instance, the English SF
books
would yield two LFs:
book, noun, plural, as in
I have bought some books) or
book, verb, present
tense, 3rd person, as in
He books a ticket). The morphological analyser executes a finite-state
transducer generated by compiling a morphological dictionary for the source language (SL).
3. A constraint-grammar (Karlsson 2005) module based on CG3
11
is used to discard some LFs
using simple rules based on context (this module is not depicted in the figure).
4. A part-of-speech tagger based on hidden Markov models (Cutting et al. 1992) selects one of
the remaining LFs. The statistical models may be supervisedly trained on an annotated SL
monolingual text corpus, or trained in an unsupervised way, either on an unannotated monolingual
SL corpus or using two unrelated, unannotated source language and target language corpora (as in
Sánchez-Martínez et al. 2008). The Apertium part-of-speech tagger can also read linguistically-
motivated constraints (much more rudimentary than constraint grammar rules in the previous
module) that forbid specific sequences of two LFs.
5. A lexical transfer module adds, to each source language LF (SL LF), one or more
corresponding target language LFs (TL LFs). This module executes a finite-state transducer
generated by compiling a bilingual SL–TL dictionary.
6. An (optional) lexical selection module (currently not active in the English→Kazakh system)
reads in rules that allow for the selection of one of the TL LFs according to context. When this
module is absent, the TL LF given as default in the dictionaries is used.
7. A structural transfer module processes the stream of SL LF–TL LF pairs produced by the
lexical transfer module and transforms it into a new sequence of TL LFs; a more detailed
description is found in section 0 as this is the main subject of this paper.
8. A morphological generator takes the sequence of TL LFs and generates a corresponding
sequence of TL SFs. The morphological generator executes a finite-state transducer generated by
compiling a morphological dictionary for the TL.
9. A post-generator takes care of some minor orthographical operations such as
apostrophations and contractions in the target language (this module is not used for English to
Kazakh).
10. Finally, the deformatter opens the square-bracketed superblanks and places the formatting
tags back into the text so that its format is preserved.
Достарыңызбен бөлісу: