In this chapter, I will introduce basic concepts and techniques in the field of translation corpus processing. The chapter includes definitions of basic terminology, as they are used in the thesis, and a summary of standard methods for the compilation of parallel corpora and their application to computational lexicography and machine translation. Figure 1 gives an overview of the compilation and use of parallel corpora within the field of natural language processing.
The following section (0.4) defines the term parallel corpus as used in the thesis in relation to other concepts of computational corpus linguistics. This is followed by a short overview of common sentence alignment techniques in section 0.5. The main part of this chapter is devoted to word alignment (section 0.6). In this part, I will briefly introduce the concept of word alignment, review two general approaches thereof and finally, discuss evaluation methodologies of alignment results. Some applications of these techniques are listed in section 0.7. Finally, the last part of this chapter links background information to other parts of the thesis.
In computational linguistics, a corpus is a (in some respect representative) collection of spoken or written utterances of natural language usually accessible in electronic form. Often, corpora represent a particular genre of text or speech. Other corpora contain a large variety of types and genres to represent language use in a more general way. However, a corpus is always just a sample and can never completely represent a whole language. The expressive power of natural language cannot be captured by a finite data set.
There are several ways of classifying corpora into different types and categories according to their properties. One way is to distinguish between corpora that include only one language (monolingual corpora) and corpora that include several languages (multilingual corpora). Multilingual corpora can be divided into parallel and non-parallel corpora. Parallel corpora are referred to as natural language utterances and their translations with alignments between corresponding segments in different languages2.1. The alignment distinguishes parallel corpora from other multilingual corpora such as general translation corpora or so-called comparable corpora. Parallel corpora usually contain a common source document (the original) and one or more translations of this source (target documents). Sometimes the original language is unknown (mixed source corpora) or the original document is not included at all (multi-target corpora) [Mer99b]. Bilingual parallel corpora are sometimes called bitexts (see e.g., [Isa92]) and corresponding parts within these corpora are called bitext segments (see e.g., [AMST99]).
Parallel corpora have been exploited in many studies. Many applications use parallel corpora for translation studies and for tasks in multilingual natural language processing (NLP). Bilingual concordances have been used for some years in order to support human translation. In recent years, parallel corpora have become more widely available and serve as a source for data-driven NLP tasks. Automatic extraction of multilingual term databases, statistical machine translation, corpus-based bilingual lexicography are just some research fields that have been developed in connection with a growing number of large parallel corpora.
The most widely used parallel corpora are derived from the English and
French records of the Canadian Parliament, the so-called Hansards
corpora.
A compilation of these records are available from, for instance,
http://www.isi.edu/natural-language/download/hansard/index.html.
Like the Hansards, most
parallel corpora contain only two languages, a source and a target
language. However, multilingual
parallel corpora with translations into more than one language are
available and became very popular in recent studies.
Examples of such corpora
are the Multext East ``1984'' corpus2.2
for central and eastern
European languages, the multilingual parallel corpus of
European Parliament proceedings
EUROPARL2.3
in eleven
languages
and the multilingual OPUS corpus2.4, which is
briefly described in section 0.10.5.
Source language documents in a translation corpus can be split into segments that correspond monotonically to segments in translated documents. Common segmentation units are paragraphs and sentences. Establishing links between corresponding segments is called alignment . In particular, linking corresponding sentences is called sentence alignment . Such an alignment essentially creates segmentally searchable parallel corpora of collections of documents and their translations.
Sentence alignment is a well established task which does not exclusively refer to 1-to-1 alignments. Sentence boundaries may vary in different translations. However, it usually assumes that information at the sentence level is expressed in the same order in the original document as in its translations. With this assumption, sentence alignment can be modeled as a monotonic mapping process, i.e. an alignment without crossing links. A sample of a sentence aligned bitext is given in figure 2.
![]() |
Several approaches to automatic sentence alignment have been proposed. The main approaches apply either length based models using correlations between the lengths of corresponding sentences, dictionary based models using correspondences between words and other lexical units, or combinations of both. Additionally, information about the document structure can be used [TRA03] to identify corresponding segments.
Length-based sentence alignment was introduced in [GC91b]. The authors found a significant correlation between character lengths of corresponding sentences for several language pairs. They proposed an alignment model based on a statistical distance measure and a dynamic programming approach. The length-based approach has been applied to a large variety of language pairs (see e.g., [Wu94,TKS96,TN03]) and has proven to be highly accurate for most of them. Some researchers applied sentence lengths in terms of words instead of characters for sentence alignment [BLM91]. However, in [GC91b], the authors demonstrated experimentally that character based models are superior to word based models.
Sentence alignment using lexical information was introduced in [KR88]. There, selected words with similar distributions serve as anchor words for establishing sentence alignments. A geometric approach to sentence alignment using such points of correspondence was proposed in [Mel96b]. In this study, the author introduces an algorithm for finding geometric patterns in a bitext mapped on a 2-dimensional bitext space . Chains of corresponding points in this bitext space are used to align sentences in the parallel corpus. This method has been successfully ported to other language pairs [Mel97b].
Combining lexical information with length-based sentence alignment has been suggested by several researchers (see e.g., [SFI92,JH94]). Various techniques have been proposed for finding corresponding words that may serve as anchor points in a parallel corpus. String similarity measures can be used to find possible cognates [SFI92]. Machine-readable dictionaries can also be utilized for identifying corresponding words [Mel96b]. Distributional models find anchor points by relating word occurrence frequencies. In these approaches, corresponding words are found using measures such as point-wise mutual information2.5 in combination with t-score filters [FC94] or using similarity measures between so-called recency vectors of word occurrences [FM94]. In the first approach, K-vec, each half of the bitext (source and target language) is split into a number of equally long segments. Frequencies are counted for word pairs that co-occur in corresponding segments, which is the basis for calculating an association measure such as point-wise mutual information. The second approach (DK-vec) applies a pattern matching technique called dynamic time warping for comparing so-called recency vectors from both halves of the bitext. These vectors contain word position distances and are used to describe the ``distributional signals'' of words occurring in the text. Both techniques are used to identify corresponding words that can be utilized as anchor words in sentence alignment.
Further enhancements and combinations of sentence alignment techniques can be found in the literature, see e.g., [SFI92]. The use of more than two languages is explored in [Sim99]. Automatic sentence alignment is known as a task that can be accomplished with high accuracy, above 90%. However, improvements are still possible in the most difficult cases, especially in connection with ``noisy corpora'' including divergent and incomplete translations.
An important implicit resource in parallel corpora is the huge number of translational relations between words and phrases included in the documents. However, such relations cannot easily be defined as monotonic mappings in a bitext space as in the case of sentence alignment. Word order is in general not identical for most language pairs, and boundaries of lexical units are not as easy to detect as sentence boundaries. There is no consistent correlation between the character lengths of corresponding words, at least not for most language pairs. Furthermore, the notion of lexical equivalence presumes that there are identical lexical concepts in both languages and that they behave identically in the given context. This is clearly not always the case. However, natural languages must be compositional in some sense for translation to be possible at all [Isa92]. Thus, many translation relations between these compositional components, i.e. words and phrases, can be found in documents and their translations. Linking corresponding words and phrases in parallel corpora is usually called word alignment, a process which can be used as the basis for the extraction of bilingual lexicons.
|
The type of relation between words varies in parallel texts. Furthermore, the strategy of aligning words and phrases in parallel corpora depends on the task to be accomplished. Usually, word alignment aims at a complete alignment of all lexical items in the corpus, i.e. the goal is to break each bitext segment into sets of corresponding lexical items. This often leads to ``fuzzy'' translation relations between certain words [MAA02,Vér98,ON00a] due to lexical differences, structural and grammatical differences, paraphrased translations, spelling errors, and other divergent translations. The degree of correspondence can be expressed in terms of alignment probabilities, which is useful for many tasks, e.g. statistical machine translation. Bilingual lexicon extraction aims at the identification of lexical word type links in parallel corpora. These links can be inferred from word alignments.
There are generally two approaches to word alignment, the association approach using measures of correspondence of some kind, and the estimation approach using probabilistic translation models. Association approaches are also referred to as heuristic approaches [ON03] or hypothesis testing approaches [Hie98]. Estimation approaches are often called statistical alignment, e.g. in [ON00a]. Both approaches use some kind of statistics. Hence, I will use the terms estimation and association in order to avoid any confusion between them. In the next two sections, both approaches are briefly introduced and discussed. Some alternative alignment models and a description of evaluation techniques are given thereafter.
Association approaches to word alignment originate mainly from early studies on lexical analysis of parallel data. Lexicographers investigated the use of parallel corpora for the creation of bilingual lexicons and for the support of cross-lingual lexical disambiguation. The relation between word alignment and lexicon extraction is discussed in the following section. Following this, common association measures and resources for such an extraction are briefly introduced, and finally, approaches for handling multi-word units are described.
The task of bilingual lexicon extraction differs from full-text word alignment in so far as identified translation relations will be used outside their context. Grammatical functions, uncertain relations, and translation irregularities are most likely to be excluded from an extracted lexicon. Furthermore, in a lexicon extraction task it is not necessary to align all occurrences of a lexical item to corresponding items in the translation. The important part is to find at least one instance of each translation relation to be included in the resulting lexicon; inflectional variants can usually be inferred from other forms. Sometimes, only a sub-set of lexical items has to be considered, e.g. domain-specific terminology.
In general, the following steps have to be taken for aligning bitexts using an association approach:
In spite of the fact that sentence and word alignment are very different from each other, many ideas from research on sentence alignment can be applied to word alignment as well. Previously, several techniques of anchor based sentence alignment have been discussed. Finding anchor points, which is usually defined as the identification of corresponding words, is in fact nothing other than aligning words. Thus, early work on extracting bilingual lexicons was often based on lexical approaches to sentence and paragraph alignment in combination with empirical analysis of lexical co-occurrence, as, for example, reported in [CGHH91]. The K-vec algorithm is one of the early techniques for extracting lists of corresponding words using point-wise mutual information scores and t-score filters [FC94]. Similarly, the DK-vec algorithm based on similarity measures between word recency vectors can be applied for extracting corresponding words from parallel corpora [FM94]. The word_align program [DCG93], which is based on the char_align aligner [Chu93], was developed for identifying technical terms and their translations in parallel texts using similarities between character N-grams. It was used in the semi-automatic extraction tool Termight [DC94]. Melamed [Mel95] used a cascade of heuristic filters for inducing translation lexicons. He applied string similarity measures and word order heuristics among other things. The same author introduced the notion of competitive linking for one-to-one word alignments [Mel96a] which brings up the idea of iterative size reduction mentioned in [Tie97].
Many word alignment approaches presume sentence aligned corpora. Sentence alignment, as discussed in the previous section, is a reasonably reliable task providing aligned regions that can be used to count frequencies of word pairs co-occurring in these regions. Such co-occurrence frequencies can be used in association measures for the identification of word correspondences.
A common idea behind statistical association measures
is to test if two words co-occur significantly more often than it
would be expected if they would co-occur purely by chance.
The t-score
is an example of such a test metric. It
is derived from the t-test,
a common hypothesis
test in statistics.
The general form of this test is
the following:
where
is the
mean of the observed values,
the expected value
according to the
hypothesis, and
the standard error of the
observations.
Using the central limit theorem,
the
standard error
is defined as the
sample
deviation (
) divided by the square-root of the number of experiments
:
.
The
-value gives the distance from the mean of
Student's t-distribution2.7.
The value of
is associated with a p-value, the probability
mass of the distribution outside the interval from the mean to
2.8.
The p-value corresponds to the statistical significance
of the
difference between observation and hypothesis. A low p-value
indicates
statistical evidence for rejecting the hypothesis.
Now, the
-test can be applied as an association measure
between translated word pairs as follows: The joint probability of
co-occurring words can be observed in a parallel
corpus (
).
Using the hypothesis that both words co-occur purely by chance, i.e.
the distributions of both words are independent of each other
(
), we can apply the t-test to find
out if there
is any statistical evidence for rejecting this hypothesis,
i.e. evidence for a dependence between
and
.
The t-test is often used as an association measure because its value
becomes larger
when there is stronger evidence for rejecting the independence
hypothesis. This measure is usually called the t-score.
In the case of
bilingual co-occurrence, the standard error is estimated as defined above,
where
is the number of aligned bitext segments and the sample
deviation is approximated using the square root of the observation
mean2.9(
)
[FC94]. Probabilities and
standard deviations are estimated from the corpus using relative
frequencies.
![]() |
(2.1) |
Another association measure based on co-occurrence is the
Dice coefficient.
This coefficient can be used to measure the correlation between two
events (the occurrences of
and
)
as follows:
The formula above shows that the Dice coefficient is in fact the
harmonic mean
of the two conditional probabilities
and
. It therefore produces
values between 0 and 1, where 1 refers to the strongest correspondence.
A third statistical association measure is point-wise mutual information,
derived from information theory.
Mutual information
measures the amount of information common to two random
variables
and
. It is defined as the difference between the
entropy
of one variable and the entropy
of another variable given the first
one. Entropy is a measure of the information content of a random
variable:
. Using this
definition, mutual information
is calculated as follows:
.
Now, we can
assume that words that have a lot of information in common are likely
to be mutual translations. Applying mutual information, we can set the
following parameters:
is a random variable that produces the
events
(the word
occurs) or
(the word
does not occur).
is a random variable that produces
and
. The joint probability
describes the probability of
and
to
co-occur in the bitext.
and
are the
probabilities of
and
to occur in the
corpus, respectively. All probabilities can be estimated from the
corpus. Point-wise
mutual information considers only ``one point'' in the distributions of
and
, namely
and
. Hence, the
definition of point-wise mutual information for co-occurrence is as follows:
| (2.3) |
In this way, point-wise mutual information ``measures the reduction of uncertainty about the occurrence of one word when we are told about the occurrence of the other'' [MS99, p. 183].
Many other measures of correspondence have been applied to parallel
corpora with slightly different results.
Two examples are the
coefficient
[GC91a], and the log-likelihood measure
[TB02].
Advantages and weaknesses of specific
measures have been discussed much in the literature (see e.g.,
[CG91,Dun93,SMH96,MS99]).
A comparison of association measures can be found in
[RLM00]. We
restrict this description to the three measures
introduced above, i.e. the t-score, the Dice coefficient and point-wise
mutual information, as they are used in our word alignment approaches.
Other empirical alignment techniques than the ones mentioned above are based on measures of string similarity. Many cognates can be found especially in bitexts of closely related languages. The term cognate denotes here etymologically related words across languages. Many cognates can be identified by way of their spelling. Simple string matching algorithms can be used to exploit this property. Initial character sequences are one simple form of cognate identification [SFI92]. Variants of the Dice coefficient can also be used to compare common character n-grams in order to find a level of similarity between strings [BM96]. For example, using character bigrams, the Dice coefficient for string similarity can be formulated as follows:
| (2.4) |
Another approximate string matching algorithm is the longest common sub-sequence ratio (LCSR) as described, for instance, in [Mel95]. LCSR is defined as the ratio of the longest common sub-sequence (LCS) of characters of two strings and the length of the longest of both strings. LCSs can be found using dynamic programming approaches for arbitrary pairs of strings [Ste92]. LCSR is a measure with values between 0 (completely distinct, i.e. LCS=0) and 1 (identical). An example is given in figure 4.
Note that string matching algorithms are just a tool for finding
possible cognates using the assumption that cognates are similar
in spelling. This presumes a similar alphabet and similar
spellings of etymologically related words. Algorithms for the
construction of weighted string similarity measures that include mappings
between non-identical characters are described in
[Tie99a] and in section 0.16.1 on page
.
Furthermore, the problem of false
friends
is common and is usually faced with string length thresholds
(e.g.
characters). A comparison of string matching techniques
and linguistically motivated methods for cognate extraction has been
presented in [Bor98].
There are many external resources that can be employed in word alignment, for instance, collections of machine-readable bilingual dictionaries (MRBD). External resources define common relations between words and phrases that can be used in word alignment. The use of bilingual dictionaries is, for instance, discussed in [ON03] and [Mel95]. The impact of such resources on the performance of automatic word alignment depends very much on their size and appropriateness with respect to the corpus and its domain.
Another type of external alignment resource is language-specific expert knowledge, which can be put into alignment systems by means of heuristics. Word order constraints and position relations are commonly used for boosting word alignment. Such constraints can be expressed in terms of monotonicity assumptions [OTN99] or in terms of position weights [AMA98]. Relations between words of similar word classes (using, e.g., part-of-speech labels) can also be used as pre-defined heuristics. This is, for example, applied in the alignment experiments presented in [TB02] and in the lexicon extraction approach in [Mel95]. Similarly, syntactic relations can be integrated in word alignment systems. Morphosyntactic information and syntactic function are used, for instance, in the interactive word aligner I*Link [AMA02]. The combination of such resources with statistical alignment techniques is usually not straightforward. The adjustment of parameters is a common problem in word alignment. Section 0.16.2 describes a sequential combination of alignment techniques and resources, and section 0.16.3 presents a probabilistic approach.
The one-to-one alignment assumption for word alignment is insufficient for most language pairs, as already mentioned in the introductory part of section 0.6. Several studies have been made on the integration of so-called multi-word units (MWUs) in word alignment. By MWUs we refer to word sequences and word groups, which express structural and conceptual units such as complex noun phrases, phrasal verbs, idiomatic expressions, and other phrasal constructions that should not be split up in an alignment process. Two general techniques are used for dealing with MWUs: prior identification of collocations and dynamic construction of MWUs during the alignment process. Exhaustive research on the identification of collocations has been carried out in studies of monolingual terminology extraction. Statistical association measures have been proposed for finding MWUs. The use of such measures for monolingual lexical analysis has been presented in, e.g., [SM90,CGHH91,Dun93,MNA94]. Another way of identifying complex terms is to use linguistic knowledge such as part-of-speech information or phrase-structure analyzes. Typically, noun phrases are emphasized in work on automatic multi-word term extraction [JK95,Arp95]. A common way of noun phrase identification is to match language-specific part-of-speech tag patterns. Hybrid systems using statistical as well as linguistic information have been proposed, e.g., in [Dai95,MA00]. Parsing techniques can also be applied for finding relevant MWUs. In particular, shallow parsing techniques have recently been developed for shallow analyzes of unrestricted texts [Abn91]. Shallow parsing can be modeled as a classification task [RM95] and has become very popular in the machine-learning community (see e.g., [Rat98,FHN00,KM01,Meg02,TKS02]).
The use of structural information in bilingual term extraction has been investigated in a number of studies. In [vdE93], the author argued that noun phrases represent a better level to compare than words for the alignment for Dutch and English. He used a simple pattern-based extraction of noun phrases and association measures for selecting phrase translations. Other studies use statistical or hybrid approaches for the identification of phrasal units in bitext segments before proceeding with word alignment. Ahrenberg et al [AMA98] apply an n-gram based phrase retrieval tool in their word alignment system in order to identify recurrent MWUs in both languages of each bitext segment. They improved the system by adding language filters using classified function word lists and entropy thresholds to their frequency-based phrase extractor [MA00]. A similar approach has been applied in [Tie99c] using point-wise mutual information for the identification of MWUs prior to alignment. Alternatively, such units may be searched for ``on-the-fly'' during the alignment process. Such an approach has been introduced in [SMH96]. The authors produce MWU alignments using a statistical collocation extractor [SM90] for the source language and an iterative alignment procedure for the identification of possible translations in the target language. The system starts the alignment of collocations with a link to single words and compares the association score iteratively with scores for links to larger units. Another ``dynamic'' segmentation approach has been used in the experiments presented in [Tie99c]. In this study, experiments using an iterative segment expansion procedure for both the source language and the target language have been carried out and compared to alignment results with prior MWU detection. This dynamic segmentation approach favors MWU alignments in cases where their association is stronger than associations between parts of the MWUs. Melamed [Mel97a] investigated another iterative approach for the automatic discovery of non-compositional compounds (as one type of MWUs) in parallel corpora. In his algorithm, alignments including MWU candidates are compared to alignments without them. He uses mutual information scores as the ``objective function'' for the ``net gain'' when including the candidate MWUs. Candidate MWUs are adjacent words excluding previously defined stop words. This technique can be used for both languages by swapping the direction of the alignment. This, finally, results in a translation model for bag-of-word alignments [Mel00].
By estimation approaches we refer to word alignment using probabilistic alignment models that are estimated from parallel corpora. Most work in this field has been inspired by the work on statistical machine translation introduced in [BCD$^+$90]. In the next sections, I will briefly review the principles of statistical machine translation and describe the application of such models to word alignment and lexicon extraction tasks.
SMT is an application of the noisy channel
model
from information theory
[Sha48] to the task of machine
translation. The source language
and the target language
are
considered to be random variables that produce strings such as
sentences. Translation is modeled
as a
transmission of a source
language string2.10
through a noisy channel that transforms it
into a string
in the target language. The
probability
is then interpreted as the
probability of
being a proper translation of
. In a noisy channel
model the
target string (
) is considered to be the observable part of the
system and the task of the model is to find the original input string
(
) that has been transmitted through the channel in order to
produce
.
Using Bayes' rule
the
probability of the input string
given the observation
is defined as follows:
| (2.5) |
According to the model, the most likely solution can be
found by using
the
function, which returns the argument
out of all
possible values for
that maximizes the given function:
| (2.6) |
Due to the fact that
is independent of
and,
consequently, is
constant for all possible strings
; it can be ignored in the
maximization procedure. The fundamental equation of the statistical
machine translation model is therefore expressed as the following
search problem:
is called the language model
and
the
translation model,
which is to be estimated
from sentence-aligned parallel corpora.
However, estimating
directly from a corpus is impossible
because of the sparse data problem. The majority of segments
in any parallel corpus, let it be as big as possible, will be unique.
Even worse, most of the possible sentences
and
of two languages will not occur in any
training corpus and
therefore according parameters for a translation model are
impossible to estimate.
Consequently, the translation model has to be
decomposed into distributions of smaller units, which recur more
frequently in the training data and are more likely to appear
again in unseen data.
Most decompositions are based on the
translation models
proposed in
[BDDM93]. The first step in decomposing the general
translation model is to introduce another random variable
denoting the alignment between sub-strings (i.e. words) of the
source and the target language strings. Using all possible alignments
a between s and t, the
translation model can be re-written as follows:
The alignment in SMT is usually modeled as a sequence of hidden
connections between words in the target language string and words in
the source language string. More specifically, each word in the target
language string
is connected to exactly one word in the
source language string
, which can be expressed as a natural
number representing the position of the connected source language word
in the sentence.
In order to handle words that do not have a possible
equivalent in the other language, a special empty word
is
introduced at position
in the source language string.
Considering the string
of
source language words (plus the empty word
) and the translation
of
target
language words, an alignment is a sequence of
connections in the
form
with
.
This alignment model is also called a directional alignment model
because it is not symmetric. It disallows multiple connections from
one target language token to several source language tokens. However, the
authors of [BDDM93] argue that this problem can be
overcome by using (possibly overlapping) multi-word units, which they
call cepts,
in order to find appropriate
alignments. This idea
is similar to the principle of prior identification of collocations as
discussed in section 0.6.1.
Using the directional definition of word alignment, the translation model in equation 8 can be decomposed into the following form:
![]() |
(2.9) |
In other words, the joint probability of the target language
string and its
alignment sequence, given the source language string, can be expressed as
the product of the probabilities of all target language words
and
their alignments
, given the previous words
and their
alignments
and
given the source language string
. The sum of all joint probabilities
is multiplied with the probability of the length of the target
language sentence, given the source language string
. In
this way, the translation
probability has been decomposed into one parameter per target language
word for each possible alignment position and alignment context.
It is still not feasible to estimates these parameters directly from
corpora due to the large number of
dependencies in the parameters, which still cause a large data
sparseness problem.
In SMT research, approximations of the translation model above are
applied using different independence assumptions.
Most translation models that have been used in the SMT community are based on the five models introduced in [BDDM93] by researchers at IBM2.11. All their models build on the directional word-to-word alignment model discussed previously. The idea behind their five-model-scheme is to start with a very simple model before progressing to more complex ones. The output of simpler models can be used in this way to initialize the following models.
Model 1 is a simple word translation model, which basically makes use of co-occurrence of corresponding words in sentence aligned bitext segments. It is initialized with a uniform distribution of word translation probabilities. Model 2 adds local dependencies by introducing position parameters (distortion) to the translation model. In model 3, so-called fertility parameters are introduced. Fertility parameters represent the probability of words to be aligned to a certain number of corresponding words. Modeling fertility copes with the fact that certain words tend to be aligned to multiple words and other words do not. Using fertility probabilities, multiple connections to one word can be penalized or supported. Model 4 includes additional dependencies on the previous alignment and on the word classes of surrounding words in order to handle MWUs, which tend to stick together. Word classes can be learned automatically from bilingual corpora using clustering techniques [Och99]. Model 5, finally, gets rid of the deficiency problem of models 3 and 4. Deficiency of these models means that parts of the probability distributions are reserved for impossible events such as alignments to word positions outside of the sentence boundaries. However, removing deficiency is a rather expensive task and complicates the model. Och and Ney [ON00b] present adjustments for handling the deficient models, which makes it possible to skip the computation of model 5 as it does not provide any significant improvement.
Several variants of the IBM models
have been proposed in the literature. One way to model bilingual
alignment is to use hidden Markov models (HMMs)
as described in
[VNT96]. IBM's translation models 1, 2 and 3 can be
formalized in a zero-order HMM (where model 3 has additional fertility
parameters) and model 4 can be expressed as a first-order HMM
with additional fertility parameters
[ON00a].
The HMM based translation model is decomposed into an
alignment model
using a chain of hidden
alignment parameters
and a lexical model
with
lexical translation probabilities
.
Statistical translation models can be improved using
dependencies on word classes [TIM02], smoothing
techniques for the estimation of probabilities [ON03], and
external dictionaries [ON00b,ON00a]. Additional
contextual dependencies can be integrated into the lexical
probabilities using log-linear models and a maximum entropy approach
for training [BDD96,GVONC01]. Investigations
on adding string similarity measures for cognate identification
[GKK03] and on integrating syntactic information
[CKY03] have recently been carried out.
Another problem that has to be addressed in statistical alignment models is the handling of multi-word units (MWUs) in both languages. Directional alignment models allow the connection of source language words with multiple target language words but not vice versa. Alignment templates have been introduced in [OTN99] in order to handle MWUs in a more symmetric way. Alignment templates are bilingual sets of word classes that are internally linked using a two-dimensional alignment matrix. These templates are used to link sequences of source language words with sequences of target language words, i.e. to perform a phrase level alignment [VOT$^+$00].
Translation models are trained using the expectation-maximization algorithm (EM), which is an iterative optimization strategy for approaching a local maximum of the likelihood of given training data according to the parameters of a probabilistic model. EM is useful if there are ``hidden'' parameters which cannot be estimated directly from data. Alignment probabilities are typical examples of such parameters because links between words are not present in the training data. EM starts with an initial guess for all free parameters in the model and updates them iteratively by maximizing the likelihood function until the process converges at a local maximum2.12.
A translation model can be used together with a monolingual language
model for
(for instance based on n-grams) to translate unseen sentences
from language
to language
using the estimated parameters and
equation 7. A translation model can also be used to
find the most likely alignment between words in the training data
according to the model. This alignment is called the Viterbi alignment
and can be used to extract bilingual lexical
data from the bitext.
There are alternative proposals for modeling translation relations statistically besides the IBM models that were described above.
The author of [Kup93] presents a method for mapping noun phrases using an iterative estimation approach based on the EM algorithm. He employs part-of-speech taggers and noun phrase recognizers for both languages in a bitext, and re-estimates the joint probability of source language noun phrases and target language noun phrases at certain positions in the bitext using EM.
Gaussier [Gau98] describes an approach using alignment graphs, which he calls alignment flow networks . An alignment network describes directional word-to-word connections as edges with attached flow costs between nodes in the graph. A network has a source node and a sink node and the best alignment is defined as the path through the network for which the total cost flow is minimal. Costs are defined as inverse association probabilities, i.e. a minimal cost flow is defined as the connection for which the association probability is at its maximum. Probabilities of word connections (association probabilities) are modeled as two independent probabilities, the one of linking two positions and the one of linking two words. Association probabilities are assumed to be independent of each other, which leads to the following model of the joint probability2.13:
![]() |
(2.10) |
The alignment flow network model is trained using an approximate EM algorithm. The system has been applied to the task of bilingual terminology extraction. Fertility graphs can be included to handle multi-word terms. The general procedure is similar to the ideas of Smadja [SMH96]: First, candidate terms are identified for one language. Secondly, possible translations of these terms are identified using the flow network model.
Another estimation approach to word alignment is described in [Hie98]. The author uses a two-dimensional contingency table for representing translation frequencies between word types of each language in a given bitext. The free parameters of the model are the probabilities that are connected with each table cell, i.e. probabilities that the words, which correspond to the table cell, are translations of one another in the corpus. Translation pairs are assumed to be independent of each other, which makes the probability estimations a function of the translation frequencies in the table. Now, the free parameters are estimated using the EM algorithm . First, the cells in the contingency table are filled with initial estimates of the probability parameters. Secondly, the expected cell frequencies are calculated for each aligned sentence pair in the corpus using the observed words and the current probability parameters (E-step). Then, new probability estimates are calculated using the maximum likelihood estimator (M-step). Finally, the E-step and the M-step are repeated until the parameters converge at a local maximum. Two different variants of the model are introduced in the paper, one with an explicit one-to-one word assumption and a second model with the possibility of many-to-one alignments. The second model is more flexible. However, it runs into maximization problems because of alignments of different lengths. Both models have a large search space which causes efficiency problems in the training process. Therefore, approximate EM techniques are used to approach the local maximum. The author also ran experiments with pre-processing steps such as compound splitting, which lead to a better performance than both models without pre-proccessing.
Another word-to-word alignment model has been introduced by Melamed
[Mel97c]. This model is a mixture of
a standard association approach and a statistical estimation
approach. The author initially applies the competitive linking
algorithm,
as already mentioned in
section 0.6.1, using
log-likelihood tests as a measure of association. Melamed introduces two
hidden parameters, which have to be learned from the
data. One parameter
represents ``true positives'', i.e. the
probability of a link given that co-occurring word pairs are mutual
translations of each other. The other parameter
represents ``false positives'', i.e. the probability of links given
that co-occurring
word pairs are not mutual translations. Using these parameters, the
likelihood of two word types being mutual translations is defined as
the ratio of the probabilities of two words being
linked, given the co-occurrence frequency and the positive parameter
, and the probability of the same two words being linked,
given the co-occurrence frequency and the negative parameter
. In other words, the model for linking word types
depends on the co-occurrence frequencies and the two hidden parameters
which have to be estimated in a maximization procedure. This can also
be called a re-estimation of association measures using
false and true positives as hidden parameters.
The reader might have observed that evaluation has not been mentioned
in the previous sections, and no comparison in terms of performance
has been made.
Evaluation of alignment is a tricky part. Most studies refer to recall
and precision measures, which have been derived from information
retrieval.
Precision (
), giving the number of correctly aligned items
(
) in
proportion to the number of obtained items (
), and recall
(
), giving
the number of
correct results in proportion to the number of correct items in total
(
),
seem to be reasonable measures for comparing system
performances. A balanced
F-value2.14 is often used to combine both measures for a
comparison of the overall performance:
However, precision and recall values are not as straightforward to estimate in the case of alignment as in information retrieval. Usually, results are not easily judgeable as completely correct or completely wrong. Partiality is a common phenomenon in word alignment results. The possibility of MWU links causes the system to return partially correct links in many cases. Link proposals including at least one correct word on both sides of the link (source and target language) are called partially correct links. These links are not captured by standard precision and recall. Therefore, the degree of correctness has to be integrated in evaluation measures in word alignment. However, this is not as straightforward as one might expect. Several approaches will be discussed later on.
In general, complete correctness of any word alignment cannot be expected. This is due to the nature of translation of natural language. Translations are correspondences in context and word alignment tries to break parallel texts into related units at the word level, which is not always feasible. This has been experienced in several attempts of manually creating word aligned reference material [Mel98,Vér98,Mer99a]. Word alignment also depends on various corpus characteristics. There will be differences depending on genre, language pair and the style of individual translators. Another crucial factor is the purpose of an alignment experiment, either lexicon extraction or full-text word alignment. In the first case, the extracted lexicon is to be evaluated, in the latter case, aligned tokens in the corpus have to be judged. The focus in lexicon extraction is set on content words whereas function words can be neglected. A word alignment system aiming at the creation of aligned bitexts has to evaluate token links within the corpus even for divergent translations2.15 and highly ambiguous function words. Translation spotting is another type of application that has been studied in an alignment competition [VL00], in which translations of a number of given source language terms are sought in bilingual corpora.
Automatic evaluation using a reference alignment (=gold standard) is often preferred over manual a posteriori evaluation. The main advantage of reference alignments is their re-usability once they are created. The main difficulty is to produce representative samples of reliable reference alignments.
In some studies sample bitext segments have been completely aligned by hand in order to create gold standards [Mel98,ON00b]. In other studies word samples from the corpus [VL00,AMST00] were used. The word sampling approach has the advantage that the evaluation can be focused on certain word types such as content words or words from certain frequency ranges [Mer99b]. Segment based alignment has the advantage that linking decisions are often easier to make when surrounding context is to be aligned as well. Furthermore, recall and precision measures are more straightforward for completely aligned segments than for a sampled gold standard.
Links between MWUs can be expressed in different ways in gold standards. In [AMST99], MWUs are treated as complex units and links between them are established in the same way as between any other word pair. [Mel98], [ON00b] and [MP03] treat MWUs as sets of words and a link between two MWUs is expressed as the exhaustive set of one-to-one word links between the members of both MWUs. A consequence of splitting links between MWUs into one-to-one word links is that the number of links increases compared to the complex unit approach, which certainly has an impact on the evaluation measures precision and recall. Consider the example in table 1.
Let us assume that an alignment system finds the links ``patient
tålamod'' and ``very
mycket''. A
restrictive2.16 evaluation system would score 1 out of 2 correct links
using the complex MWU notation. In the second approach, the same
alignment would yield only 2 out of 7 links. Recall differs between
0.5 for the restrictive MWU approach and about 0.29 for the splitting
approach. Precision differs between 0.5 and 1. The F-value for the
first approach is 0.5 and ca 0.36 for the second approach. The MWU
splitting approach has the advantage that it counts the partially
correct link ``patient
tålamod'', which is not
captured by the restrictive complex unit approach. However, the scores may be
blurred as in the example because of the increasing number of links when
splitting MWU links.
Another problem that has to be solved when producing gold standards is the alignment of divergent translations. Previous studies have demonstrated how difficult it is to agree on manual word alignments [Mel98,Vér98,Mer99a]. Approaches to handle uncertain links are quite similar in these studies. [Vér98] uses confidence levels as a degree of certainty on a scale between 0 and 3. [Mer99a] uses ``fuzzy'' markers for labeling uncertain links. [ON00b] uses the marker 'P' for ``probable'' alignments.
A last group of links is commonly called null links referring to not translated words. Certain words do not have any correspondence in another language such as the auxiliary verb 'do' in English questions or negations. Other words are simply not translated and therefore cannot be aligned.
In some cases it is hard to decide if words should be aligned as fuzzy links, non-aligned null links, or if they should be included in a larger link unit. Therefore, detailed guidelines are necessary for manual annotators when creating gold standards [Mel98,Vér98,Mer99a].
Precision and recall can be defined in several ways according to the gold standard and the representation of MWU links, fuzzy links and null links. The main difference is to be found in the treatment of partially correct links. Partiality is measured in different ways. The evaluation measures of the ARCADE word alignment track were tailored towards the task of translation spotting, i.e. the search for proper translations of given source language terms. Therefore, the measures consider only tokens in the target language. They are defined as follows:
is the set of target language words found
by the alignment system for the source language term in link
of
the gold standard.
is the set of correct target
language words of link
in the gold standard. Null responses are
counted as links to a special ``empty'' word, i.e. if the system does
not find any link for reference
the set
will
be set to
. Consequently, null responses
(i.e. missing alignments) have an
impact on both, precision and recall.
Normally, word alignment also has to cope with the task of finding the
correct source items (words and phrases) to be aligned. Therefore,
other measures are needed for general
word alignment systems that do not have access to given source
language terms. The MWU splitting approach is one way to handle
partiality in a symmetric way. In [ON00b], the following
metrics have been proposed (henceforth the
measures):
The set of
one-to-one word links is compared with
the set of
one-to-one word links from the gold standard for measuring
recall. Links that have been marked to be
are a sub-set of all
one-to-one word links in the gold standard, which are referred to as
links. In contrast to recall, the complete set of
(
) links is used for
measuring precision as they can be ``correctly'' proposed by the
system.
The authors of [ON00b] also suggested a combination of both
measures (essentially a complementary F-value) which they call the alignment error rate.
The
-measures above are designed for word-to-word
alignment approaches
using the split type of MWU links.
Measures for word alignment
evaluation, using complex MWU link references, are presented in section
0.17.
In the previous sections common techniques for processing
parallel data were presented. Several applications of aligned data
have been mentioned already. In this
section, I will list some additional applications, tools and projects
that have been based on parallel corpora and techniques from above.
See also the illustration in figure 1 on page
for the relation between parallel corpora,
alignment approaches, and the applications which are mentioned here.
Note that this description is not
intended as a comprehensive list of tools and projects on this
subject.
Sentence aligned parallel corpora are directly applicable for supporting translators in their daily work. Translation memories have been used for a long time by human translators and sentence aligned bitexts can be used as such without any further processing. Extending the functionality of translation memories by aligning even sub-sentential parts leads to the idea of example-based machine translation (EBMT) [Bro96]. Several techniques have been proposed for generalizing aligned segments [Bro00] and putting ``bits and pieces'' together that have been derived from old translation examples.
Statistical machine translation was introduced above. SMT systems become ever more popular due to recent improvements of translation models and increased power of today's computer technology. SMT systems have the advantage that they can be developed very fast once there are tools and sufficient training data available for the particular language pair. SMT systems have the disadvantage that they rely on training and the statistical model. Corrections and improvements are hard to integrate in the set of estimated parameters which are usually not human readable. However, SMT has been used in several applications starting with the Candide system at IBM [BBP$^+$94] and moving to the VERBMOBILE project on speech translation [VOT$^+$00]. The flexibility of SMT systems has been proven by the ``MT in a Day'' experiment which was carried out at the NSF Workshop on statistical machine translation at Johns Hopkins University [AOCJ$^+$99]. Many teams work on the improvement of SMT systems. Co-training of SMT models using more than two languages is one way to boost the performance of a translation system [CBO03].
Recently, interactive machine translation (IMT) has been studied in connection with statistical translation approaches. The idea of translation predictions for IMT has been suggested in [FIP96] and implemented in the TransType system [LFL00]. The SMT framework has been integrated in the system using automatically constructed word hypothesis graphs for the efficient search of possible translation completions [OZN03].
Another obvious application of parallel corpora is the extraction of bilingual terminology. Several systems have been developed using word alignment techniques as described above. Termight uses Church's character-based alignment approach char_align [DC94], TransSearch uses IBM's model 2 [MH96], and Champollion uses Smadja's collocation aligner [SMH96]. Terminology extraction techniques have successfully been ported to a variety of language pairs among them less related languages such as English and Japanese [FM97] or English and Chinese [WX94].
Related to terminology extraction is the field of lexicography. The use of bilingual data for building translation dictionaries has been investigated in several projects. BICORD is one example of an attempt to combine bitexts and machine-readable dictionaries for building and extending bilingual dictionaries [KT90]. Dilemma is another lexicographic tool that re-uses existing translations [KKN$^+$94]. Many more projects aim at the automatic or semi-automatic extraction of bilingual lexicons for different language pairs (see e.g., [RM97,ARM01,AMA02]).
Furthermore, extracted translation dictionaries can be applied to machine translation as lexical resource in, for instance, direct and transfer based machine translation systems (see e.g. [Ahr99,MR01,SFT$^+$02,ISM03]), or example-based machine translation (see e.g.,[Bro97].
Another field of research where parallel data can help is the field of word sense disambiguation. Ambiguities are distributed differently in natural languages. This fact can be used for cross-lingual comparisons, which may help to disambiguate words and to identify concepts in context [GCY92,DR02]. Dyvik explores translations as semantic mirrors of ambiguous words [Dyv98]. Translation alternatives of a language sign (i.e. word) describe a so-called t-image of the sign in this language. t-images can be reversed. i.e. sets of translational correspondences in the original language can be found for all words in the t-image. Intersection between these sets in inverse t-images represent conceptual distinctions of the original sign. Dyvik uses multiple inversions and several heuristics for producing semantic networks for ambiguous words in a parallel corpus [Dyv02]. Furthermore, these networks can be linked between the two languages.
A last application of parallel corpora to be mentioned here is the adaptation of language tools to new languages with the help of parallel data. Robust text analysis tools, which exist for one language, can be ported to other languages by projecting analyzes (such as part-of-speech and chunks) from one language to another in a parallel corpus [Bor99,YNW01,Bor02]. Similarly, a third language may be used to induce word alignments between two other languages [Bor00].
In this chapter, basic concepts and techniques of the work with parallel corpora have been presented. The compilation of parallel corpora using alignment techniques such as automatic sentence alignment has been introduced. Common methods for the alignment of sentences have been discussed in section 0.5. Word alignment techniques have been described in detail as they are essential for the extraction of lexical correspondences from bilingual parallel corpora. Two main approaches to word alignment have been discussed, which we refer to as association approaches and estimation approaches. Association measures are widely used for the identification of translational correspondences. Section 0.6.1 describes common measures used in association approaches to word alignment. Their application in our alignment systems is presented in chapter 0.15. In statistical machine translation, probabilistic alignment models are used to estimate alignment parameters between words in parallel corpora. This approach is referred to as the estimation approach to word alignment in the present thesis. Statistical translation models, as the ones described in section 0.6.2, are mainly studied for the purpose of machine translation. However, research on word alignment and bilingual lexicon extraction using such models has recently become very intense. Translation models have been improved in various ways and tools for statistical machine translation have become available in recent years. Estimation approaches using translation models have been incorporated into our word alignment approach described in section 0.16.3.