- ... corpora1.1
- The concept of
parallel corpora will be explained in section
0.4.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... VINNOVA1.2
- Swedish Agency for Innovation
Systems
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... languages2.1
- Here, we refer to parallel corpora
exclusively in terms of multilingual parallel corpora. Other types
of parallel corpora include diachronic corpora (different versions
of the same document from different periods of time) and
transcription corpora (e.g. textual representations of spoken
language or dialects aligned to a corresponding standard language
text). [Mer99b]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... corpus2.2
-
http://nl.ijs.si/ME/CD/docs/1984.html
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
EUROPARL2.3
- http://www.isi.edu/ koehn/publications/europarl/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... corpus2.4
- http://logos.uio.no/opus/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
information2.5
- Point-wise mutual information
differs from the standard measure of mutual information
in
information theory. Mutual information
measures how well
one random variable predicts another one; i.e. how much information
about a random variable
is included in another random variable
and vice versa. It
is defined as the weighted sum of possible event-combinations
.
Point-wise mutual information considers only one specific
``point'' of the probability distribution [MS99]. The
random variables involved here are binary, i.e. their distribution
includes only two probabilities, one that a certain event occurs
(e.g. a word occurs in a corpus) and the other that the event does
not occur. In this case, point-wise mutual information considers only the
point where the event (or the joint event) actually happens and
discards the other combinations
.
Point-wise mutual information is
sometimes referred to as specific mutual information
whereas
the mutual information from information theory is called average mutual information
[SMH96]. In computational linguistics, the term
mutual information has often been used to denote point-wise mutual
information. The reader should be aware of this fact when referring
to the literature.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... languages2.6
- Note that lexical
items may refer to single words as well as to
phrases or even whole sentence fragments. Note also that it might be
necessary to change lexical boundaries for different language
pairs. This is often the case when the segmentation into lexical
concepts differs between languages. For example, a large concept may
be bound to one particular word in one language but in a second
language it may be required to use a whole phrase in order to
explain the same meaning. However, a third language may use a set of
sub-concepts similar to the ones in language two.
In this case,
lexical boundaries should probably differ when aligning words of
language two and three compared with an alignment of words of language one
with words of one of the other two languages. Similar problems appear
with morphological and derivational differences between
languages. For example, in one of our parallel corpora the Swedish
compound ``regeringsförklaring'' is translated into the English noun
phrase ``statement of government policy'' and into the French
``declaration de politique générale du gouvernement''.
An English-French word alignment with links between
(statement - declaration), (of - de),
(government policy - politique générale du gouvernement) is totally
acceptable whereas a Swedish-English alignment requires a link
between the Swedish compound and the complete noun phrase in English
(similarly for Swedish-French).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... t-distribution2.7
- The t-distribution
is used
instead of the normal distribution
for hypothesis tests on
random variables with unknown standard deviations. Student's t-distributions
depend on the number
of observations which determine the degree of freedom. The
distribution approaches the standard normal distribution
for high degrees of
freedom.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
2.8
- One distinguishes between one-tail and two-tail tests
depending on whether the hypothesis is directional or not.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
mean2.9
- The random process of generating bigrams is modeled as a
Bernoulli trial with
for the probability of the
bigram
to be produced and
for the probability
of any other outcome.
Variances of such distributions can be
approximated as
if
is small, which is the
case for most bigrams in a corpus [MS99].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... string2.10
- Boldface variables such as
denote strings of outcomes of a random variable such
as the source language
. Probabilities such as
denote
probabilities of events
, i.e.
is a
short form for
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... IBM2.11
- Translation
models from this study are often referred to as the IBM models 1 to
5.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... maximum2.12
- For efficiency
reasons, approximate estimation techniques have to be used when
running EM on fertility based models.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... probability2.13
- The notation
follows the one which has been used in section 0.6.2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... F-value2.14
- The balanced
F-value is derived from the weighted
measure, which is
defined as the ratio
. Setting
``balances'' precision and recall, i.e. both rates are weighted to be
equally important.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... translations2.15
- With
divergent translations we refer to insertions,
deletions, errors or other unexpected parts in translated text.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... restrictive2.16
- Restrictive evaluation refers to evaluation
disregarding partly correct
alignments.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... TEI3.1
- TEI is the
Text Encoding Initiative (http://www.tei-c.org/).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... MATS3.2
- http://stp.ling.uu.se/mats
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... DTD3.3
- The prefixes ``liu'' (figure
6) and ``lin'' (figure 7) refer to
Linköping University.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
OpenOffice.org3.4
- OpenOffice.org is an open source office suite.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
PHP3.5
- PHP:Hypertext Preprocessor (PHP) is a
widely-used general-purpose scripting language which is available as
open source.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... KDE3.6
- The K Desktop
Environment (KDE) is free graphical desktop environment for UNIX
workstations.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
segments5.1
- Single-word bitext segments denote bitext segments
with only one word on either the source or target language
segment. Sentence alignment may produce many of them by aligning,
for instance, table cells.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... words5.2
- We
use sets in order to include MWUs in the definition. Word order is
not explicitly defined but may be used as a feature.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... tri-grams5.3
- A short list of stop
words has been used to recognize common phrase boundaries.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... SUC-tags5.4
- SUC is the
Stockholm-Umeĺ corpus of 1 million running Swedish words
[EKÅ92].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...declarativeclue5.5
- The first column contains English
part-of-speech tags and the second column contains parts of
Swedish part-of-speech tags which are produced by the substitution
pattern which is shown in the header of the example (target).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... proposals5.6
- Partially correct links include at least one
correct source language word and at least one correct target
language word, i.e.
and
. In all other
cases the link is called incorrect and
by
definition.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... links5.7
- Again,
for
incorrect links for both, precision and recall.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
standard5.8
- Two links in
the MWU-splitting gold standard
have been marked as probable (P) for the sake of
explanation. This might not exactly meet the guidelines
for creating such references.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...Ahrenberg.Merkel.ea:995.9
- Note
that the evaluation metrics used in this report differ from the ones used in
this thesis.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... GIZA++5.10
- GIZA++ implements IBM's translation models
1 to 5 and is
freely available from
http://www-i6.informatik.rwth-aachen.de/web/Software/GIZA++.html
provided by Franz Josef Och. The system implements several
refinements of the statistical alignment models
discussed in section 0.6.2 [ON00b].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ... corpus5.11
- The figure illustrates a typical problem
with manual alignment of word samples. The example of the ``fuzzy
link'' may seem to
be odd because it does not include ``to him'' in the English part of
the link corresponding to the Swedish translation ``inte tillhör
hans släkt'' [does not belong to his family]. Alignment decisions have
to be made and mistakes are always possible. Probably this particular
alignment would have been different if the complete sentence was to be
aligned instead of the sampled source language word (``unrelated'') only.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
respectively5.12
- Weights are chosen
intuitively rather than using
empirical investigations. Scores are simply truncated in cases where
they exceed 1.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
filters6.1
- The remaining entries are alignment errors.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...precision6.2
- Precision gives the percentage of correct terms
among all extracted term candidates.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
- ...
lexemes6.3
- A similar project has recently been carried out in
co-operation with Systran. The lexical components of the
Swedish-English and Swedish-Danish engines of the EC Systran machine
translation system have been build from automatically extracted word
type links using our Clue Aligner. This project is part of the
European Commission Contract, SDT/MT 2003-1: Extension of EC Systran to
Danish and Swedish into English.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.