... corpora1.1
The concept of parallel corpora will be explained in section 0.4.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... VINNOVA1.2
Swedish Agency for Innovation Systems
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... languages2.1
Here, we refer to parallel corpora exclusively in terms of multilingual parallel corpora. Other types of parallel corpora include diachronic corpora (different versions of the same document from different periods of time) and transcription corpora (e.g. textual representations of spoken language or dialects aligned to a corresponding standard language text). [Mer99b]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... corpus2.2
http://nl.ijs.si/ME/CD/docs/1984.html
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... EUROPARL2.3
http://www.isi.edu/ koehn/publications/europarl/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... corpus2.4
http://logos.uio.no/opus/
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... information2.5
Point-wise mutual information differs from the standard measure of mutual information in information theory. Mutual information $I(X;Y)$ measures how well one random variable predicts another one; i.e. how much information about a random variable $Y$ is included in another random variable $X$ and vice versa. It is defined as the weighted sum of possible event-combinations $I(X;Y)=\sum_{x}\sum_{y}p(x,y)log_{2}\frac{p(x,y)}{p(x)p(y)}$. Point-wise mutual information considers only one specific ``point'' of the probability distribution [MS99]. The random variables involved here are binary, i.e. their distribution includes only two probabilities, one that a certain event occurs (e.g. a word occurs in a corpus) and the other that the event does not occur. In this case, point-wise mutual information considers only the point where the event (or the joint event) actually happens and discards the other combinations $I(x,y)=log_{2}\frac{p(x,y)}{p(x)p(y)}$. Point-wise mutual information is sometimes referred to as specific mutual information whereas the mutual information from information theory is called average mutual information [SMH96]. In computational linguistics, the term mutual information has often been used to denote point-wise mutual information. The reader should be aware of this fact when referring to the literature.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... languages2.6
Note that lexical items may refer to single words as well as to phrases or even whole sentence fragments. Note also that it might be necessary to change lexical boundaries for different language pairs. This is often the case when the segmentation into lexical concepts differs between languages. For example, a large concept may be bound to one particular word in one language but in a second language it may be required to use a whole phrase in order to explain the same meaning. However, a third language may use a set of sub-concepts similar to the ones in language two. In this case, lexical boundaries should probably differ when aligning words of language two and three compared with an alignment of words of language one with words of one of the other two languages. Similar problems appear with morphological and derivational differences between languages. For example, in one of our parallel corpora the Swedish compound ``regeringsförklaring'' is translated into the English noun phrase ``statement of government policy'' and into the French ``declaration de politique générale du gouvernement''. An English-French word alignment with links between (statement - declaration), (of - de), (government policy - politique générale du gouvernement) is totally acceptable whereas a Swedish-English alignment requires a link between the Swedish compound and the complete noun phrase in English (similarly for Swedish-French).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... t-distribution2.7
The t-distribution is used instead of the normal distribution for hypothesis tests on random variables with unknown standard deviations. Student's t-distributions depend on the number of observations which determine the degree of freedom. The distribution approaches the standard normal distribution for high degrees of freedom.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...$t$2.8
One distinguishes between one-tail and two-tail tests depending on whether the hypothesis is directional or not.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... mean2.9
The random process of generating bigrams is modeled as a Bernoulli trial with $p=p(w_{s},w_{t})$ for the probability of the bigram $w_{s} w_{t}$ to be produced and $(1-p)$ for the probability of any other outcome. Variances of such distributions can be approximated as $\sigma^{2}=p(1-p)\approx p$ if $p$ is small, which is the case for most bigrams in a corpus [MS99].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... string2.10
Boldface variables such as ${\bf s}$ denote strings of outcomes of a random variable such as the source language $S$. Probabilities such as $P({\bf s})$ denote probabilities of events $\{S={\bf s}\}$, i.e. $P({\bf s})$ is a short form for $P(S={\bf s})$.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... IBM2.11
Translation models from this study are often referred to as the IBM models 1 to 5.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... maximum2.12
For efficiency reasons, approximate estimation techniques have to be used when running EM on fertility based models.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... probability2.13
The notation follows the one which has been used in section 0.6.2.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... F-value2.14
The balanced F-value is derived from the weighted $F_{\beta}$ measure, which is defined as the ratio $F_{\beta}=((\beta^{2}+1)*P*R)/(\beta^{2}*P+R)$. Setting $\beta=1$ ``balances'' precision and recall, i.e. both rates are weighted to be equally important.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... translations2.15
With divergent translations we refer to insertions, deletions, errors or other unexpected parts in translated text.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... restrictive2.16
Restrictive evaluation refers to evaluation disregarding partly correct alignments.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... TEI3.1
TEI is the Text Encoding Initiative (http://www.tei-c.org/).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... MATS3.2
http://stp.ling.uu.se/mats
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... DTD3.3
The prefixes ``liu'' (figure 6) and ``lin'' (figure 7) refer to Linköping University.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... OpenOffice.org3.4
OpenOffice.org is an open source office suite.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... PHP3.5
PHP:Hypertext Preprocessor (PHP) is a widely-used general-purpose scripting language which is available as open source.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... KDE3.6
The K Desktop Environment (KDE) is free graphical desktop environment for UNIX workstations.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... segments5.1
Single-word bitext segments denote bitext segments with only one word on either the source or target language segment. Sentence alignment may produce many of them by aligning, for instance, table cells.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... words5.2
We use sets in order to include MWUs in the definition. Word order is not explicitly defined but may be used as a feature.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... tri-grams5.3
A short list of stop words has been used to recognize common phrase boundaries.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... SUC-tags5.4
SUC is the Stockholm-Umeĺ corpus of 1 million running Swedish words [EKÅ92].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...declarativeclue5.5
The first column contains English part-of-speech tags and the second column contains parts of Swedish part-of-speech tags which are produced by the substitution pattern which is shown in the header of the example (target).
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... proposals5.6
Partially correct links include at least one correct source language word and at least one correct target language word, i.e. $\vert aligned_{src}^{x}\cap correct_{src}^{x}\vert>0$ and $\vert aligned_{trg}^{x}\cap correct_{trg}^{x}\vert>0$. In all other cases the link is called incorrect and $Q_{x}\equiv 0$ by definition.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... links5.7
Again, $Q_{x}\equiv 0$ for incorrect links for both, precision and recall.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... standard5.8
Two links in the MWU-splitting gold standard have been marked as probable (P) for the sake of explanation. This might not exactly meet the guidelines for creating such references.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...Ahrenberg.Merkel.ea:995.9
Note that the evaluation metrics used in this report differ from the ones used in this thesis.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... GIZA++5.10
GIZA++ implements IBM's translation models 1 to 5 and is freely available from http://www-i6.informatik.rwth-aachen.de/web/Software/GIZA++.html provided by Franz Josef Och. The system implements several refinements of the statistical alignment models discussed in section 0.6.2 [ON00b].
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... corpus5.11
The figure illustrates a typical problem with manual alignment of word samples. The example of the ``fuzzy link'' may seem to be odd because it does not include ``to him'' in the English part of the link corresponding to the Swedish translation ``inte tillhör hans släkt'' [does not belong to his family]. Alignment decisions have to be made and mistakes are always possible. Probably this particular alignment would have been different if the complete sentence was to be aligned instead of the sampled source language word (``unrelated'') only.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... respectively5.12
Weights are chosen intuitively rather than using empirical investigations. Scores are simply truncated in cases where they exceed 1.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... filters6.1
The remaining entries are alignment errors.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...precision6.2
Precision gives the percentage of correct terms among all extracted term candidates.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... lexemes6.3
A similar project has recently been carried out in co-operation with Systran. The lexical components of the Swedish-English and Swedish-Danish engines of the EC Systran machine translation system have been build from automatically extracted word type links using our Clue Aligner. This project is part of the European Commission Contract, SDT/MT 2003-1: Extension of EC Systran to Danish and Swedish into English.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.