tar xzf Alpino.tar.gz
Alpino is a collection of tools and programs for parsing Dutch sentences into dependency structures.
The binary distribution is available only for the x86-linux-glibc2.3 platform (specific requirements are listed below). Installing involves the following steps:
Download the file Alpino.tar.gz. The binaries are now included in the source distribution.
Extract the contents of that file in a place of your liking, e.g., in your home-directory, and do the following:
tar xzf Alpino.tar.gz
The binary file depends on the Tcl/Tk libraries as well as a number of system libraries. The specific dependencies of the current release are as follows (subject to change - created with a SuSE 9 system):
libstdc++.so.5 libtk8.4.so libtcl8.4.so libSM.so.6 libICE.so.6 libX11.so.6 libdl.so.2 libm.so.6 libpthread.so.0 libc.so.6 libgcc_s.so.1 /lib/ld-linux.so.2
In addition, some of the tools and programs that are shipped with Alpino require additional packages. Here is the beginning of a list. Typically you need a recent version ;-)
tcl/tk
libxml2
libxslt
python
python library libxml2
python library libxslt
dictzip (sometimes packaged with dictd)
java (for Thistle)
libpopt
libz
In other that you can use Alpino and the related tools, you need to define your ALPINO_HOME environment variable. This variable should point to the directory that containts the Alpino stuff. A typical way tot do this, is to add the following lines to the file .bashrc in your home directory, assuming you use bash, and assuming you extracted the Alpino stuff in a directory called Alpino in your home directory:
export ALPINO_HOME=$HOME/Alpino export PATH=$PATH:$ALPINO_HOME/bin
The various usages of Alpino can be summarized as follows:
Alpino [Options] Alpino [Options] -parse Alpino [Options] -parse W1 .. Wn
In the first form, the system will work in interactive mode. By default, it will start the graphical user interface. Use the options -tk or -notk to enable/disable the gui explicitly.
In interactive mode, you are in effect using the Hdrug command interpreter and (optionally) the Hdrug graphical user interface. Refer to the Hdrug documentation for more information.
In the second form, the system will read lines from standard input. Each line is then parsed, and various output is produced (the actual output depends on the various options that are set).
Finally, in the third form, the system will parse the single sentence given on the command line, provide output (again, depending on the options), and quit. In this form, the input sentence is given as a sequence of tokens W1 .. Wn on the command line.
There are a whole bunch of options which affect the way in which the system runs. For most options, there is documentation available from the graphical user interface. In order to have access to this, run:
Alpino
Some frequently used options are given here.
Assigns Value to global variable Flag. See below for a list of flags with suggested values.
The remaining options all start with a dash:
Similar to the above. The difference is that in the Flag=Value syntax, the Value is parsed as a Prolog term. In the -flag Flag Value syntax, on the other hand, the Value is parsed as a Prolog atom. In many cases this does not make a difference. Remember: for path names, you need the second format.
Use the graphical user interface. Only makes sense for interactive operation.
Do not use the graphical user interface.
Evaluates Prolog Goal; Goal is parsed as Prolog term.
Loads the Prolog file, using the Prolog goal use_module(File).
With this option, a number of global variables are set in such a way that the system uses the part-of-speech pre-processor, and only produces a single (what should be the best) parse. This is the default.
The alternative to the fast-option. The system will find all possible parses for the input. No part-of-speech pre-processor is applied. No beams are used in search.
This option uses an even more aggressive set of optios to improve the speed of the parser, at the cost of reduced accuracy.
There are (way too) many global variables (called flags) which will alter the behavior of Alpino. These flags typically already have a default value. The value can be changed using command line options. The values can also be changed in interactive mode using either the command interpreter or the graphical user interface. In the graphical user interface, documentation for each option is provided. For users of the binary distribution, the following options appear relevant:
Integer (typically 0, 1 or 2) which determines the number and detail of debug and continuation messages.
Boolean flag which determines if a visual impression of the parse is produced (either to standard output, or the graphical interface if it is running).
Integer which determines during unpacking the how many best analyses are kept for each `maximal projection'. A larger value will imply slower and more accurate processing. The value 0 is special: in that case the system performs a full search (hence maximal accuracy and minimal speed). The value of this flag is ignored in case unpack_bestfirst=off.
Integer which determines during parsing how many parses are produced for any given maximal projection. From these, only the best are kept for further processing later (using the disambiguation_beam flag). This flag can be used to limit the number of parses that are computed in the first place. A value of 0 means that all parses are produced. If the value is N>0, then only the first N parses are computed.
Integer which determines during parsing (first phase) how many sub-parses are produced for any given sub-goal. A value of 0 means that no limit on this number is enforced.
Integer which you can use to enforce that Alpino first ignores a number of lines, and then starts parsing after that many lines. Normally, the value of this flag will be 0 (parse all lines), but in some cases it is useful to ignore the first part of an input file (because perhaps those lines already were parsed earlier).
See below for some examples.
Integer which determines minimum sentence length. Sentences with less words are ignored.
Integer which determines maximum sentence length. Sentences with more words are ignored.
This flag determines the number of analyses that is passed on by the robustness / disambiguation component. If the value is 0, then the system simply finds all solutions.
Boolean flag to determine whether to use a POS-tagger to filter the result of lexical lookup. The POS-tagger is based on unigram, bigram and trigram frequencies. This filter can be made more or less strict using the pos_tagger_n flag.
This flag takes a numerical value which determines how much filtering should be done by the POS-tagger which filters the result of lexical lookup (if pos_tagger=on). For each position in the string the tagger compares the combined forward and backward probability of a tag with the best score. If the score of a tag is greater than the best_score + the value of this flag, then the tag is removed. Thus, a lower value indicates that more filtering is done.
Boolean flag which determines whether or not dependency graphs should be mapped to a format which is closer to the format used in the CGN treebank.
This flag takes a numberical value which is then used as the maximum number of milliseconds that the system is given for each sentence. So, if the value is set to 60000, then if a parse for a given sentence requires more than a minute of CPU-time, that parse is aborted. Because the system can sometimes spend a very long time on a single (long, very ambiguous) sentence, it is often a good idea to use this time-out.
Boolean flag that indicates whether the system should produce verbose xml output (containing various detailed lexical features).
Perhaps the most important flag is the end_hook flag which determines what the system should do with a parse that it has found. Typically this involves printing certain information concerning the parse to standard error, standard output or a file. Various examples are provided below.
Suppose you want to save the dependency structure of the best parse of each sentence as xml. In that case, the following command might be what you want:
Alpino -fast -flag treebank $HOME/tmp\
end_hook=xml -parse
For each sentence read from standard input, an xml file will be created in the directory $HOME/tmp containing the CGN dependency structure. The files are named 1.xml, 2.xml, … etc (if you want to start counting from N+1, then you can use the current_ref=N flag to initialize the counter at N).
For browsing and querying such xml files, check out the scripts that come with the Alpino Treebank.
Rather than a full dependency structure for each sentence, it might be that you only care for the dependency triples (two words, and the name of the dependency relation). For this, you might try:
Alpino -fast end_hook=triples -parse
For an input consisting of the three lines:
ik houd van spruitjes ik ben gek op spruitjes Mijn minimal brain dysfunction speelt weer op
this writes out:
houd/[1,2]|su|ik/[0,1]|1 houd/[1,2]|pc|van/[2,3]|1 van/[2,3]|obj1|spruitje/[3,4]|1 ben/[1,2]|su|ik/[0,1]|2 ben/[1,2]|predc|gek/[2,3]|2 gek/[2,3]|pc|op/[3,4]|2 op/[3,4]|obj1|spruitje/[4,5]|2 speel_op/[4,5]|su|minimal brain dysfunction/[1,4]|3 minimal brain dysfunction/[1,4]|det|mijn/[0,1]|3 speel_op/[4,5]|mod|weer/[5,6]|3 speel_op/[4,5]|svp|op/[6,7]|3
Each line is a single dependency triple. The line contains four fields separated by the | character. The first field is the head word, the second field is the dependency name, the third field is the dependent word, and the fourth field is the sentence number. The words are representend as Stem/[Start,End] where Stem is the stem of the word, and Start and End are string positions. If you also want two additional fields with the POS-tags of each word, then use the end_hook=triples_with_frames option. Typical output then looks like:
houd/[1,2]|verb|mod|niet/[2,3]|adv|1 houd/[1,2]|verb|pc|van/[3,4]|prep|1 houd/[1,2]|verb|su|ik/[0,1]|pron|1 van/[3,4]|prep|obj1|smurfen/[4,5]|noun|1
Perhaps your interest lies in the syntactic structure assigned by Alpino. The following writes out the syntactic structure for each parse that Alpino finds:
Alpino number_analyses=1000 end_hook=syntax -parse
Each parse is written as a bracketed string, where each opening bracket is followed by the category (preceded by the ampersand). For the phrase
de leuke en vervelende kinderen
the (rather verbose) output is, where each line starts with the sentence number, the separation character | and the annotated sentence:
1| [ @top_cat [ @start [ @max [ @np [ @det de ] [ @n [ @a [ @a leuke ] [ @clist [ @optpunct ] [ @conj en ] [ @a vervelende ] ] ] [ @n kinderen ] ] ] ] ] [ @optpunct ] ] 1| [ @top_cat [ @start [ @max [ @np [ @det de ] [ @n [ @n [ @a leuke ] ] [ @clist [ @optpunct ] [ @conj en ] [ @n [ @a vervelende ] [ @n kinderen ] ] ] ] ] ] ] [ @optpunct ] ] 1| [ @top_cat [ @start [ @max [ @np [ @np [ @det de ] [ @n [ @a leuke ] ] ] [ @clist [ @optpunct ] [ @conj en ] [ @np [ @n [ @a vervelende ] [ @n kinderen ] ] ] ] ] ] ] [ @optpunct ] ]
If you only care for the part-of-speech tags that Alpino used to derive the best parse, then you can use the following command:
Alpino end_hook=frames -parse
In that case, the system prints lines like the following to standard output:
ik|pronoun(nwh,fir,sg,de,nom,def)|1|0|1|normal|pre vertrek|verb(zijn,sg1,intransitive)|1|1|2|normal|post houdt|verb(hebben,sg3,pc_pp(van))|2|1|2|normal|post van|preposition(van,[af,uit,vandaan,[af,aan]])|2|2|3|normal|post Marietje|proper_name(both)|2|3|4|name(not_begin)|post
Each line represents the information of a word. The information is contained in various fields separated by the | character. The fileds represent from left to right: the word, the part-of-speech tag, the sentence number, the begin position of the word, the end position of the word, and two further fields that you probably want to ignore.
It is fairly easy to define further output formats. Please contact the author if you have specific requests.
If for some reason Alpino does not work, the first thing you should try is to alter in the Alpino initialization script the line
debug=0
into
debug=1
If you are lucky this implies that you get told what is going wrong.
If this doesn't help, send a bug-report to the author. Please.
The sentences that you give to Alpino are assumed to be already tokenized: a single sentence on a line, and each token (word) separated by a single space. Alpino currently only supports the latin1 encoding (iso-8859-1) and will also work with the latin9 (iso-8859-15) encoding. Here are some properly tokenized sentences:
Dat is nog niet duidelijk . ' We zien hier niet het breken van tandenstokers . Dodelijke wapens worden gesloopt . ' Slechts enkele nieuwe documenten zijn aan het licht gekomen . ' Er is nog geen bewijs van verboden activiteiten gevonden .
If you want to assign an identifier for each sentence, then you write the identifier in front of the sentence, with a | separation character. For instance:
volkskrant20030308_12|Dat is nog niet duidelijk . ' volkskrant20030308_14|We zien hier niet het breken van tandenstokers . volkskrant20030308_15|Dodelijke wapens worden gesloopt . ' volkskrant20030308_17|Slechts enkele nieuwe documenten zijn aan het licht gekomen . ' volkskrant20030308_20|Er is nog geen bewijs van verboden activiteiten gevonden . '
It follows that the vertical bar is a special character that is supposed to be not part of the sentence in case there is not an identifier.
Lines which start with a percentage sign are ignored: the percentage sign is used to introduce comments:
volkskrant20030308_12|Dat is nog niet duidelijk . ' volkskrant20030308_14|We zien hier niet het breken van tandenstokers . %% the next sentence is easy: volkskrant20030308_15|Dodelijke wapens worden gesloopt . ' volkskrant20030308_17|Slechts enkele nieuwe documenten zijn aan het licht gekomen . ' volkskrant20030308_20|Er is nog geen bewijs van verboden activiteiten gevonden . '
If you have corpus material that has not been tokenized yet, then you can also use Alpino to tokenize your input. Use the flag assume_input_is_tokenized for this purpose. If the value is on, then Alpino won't try to tokenize the input. If the value is off, it will tokenize the input.
The output of Alpino uses identifiers. For instance, if the end_hook=xml option has been set, the names of the XML files are constructed out of these identifiers. The sentence identifier is either defined explicitly as part of the input sentence, or it is set implicitly by Alpino (simply counting the sentences of a specific session starting from 1). If there are multiple outputs for a given sentence, for instance if you have defined the option number_analyses=10, then each of the results is indexed with the rank of the analysis. The resulting identifier then consists of the sentence identifier and the rank of the analysis, separated by a dash. Example:
echo "example1|de mannen kussen de vrouwen" |\ Alpino -notk number_analyses=0 end_hook=frames -parse
This yields:
de|determiner(de)|example1|0|1|normal(normal)|pre|de|0|1 mannen|meas_mod_noun(de,count,pl)|example1|1|2|normal(normal)|pre|man|0|1 kussen|verb(hebben,pl,transitive)|example1|2|3|normal(normal)|post|kus|0|1 de|determiner(de)|example1|3|4|normal(normal)|post|de|0|1 vrouwen|noun(de,count,pl)|example1|4|5|normal(normal)|post|vrouw|0|1 de|determiner(de)|example1-2|0|1|normal(normal)|pre|de|0|1 mannen|meas_mod_noun(de,count,pl)|example1-2|1|2|normal(normal)|pre|man|0|1 kussen|verb(hebben,pl,transitive)|example1-2|2|3|normal(normal)|post|kus|0|1 de|determiner(de)|example1-2|3|4|normal(normal)|post|de|0|1 vrouwen|noun(de,count,pl)|example1-2|4|5|normal(normal)|post|vrouw|0|1 de|determiner(de)|example1-3|0|1|normal(normal)|pre|de|0|1 mannen|meas_mod_noun(de,count,pl)|example1-3|1|2|normal(normal)|pre|man|0|1 kussen|verb(hebben,pl,intransitive)|example1-3|2|3|normal(normal)|post|kus|0|1 de|determiner(de)|example1-3|3|4|normal(normal)|post|de|0|1 vrouwen|noun(de,count,pl)|example1-3|4|5|normal(normal)|post|vrouw|0|1
The following symbols which might occur in input lines are special for Alpino:
| [ ] % ^P
The vertical bar | is used to separate the sentence key from the sentence. Only the first occurrence of a vertical bar on a given line is special.
The square brackets are used for annotating sentences with syntactic information. This is described below.
The percentage sign is only special if it is the first character of a line. In that case the line is treated as a comment, and is ignored.
A line only consisting of P (control-P) is treated as an instruction to escape to the Prolog command-line interface. This is probably only useful for interactive usage. Typing "halt." to the Prolog command-line reader will return to the state before P was typed.
If you need to parse a line starting with a %-sign, then the easiest solution is to use a key (for interactive usage, perhaps use the empty key):
3|% is een procentteken |% is een procentteken
The same trick can be used to interpret the vertical bar literally:
4|ik bereken P(x|y) |ik bereken P(x|y)
Note that it is currently not possible to use keys that start with the symbol % or |.
If you have square brackets in your input, you need to escape these using the backslash. Currently, the backslashes will also be present in the output (this needs to be corrected!).
Alpino supports brackets in the input to indicate:
syntactic structure
lexical assignment
requirement to skip words
include phantom words
A large part of the computational time is spent on finding the correct constituents. For interactive annotation, it is possible to give hints to Alpino about the correct constituent structure by putting straight brackets around constituents. Both brackets should be surrounded by a single space on both sides otherwise the POS-tagging will give problems, regarding a space as a word. This is a nice feature especially for attaching modifiers at the correct location, enumerations and complex nestings in syntactic structures. Brackets are normally associated with a category name, using the @-operator. Normal syntactic symbols can be used here as @np, @pp etc. Examples:
Hij fietst [ @pp op zijn gemakje ] [ door de straten van De Baarsjes ] . FNB zet teksten van [ [ kranten en tijdschriften ] , boeken , [ studie- en vaklectuur ] , bladmuziek , folders , brochures ] om in gesproken vorm . [ @np De conferentie die betrekking heeft op ondersteunende technologie voor gehandicapten in het algemeen ] , bood [ @np een goed platform [ om duidelijk te maken hoe uitgevers zelf toegankelijke structuren in hun informatie kunnen aanbrengen ] ] .
You can also force a lexical assingment to a token or a series of tokens, if the Alpino lexicon does not contain the proper assignment. A @postag followed by an Alpino lexical category will force the assignment of the corresonding lexical category to the words contained in the brackets. This comes in handy when a part of the sentence is written in a foreign language or a spelling mistake has occurred. Of course, it may need adjusting with the editor afterwards.
Hij heeft een beetje [ @postag adjective(no_e(adv)) curly ] haar . Op een [ @postag adjective(e) mgooie ] dag gingen ze fietsen . Mijn [ @postag noun(de,count,sg) body mass index ] laat te wensen over .
Also, if the annotator can predict that a certain token or series of tokens will make the annotation a mess, it can be skipped by @skip. In such a case, the sentence is parsed as if the word(s) that are marked with skip where not there, except that the numbering of the words in the resulting dependency structure is still correct. Clearly, in many cases an additional editing phase is required to obtain the fully correct analysis, but this method might reduce efforts considerably.
Ik wil [ @skip ??? ] naar huis Ten opzichte [ @skip echter ] van deze bezwaren willen wij ....
A related trick that is useful in particular cases, is to add a word to the sentence in order to ensure that the parse can succeed, but instruct Alpino that this word should not be part of the resulting dependency structures. Such words are labeled phantom as follows:
Ik aanbad [ @phantom hem ] dagelijks in de kerk Ik kocht boeken en Piet [ @phantom kocht ] platen Ik heb [ @phantom meer ] boeken gezien dan hem
Limitations: a phantom bracketed string can only contain a single word. The technique does not work yet for words that are part of a multi-word unit.
Warning: the resulting dependency structure is most likely not well-formed and often needs manual editing.
Note that Alpino currently is not really implemented as a library. Yet, you might find it useful to know that you can call the Alpino parser from Prolog using the following two predicates:
alpino_parse_line(Ref,String) alpino_parse_tokens(Ref,Tokens)
Here, Ref is an atomic identifier which is used for administrative purposes, so that results for different sentences can be easily distinguished. In the first case, the input is a string which may need to be tokenized (see discussion above). In the second case, Tokens is a list of atoms where each atom is a token (word, punctuation mark) of the sentence. Here's some examples of using these predicates:
?- alpino_parse_line(s2323,"ik besta"). ?- alpino_parse_tokens(s2324,[ik,besta]).
If you want to extend the functionality of Alpino, you may be interested to learn about the following (Prolog-) hooks.
The hook predicate
alpino_start_hook(Key,Sentence)
is called before the parse of each sentence. Key is instantiated as the atomic identifier of the sentence. Such identifiers are set using the current_ref Hdrug flag.
The hook predicate
alpino_result_hook(Key,Sentence,No,Result)
is called for each parse. Key and Sentence are as before, No is an integer indicating the number of the parse (parses for a given sentence are numbered from 1 to n), and Result is an internal representation of the full parse itself.
The hook predicate
alpino_end_hook(Key,Sentence,Status,NumberOfSolutions)
is called after the parser has found all solutions. NumberOfSolutions is an integer indicating the number of solutions. Status is an atom indicating whether parsing was successful, or not. Possible values are the following - their meaning is supposed to be evident:
success failure time_out out_of_memory
Other values describe more exotic possibilities, please consult the hdrug_status flag in Hdrug for further details.
TODO