CoNLL-X Shared Task: Multi-lingual Dependency Parsing

Tenth Conference on Computational Natural Language Learning - New York City, June 8-9, 2006

Contents

eval.pl

validateFormat.py

tabs2blanks.py

blanks2tabs.py

conlltab2dot.py

Treebank conversion software

External software

Kaarel Kaljurand's tree visualizer

Joakim Nivre's (de)projectivization

UTF-8 to ASCII and back

Back

Download software

Note: Now that the shared task is over, we (the organizers) will probably not develop this software any further. Howere, everybody is invited to make improvements to it and to release those on the Depparse Wiki page.

eval.pl

  CoNLL-X evaluation script:

   [perl] eval.pl [OPTIONS] -g <gold standard> -s <system output>

  This script evaluates a system output with respect to a gold standard.
  Both files should be in UTF-8 encoded CoNLL-X tabular format.

  Punctuation tokens (those where all characters have the Unicode
  category property "Punctuation") are ignored for scoring (unless the
  -p flag is used).

  The output breaks down the errors according to their type and context.

  Optional parameters:
     -o FILE : output: print output to FILE (default is standard output)
     -q : quiet:       only print overall performance, without the details
     -b : evalb:       produce output in a format similar to evalb
                       (http://nlp.cs.nyu.edu/evalb/); use together with -q
     -p : punctuation: also score on punctuation (default is not to score on it)
     -v : version:     show the version number
     -h : help:        print this help text and exit

Download latest release of eval.pl (version 1.8)

This is the official CoNLL-X shared task evaluation script. It computes the official scoring metric "labeled attachment score" and also provides details useful for error analysis. It was first released on 9 January 2006.

An improved version was released on 22 January 2006. The first release required Perl v5.8. However, that version of Perl still contains bugs with respect to Unicode handling. The new release of the evaluation script therefore requires at least Perl v5.8.1. If you have at least Perl v5.8.1, the new release of the evaluation script gives identical scores to the first release. However, it provides more output:

A version with additional output for error analysis was released on 8 February 2006.

A version with a new option for significance testing (-b), with "label accuracy score" and one more error analysis table (thanks to Prokopis Prokopidis) was released on 12 March 2006. Use the -b option as follows:

perl eval.pl -b -q -g GOLD_FILE -s SYSTEM_FILE1 > system1.txt
perl eval.pl -b -q -g GOLD_FILE -s SYSTEM_FILE2 > system2.txt
perl compare.pl system1.txt system2.txt
where compare.pl is Dan Bikel's Randomized Parsing Evaluation Comparator (Statistical Significance Tester for evalb Output). Its output talks about "recall" and "precision" but for the output of eval.pl these are really "unlabeled attachment" and "labeled attachment" respectively.

Here are some questions and answers about the definition of non-scoring tokens:

How do you exclude tokens from scoring?

The evaluation script defines a non-scoring token as a token where all characters have the Unicode category property "Punctuation" (see "man perlunicode"). As the underscore character also has the "Punctuation" property, the non-last inflection groups in our Turkish data (with FORM '_') and the punctuation tokens in our Arabic data (with FORMs such as '._.' where the part before the underscore is in Arabic script while the part after the underscore is in Latin script) are all non-scoring tokens, as intended. There might be a few cases where one does not agree with the Unicode category. For example, Unicode classifies the percentage sign (%) as punctuation. We chose not to try to improve upon the Unicode definition and do not think that this will substantially influence results. However, a detailed study of this issue might be useful in the future.

Why is the definition of non-scoring token not based on the part of speech?

That was indeed our first idea. However, there are two problem cases. The Swedish treebank has a POS tag for punctuation but it is not always used for punctuation. For example, punctuation that is part of a multiword gets the multiword tag. Also some punctuation (e.g. some commas) are tagged as "coordinating conjunction". The Arabic treebank also has a POS tag for punctuation but it is not used in one of the four subcorpora (instead punctuation together with numbers and others has the POS tag "non-alphabetic material").

In addition, there is a more fundamental issue. Parsing as defined for this shared task assumes that the gold-standard POS tags are given. However, this is rarely the case in real applications. Typically, input text is untagged (and untokenized, but that is a different issue). One would therefore either apply a POS tagger to the data and use the output of the tagger as the input to the parser, or use a parser that assigns POS tags directly during parsing (possibly preceded by a morphological component for highly inflected languages). We would like to be able to compare the results for the shared task definition of parsing to results on this more general definition of parsing. However, a meaningful comparison requires that the same set of tokens is scored in both set-ups. If the definition of a non-scoring token relied on the POS tags, it would mean that during parsing in the general set-up the parser does not know which tokens it will be scored upon. It would also mean that each gold standard for parsing must include POS tags. Both consequences might be problematic.

Should punctuation really be excluded from scoring?

This is the most fundamental question and we do not have a satisfactory answer. Some treebanks (e.g. Alpino for Dutch) do not attach punctuation at all. Although we could (and in fact did) attach punctuation during the conversion to the shared task format, it does not sound like a good idea to then score on it, as parsers would be punished if they fail to reproduce an attachment which we introduced. In addition, punctuation is also normally ignored for the scoring of constituent-based parsers, and it might not even exist for (parts of) treebanks that are based on transcriptions of speech. Finally, the attachment of most punctuation is irrelevant for most parsing applications.

However, there are a few important exceptions. Some treebanks (e.g. the Prague Dependency Treebanks) analyze some punctuation tokens as the syntactic heads of other tokens. For example, in the absence of a coordinating conjunction, a comma might be analyzed as the coordinator, and therefore as the head of the conjuncts. The attachment of the comma therefore encodes the information what the whole coordinated phrase refers to. If we ignore this comma for scoring purposes, a parser might get an attachment score of 100% despite the fact that it misattached the coordinated phrase. Clearly, this is problematic.

We feel that the problem of whether or not to score some or all punctuation for some or all treebanks is a complicated one and we will leave it to the research community to find a satisfactory answer in the future. For the shared task, we exclude punctuation from scoring. It remains to be seen whether this influences results in any way.

validateFormat.py

usage:
        validateFormat.py [options] FILES

purpose:
        checks whether files are in CoNLL-X shared task format

args:
        FILES              input files


options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -d STRING, --discard_problems=STRING
  -e STRING, --encoding=STRING
                        output character encoding (default is utf-8)
  -i STRING, --input_separator=STRING
                        regular expression for column separator in
                        input (default is one tab, i.e. '\t')
  -p STRING, --punctuation=STRING
                        use given regular expression to identify
                        punctuation (by matching with the CPOSTAG column)
                        and check that nothing links to and that a sentence
                        contains more than just punctuation (default: turned
                        off)
  -r STRING, --root_deprel=STRING
                        designated root label: check that there is exactly
                        one token with that label and that it's HEAD is 0
                        (default: not specified)
  -s STRING, --silence_warnings=STRING
                        don't warn about certain types of
                        problems (default is to warn about every problem);
                        possible choices:cycle punct whitespace root other
  -t STRING, --type=STRING
                        type of the data to be tested: train,
                        test_blind, system (default: train)
This script can be used to check files for compliance with the CoNLL-X shared task format. It prints detailed warnings and error messages to STDERR. The returned status code indicates whether the files passed the test (status 0) or not (status 1). The requirements for training data (-t train) are stricter than for system produced output (-t system): Errors in the (P)HEAD and (P)DEPREL columns cause status 1 for training data but not for system output (STDERR messages are the same). System output is allowed but not required to have the PHEAD and PDEPREL column.

You can suppress warnings by using the -s option. E.g. if you already know that your system sometimes predicts cycles in the dependency structure, you could call the script with:

./validateFormat.py
-t system -s cycle systems_output.conll
You cannot suppress error messages.

Download validateFormat.py (version 1.4)

Download SharedTaskCommon.py which is needed by validateFormat.pl

Note: I have fixed a bug in version 1.2 that caused validateFormat.py to complain if a file followed the Windows end-of-line convention (of using \r\n)

tabs2blanks.py

usage: 
    tabs2blanks.py [options] <INFILE >OUTFILE

purpose:
    Converts tabs to blanks in an attempt to align the column content.
    Reads from standard input and writes to standard output.
    Expects input in tabular format with columns separated by tabs.
    Spaces in column content are replaced by tabs (by default).

options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -b STRING, --blank-replace=STRING
                        replacement for blanks in column content (default is
                        tab)
  -e STRING, --encoding=STRING
                        input and output character encoding (default is utf-8)
  -m INT, --max-width=INT
                        maximum width of a column (default is unlimited)

Download tabs2blanks.py

blanks2tabs.py

usage: 
    blanks2tabs.py [options] <INFILE >OUTFILE

purpose:
    Converts each sequence of blanks to a single tab.
    Reads from standard input and writes to standard output.
    Expects input in tabular format with columns separated by one or more blanks.
    Tabs in column content are replaced by blanks (by default).

options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -e STRING, --encoding=STRING
                        input and output character encoding (default is utf-8)
  -t STRING, --tab-replace=STRING
                        replacement for tabs in column content (default is a
                        single blank)

Download blanks2tabs.py

conlltab2dot.py

usage: 
    conlltab2dot.py [options] <INFILE >OUTFILE
     
purpose:
    Converts dependency structures in CoNLL-X tabular format 
    to dot graph specifications.
    Reads from standard input and writes to standard output.

examples:

   To generate postscript output on Linux, try:
 
      ./conlltab2dot.py <sample.conll | \
      dot -Tps2 > /tmp/sample.ps && \
      gv /tmp/sample.ps

    To generate PDF output on Mac OS X, try:
    
      ./conlltab2dot.py <sample.conll | \
      /Applications/Graphviz.app/Contents/MacOS/dot -Tepdf | \
      open -f -a preview
    

options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -e STRING, --encoding=STRING
                        character encoding (default is utf-8)
  -r STRING, --range=STRING
                        sentence range as comma-separated list of sentence
                        numbers, optionally using n-, -n, or n-m to denote
                        inclusive ranges (default is all sentences)
  -s CHAR, --shape=CHAR
                        shape of graph where 'h' means hierarchical and 'l'
                        means linear (default is 'h')
                        means linear (default is 'h')

Download conlltab2dot.py

Treebank conversion software

For the shared task, 13 treebanks were converted from their original formats to the data format used in the shared task. The software to do that was developed by several different people and for practical reasons, no effort was made to standardize it. We provide this software here without any warranty but hope that it will be useful to other researchers. For general questions about this page, please contact conll06st@uvt.nl. For questions about specific software, however, please contact the respective author directly.

Note: Now that the shared task is over, we (the organizers) will probably not develop this software any further. However, everybody is invited to make improvements to it and to release those on the Depparse Wiki page.

Amit Dubey's software to convert the Cast3LB, Sinica and TIGER treebanks

Download tarred and zipped software: dubey-software.tar.bz2.
This tarball is zipped using bzip2, and can be unpacked with either 'tar xjf filename' or 'tar xyf filename' (depending on the your version of tar).

This software is written in OCaml. The tarball contains library functions for the conversion of other treebanks as well.

See tools/nlX4/README.CONLL for general information.

You can contact Amit Dubey at "adubey at inf dot ed dot ac dot uk"

Erwin Marsi's software to convert the DDT, Alpino, Japanese Verbmobil and Talbanken05 treebanks

Download tarred and zipped software: erwins-software.tar.bz2.
This tarball is zipped using bzip2, and can be unpacked with either 'tar xjf filename' or 'tar xyf filename' (depending on the your version of tar).

The software is organized in the same way as the training and test data.

See the files data/<language>/<treebank>/README for general information.

The treebank-specific conversion software is in data/<language>/<treebank>/tools/.

Some general software is in tools/.

The shell scripts that control the conversion processes are data/<language>/<treebank>/tools/build.sh (go there and execute build.sh).

The main conversion scripts are data/<language>/<treebank>/tools/<treebank>2tab.py (written in Python).

The treebank files are expected in data/<language>/<treebank>/treebank/.

The resulting files can be found in data/<language>/<treebank>/dist/.

There are also some log files that were created during the original conversion. They might be useful for comparing against your log files as a sanity check that your conversion worked the same way as ours.

The Dutch Alpino treebank was not only converted but also retagged. The software used for tagging is MBT. You will need to get the following files and put them into these locations:

data/dutch/alpino/tools/mbt-nl/gen-optimal-mbt
data/dutch/alpino/tools/mbt-nl/Mbt
data/dutch/alpino/tools/mbt-nl/wotan.all.tag.5paxes
data/dutch/alpino/tools/mbt-nl/wotan.all.tag.known.ddwfWawa
data/dutch/alpino/tools/mbt-nl/wotan.all.tag.lex
data/dutch/alpino/tools/mbt-nl/wotan.all.tag.lex.ambi.20
data/dutch/alpino/tools/mbt-nl/wotan.all.tag.settings
data/dutch/alpino/tools/mbt-nl/wotan.all.tag.top50
data/dutch/alpino/tools/mbt-nl/wotan.all.tag.unknown.chnppddwFawasss

Other software to convert the PADT, PDT, Bosque, Metu-Sabanci and SDT treebanks, and for making the training-test-split for the BulTreeBank

Download tarred and zipped software: other-software.tar.bz2.
This tarball is zipped using bzip2, and can be unpacked with either 'tar xjf filename' or 'tar xyf filename' (depending on the your version of tar).

The software is organized in the same way as the training and test data.

The treebank-specific conversion software is in data/<language>/<treebank>/tools/.

Some general software is in tools/.

The Makefiles that control the conversion processes are data/<language>/<treebank>/Makefile (go there and type "make"). They also contain some comments about how the training-test-split was determined.

The conversion scripts for PDT, Bosque, and Metu-Sabanci are data/<language>/<treebank>/tools/<treebank>2MALT.py (written in Python). The other conversions were done by different people, so there is no pattern in the naming.

The Bosque, SDT and BulTreeBank treebank files are expected in data/<language>/<treebank>/treebank/. The Metu-Sabanci treebank files are expected in data/<language>/<treebank>/tb_corrected because that's what "tb_corrected_versionConll.zip" (the version of the treebank that we used) expands to. For PDT, you have to modify the Makefile to point it to the location of your PDT CD. The PADT Makefile does not control the complete conversion process: you will have to convert the treebank files individually (using data/<language>/<treebank>/tools/padt2tab.py) and put the results into the directories data/<language>/<treebank>/train/ and data/<language>/<treebank>/test/, respectively.

The resulting files can be found in data/<language>/<treebank>/dist/.

There are also some log files that were created during the original conversion. They might be useful for comparing against your log files as a sanity check that your conversion worked the same way as ours.

Atanas Chanev's software for converting the BulTreeBank

The CoNLL-X shared task conversion of the BulTreeBank is based on, but not identical to, Atanas' scripts. The final software used in the shared task will hopefully be released later. You can contact Atanas Chanev at "artanisz at gmail dot com". See the beginning of "BTB_HPSG2Dep.pl" for a short explanation.
Note: Now that the shared task is over, we (the organizers) will probably not develop this software any further. However, everybody is invited to make improvements to it and to release those on the Depparse Wiki page.