The shared task of CoNLL-X will be multi-lingual dependency parsing. Following previous CoNLL shared tasks (NP bracketing, chunking, clause identification, language independent named-entity recognition, and semantic role labeling), this task aims to define and extend the current state of the art in dependency parsing - a technology which complements the previous tasks by producing a different kind of syntactic description of input text.
Ideally, a parser should be trainable for any language, possibly by adjusting a small number of hyperparameters. The CoNLL-X shared task will provide the community with a benchmark for evaluating their parsers across different languages. Because of the variety of languages and the interest in parser performance across languages, a special focus of the CoNLL-X shared task will be on a qualitative evaluation (along with the quantitative scores as before). We will require the participants to provide an informative error analysis and will ourselves perform a cross-system comparison. This, we expect, will result in a clear picture of the problems that lie ahead for multilingual parsing and the kind of work necessary for adapting existing parsing architectures across languages.
This page provides a detailed description of the shared task and further information regarding scheduling, datasets, paper submission, etc.
Head-modifier dependency relations have been employed as a useful representation in several language modelling tasks. These include unsupervised acquisition of argument structure (Briscoe and Carroll, 1997; McCarthy, 2000), generating word classes (Clark and Weir, 2000) and cooccurence probabilities ( Dagan et al., 1999) for disambiguation, and extracting collocations ( Lin, 1999; Pearce, 2001). Palmer et al. (1993) and Yeh (2000) used dependencies as intermediate representations for information retrieval. In addition, some of the recent parser evaluation schemes (Carroll et al., 2000; Lin, 2000; Srinivas, 2000) use dependencies instead of phrases.
Several statistical approaches to full phrase structure parsing model the probability of dependencies between pairs of words (Collins, 1997; Charniak, 1999). Brants et al. (1997) report on a markov model for adding grammatical functions to a phrase structure. Other approaches are directly aimed at unlabelled (Eisner, 1996; McDonald et al. (2005)) or labelled dependency parsing (Nivre and Scholz (2004)).
Until a few years ago, parsers have mostly been applied to only one or two languages: often English and/or the author's native language. Recently, however, several parsers have been applied to more languages, e.g. Collins' parsing model has been tested for English, Czech (Collins et al., 1999), German ( Dubey and Keller, 2003), Spanish ( Cowan and Collins, 2005) and French (Arun and Keller, 2005), while Nivre's parser has been tested for English (Nivre and Scholz, 2004), Swedish (Nivre et al., 2004) and Czech (Nivre and Nilsson, 2005). This has resulted in interesting insights into how the properties of a language or a treebank influence parser performance, or conversely, how one should best approach parsing for that language/treebank (see e.g. the above references for Keller). However, different parsers have been applied to different subsets of languages.
Ideally, a parser should be trainable for any language, possibly by adjusting parameters. The CoNLL 2006 shared task will provide the community members with the possibility of testing their parsers across different languages. The training and test data for the languages differ in size, granularity and quality, but we will try to at least even out differences in the markup format.
The shared task is to assign labeled dependency structures for a range of languages by means of a fully automatic dependency parser. Some gold standard dependency structures against which systems are scored will be non-projective. A system that produces only projective structures will nevertheless be scored against the partially non-projective gold standard. The input consists of (minimally) tokenized and part-of-speech tagged sentences. Each sentence is represented as a sequence of tokens plus additional features such as lemma, part-of-speech, or morphological properties. For each token, the parser must output its head and the corresponding dependency relation (secondary relations are not taken into consideration). Although data and settings may vary per language, the same parser should handle all languages. The parser must therefore be able to learn from training data, to generalize to unseen test data, and to handle multiple languages. Participants are expected to submit parsing results for all languages involved.
What about pre- and post-processing steps not integrated in the parser itself?
That is fine, provided that those steps constitute general approaches (e.g. feature construction, tree manipulation) and not just manual hacks to fix certain errors (e.g. replace relation X in context Y with relation Z).
Data adheres to the following rules:
|Field number:||Field name:||Description:|
|1||ID||Token counter, starting at 1 for each new sentence.|
|2||FORM||Word form or punctuation symbol.|
|3||LEMMA||Lemma or stem (depending on particular data set) of word form, or an underscore if not available.|
|4||CPOSTAG||Coarse-grained part-of-speech tag, where tagset depends on the language.|
|5||POSTAG||Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.|
|6||FEATS||Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.|
|7||HEAD||Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.|
|8||DEPREL||Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.|
|9||PHEAD||Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).|
|10||PDEPREL||Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.|
Here is an example from the Dutch data set.
We will provide training and test data for about 10 languages from several language families. The data format is the same for all data sets. The details, especially of the features, vary among languages. The size of the training sets also varies with language, but the test sets will be of approximately the same size for all the languages: about 5000 "scoring" tokens. Punctuation is not a scoring token.
Whether we can make data freely available for download depends on the license type of the source treebank. Some data sets can be made freely available under an open source license, other data sets require participants to sign and send a license agreement first. Therefore, if you are interested in participating, please register as early as possible. We will then inform you about the details of the licensing procedure. No fee will be required for any data.
In order to allow participants to start developing systems early on, some preliminary data sets will be made available by December 12. We cannot yet provide final information as to which languages will be represented in the shared task as we are still in the process of searching for appropriate treebanks, assessing suitability, negotiating licenses (if necessary) and converting to the common format. Final data sets will be made available by 9 January 2006 at the latest.
Unparsed test data for all languages will be released on March 6. This data will contain the first six columns, although only the ID, FORM, CPOSTAG and POSTAG columns are guaranteed to contain non-underscore values. Participants are given three days to run their parsers on the test data. Parsed test data must be returned on March 9. (Details to be supplied later...) Parsed data must include all original columns from the unparsed data plus the HEAD and DEPREL column. Only the score on the HEAD and DEPREL counts for the evaluation; the PHEAD and PDEPREL columns will be ignored. Optionally, participants can submit additional results obtained by using external data or knowledge sources.
The evaluation metric is labeled attachment score: the proportion of "scoring" tokens that are assigned both the correct head and the correct dependency relation label.
Punctuation tokens are non-scoring. In very exceptional cases, and depending on the original treebank annotation, some additional types of tokens might also be non-scoring. This will clearly be marked in the README accompanying such data.
The overall score of a system is its labeled attachment score on all test sets taken together.
The unlabeled attachment score will be provided as well, but the overall ranking of a system will be based on the labeled attachment score. Unlabeled attachement score is the proportion of "scoring" tokens that are assigned the correct head (regardless of the dependency relation label).
We will return the score on the test data, as well as the gold standard answers, on March 10. Partipants are then invited to submit a paper describing their approach and results. Because of the variety of languages and the interest in parser performance across languages, a special focus of the CoNLL 2006 shared task will be on a qualitative evaluation: we will prefer papers with an informative error analysis and ask the participants to discuss performance on two languages. In support, we include the documentation of the original treebanks that the data were derived from, as well as some typological information in order to give participants a flavour of languages with which they may not be familiar. We will also supply an evaluation tool to discover structural parser errors.
We will perform a cross-system comparison which we expect to result in a clear picture of the problems that lie ahead for multilingual parsing and the kind of work necessary for adapting existing parsing architectures across languages.
Our Framework page aims to provide an overview of existing work on dependency parsing. We hope that it will be useful as a starting point for participants who have not yet decided their approach and as inspiration for improvements for those who have.
|December 12||Preliminary release of training data and software|
|January 9||Final release of training data and software|
|March 6||Release of the test data|
|March 10 at 8:00 am GMT
(Greenwich Mean Time)
|Submission of results on the test data|
|March 10||Scoring and release of gold standard answers for the test data|
|March 20 at 8:00 am GMT
(Greenwich Mean Time)
|Deadline for paper submissions|
|April 10||Notification of acceptance|
|April 24 at 8:00 am GMT
(Greenwich Mean Time)
|Deadline for camera ready copy|
If you are interested in participating, please send an email with your name and affiliation to firstname.lastname@example.org We would also like to know whether your institution is a member of LDC (and for which years) as this information might help in negotiating license-restricted data.
Toshiba Research Europe Ltd (UK)
sabine dot buchholz at crl dot toshiba dot co dot uk
Tilburg University (The Netherlands)
e dot c dot marsi at uvt dot nl
University of Haifa (Israel)
yuval at cl.haifa.ac.il
University of Edinburgh (UK)
adubey at inf dot ed dot ac dot uk
If you have an existing dependency parser that cannot support Unicode, you can convert the data to ASCII and back without loosing any information by using numerical character references (e.g. Ş becomes Ş). See the Useful Perl one-liners for working with UTF-8.
However, many programming and scripting languages nowadays offer Unicode support. Make sure though that you are using a new version of the software as earlier versions often contain bugs in their treatment of Unicode that can result in totally unexpected results. If in any doubt about the cause of unexpected behaviour, replace Unicode with ASCII characters and see whether it works then.
Also, many editors can now render UTF-8 encoded text. E.g. in emacs, a little "u" in the left corner of the status bar indicates that the currently displayed file is UTF-8 encoded.
An interactive evaluation menu for viewing complete evaluations or result tables.
More space for reporting your findings, read Treebank references
Support for significance testing in eval.pl
Release of the unparsed test data (other parts password protected)
A change in the procedure: participants are free to decide on the two languages for which they will present a detailed error analysis.
Corrected version of Chinese data (password restricted)
Some minor changes in the data format:
Preliminary data sets released.
First call for papers for CoNLL-X