CoNLL-X Shared Task: Multi-lingual Dependency Parsing

Tenth Conference on Computational Natural Language Learning - New York City, June 8-9, 2006

Contents

Abstract
Register
Background
Task definition
Data format
Procedure
Important dates
Organizers
Unicode help

Internal links

Obtain data
Download software
Upload results (closed)
Results
Submit paper (closed)
Framework
Workshop programme

External Links

Depparse Wiki
CoNLL Homepage

Contact

conll06st@uvt.nl

Page statistics


site stats

Abstract

The shared task of CoNLL-X will be multi-lingual dependency parsing. Following previous CoNLL shared tasks (NP bracketing, chunking, clause identification, language independent named-entity recognition, and semantic role labeling), this task aims to define and extend the current state of the art in dependency parsing - a technology which complements the previous tasks by producing a different kind of syntactic description of input text.

Ideally, a parser should be trainable for any language, possibly by adjusting a small number of hyperparameters. The CoNLL-X shared task will provide the community with a benchmark for evaluating their parsers across different languages. Because of the variety of languages and the interest in parser performance across languages, a special focus of the CoNLL-X shared task will be on a qualitative evaluation (along with the quantitative scores as before). We will require the participants to provide an informative error analysis and will ourselves perform a cross-system comparison. This, we expect, will result in a clear picture of the problems that lie ahead for multilingual parsing and the kind of work necessary for adapting existing parsing architectures across languages.

This page provides a detailed description of the shared task and further information regarding scheduling, datasets, paper submission, etc.

Register

As the main part of the shared task is now passed, you can no longer register for participation. If you want to find out about results, please attend the CoNLL-X workshop at HLT-NAACL 2006 or check the online proceedings afterwards.

Background

Head-modifier dependency relations have been employed as a useful representation in several language modelling tasks. These include unsupervised acquisition of argument structure (Briscoe and Carroll, 1997; McCarthy, 2000), generating word classes (Clark and Weir, 2000) and cooccurence probabilities ( Dagan et al., 1999) for disambiguation, and extracting collocations ( Lin, 1999; Pearce, 2001). Palmer et al. (1993) and Yeh (2000) used dependencies as intermediate representations for information retrieval. In addition, some of the recent parser evaluation schemes (Carroll et al., 2000; Lin, 2000; Srinivas, 2000) use dependencies instead of phrases.

Several statistical approaches to full phrase structure parsing model the probability of dependencies between pairs of words (Collins, 1997; Charniak, 1999). Brants et al. (1997) report on a markov model for adding grammatical functions to a phrase structure. Other approaches are directly aimed at unlabelled (Eisner, 1996; McDonald et al. (2005)) or labelled dependency parsing (Nivre and Scholz (2004)).

Until a few years ago, parsers have mostly been applied to only one or two languages: often English and/or the author's native language. Recently, however, several parsers have been applied to more languages, e.g. Collins' parsing model has been tested for English, Czech (Collins et al., 1999), German ( Dubey and Keller, 2003), Spanish ( Cowan and Collins, 2005) and French (Arun and Keller, 2005), while Nivre's parser has been tested for English (Nivre and Scholz, 2004), Swedish (Nivre et al., 2004) and Czech (Nivre and Nilsson, 2005). This has resulted in interesting insights into how the properties of a language or a treebank influence parser performance, or conversely, how one should best approach parsing for that language/treebank (see e.g. the above references for Keller). However, different parsers have been applied to different subsets of languages.

Ideally, a parser should be trainable for any language, possibly by adjusting parameters. The CoNLL 2006 shared task will provide the community members with the possibility of testing their parsers across different languages. The training and test data for the languages differ in size, granularity and quality, but we will try to at least even out differences in the markup format.

Task definition

The shared task is to assign labeled dependency structures for a range of languages by means of a fully automatic dependency parser. Some gold standard dependency structures against which systems are scored will be non-projective. A system that produces only projective structures will nevertheless be scored against the partially non-projective gold standard. The input consists of (minimally) tokenized and part-of-speech tagged sentences. Each sentence is represented as a sequence of tokens plus additional features such as lemma, part-of-speech, or morphological properties. For each token, the parser must output its head and the corresponding dependency relation (secondary relations are not taken into consideration). Although data and settings may vary per language, the same parser should handle all languages. The parser must therefore be able to learn from training data, to generalize to unseen test data, and to handle multiple languages. Participants are expected to submit parsing results for all languages involved.

Clarifications (added 22 January 2006)

What is meant by "the same parser should handle all languages"?
For example, if a parsing software package allows the use of several different parsing algorithms or of several different machine learners, are those still "the same parser"? No. So in general, you should choose one parsing algorithm and one learner. However, we realize that it would still be an interesting result of this shared task if it turned out that algorithm/learner X is much better for language/treebank Y while algorithm/learner W is much better for language/treebank Z. So, in line with our focus on an informative error analysis, we allow deviation from your "default" algorithm/learner if you can explain WHY the non-default algorithm/learner is better in some cases. We hope that this will restrict deviations to the really important ones.

What about pre- and post-processing steps not integrated in the parser itself?
That is fine, provided that those steps constitute general approaches (e.g. feature construction, tree manipulation) and not just manual hacks to fix certain errors (e.g. replace relation X in context Y with relation Z).

Data format

Data adheres to the following rules:


Field number: Field name: Description:
1 ID Token counter, starting at 1 for each new sentence.
2 FORM Word form or punctuation symbol.
3 LEMMA Lemma or stem (depending on particular data set) of word form, or an underscore if not available.
4 CPOSTAG Coarse-grained part-of-speech tag, where tagset depends on the language.
5 POSTAG Fine-grained part-of-speech tag, where the tagset depends on the language, or identical to the coarse-grained part-of-speech tag if not available.
6 FEATS Unordered set of syntactic and/or morphological features (depending on the particular language), separated by a vertical bar (|), or an underscore if not available.
7 HEAD Head of the current token, which is either a value of ID or zero ('0'). Note that depending on the original treebank annotation, there may be multiple tokens with an ID of zero.
8 DEPREL Dependency relation to the HEAD. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.
9 PHEAD Projective head of current token, which is either a value of ID or zero ('0'), or an underscore if not available. Note that depending on the original treebank annotation, there may be multiple tokens an with ID of zero. The dependency structure resulting from the PHEAD column is guaranteed to be projective (but is not available for all languages), whereas the structures resulting from the HEAD column will be non-projective for some sentences of some languages (but is always available).
10 PDEPREL Dependency relation to the PHEAD, or an underscore if not available. The set of dependency relations depends on the particular language. Note that depending on the original treebank annotation, the dependency relation may be meaningfull or simply 'ROOT'.

Here is an example from the Dutch data set.

Procedure

We will provide training and test data for about 10 languages from several language families. The data format is the same for all data sets. The details, especially of the features, vary among languages. The size of the training sets also varies with language, but the test sets will be of approximately the same size for all the languages: about 5000 "scoring" tokens. Punctuation is not a scoring token.

Whether we can make data freely available for download depends on the license type of the source treebank. Some data sets can be made freely available under an open source license, other data sets require participants to sign and send a license agreement first. Therefore, if you are interested in participating, please register as early as possible. We will then inform you about the details of the licensing procedure. No fee will be required for any data.

In order to allow participants to start developing systems early on, some preliminary data sets will be made available by December 12. We cannot yet provide final information as to which languages will be represented in the shared task as we are still in the process of searching for appropriate treebanks, assessing suitability, negotiating licenses (if necessary) and converting to the common format. Final data sets will be made available by 9 January 2006 at the latest.

Unparsed test data for all languages will be released on March 6. This data will contain the first six columns, although only the ID, FORM, CPOSTAG and POSTAG columns are guaranteed to contain non-underscore values. Participants are given three days to run their parsers on the test data. Parsed test data must be returned on March 9. (Details to be supplied later...) Parsed data must include all original columns from the unparsed data plus the HEAD and DEPREL column. Only the score on the HEAD and DEPREL counts for the evaluation; the PHEAD and PDEPREL columns will be ignored. Optionally, participants can submit additional results obtained by using external data or knowledge sources.

The evaluation metric is labeled attachment score: the proportion of "scoring" tokens that are assigned both the correct head and the correct dependency relation label.

Punctuation tokens are non-scoring. In very exceptional cases, and depending on the original treebank annotation, some additional types of tokens might also be non-scoring. This will clearly be marked in the README accompanying such data.

The overall score of a system is its labeled attachment score on all test sets taken together.

The unlabeled attachment score will be provided as well, but the overall ranking of a system will be based on the labeled attachment score. Unlabeled attachement score is the proportion of "scoring" tokens that are assigned the correct head (regardless of the dependency relation label).

We will return the score on the test data, as well as the gold standard answers, on March 10. Partipants are then invited to submit a paper describing their approach and results. Because of the variety of languages and the interest in parser performance across languages, a special focus of the CoNLL 2006 shared task will be on a qualitative evaluation: we will prefer papers with an informative error analysis and ask the participants to discuss performance on two languages. In support, we include the documentation of the original treebanks that the data were derived from, as well as some typological information in order to give participants a flavour of languages with which they may not be familiar. We will also supply an evaluation tool to discover structural parser errors.

We will perform a cross-system comparison which we expect to result in a clear picture of the problems that lie ahead for multilingual parsing and the kind of work necessary for adapting existing parsing architectures across languages.

Our Framework page aims to provide an overview of existing work on dependency parsing. We hope that it will be useful as a starting point for participants who have not yet decided their approach and as inspiration for improvements for those who have.

Important dates


December 12 Preliminary release of training data and software
January 9 Final release of training data and software
March 6 Release of the test data
March 10 at 8:00 am GMT
(Greenwich Mean Time)
Submission of results on the test data
March 10 Scoring and release of gold standard answers for the test data
March 20 at 8:00 am GMT
(Greenwich Mean Time)
Deadline for paper submissions
April 10 Notification of acceptance
April 24 at 8:00 am GMT
(Greenwich Mean Time)
Deadline for camera ready copy

Register

If you are interested in participating, please send an email with your name and affiliation to conll06st@uvt.nl We would also like to know whether your institution is a member of LDC (and for which years) as this information might help in negotiating license-restricted data.

Organizers

Sabine Buchholz
Toshiba Research Europe Ltd (UK)
sabine dot buchholz at crl dot toshiba dot co dot uk

Erwin Marsi
Tilburg University (The Netherlands)
e dot c dot marsi at uvt dot nl

Yuval Krymolowski
University of Haifa (Israel)
yuval at cl.haifa.ac.il

Amit Dubey
University of Edinburgh (UK)
adubey at inf dot ed dot ac dot uk

Unicode help

Much information about Unicode can be found on the internet, see e.g. the Wikipedia article on Unicode or Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux.

If you have an existing dependency parser that cannot support Unicode, you can convert the data to ASCII and back without loosing any information by using numerical character references (e.g. Ş becomes Ş). See the Useful Perl one-liners for working with UTF-8.

However, many programming and scripting languages nowadays offer Unicode support. Make sure though that you are using a new version of the software as earlier versions often contain bugs in their treatment of Unicode that can result in totally unexpected results. If in any doubt about the cause of unexpected behaviour, replace Unicode with ASCII characters and see whether it works then.

Also, many editors can now render UTF-8 encoded text. E.g. in emacs, a little "u" in the left corner of the status bar indicates that the currently displayed file is UTF-8 encoded.