CoNLL-X Shared Task: Multi-lingual Dependency Parsing

Tenth Conference on Computational Natural Language Learning - New York City, June 8-9, 2006

Contents

Number of pages

Treebank references

Style files

Additional guidelines

BibTeX entries

Data overview table

Back

Submit paper

How to submit

Please submit a PDF file by email to conll06st@uvt.nl by April 24, 08:00 GMT (A.M.).

Number of pages

You can submit up to 5 pages for the camera-ready version.

The reference for the organizers' paper is now in the bibtex file. Please refer to this paper instead of explaining the details of the CoNLL-X shared task (a summary is fine).

Treebank references

In order to allow more space for participants to report their findings, the references to the treebanks will appear in a separate page of the CoNLL proceedings just before the shared task papers. Paper authors will therefore not be required to include these references in their camera-ready 5-page versions.

If your paper, including the treebank references, exceeds five pages you may remove these references from the camera-ready version. The camera-ready version should in any case not exceed five pages. If you do so, please send us also a full version, one which includes all the references. This full version will appear on the CoNLL web site, it is this version which should be further distributed should you choose to place it, for example, on your web site.

In short: i) the treebank references may be omitted only for the camera-ready version, ii) the text of the camera-ready version must be similar to the text of the full version and in particular contain citations for all the treebanks.

In order to create a shortened camera-ready version, those of you who use a word processor may simply delete the relevant entries in the reference list. Please rename this version with ".camera" preceding the original file extension (e.g. ".camera.pdf"). We advise the following procedure for Latex users:

  1. Process the file as usual with all the references, it may exceed 5 pages as long as removing the treebank references will make it 5 pages or shorter.
  2. This will be the online full version of the paper, please email it to us so that we could put it on our web site.
  3. Copy the ".bbl" to a backup file
  4. Remove all the references to treebanks from the ".bbl" file
  5. Run latex again, the output will be a paper with a text similar to that of the full version but the actual references will not appear. The citations will still appear as usual.
  6. This will be the camera-ready version of the paper, please send this output to us. Use the extension "camera.pdf" to distinguish the file from the full version.

Style files

Please follow the HLT-NAACL 2006 syle except for the paper length.

Additional guidelines

Here is the form we intend to use for reviewing papers. We thought it might help you to know what we expect.

BibTeX entries

This is the BibTeX file used by the shared task organizers for their introduction paper. It now contains references for all treebanks used, so we thought it might be useful for those of you who use LaTeX (and maybe even the others). It comes with no warranty :-) Please report errors/typos.

Please cite all treebanks that you worked on (even if you do not discuss them in detail). The Bulgarian treebank providers agreed that for the shared task papers, it is OK to only cite one or two references instead of all five listed in the license. In LaTeX citing all treebanks can be done by:

\cite{PADT,BulTreeBank,BulTreeBank-2,Sinica,PDT,DDT,Alpino,TIGER,Verbmobil,Bosque,SDT,Cast3LB,Talbanken05,MetuSabanci,MetuSabanci-2}
The bibtex file also contains additional references for some treebanks.

Data overview table

This table lists various properties of the CoNLL shared task training data for all 13 languages (12 required plus Bulgarian). We hope it will be helpful for explaining differences in performance across languages. The table will be reproduced in similar form in the organizers' introduction paper, so you can refer to it in your papers. We hope that the figures are correct but if you see anything that looks like a bug, please let us know. Please also tell us if you think another statistic should be included here. Some details are explained below.

Ar Ch Cz Da Du Ge Ja Po Sl Sp Sw Tu Bu
number of tokens (*1000) 54 3371249 94 195 700 151 207 29 89 191 58 190
% of non-scoring tokens 8.8 0.814.913.911.311.511.614.217.312.611.033.114.4
number of sents (*1000) 1.557.072.7 5.213.339.217.0 9.1 1.5 3.311.0 5.012.8
tokens per sentence 37.2 5.917.218.214.617.8 8.922.818.727.017.311.514.8
LEMMA present Yes No Yes No Yes No No Yes Yes Yes No Yes No
number of different ... CPOSTAG values for scoring tokens 14 22 12 10 13 52 20 15 11 15 37 14 11
... POSTAG values for scoring tokens 19 303 63 24 302 52 77 21 28 38 37 30 53
... parts (separated by '|') in FEATS for scoring tokens 19 0 61 47 81 0 4 146 51 33 0 82 50
... DEPREL values for scoring tokens 27 82 78 52 26 46 7 55 25 21 56 25 18
... DEPREL values of scoring tokens with HEAD=0 15 1 14 1 1 1 1 6 6 1 1 1 1
% of scoring tokens with HEAD=0 5.516.9 6.7 6.4 8.9 6.318.6 5.1 5.9 4.2 6.513.4 7.9
% of scoring tokens with HEAD to the left 82.924.850.975.046.550.9 8.960.347.260.852.8 6.262.9
% of scoring tokens with HEAD to the right 11.658.242.418.644.642.772.534.646.935.140.780.429.2
number of scoring tokens with HEAD=0 per sentence 1.9 1.0 1.0 1.0 1.2 1.0 1.5 1.0 0.9 1.0 1.0 1.0 1.0
% of relations that are non-projective (incl. non-scoring tokens) 0.4 0.0 1.9 1.0 5.4 2.3 1.1 1.3 1.9 0.1 1.0 1.5 0.4
% of sentences with at least one non-projective relation (incl. non-scoring tokens) 11.2 0.023.215.636.427.8 5.318.922.2 1.7 9.811.6 5.4

Note the following points: