The reference for the organizers' paper is now in the bibtex file. Please refer to this paper instead of explaining the details of the CoNLL-X shared task (a summary is fine).
If your paper, including the treebank references, exceeds five pages you may remove these references from the camera-ready version. The camera-ready version should in any case not exceed five pages. If you do so, please send us also a full version, one which includes all the references. This full version will appear on the CoNLL web site, it is this version which should be further distributed should you choose to place it, for example, on your web site.
In short: i) the treebank references may be omitted only for the camera-ready version, ii) the text of the camera-ready version must be similar to the text of the full version and in particular contain citations for all the treebanks.
In order to create a shortened camera-ready version, those of you who use a word processor may simply delete the relevant entries in the reference list. Please rename this version with ".camera" preceding the original file extension (e.g. ".camera.pdf"). We advise the following procedure for Latex users:
Please cite all treebanks that you worked on (even if you do not discuss them in detail). The Bulgarian treebank providers agreed that for the shared task papers, it is OK to only cite one or two references instead of all five listed in the license. In LaTeX citing all treebanks can be done by:
\cite{PADT,BulTreeBank,BulTreeBank-2,Sinica,PDT,DDT,Alpino,TIGER,Verbmobil,Bosque,SDT,Cast3LB,Talbanken05,MetuSabanci,MetuSabanci-2} The bibtex file also contains additional references for some treebanks.
This table lists various properties of the CoNLL shared task training data for all 13 languages (12 required plus Bulgarian). We hope it will be helpful for explaining differences in performance across languages. The table will be reproduced in similar form in the organizers' introduction paper, so you can refer to it in your papers. We hope that the figures are correct but if you see anything that looks like a bug, please let us know. Please also tell us if you think another statistic should be included here. Some details are explained below.
| Ar | Ch | Cz | Da | Du | Ge | Ja | Po | Sl | Sp | Sw | Tu | Bu | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| number of tokens (*1000) | 54 | 337 | 1249 | 94 | 195 | 700 | 151 | 207 | 29 | 89 | 191 | 58 | 190 |
| % of non-scoring tokens | 8.8 | 0.8 | 14.9 | 13.9 | 11.3 | 11.5 | 11.6 | 14.2 | 17.3 | 12.6 | 11.0 | 33.1 | 14.4 |
| number of sents (*1000) | 1.5 | 57.0 | 72.7 | 5.2 | 13.3 | 39.2 | 17.0 | 9.1 | 1.5 | 3.3 | 11.0 | 5.0 | 12.8 |
| tokens per sentence | 37.2 | 5.9 | 17.2 | 18.2 | 14.6 | 17.8 | 8.9 | 22.8 | 18.7 | 27.0 | 17.3 | 11.5 | 14.8 |
| LEMMA present | Yes | No | Yes | No | Yes | No | No | Yes | Yes | Yes | No | Yes | No |
| number of different ... CPOSTAG values for scoring tokens | 14 | 22 | 12 | 10 | 13 | 52 | 20 | 15 | 11 | 15 | 37 | 14 | 11 |
| ... POSTAG values for scoring tokens | 19 | 303 | 63 | 24 | 302 | 52 | 77 | 21 | 28 | 38 | 37 | 30 | 53 |
| ... parts (separated by '|') in FEATS for scoring tokens | 19 | 0 | 61 | 47 | 81 | 0 | 4 | 146 | 51 | 33 | 0 | 82 | 50 |
| ... DEPREL values for scoring tokens | 27 | 82 | 78 | 52 | 26 | 46 | 7 | 55 | 25 | 21 | 56 | 25 | 18 |
| ... DEPREL values of scoring tokens with HEAD=0 | 15 | 1 | 14 | 1 | 1 | 1 | 1 | 6 | 6 | 1 | 1 | 1 | 1 |
| % of scoring tokens with HEAD=0 | 5.5 | 16.9 | 6.7 | 6.4 | 8.9 | 6.3 | 18.6 | 5.1 | 5.9 | 4.2 | 6.5 | 13.4 | 7.9 |
| % of scoring tokens with HEAD to the left | 82.9 | 24.8 | 50.9 | 75.0 | 46.5 | 50.9 | 8.9 | 60.3 | 47.2 | 60.8 | 52.8 | 6.2 | 62.9 |
| % of scoring tokens with HEAD to the right | 11.6 | 58.2 | 42.4 | 18.6 | 44.6 | 42.7 | 72.5 | 34.6 | 46.9 | 35.1 | 40.7 | 80.4 | 29.2 |
| number of scoring tokens with HEAD=0 per sentence | 1.9 | 1.0 | 1.0 | 1.0 | 1.2 | 1.0 | 1.5 | 1.0 | 0.9 | 1.0 | 1.0 | 1.0 | 1.0 |
| % of relations that are non-projective (incl. non-scoring tokens) | 0.4 | 0.0 | 1.9 | 1.0 | 5.4 | 2.3 | 1.1 | 1.3 | 1.9 | 0.1 | 1.0 | 1.5 | 0.4 |
| % of sentences with at least one non-projective relation (incl. non-scoring tokens) | 11.2 | 0.0 | 23.2 | 15.6 | 36.4 | 27.8 | 5.3 | 18.9 | 22.2 | 1.7 | 9.8 | 11.6 | 5.4 |
Note the following points: