|
About Tadpole
Tadpole, which stands for Tagger, Dependency Parser, and
Other Linguistic Engines, is an integration of memory-based
natural language processing (NLP) modules developed for Dutch. All NLP
modules are based
on Timbl 6.3.
Tadpole's current version will tokenize, tag, lemmatize, and
morphologically segment word tokens in incoming Dutch text files, and
will assign a dependency graph to each sentence.
Without the dependency parser, Tadpole will process about 1,600 words
per second, and consume 530 Mb on a 64-bit architecture. With the
parser, Tadpole's speed reduces to about 220 words per second, taking
1950 Mb of memory; you have been warned.
Tadpole expects UTF-8 texts as input, and will produce tab-delimited
column-formatted output, one line per token, that looks as follows:
1 De de [de] LID(bep,stan,rest) 2 det
2 oprichter oprichter [op][richt][er] N(soort,ev,basis,zijd,stan) 8 su
3 van van [van] VZ(init) 2 mod
4 Wikipedia Wikipedia [Wikipedia] SPEC(deeleigen) 3 obj1
5 , , [,] LET() 4 punct
6 Jimmy_Wales Jimmy_Wales [Jimmy]_[Wales] SPEC(deeleigen) 2 app
7 , , [,] LET() 6 punct
8 wil willen [wil] WW(pv,tgw,ev) 0 ROOT
9 een een [een] LID(onbep,stan,agr) 11 det
10 nieuwe nieuw [nieuw][e] ADJ(prenom,basis,met-e,stan) 11 mod
11 zoekmachine zoekmachine [zoek][machine] N(soort,ev,basis,zijd,stan) 8 su
12 lanceren lanceren [lanceren] WW(inf,vrij,zonder) 8 vc
13 . . [.] LET() 12 punct
The seven columns contain the following information:
- Token number (resets every sentence)
- Token
- Lemma (according to MBLEM)
- Morphological segmentation (according to MBMA)
- PoS tag (CGN tagset; according to MBT)
- Token number of head word (according to CSI-DP)
- Type of dependency relation with head word
References
If you use Tadpole for your own work, please cite the following paper:
Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius,
S. (2007). An
efficient memory-based morphosyntactic tagger and parser for
Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and
V. Vandeghinste (Eds.), Selected Papers of the 17th Computational
Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114.
|