Tadpole
About Tadpole

Tadpole, which stands for Tagger, Dependency Parser, and Other Linguistic Engines, is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl 6.3.

Tadpole's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in incoming Dutch text files, and will assign a dependency graph to each sentence.

Without the dependency parser, Tadpole will process about 1,600 words per second, and consume 530 Mb on a 64-bit architecture. With the parser, Tadpole's speed reduces to about 220 words per second, taking 1950 Mb of memory; you have been warned.

Tadpole expects UTF-8 texts as input, and will produce tab-delimited column-formatted output, one line per token, that looks as follows:


  1  De de [de] LID(bep,stan,rest) 2 det
  2  oprichter oprichter [op][richt][er] N(soort,ev,basis,zijd,stan) 8 su
  3  van van [van] VZ(init) 2 mod
  4  Wikipedia Wikipedia [Wikipedia] SPEC(deeleigen) 3 obj1
  5  , , [,] LET() 4 punct
  6  Jimmy_Wales Jimmy_Wales [Jimmy]_[Wales] SPEC(deeleigen) 2 app
  7  , , [,] LET() 6 punct
  8  wil willen [wil] WW(pv,tgw,ev) 0 ROOT
  9  een een [een] LID(onbep,stan,agr) 11 det
  10 nieuwe nieuw [nieuw][e] ADJ(prenom,basis,met-e,stan) 11 mod
  11 zoekmachine zoekmachine [zoek][machine] N(soort,ev,basis,zijd,stan) 8 su
  12 lanceren lanceren [lanceren] WW(inf,vrij,zonder) 8 vc
  13 . . [.] LET() 12 punct

The seven columns contain the following information:
  1. Token number (resets every sentence)
  2. Token
  3. Lemma (according to MBLEM)
  4. Morphological segmentation (according to MBMA)
  5. PoS tag (CGN tagset; according to MBT)
  6. Token number of head word (according to CSI-DP)
  7. Type of dependency relation with head word

References

If you use Tadpole for your own work, please cite the following paper:

Download

Tadpole is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

Installation

Please consult the README file in the package for installation instructions. In a nutshell:

  • The tarball will unpack (tar zxvf tadpole-0.8.tar.gz) in a directory called 'tadpole-0.8'.
  • When in the tadpole-0.8 directory, issue a ./configure --prefix=<installdir> command, followed by make and make install.

Tadpole relies on other software, so before installing Tadpole, check the following list of dependencies:

  • Tadpole assumes installed versions of Timbl 6.3, TimblServer 1.0, and Mbt 3.2. Tadpole will not work with older versions of Timbl and Mbt.
  • Tadpole is also dependent on Python 2.5 or higher, libboost 1.33 or higher, and ICU 3.6 or higher. You may also need to install fresh versions of pkgconfig, libxml2, and the autoconf toolkit.
  • Mac users are advised to install the latest version of XCode, and use either Fink or Macports to install the above libraries.
Credits and contact information

Tadpole was coded by Bertjan Busser, Ko van der Sloot, Maarten van Gompel, and Peter Berck, subsuming code by Sander Canisius (constraint satisfaction inference-based dependency parser), Antal van den Bosch (MBMA, MBLEM, tagger-lemmatizer integration), Jakub Zavrel (MBT), and Sabine Buchholz (tokenization). The development of Tadpole further relies on earlier work and ideas from Ko van der Sloot (lead programmer of MBT and TiMBL and the TiMBL API), Walter Daelemans, Jakub Zavrel, Peter Berck, Gert Durieux, and Ton Weijters.

The development of Tadpole relies on your bug reports, suggestions, and comments. Please send them to Antal.vdnBosch (at) uvt.nl.

Antal.vdnBosch@uvt.nl | Last update: