|
About Frog
Frog, formerly known as Tadpole, is an integration of
memory-based natural language processing (NLP) modules developed for
Dutch. All NLP modules are based
on Timbl, the Tilburg
memory-based learning software package. Most modules were created in
the 1990s at the ILK Research
Group (Tilburg University, the Netherlands) and
the CLiPS Research
Centre (University of Antwerp, Belgium). Over the years they
have been integrated into a single text processing tool. More
recently, a dependency parser, a base phrase chunker, and a
named-entity recognizer module were added.
Various (re)programming rounds have been made possible through funding
by NWO, the Netherlands
Organisation for Scientific Research, particularly under
the CGN
project, the IMIX programme,
the Implicit Linguistics
project, and the CLARIN-NL
programme.
What does it do?
Frog's current version will tokenize, tag, lemmatize, and
morphologically segment word tokens in Dutch text files, will assign a
dependency graph to each sentence, will identify the base phrase
chunks in the sentence, and will attempt to find and label all named
entities.
Frog produces FoLiA XML, or tab-delimited
column-formatted output, one line per token, that looks as follows:
The ten columns contain the following information:
- Token number (resets every sentence)
- Token
- Lemma (according to MBLEM)
- Morphological segmentation (according to MBMA)
- PoS tag (CGN tagset; according to MBT)
- Confidence in the POS tag, a number between 0 and 1, representing the probability mass assigned to the best guess tag in the tag distribution
- Named entity type, identifying person (PER), organization (ORG), location (LOC), product (PRO), event (EVE), and miscellaneous (MISC), using a BIO (or IOB2) encoding
- Base (non-embedded) phrase chunk in BIO encoding
- Token number of head word in dependency graph (according to CSI-DP)
- Type of dependency relation with head word
References
If you use Frog for your own work, please cite the following paper:
Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius,
S. (2007). An
efficient memory-based morphosyntactic tagger and parser for
Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and
V. Vandeghinste (Eds.), Selected Papers of the 17th Computational
Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114.
Credits and contact information
Frog, formerly known as Tadpole and before that as MB-TALPA, was coded
by Bertjan Busser, Ko van der Sloot, Maarten van Gompel, and Peter
Berck, subsuming code by Sander Canisius (constraint satisfaction
inference-based dependency parser), Antal van den Bosch (MBMA, MBLEM,
tagger-lemmatizer integration), Jakub Zavrel (MBT), and Maarten van
Gompel (Ucto). In the context of the CLARIN-NL infrastructure project
TTNWW, Frederik Vaassen (CLiPS, Antwerp) created the base phrase
chunking module, and Bart Desmet (LT3, Ghent) provided the data for the
named-entity module.
The development of Frog relies on earlier work
and ideas from Ko van der Sloot (lead programmer of MBT and TiMBL and
the TiMBL API), Walter Daelemans, Jakub Zavrel, Peter Berck, Gert
Durieux, and Ton Weijters.
The development of Frog relies on your bug reports, suggestions, and comments. Please send them to a.vandenbosch (at) let.ru.nl.
|