Frog: Dutch morpho-syntactic analyzer and dependency parser
  
Frog

About Frog

Frog, formerly known as Tadpole, is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Most modules were created in the 1990s at the ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium). Over the years they have been integrated into a single text processing tool. More recently, a dependency parser, a base phrase chunker, and a named-entity recognizer module were added.

Various (re)programming rounds have been made possible through funding by NWO, the Netherlands Organisation for Scientific Research, particularly under the CGN project, the IMIX programme, the Implicit Linguistics project, and the CLARIN-NL programme.

What does it do?

Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

Frog produces FoLiA XML, or tab-delimited column-formatted output, one line per token, that looks as follows:

The ten columns contain the following information:

  1. Token number (resets every sentence)
  2. Token
  3. Lemma (according to MBLEM)
  4. Morphological segmentation (according to MBMA)
  5. PoS tag (CGN tagset; according to MBT)
  6. Confidence in the POS tag, a number between 0 and 1, representing the probability mass assigned to the best guess tag in the tag distribution
  7. Named entity type, identifying person (PER), organization (ORG), location (LOC), product (PRO), event (EVE), and miscellaneous (MISC), using a BIO (or IOB2) encoding
  8. Base (non-embedded) phrase chunk in BIO encoding
  9. Token number of head word in dependency graph (according to CSI-DP)
  10. Type of dependency relation with head word

References

If you use Frog for your own work, please cite the following paper:

Credits and contact information

Frog, formerly known as Tadpole and before that as MB-TALPA, was coded by Bertjan Busser, Ko van der Sloot, Maarten van Gompel, and Peter Berck, subsuming code by Sander Canisius (constraint satisfaction inference-based dependency parser), Antal van den Bosch (MBMA, MBLEM, tagger-lemmatizer integration), Jakub Zavrel (MBT), and Maarten van Gompel (Ucto). In the context of the CLARIN-NL infrastructure project TTNWW, Frederik Vaassen (CLiPS, Antwerp) created the base phrase chunking module, and Bart Desmet (LT3, Ghent) provided the data for the named-entity module.

The development of Frog relies on earlier work and ideas from Ko van der Sloot (lead programmer of MBT and TiMBL and the TiMBL API), Walter Daelemans, Jakub Zavrel, Peter Berck, Gert Durieux, and Ton Weijters.

The development of Frog relies on your bug reports, suggestions, and comments. Please send them to a.vandenbosch (at) let.ru.nl.

Download

Frog is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

Consult these installation instructions for details on how to install this software if you are using a Debian, Ubuntu, or Fedora-based system (recommended). If you want to build the code from source yourself, download

Because of file sizes, the configuration of Frog is NO LONGER included in the packages or in the tarball. To get them, download

Follow the same installation instructions as for Frog, see below; this will unpack the data into the Frog configuration directory.

Installation

Installing Frog in a nutshell:

  • The tarball will unpack (tar zxvf frog-latest.tar.gz) in a directory called 'frog-[version]'.
  • When in the frog-[version] directory, issue a ./configure --prefix=<installdir> command, followed by make and make install.

Frog relies on other software, so before installing Frog, check the following list of dependencies and make sure this software is installed:

  • Frog assumes installed current versions of
    • Timbl (the memory-based classifier engine),
    • TimblServer (for server functionality around Timbl),
    • Mbt (the memory-based tagger),
    • Ucto (unicode tokenizer)
    • libfolia (a FoLiA library)
    Frog will not work with versions of Timbl before 6.4, and Mbt before 3.2 - please make sure you have the latest versions installed before installing Frog.
  • Frog is also dependent on Python 2.5 or higher and ICU 3.6 or higher. You may also need to install fresh versions of pkgconfig, libxml2 and/or libxml2-dev, and the autoconf toolkit.
  • Mac users are advised to install the latest version of XCode, and use Fink, Macports, or homebrew to install the above libraries.

Making Frog Leap

To let Frog leap, simply invoking frog without arguments will produce a list of available commandline options. Some main options are:

  • frog -t <file> will run all modules on the text in <file>.
  • frog --testdir=<dir> will let Frog process all files in the directory <dir>.
  • frog -S <port> starts up a Frog server listening on port number <port>.
  • With --skip=[mptnc] you can tell Frog to skip tokenization (t), base phrase chunking (c), named-entity recognition (n), multi-word unit chunking for the parser (m), or parsing (p).

Without the dependency parser, Frog will process about 900 words per second, and consume 542 MB on a 64-bit Linux architecture. With the parser, Frog's speed reduces to about 200 words per second, taking just under 1200 MB of memory; you have been warned.

Notice: we are in the process of writing a reference guide for Frog that explains all options in detail.

Archived versions

You can find archived versions of Frog in our public software repository:

Antal.vdnBosch@uvt.nl | Last update: