Frog: Dutch morpho-syntactic analyzer and dependency parser
  
Frog

About Frog

Frog, formerly known as Tadpole, is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package. Most modules were created in the 1990s at the ILK Research Group (Tilburg University, the Netherlands) and the CLiPS Research Centre (University of Antwerp, Belgium). Over the years they have been integrated into a single text processing tool. More recently, a dependency parser, a base phrase chunker, and a named-entity recognizer module were added. Where possible, Frog makes use of multi-processor support to run subtasks in parallel.

Various (re)programming rounds have been made possible through funding by NWO, the Netherlands Organisation for Scientific Research, particularly under the CGN project, the IMIX programme, the Implicit Linguistics project, and the CLARIN-NL programme.

What does it do?

Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

Frog produces FoLiA XML, or tab-delimited column-formatted output, one line per token, that looks as follows:

The ten columns contain the following information:

  1. Token number (resets every sentence)
  2. Token
  3. Lemma (according to MBLEM)
  4. Morphological segmentation (according to MBMA)
  5. PoS tag (CGN tagset; according to MBT)
  6. Confidence in the POS tag, a number between 0 and 1, representing the probability mass assigned to the best guess tag in the tag distribution
  7. Named entity type, identifying person (PER), organization (ORG), location (LOC), product (PRO), event (EVE), and miscellaneous (MISC), using a BIO (or IOB2) encoding
  8. Base (non-embedded) phrase chunk in BIO encoding
  9. Token number of head word in dependency graph (according to CSI-DP)
  10. Type of dependency relation with head word

References

If you use Frog for your own work, please cite the following paper:

Credits and contact information

Frog, formerly known as Tadpole and before that as MB-TALPA, was coded by Bertjan Busser, Ko van der Sloot, Maarten van Gompel, and Peter Berck, subsuming code by Sander Canisius (constraint satisfaction inference-based dependency parser), Antal van den Bosch (MBMA, MBLEM, tagger-lemmatizer integration), Jakub Zavrel (MBT), and Maarten van Gompel (Ucto). In the context of the CLARIN-NL infrastructure project TTNWW, Frederik Vaassen (CLiPS, Antwerp) created the base phrase chunking module, and Bart Desmet (LT3, Ghent) provided the data for the named-entity module.

Maarten van Gompel designed the FoLiA XML output format that Frog produces, and also wrote a Frog client in Python. Wouter van Atteveldt wrote a Frog client in R.

The development of Frog relies on earlier work and ideas from Ko van der Sloot (lead programmer of MBT and TiMBL and the TiMBL API), Walter Daelemans, Jakub Zavrel, Peter Berck, Gert Durieux, and Ton Weijters.

The development and improvement of Frog also relies on your bug reports, suggestions, and comments. Please send them to a.vandenbosch (at) let.ru.nl.

Memory and speed considerations

Without the dependency parser, Frog will process about 900 words per second, and consume 542 MB on a 64-bit Linux architecture. With the parser, Frog's speed reduces to about 200 words per second, taking just under 1200 MB of memory; you have been warned.

Archived versions

You can find archived versions of Frog in our public software repository:

Download

Frog is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.

Consult these installation instructions for details on how to install this software if you are using a Debian, Ubuntu, or Fedora-based system (recommended). If you want to build the code from source yourself, download

Because of file sizes and to cleanly separate code from data, the data and configuration files for the modules of Frog have been packaged separately. To get them, download

Follow the same installation instructions as for Frog, see below; this will unpack the data into the Frog configuration directory.

Installing and running Frog

If you downloaded Frog as tarball, proceed as follows:

  • The tarball will unpack (tar zxvf frog-latest.tar.gz) in a directory called 'frog-[version]'.
  • When in the frog-[version] directory, issue a ./configure --prefix=<installdir> command, followed by make and make install.
  • Repeat the same procedure for the frogdata tarball.

Frog relies on other software, so before installing Frog, check the following list of dependencies and make sure this software is installed:

  • Frog assumes installed current versions of Frog will not work with versions of Timbl before 6.4, and Mbt before 3.2 - please make sure you have the latest versions installed before installing Frog.
  • Frog is also dependent on Python 2.5 or higher and ICU 3.6 or higher. You may also need to install fresh versions of pkgconfig, libxml2 and/or libxml2-dev, and the autoconf toolkit.
  • Mac users are advised to install the latest version of XCode, and use homebrew to install the above libraries. Frog successfully compiles (albeit without openmp support) with Mac OS X Yosemite 10.10, XCode 6.0.1, LLVM version 6.0, and a brew install of icu4c.

Making Frog Leap

To let Frog leap, simply invoking frog without arguments will produce a list of available commandline options. Some main options are:

  • frog -t <file> will run all modules on the text in <file>.
  • frog --testdir=<dir> will let Frog process all files in the directory <dir>.
  • frog -S <port> starts up a Frog server listening on port number <port>.
  • With --skip=[mptnc] you can tell Frog to skip tokenization (t), base phrase chunking (c), named-entity recognition (n), multi-word unit chunking for the parser (m), or parsing (p).

Calling the Frog server from Python and R

Python: The pynlpl toolkit (see documentation), written by Maarten van Gompel, contains a Frog client through which a Frog server running on a port can be called, and its output processed. To install pynlpl, invoke

$ easy_install pynlpl

Communication with Frog can be established as follows:

from pynlpl.clients.frogclient import FrogClient

port = 8020
frogclient = FrogClient('localhost',port)

for data in frogclient.process("Een voorbeeldbericht om te froggen")
  word, lemma, morph, pos = data[:4]
  #TODO: verdere verwerking per gefrogged woord

R: Wouter van Atteveldt has developed a Frog client for R, frogr. This package contains functions for connecting to a Frog server from R and creating a document-term matrix from the resulting tokens. Since this yields a standard term-document matrix, it can be used with other R packages e.g. for corpus analysis or text classification using RTextTools.

Notice: we are in the process of writing a reference guide for Frog that explains all options in detail.

a.vandenbosch@let.ru.nl