Universiteit * van Tilburg

The Pilot Project
Cleaning and Enriching Research Data on Reptiles and Amphibians

The MITCH pilot project ran from 1 September to 31 December 2005. In the pilot project, we looked at various data clean-up and enrichment techniques for textual databases, such as named entity tagging, automatic data completion, and error detection and correction. The developed techniques make use of machine learning and were tested on a database containing information about some of the Reptile and Amphibian specimens in Naturalis's collection. The techniques could easily be ported to other textual databases though, as all methods are knowledge-lean and data-driven, i.e., they don't require the user to provide background knowledge about the data or to manually annotate example data for training. Instead they exploit the semi-structured nature of the data to automatically derive training material.

(For more information about the pilot project see the technical report and try out the demo of the error corrector developed by MITCH.)

The Reptiles and Amphibians Database
  • manually created archive of specimens
  • 16,870 entries (rows), 47 columns
  • entries provide information about specimens (e.g., taxonomic information, circumstances of collection, storage information etc.)
  • database fields in a variety of formats (dates, numbers, free text)
  • multi-lingual, mainly Dutch
Named-Entity Tagging

Named-Entity tagging refers to the task of determining whether a text string contains named-entities, such as M.S. Hoogmoed or Surinam, and if so, determining which entity category these belong to, e.g., person or location.

Named-entity tagging is useful for many information extraction tasks. Textual databases typically contain several "free text" fields, such as Special Remarks, that could benefit from named-entity tagging. They also frequently contain entity-specific fields, such as Collector which normally contains expressions of type person. These can be exploited to automatically extract lists of named-entity expressions (so-called gazetteers). These gazetteers can then be used in a simple look-up tagger or to extract training data for a trained tagger. We investigated how successful such an approach is and obtained good results for the look-up and machine learnt taggers. We also found that a generic named-entity tagger may only be of limited use for this type of data.

(For more information see the LREC-06 paper.)
Automatic Data Completion

Automatic data completion refers to the task of automatically completing incomplete information. Databases frequently contain empty cells. Not all empty cells indicate missing information, for example a Special Remarks field does not always have to be filled. However, in many cases empty cells should be filled.

We investigated whether it is possible to fill empty fields in our database automatically by exploiting the fact that the field values within a record are often not independent. For example, if a field entitled Location contains the value Tafel Mountain, it is likely that the field Country contains the value South Africa. We trained a classifier to predict the value of a field given the values of the other fields in a database record. This strategy turned out to be surprisingly successful for our database, even for the free-text fields, with accuracies of 63% and higher. This indicates that our data contains many interdependencies that can be exploited in this way.
Error Detection and Correction


Real-life data is rarely error-free. Even well-maintained databases typically contain a small number errors. These can range from typos to content errors (i.e., database fields which contain a wrong value). These errors usually have a negative effect on information retrieval, for example, typos can prevent information from being found and retrieved.

While error detection is an active research area, most work is not geared towards textual databases. Most work on outlier detection, for instance, requires that the data is numerical or at least categorical (i.e., it is assumed that data values are atomic). However, textual databases typically contain free-text fields. The values of these fields are relatively long text strings which should not be treated as atoms.

Given these shortcomings of existing error detection methods, we developed two new error detection techniques that are specifically aimed at detecting errors in textual databases. The first method, which we called horizontal error detection, aims to detect fields which contain a wrong value, e.g., South America instead of South Africa in a Country field. These errors are detected by comparing the value of a field to the values of other fields in a record. The second method, which we called vertical error detection, aims at detecting whether a text string was entered in the wrong column. For example, the string died in captivity might occur in a Location column, but it would be better places in a Special Remarks column. To detect this type of error, we recast the problem as a text classification task, i.e., we train a classifier to predict which column a text string should be in and signal a potential error if the predicted column deviates from the original column.

(For more information see the ATEM-06 paper and try the demo.)

CATCH ILk .naturalis NW0 Universiteit * van Tilburg
© 2005 Tilburg University, Caroline Sporleder | Last update: 24 August 2006