Cleaning and Enriching Research Data on Reptiles and Amphibians. The MITCH Pilot Project and "nulmeting".
Author(s): Caroline Sporleder, Marieke van Erp, Tijn Porcelijn, Antal van den Bosch, Pim Arntzen and Erik van Nieukerken
Reference: Technical Report, ILK 06-01, Tilburg University, 2006.
This document describes a pilot study undertaken as part of the MITCH project.
In the pilot study, we developed several data clean-up
and enrichment techniques for textual databases. In particular, we
looked at (i) named entity tagging, (ii) automatic data completion,
and (iii) error correction. All of the techniques we developed are
knowledge-lean and data-driven, i.e., they do not require the
provision of background knowledge or of manually labelled training
data. Instead our techniques exploit the semi-structured nature of
textual databases to automatically derive suitable training
material. Hence, they can easily be ported to textual databases from
other domains. We tested the techniques on a zoological specimen