Cleaning and Enriching Research Data on Reptiles and Amphibians. The MITCH Pilot Project and "nulmeting".

Author(s): Caroline Sporleder, Marieke van Erp, Tijn Porcelijn, Antal van den Bosch, Pim Arntzen and Erik van Nieukerken

Reference: Technical Report, ILK 06-01, Tilburg University, 2006.

Abstract: This document describes a pilot study undertaken as part of the MITCH project. In the pilot study, we developed several data clean-up and enrichment techniques for textual databases. In particular, we looked at (i) named entity tagging, (ii) automatic data completion, and (iii) error correction. All of the techniques we developed are knowledge-lean and data-driven, i.e., they do not require the provision of background knowledge or of manually labelled training data. Instead our techniques exploit the semi-structured nature of textual databases to automatically derive suitable training material. Hence, they can easily be ported to textual databases from other domains. We tested the techniques on a zoological specimen database.

