Correcting taxonomic errors in a specimen database
Author(s): Marieke van Erp, Caroline Sporleder, Tijn Porcelijn and Antal van den Bosch
Reference: Presented at The 16th Meeting of Computational Linguistics in the Netherlands (CLIN 2005), Amsterdam, The Netherlands, December 16, 2005.
Taxonomic names are of vital importance in biology: they are the classifiers of biological objects such as animals and plants. In a database describing biological entities errors can occur such as a name entered on a wrong level (e.g. a name of a SPECIES in a SUBSPECIES field) or a field being left blank because (to biologists) its value is obvious.
The aims of this experiment are to find out (i) whether it is possible to predict taxonomic labels in an animal specimen database with textual fields, among which taxonomic, and (ii) whether this method can be used to automatically correct errors. A generic approach to automatically correcting errors in databases is a practically useful NLP application if one thinks of the variety of semi-textual databases documenting objects that exist such as the Dutch database of zoological specimens, the Netherlands Soortenregister (http://www.nederlandsesoorten.nl) or the Internet Movie Database (http://www.imdb.com). Since in some domains, such as the biological domain, databases are large, rich in information and created over a long time span, it is hard to maintain a high accuracy and consistency. Considering the size of these databases, empty or incorrect fields that may occur cannot always be identified and filled or corrected manually; we propose an automatic method based on memory based learning (MBL) as a general approach to this problem.
We applied MBL to a database containing 16,870 records that document reptile and amphibian specimens. For each class in the taxonomy (i.e. KLASSE, ORDE, FAMILIE, GENUS, SPECIES and SUBSPECIES) a classifier was trained to predict its values based on information from the other taxonomic fields and information in the author field. The information in the author field can be a useful cue to predict higher classes in the hierarchy because certain taxonomists primarily work on, for example, turtles and others on frogs.
The baseline accuracy of always predicting the majority value is roughly linear to the specificity of the predicted field and ranges from 7.67% for SPECIES to 54.98% for KLASSE. The machine learning algorithm performed significantly better: the highest accuracy is 99.87% for KLASSE and the lowest 89.93% for SPECIES. Thus automatic prediction of taxonomic values in this database is possible. To check whether this approach can be used to correct errors in the database, the precision was calculated by manually analysing the disagreements between the original database values and the machine learning prediction of the four higher classes in the taxonomic hierarchy. From this analysis it became clear that the machine learning algorithm had actually predicted a correct value for an incorrect entry in the database in 38% of the cases in ORDE. On top of that, another 19% of the disagreements between the database and the classifier concerned synonyms. To gauge the recall of this error-correction method a small experiment was carried out in which 10% of the database fields in the test sets were given an incorrect value, MBL was run and the amount of corrected errors was checked. The lowest recall on this was 93.09% for the GENUS field and highest recall was 96.82% for the ORDER field, from which we may conclude that memory based learning is well capable of finding and correcting taxonomic errors in a database.