Correcting 'Wrong-Column' Errors in Text Databases.
Author(s): Caroline Sporleder, Marieke van Erp, Tijn Porcelijn and Antal van den Bosch
Reference: Proceedings of the Annual Machine Learning Conference of Belgium and The Netherlands (Benelearn-06), pp. 49-56, Ghent, Belgium, 2006.
We present a novel data-driven approach for detecting and correcting
errors in text databases. We focus on information that was
accidentally entered in an incorrect column. Unlike
machine-learning approaches to data cleaning that assume the
database cells to contain atomic or numeric content, our method
takes into account substrings of textual cells, and treats error
detection and correction as a text categorisation task. Errors are
detected at points where the classifier disagrees with the data;
corrections are the suggestions put forward by the classifier. We
demonstrate that the method is suited for high-recall detection of
errors in free-text columns of a zoological database, with a high
correction accuracy as well.