Correcting 'Wrong-Column' Errors in Text Databases.

Author(s): Caroline Sporleder, Marieke van Erp, Tijn Porcelijn and Antal van den Bosch

Reference: Proceedings of the Annual Machine Learning Conference of Belgium and The Netherlands (Benelearn-06), pp. 49-56, Ghent, Belgium, 2006.

Abstract: We present a novel data-driven approach for detecting and correcting errors in text databases. We focus on information that was accidentally entered in an incorrect column. Unlike machine-learning approaches to data cleaning that assume the database cells to contain atomic or numeric content, our method takes into account substrings of textual cells, and treats error detection and correction as a text categorisation task. Errors are detected at points where the classifier disagrees with the data; corrections are the suggestions put forward by the classifier. We demonstrate that the method is suited for high-recall detection of errors in free-text columns of a zoological database, with a high correction accuracy as well.

[pdf]   [Publications]   [Home]