| |
|
|
|
|
|
Data correction through imputation
Timpute is a perl package based on the TiMBL software that
self-corrects the contents of each cell in a database based on the
rest of the database. Timpute is essentially a wrapper that processes
the database and passes it piece by piece to TiMBL, whose output is
parsed into a csv file again.
Features
- Accepts csv databases
- Auto-detection of zero or maximal entropy features
- Choice between the generation of corrected columns vs. "arrogant" feature value replacement
Timpute is free software; you can
redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free
Software Foundation.
About
Timpute is being developed as part of the MITCH (Mining for
Information in Texts from Cultural Heritage), as part of ongoing
research into automated cleaning and enrichment of textual
databases. Timpute was originally conceived by Antal van den Bosch and Caroline
Sporleder, and is programmed by Steve Hunt. The MITCH project is funded by NWO, the Netherlands Organisation for
Scientific Research, as part of the CATCH programme.
You can read about Timpute in the following paper:
|
|
|
|
|
|
|
|
|
|
|
|
Download and installation
To install, please follow these basic instructions:
- Timpute relies on an installed version of Timbl
version 6.1 (preferably 6.1.2).
- The tarball will unpack ('tar zxvf timpute-0.3.tar.gz') in a
directory called 'timpute-0.3'.
The easiest method to invoke Timpute is have it run with default
settings and replace every cell in every column with its
corrections. The required file format at this time is comma seperated
values, with the first row containing column names, and every cell
contained within doublequotes. To run Timpute on the sample file
included, invoke
./timpute.pl -f reptile.csv -o reptile_timputed.csv -p
The command above specifies the input file as reptile.csv and the output file as reptile_timputed.csv, which will contain altered data cells changed by Timpute. The -p option specifies that Timpute should replace the contents of a cell if Timpute disagrees with the original value.
More options are listed by typing ./timpute.pl --help . See also the following files included in the package:
This is very much a beta version and as such may contain bugs or improperly working features. Comments or bug reports are welcome at: s.j.hunt@uvt.nl
|
|
|
|
|
|
|
|