|
|
|
|
The Reptiles and Amphibians Database
- manually created archive of specimens
- 16,870 entries (rows), 47 columns
- entries provide information about specimens (e.g., taxonomic
information, circumstances of collection, storage information etc.)
- database fields in a variety of formats (dates, numbers, free
text)
- multi-lingual, mainly Dutch
|
|
|
|
|
|
|
|
|
|
|
Named-Entity Tagging
Named-Entity tagging refers to the task of determining whether a text
string contains named-entities, such as M.S. Hoogmoed or
Surinam, and if so, determining which entity category these
belong to, e.g., person or location.
Named-entity
tagging is useful for many information extraction tasks. Textual
databases typically contain several "free text" fields, such as
Special Remarks, that could benefit from named-entity
tagging. They also frequently contain entity-specific fields, such as
Collector which normally contains expressions of type
person. These can be exploited to automatically extract
lists of named-entity expressions (so-called gazetteers). These
gazetteers can then be used in a simple look-up tagger or to extract
training data for a trained tagger. We investigated how successful
such an approach is and obtained good results for the look-up and
machine learnt taggers. We also found that a generic named-entity
tagger may only be of limited use for this type of data.
(For more
information see the LREC-06 paper.)
|
|
|
|
|
|
|
|
|
|
|
Automatic Data Completion
Automatic data completion refers to the task of automatically
completing incomplete information. Databases frequently contain empty
cells. Not all empty cells indicate missing information, for example a
Special Remarks field does not always have to be
filled. However, in many cases empty cells should be
filled.
We investigated whether it is possible to fill empty fields in our
database automatically by exploiting the fact that the field values
within a record are often not independent. For example, if a field
entitled Location contains the value Tafel Mountain, it
is likely that the field Country contains the value South
Africa. We trained a classifier to predict the value of a field
given the values of the other fields in a database record. This
strategy turned out to be surprisingly successful for our database,
even for the free-text fields, with accuracies of 63% and higher.
This indicates that our data contains many interdependencies that can
be exploited in this way.
|
|
|
|
|
|
|
|
|
|
|
|
Error Detection and Correction
Real-life data is rarely error-free. Even well-maintained databases
typically contain a small number errors. These can range from typos to
content errors (i.e., database fields which contain a wrong
value). These errors usually have a negative effect on information
retrieval, for example, typos can prevent information from being found
and retrieved.
While error detection is an active research area, most work is not
geared towards textual databases. Most work on outlier
detection, for instance, requires that the data is numerical or at least categorical
(i.e., it is assumed that data values are atomic).
However, textual databases typically contain free-text fields. The
values of these fields are relatively long text strings which should
not be treated as atoms.
Given these shortcomings of existing error detection methods, we
developed two new error detection techniques that are specifically
aimed at detecting errors in textual databases. The first method,
which we called horizontal error detection, aims to detect
fields which contain a wrong value, e.g., South America instead
of South Africa in a Country field. These errors are
detected by comparing the value of a field to the values
of other fields in a record. The second method, which we called
vertical error detection, aims at detecting whether a text
string was entered in the wrong column. For example, the string
died in captivity might occur in a Location column, but
it would be better places in a Special Remarks column. To
detect this type of error, we recast the problem as a text
classification task, i.e., we train a classifier to predict which
column a text string should be in and signal a potential error if the
predicted column deviates from the original column.
(For more information see the ATEM-06 paper and try the demo.)
|
|
|
|
|
|
|