Bootstrapping Information Extraction from Field Books
Author(s): Sander Canisius and Caroline Sporleder
Reference:Presented at CLIN 17 - Computational Linguistics in the Netherlands, Leuven, Belgium, January 12, 2007.
Zoological field books are semi-structured texts consisting of a number of records, each describing an animal specimen and the circumstances of its collection, e.g., where and when it was collected, what biotope it was found in etc. These books are a potentially very valuable resource for researchers in the field, especially if they can be systematically queried. However, this usually presupposes that the data is explicitly structured, e.g., represented in a database or at least annotated with some meta-information indicating which parts of a record contain what type of information.
In this talk, we present a method for extracting information from field book records. We model the task as a sequence labelling problem and propose a bootstrapping approach which does not require manually annotated field book entries for training. Instead we use information from a specimen database for modelling the contents of relevant text segments, and apply unsupervised learning for constructing the sequential model required for extracting those segments from actual field book entries.