Information extraction

In the context of the ROLAQUAD project (an NWO IMIX programme), I am working on domain-specific information extraction. The application domain of the project is general (i.e. non-expert) medical information written in Dutch. Within the project, machine learning components are developed for analysing unstructured texts on multiple levels of semantic annotation. These levels are:

token level: Domain-specific named entitiy recogntion; for example, identifying occurrances of illnesses or treatments.
sentence level: Detecting and identifying sentence topics, roughly corresponding to domain-specific relations; for example, whether a sentence gives information on the cause or prevention of an illness, or on the side-effects of a treatment..
section level: High-level classification of the topic(s) of a document section. Possible section topics are those commonly found in medical encyclopedias, such as definition, cause, and treatment.

One of the aims of the ROLAQUAD project is to exploit (partially) redundant information stored on these levels of semantic annotation. For example, in the correct annotation, the sentence below is assigned the sentence topic causes and the word influenza is labelled as referring to an illness.

Influenza is a contagious infection of the airways caused by the influenza virus.

The presence of an illness in the sentence may be highly informative for determining whether the sentence expresses a causes relation. Likewise, knowing a sentence to have causes as one of its topics, strongly points at the presence of at least one illness in the sentence. Effectively exploiting this redundancy for classification may require architectures more complex than a sequential pipeline, in which the semantic annotation levels are predicted one after the other. Developing architectures optimised for predicting multiple interacting levels of information is central to the ROLAQUAD project.

Related publications

Sander Canisius and Caroline Sporleder (2007)
Bootstrapping Information Extraction from Field Books
In Proceedings of the 2007 Joint Conference on Empirical
Methods in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL), Prague, Czech
Republic.
[pdf]

Sander Canisius and Caroline Sporleder (2007)
Learning to Segment and Label Semi-Structured Documents with
Little or No Supervision
In Proceedings of Benelearn 2007, Amsterdam, The Netherlands.
[pdf]

Sander Canisius, Antal van den Bosch, and Walter Daelemans (2006)
Discrete versus Probabilistic Sequence Classifiers for
Domain-specific Entity Chunking
In Proceedings of BNAIC 2006, Namur, Belgium.
[pdf]