Program | Abstracts


Eduard Hovy (Information Sciences Institute of the University of Southern California, Marina del Rey, CA, U.S.A.)

Learning by Reading: From Information Extraction to Machine Reading

Creating computer systems that educate themselves by reading text was one of the original dreams of Artificial Intelligence. Researchers in Natural Language Processing (NLP) have made initial steps in this direction, especially with Information Extraction and Text Mining, which derive information from large sets of data. Can one, however, build a system that learns by reading just one, or a small number, of texts about a given topic?

Starting in 2002, three research groups in an experiment called Project Halo manually converted the information in one chapter of a high school chemistry textbook into knowledge representation statements, and then had a knowledge representation system take the US high school standardized (AP) exam. Surprisingly, all three systems passed, albeit not very well. Could one do the same, automatically? In late 2005, DARPA funded several small pilot projects in NLP, Knowledge Representation and Reasoning (KR&R), and Cognitive Science to take up this challenge, which grew into Project Möbius, a collaboration of SRI, USC/ISI, University of Texas Austin, Boeing, and BBN Inc. The Möbius prototype learning‐byreading system read paragraph‐length Wikipedia‐level texts about the human heart and about engines, built up enough knowledge to apply inferences, to produce its own further reading requests, and to answer unseen questions. Results were encouraging. In 2009, DARPA funded a new 5‐year program called Machine Reading, which funds three large teams that include many of the top NLP and KR&R research scientists in the USA.

This talk describes the Machine Reading program and provides details about one the three teams, RACR, which is led by IBM's IE / QA team and includes researchers at USC/ISI, University of Texas Austin, CMU, and the University of Utah. The system contains several reading engines that are being composed into a single large framework, with access to a cluster of several thousand computers for large‐scale experiments. The reading engines include traditional Information Extraction engines, parsers, converters to various logical form representations, abstract semantic models of world knowledge, and various kinds of abductive and other reasoning engines. I will focus on the use of large repositories of background knowledge and their various uses to support reading and inference, and describe the experiments currently being done.

Piek Vossen (Computational Lexicology and Terminology Lab, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands)

KYOTO: A Community Platform for Knowledge Modeling and Text Mining

The Asian-European project KYOTO develops an open system for knowledge modeling and text mining: KYOTO operates in two cycles. First, we derive a domain model from text by learning terms and term-relations. Terms are automatically mapped to wordnets, which are anchored to a central ontology. Next the domain model is used to extract events and facts from text through a process of incremental annotation of semantic layers. These layers are extracted through simple profiles that can take any previous step as input and generate a next layer as output. The KYOTO system uses an open text representation format and a central ontology to enable extraction of knowledge and facts from large volumes of text in many different languages. We implemented a semantic tagging approach that performs off-line reasoning. Mining of facts and knowledge is achieved through a flexible pattern matching module that can work in much the same way for different languages, can handle efficiently large volumes of documents and is not restricted to a specific domain. We applied the system to an English database on estuaries.


Marieke van Erp (Dept. of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands)

Accessing Natural History: Discoveries in Data Cleaning, Structuring, and Retrieval

Cultural heritage institutions harbour a vast treasure of information. However, this treasure of information is often confined to the walls of the archive, museum, or library. This thesis is about improving access to cultural heritage collections through digitisation and enrichment. In this thesis, three themes that improve information access in a digital information collection from the Dutch National Museum for Natural History Naturalis were investigated: data cleaning, information structuring, and object retrieval.

Two methods for automatic cleanup of databases are presented: a data-driven and a knowledge driven method. Both methods detect a large number of inconsistencies in the data, but the experiments show that they also detect different types of errors and are thus complementary.

Next, an automatic ontology construction method is presented. This method makes implicit domain information present in the database from Naturalis explicit by linking it to the online encyclopaedia Wikipedia.

Finally, a system for data retrieval are presented in which three different types of domain knowledge in three different stages of the retrieval process are used. First, knowledge from external resources and rules is used to interpret the queries to formulate more precise queries. Then, the same types of knowledge is used to expand queries with synonyms to increase recall. To rank results by relevance, knowledge from the domain ontologies and query analysis is used. Mira provides a significant improvement in data access as it decreases the number of unanswered queries.


Last update: