Webclopedia Research Areas

Research Areas

Question Answering

The Webclopedia is intended to answer questions posed to it in various languages, drawing its answers from text collections and/or the web, from multiple languages.

Examples of Webclopedia in operation, on TREC corpus of 1 million documents.

In the TREC-9 QA competition, Webclopedia tied for second place with a score of 31%.

The Webclopedia interface (still under development). The architecture includes:

the CONTEX parser (ISI; see below) to analyze the question,
a Query Formation module (ISI), which includes stemming, query expansion, and other proprocessing routines,
the MG Information Retrieval engine (from Sydney University) to retrieve documents,
three text segmenters,
a Ranker module (ISI) to rank segments by likelihood of containing an answer,
BBN's proper name recognizer IdentiFinder,
the CONTEX parser to parser likely answer sentences,
a Matcher module (ISI) that applies QA answer pinpointing patterns,
an Answer module that ranks the candidate answers, decides on cutoff points to determine whether one or more answer(s) has/have likely been found, and presents it/them.

The Webclopedia uses the CONTEX parser to parse the user's question and identify the question operator (called Qtarget), the desired topic, and additional specification details (called Qargs and Qwords). Webclopedia then creates a query and retrieves documents from the source corpus using the MG IR engine. Several increasingly general queries can be created (using stemming, query expansion lists, etc.). One of several Segmenters is run to split up documents into topically cohesive segments. The Ranker module then ranks the segments according to likelihood of containing an answer. The sentences in the top-ranked segments are then parsed by CONTEX. Next, the Matcher applies several independent matching heuristics to each candidate sentence in order to pinpoint the answer(s). One set of heuristics employs general question-answer patterns that express how portions of questions and answers relate within a CONTEX parse tree. Another set computes the degree of overlap between question tree and candidate answer tree, taking into account the Qtarget, Qargs, and Qwords. Other heuristics implement the fallback strategy of scoring a fixed-length window of words for their contents (overlap with the important words in the question). Finally, the Answer module compares the various answers' scores and rates them, also deciding whether an acceptable answer (or set of answers) has been found.

The Webclopedia is based on the theory that questions fall into a natural typology, based on the semantics of their desired answers. That is, "who discovered America", "what is the name of the person who discovered America?", "what was the discoverer of America called?" are all essentially the same question, and require a Named-Person as an answer. In contrast, "who was Columbus?" is a different type of question (one we call Why-Famous), and requires a different type of answer. We have developed a typology of over 100 question-answer types and a corresponding set of over 500 answer patterns, based on an analysis of several thousand questions.

The CONTEX parser is used to parse both questions and candidate answers. The question parse yields a list of semantic types of the likely answer(s), the Qtargets, as defined in the QA Typology, which are then matched against the parsed answer candidates in order to pinpoint answers. See below for details.

Publications:

E.H. Hovy, L. Gerber, U. Hermjakob, C.-Y. Lin, D. Ravichandran. 2001. Toward Semantics-Based Answer Pinpointing. Get paper in pdf.
E.H. Hovy, L. Gerber, U. Hermjakob, M. Junk, C.-Y. Lin. 2000. Question Answering in Webclopedia. Get paper in pdf.

Automated grammar learning and parsing

CONTEX is a parser that produces syntactic-semantic analyses of sentences. CONTEX consists of two major parts, a grammar learner and a parser. The grammar learner uses machine learning techniques to induce a grammar (represented as parsing actions) from a set of training examples (sentences with their trees, produced by a human).

By having a human supervisor assist with training, CONTEX cuts down on the number of training examples it requires. A grammar of Korean was learned from scratch over a three-month period with the help of two graduate students (one to create a training set of 1100 trees; the other to put in place a part of speech tagger and other auxiliary software). The system performs at approx. 86% labeled bracketed precision and recall, tested on unseen sentences. The Japanese version of CONTEX performs at approx. 91%. The English version of CONTEX currently performs at about 92% labeled bracketed precision and recall, when trained on 2048 sentences. This figure is a few percent lower than the best English parsers in the world today; however, these systems require more than 100x as much training data.

In Webclopedia, continued development of the CONTEX parser supports several goals:

question parsing for determination of the type of answer desired,
candidate answer parsing for accurate determination of the precise answer,
learning of grammars of new languages, for Webclopedia and for other uses.

Demo of CONTEX.

Publications:

Hermjakob, U. 2000. Rapid Parser Development: A Machine Learning Approach for Korean. Get paper in gzipped postscript.

Text summarization

Earlier work at ISI focused on the development of SUMMARIST, a single-document multilingual text summarizer. SUMMARIST can summarize texts in English, Chinese, Arabic, Spanish, French, Italian, Japanese, and Bahasa Indonedia, and was used in the MuST system by the Pacific Command (PACOM) to monitor events in Indonesia in 1998--2000.

We build on SUMMARIST in Webclopedia, focusing on multi-document summarization. In collaboration with Daniel Marcu of the Rewrite project at ISI, we are starting to investigate the range of types of multi-document summaries (event stories, object descriptions, biographies, etc.), as well as methods for producing summaries of the more tractable of them. A recent comparison of several methods has found that for newspaper text several simple baseline methods work about as well as more sophisticated methods involving sentence clustering and filtering.

We have also built an interface with which summaries can be evaluated. SEE allows an assessor to compare the system's summary to a human's, at any level of granularity, and to tabulate findings, which are then tallied and converted to recall and precision scores. SEE is likely to be used by NIST in assessing the quality of summaries in the new Document Understanding Conference (DUC).

Demo of MuST.

Demo of SUMMARIST.

Publications:

Hovy, E.H. and C.-Y. Lin 1998. 1998. Automated Text Summarization in SUMMARIST. Get paper in pdf.
Lin, C.-Y. and E.H. Hovy. 2000. The Automated Acquisition of Topic Signatures for Text Summarization. Get paper in postscript.
Marcu, D. and L. Gerber. 2001. An Inquiry into the Nature of Multidocument Abstracts, Extracts, and Their Evaluation. Get paper in pdf.

Rapid ramp-up of new languages

In order to support our focus on multilingual language processing, we continue to explore methods to incorporate new languages rapidly. We have recently collected a large amount of Chinese text, several Chinese dictionaries, and a tree bank of clauses. We are working with both Mandarin (simplified) and Taiwanese character sets.

CONTEX has been used to automatically learn garmmars of English, Japanese, and Korean. SUMMARIST can summarize texts in English, Chinese, Arabic, Spanish, French, Italian, Japanese, and Bahasa Indonedia. With postdoctoral visitors in the year 2001--2002, we are developing Korean named entity taggers, part of speech taggers, and other software tools.