Text mining: Assignment 1

The memory-based shallow parser (Daelemans, Buchholz, and Veenstra, 1999; Buchholz, Veenstra, and Daelemans, 1999) is a modular system that performs tokenization, part-of-speech tagging, constituent chunking, and grammatical relation finding. It has been trained on the Wall Street Journal part of the Penn Treebank, and is known to be quite accurate on WSJ-style texts: news articles, particularly with financial topics. An online demo can be found here; the page features more references and explanations of the tagsets and color codes used to encode the syntactic information.

Like most parsers trained on the WSJ treebank, MBSP is known to be less accurate sometimes on text of different genres. For example, specialistic texts from medicine contain many words that MBSP will not recognize. When the part-of-speech tagger generates an error on such words, it is very well possible that the rest of the shallow parse is incorrect too.

With some effort, MBSP can be retrained and tuned to a new domain by training certain components (such as the essential part-of-speech tagger) on specific words from the domain, such as medical terms.

The assignment is to write a report on a series of experiments, in which you start by selecting experimental data from one of the following two sources (the choice is yours):

Select a random 20 sentences from plain news text, from an English-language news site (e.g. the New York Times).
Select a random 20 sentences from medical texts, by querying for the words "protein" or "hiv" in MedLine abstracts here.

Next, go to this page (created and maintained at the University of Antwerp's CNTS research group), which features a generic version of MBSP and an MBSP tuned on medical text.

Process all 20 sentences through both parsers, and study the parsing results closely. Note down the following:

Which sentences are correctly parsed, and which analyses contain errors?
What kind of errors occur? In what module do they originate?

In your report you can choose to take on a casuistic or a broad empirical style - it's your choice again. The report should contain the following elements:

Title and author (please also mention your administration number and email address)
Abstract (a brief summary of what you did and what you found)
Introduction (here, on the memory-based shallow parser)
Experiments (what you did, and what the results were; optionally divided over two different sections, Experiments and Results)
Conclusions References

Send your assignments to Walter Daelemans before 14 March, 2005.