The memory-based shallow parser (Daelemans, Buchholz, and Veenstra, 1999; Buchholz, Veenstra, and Daelemans, 1999) is a modular system that performs tokenization, part-of-speech tagging, constituent chunking, and grammatical relation finding. It has been trained on the Wall Street Journal part of the Penn Treebank, and is known to be quite accurate on WSJ-style texts: news articles, particularly with financial topics. An online demo can be found here; the page features more references and explanations of the tagsets and color codes used to encode the syntactic information. Like most parsers trained on the WSJ treebank, MBSP is known to be less accurate sometimes on text of different genres. For example, specialistic texts from medicine contain many words that MBSP will not recognize. When the part-of-speech tagger generates an error on such words, it is very well possible that the rest of the shallow parse is incorrect too. With some effort, MBSP can be retrained and tuned to a new domain by training certain components (such as the essential part-of-speech tagger) on specific words from the domain, such as medical terms. The assignment is to write a report on a series of experiments, in which you start by selecting experimental data from one of the following two sources (the choice is yours):
Next, go to this page (created and maintained at the University of Antwerp's CNTS research group), which features a generic version of MBSP and an MBSP tuned on medical text. Process all 20 sentences through both parsers, and study the parsing results closely. Note down the following:
In your report you can choose to take on a casuistic or a broad empirical style - it's your choice again. The report should contain the following elements:
Send your assignments to Walter Daelemans before 14 March, 2005. |