|10.00 - 11.00
||Valentin Jijkoun (Invited speaker)
TiMBL in Amsterdam: semantic parsing and question analysis
ILPS, the Information and Language Processing Systems group at the University of Amsterdam, is working on different aspects of intelligent information access: Information Extraction and Retrieval, Applied NLP, Automated Reasoning, Question Answering. In this talk I will describe our recent experiments with MBL for several NLP tasks: recovery of predicate-argument structure as post-parsing, FrameNet-based extraction of semantic roles, and question classification for Question Answering. The evaluation results (e.g., our scores at the Senseval-3 Semantic Roles task) suggest that MBL is indeed a good choice for these problems.
|11.00 - 11.30
Automatic Sentence Compression in the MUSA project
We describe ongoing work on sentence compression in the MUSA project. This project aims at automatically creating subtitles for English TV programs in three target languages English, Greek, and French. It involves automatic speech recognition, sentence compression and machine translation. This talk will focus on the architecture of the sentence compression module, resposible for reducing English sentences according to a given compression rate without affecting the grammaticality of the input while ensuring that no relevant information gets lost.
While there has been a lot of research on document summarization, the field of sentence compression is relatively unexplored. Basically there are two ways of shortening a sentence. Either you delete irrelevant parts of it or you try to replace phrases with shorter paraphrases that express the same thing.
The current architecture of the MUSA subtitling module combines both: A parallel corpus of TV transcripts and subtitle has been used to create a lexicon with common paraphrases. Given an input sentence, the subtitling engine first checks if it can be reduced by looking up paraphrases in the lexicon and replacing the parts of the sentence for which a shorter paraphrase can be found. If further compression is required, the input sentence gets linguistically annotated with a shallow parser, which assigns lemmas and part-of-speech tags to the words, groups together adjacent syntactically related words and performs basic relation finding (identification of verbs and their subjects and objects). A set of hand-crafted linguistic rules is then used to determine which parts of the sentence can be deleted without loss of grammaticality. Different relevance measures, based on word frequency, word duration and deletion probabilities have been implemented to decide which substring to delete first. Starting with the least important we delete substrings until the target compression rate is met. The evaluation of the different system setups is based on the BLEU method, i.e the automatically generated subtitles are compared to five reference subtitles that were created by professional subtitlers.
|11.30 - 12.00
Towards fully automatic Text Induced Spelling Correction
We present new developments concerning tisc, a language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from a very large corpus of raw text, without supervision, and contains word unigrams and word bigrams. It is stored in a novel representation based on a purpose-built hashing function, which provides a fast and computationally tractable way of checking whether a particular word form likely constitutes a spelling error and of retrieving correction candidates. We will present our findings on the nature and scope of non-word spelling errors. We also discuss recent evaluations which show tisc outperforms the state-of-the-art systems available today. Its performance indicates fully automatic spelling checking and correction is gradually becoming a real possibility.
|12.00 - 12.30
||Guy De Pauw
Revisiting Memory-Based Morphological Analysis
In this talk, we revisit the well-known memory-based method for morphological analysis described in Van den Bosch et al. (1999) using a cascaded approach in which each processing step can be evaluated and optimized separately without risking percolation of bad classification decisions. We report scores on the alternation, segmentation, tagging and bracketing subtasks on a cleaned-up version of the Dutch morphological database of CELEX and find that the memory-based approach stands up well against an independently developed fine-grained finite state method for morphological segmentation. Both systems were developed as autonomous language models for use in the modular speech recognition system developed within the FLaVoR project.
|13.30 - 14.00
Learning coreference resolution
We provide a thorough empirical study of the behaviour of two well-known machine learning techniques, viz. memory-based learning and rule induction on the task of coreference resolution. Applied to this specific task, we determine which factors contribute to the success or failure of a machine learning experiment. We consider the effect of algorithm bias, feature selection, algorithm parameter optimization, the combined variation of both and the effect of sample selection on the performance of both learning techniques. On the basis of results on the MUC-6 coreference resolution data sets, we show that the initial differences between the two learning techniques are easily outruled, or even reversed, when taking into account all these factors. We also introduce the first coreferentially annotated corpus of Dutch texts.
|14.00 - 14.30
Antal van den Bosch
Robust Interpretation of Recognised Speech
We describe a series of experiments in which shallow processing approaches are combined to extract pragmatic-semantic information from spoken user input in Dutch human-machine dialogues. In the machine learning experiments the task is to interpret user utterances to a spoken dialogue system in terms of four aspects: the basic dialogue act of the input, the type of query slot(s) the user fills, whether the user reacts to system error, and whether the user input is going to be erroneously processed by the system. We conduct all experiments both with a memory-based and a rule induction classifier that draw on a variety of features representing the spoken input (speech signal measurements and word recognition information) as well as the interaction context (the dialogue history).
We investigate if directly processing N-best recognition hypotheses improves performance on the interpretation task. Processing is carried out by three approaches that aim at filtering out certain tokens from the N-best list of each user utterance: (i) disfluencies, (ii) words that are not the head of their syntactic chunk, and (iii) words that do not belong to the set of 15 most frequent words in the corpus.
|14.30 - 15.00
||Erik Tjong Kim Sang
Automatic Reduction of Dutch Sentences
Apart from speech recognition, automatic subtitling of tv programmes requires a reduction of the length of the sentences in order to fit them in the available space. Such a reduction can be accomplished by removing words or by replacing them by shorter variants. This talk presents two methods for performing this task. The first is based on hand-crafted phrase deletion rules and the second is a machine-learning approach based on a corpus of sentences and their reduced counterparts. The two approaches will be evaluated and compared.
|15.00 - 15.30
||Antal van den Bosch
How hard can word prediction be?
Word prediction is an intrigueing language engineering semi-product; arguably it is the archetypal prediction problem in natural language processing. It can be an asset in higher-level proofing or authoring tools, e.g. to be able to discern among confusibles, or to suggest words in a word processing environment, both under normal circumstances and in special cases such as language learning or augmentative communication. It could alleviate problems with low-frequency and unknown words in natural language processing and information retrieval, by replacing them with likely alternatives that carry similar information; it could provide answers to some questions in question answering systems by filling in blanks. And finally, since it is also a very direct interpretation of language modeling, a word prediction system could provide useful information for language modeling components in speech recognition systems.
We present an automatically learned word prediction model based on IGTree, a decision-tree approximation of the memory-based learning algorithm IB1- IG, which uses information gain as the criterion to order tests in the tree. We define the word prediction task both as a "blank filling" task with left and right context words, and as a "next word" task with only left context words. Through experiments, in which we train on Reuters newswire text and test either on the same type of data or on fictional text, we demonstrate that the system can scale up to predicting among hundreds of thousands of words, can be trained on up to twenty million training examples, and can predict words at rates between thousands and tens of words per second. Prediction accuracy ranges between 12% on the fictional text and 49% on the newswire text, suggesting that the proposed method is highly sensitive to genre differences in training and test material.
|15.30 - 16.00
Lexical Representations for Prepositional Phrase Attachment Disambiguation
The problem of prepositional phrase (PP) attachment disambiguation consists of determining whether a PP in a Verb-Object-PP pattern attaches to the verb - as in (1) - or to the noun phrase in the object position - as in (2).
(1) I [VP eat [NP pizza]] [PP with a fork]
(2) I [VP eat [NP pizza [PP with anchovies]]]
In this talk, we will report on our machine learning experiments with various representations of words as features in PP attachment disambiguation. The main idea is to explore various levels of generalization between two extremes, the words themselves as the most specific level, and POS-tags as the most general level. The in-between levels investigated are: a +/- morphological representation of low-frequency words (HAPAX-ing), a representation derived from automatically clustering words into (semantic) classes, and a representation derived from WordNet.
We use the Rathnaparki data for our PP attachment disambiguation experiments: this data set contains 20.000 training items and 4.000 test items, each item being a four-tuple [verb, head noun of object, preposition, head noun of PP], extracted from the Wall Street Journal corpus and labeled with its correct class (N for attachment to the noun, V for attachment to the verb).
The machine learning algorithm used in our experiments is TiMBL. Besides reporting the results for the various levels of generalization in the lexical representation of the words, we will also investigate the effects parameter optimization and training set size.
Questions? Contact Iris Hendrickx or Erwin Marsi.