On 7 December 2011 the Tilburg center for Cognition and Communication organizes a symposium on language modelling on the occasion of the Ph.D. thesis defense of Herman Stehouwer later that day.
Language modelling has grown to be a pervasive technology in various information processing fields where the matching and ranking of linguistic objects such as documents, user profiles, and questions and answers is based on similarity of the text contained in them. The goal of the symposium is to provide an overview of different approaches in language modelling, exhibiting the large variety of work in this area. There will be talks by Colin de la Higuera, Antal van den Bosch, and Louis ten Bosch; Herman Stehouwer will additionally present the work described in his thesis.
Program:
With the growing availability of text corpora from different time periods, it is of increasing interest to be able to automatically detect diachronic changes in e.g. the usage of lexical items, text styles, and grammatical constructions. We will describe structure discovery methods to reveal such changes in grammatical constructions in texts. The methods have been applied to subcorpora of the Penn-Helsinki Parsed Corpus of Middle English (PPCME2), a collection of text samples based largely on the Middle English section of the Diachronic Part of the Helsinki Corpus of English Texts.
Memory-based language modeling is a class of statistical language modeling in which features are unordered, and next-word prediction is based on k-nearest neighbor classification (or approximations thereof). The approach is closely linked to standard back-off n-gram models, but offers more flexibility. For instance, the approach allows for testing whether adding right context makes for better language models than the traditional left-context-only models. I critically compare memory-based language models to baseline statistical n-gram language models in various terms, such as perplexity, word prediction accuracy and mean reciprocal rank, as well as external evaluations on tasks such as text completion, spelling correction, and machine translation.
Grammatical inference aims at learning grammars given information about languages. It should therefore be able to build models of probabilistic automata or grammars from a corpus, and use these learnt models to estimate the probability of a next word or obtain good perplexity results. Yet the preliminary results which were promising in the early 90's have failed to transform themselves into off-the-shelf technologies today.. During this talk we will comment upon some of the attempts of the field and the reasons of their relative failure. We shall also give some ideas as to why we hope the future of grammatical inference will be brighter for language modelling.
Is there a need to limit certain aspects of statistical language models? Is it necessary to pre-limit the size of the n-gram? Is it useful to use linguistic annotation, within alternative sequence selection tasks? We compare the ability of a language model to select the correct alternative from sets of alternatives in hundreds of experiments. These experiments where performed for three different alternative sequence selection tasks, for four different annotations (and also for no annotation), and for four different ways to combine the annotation with the text.
The symposium will be at the University of Tilburg, Warande building, room WZ104.