ATILA 2009


Printable version of programme and abstracts here. (pdf)

Monday 9 November 2009

09.00 - 10.00 Registration
10.00 - 10.30 Welcome Coffee & Opening
10.30 - 11.00 Mike Kestemont
21 grams, the weight of the author? The acclaimed robustness of n-grams in authorship verification and the case of medieval rhymed epics.


In this talk I shall deal with authorship verification in medieval rhymed epics on the basis of several machine learning methods. N-grams, in particular bigrams, will be the main research focus. Though these small linguistic units are known to be a very predictive feature for present day authors, is it yet unclear how 'telling' they are for Middle Dutch authors.

Recent studies in computational stylometry have convincingly demonstrated that modern Machine Learning methods are able to recognize or verify the authorship of texts. These methods are far from flawless but seem to perform with high degrees of accuracy when confronted with a realistic and confined classification task. Such text classification is normally based on the extraction of a wide range of features (such as frequency counts) from texts. Character n-grams and in particular bigrams seem to make up an essential part of the so-called stylome of present day authors. The robustness of these small linguistic units is extraordinary, especially because they seem rather unmeaningful at first sight. However, when it comes to historical literary data, one could wonder whether the acclaimed robustness of n-grams still holds. In this paper I shall report on some exploratory experiments into the n-gram based authorship verification of some well-known Middle Dutch rhymed epics, such as Van den vos Reynaerde, de Roman van Moriaen and Karel ende Elegast. In fact, the main focus will lie with (dis)proving the following two hypotheses:

  • Character n-grams will perform poorly in this task because of the enormous spelling and spacing variation, so typical of medieval manuscripts. Moreover, the most 'telling' n-grams for modern texts often include punctuation marks, a graphemic category that medieval epics generally lack.
  • In rhymed epics, n-gram extraction restricted to tokens that appear in rhyme position will be superior to n-gram extraction from words outside rhyme position.
Three Machine Learning frameworks will be compared for this task: one lazy instance-based learner (TiMBL) and two eager learners (MaxEnt & LibSVM).

11.00 - 11.30 Herman Stehouwer
Token merging in language model-based confusible disambiguation


In the context of confusible disambiguation (spelling correction that requires context), the synchronous back-off strategy combined with traditional n-gram language models performs well. However, when alternatives consist of a different number of tokens, this classification technique cannot be applied directly, because the computation of the probabilities is skewed. Previous work already showed that probabilities based on different order n-grams should not be compared directly.

In this article, we propose new probability metrics in which the size of the n is varied according to the number of tokens of the confusible alternative. This requires access to n-grams of variable length. Results show that the synchronous back-off method is extremely robust.

We discuss the use of suffix trees as a technique to store variable length n-gram information efficiently.

11.30 - 12.00 Bram Vandekerckhove
Simulating the selective impairment of semantic constraints on adjective order with a memory-based language processing model.


Recent work in neurolinguistics (Kemmerer et al., 2009) reports aphasics that are selectively impaired in their knowledge about the semantic constraints that govern pre-nominal adjective order (e.g. 'a big brown dog' vs. *'a brown big dog'). At the same time, knowledge of the semantic categories to which these constraints apply (e.g. size, color, etc.) and of syntactic word order constraints ('a big brown dog' vs. *'dog brown big a') seems to be still intact. By varying the number of neighbors taken into account for extrapolation in a simple memory-based language processing model, we show how these patients can be characterized as 'overeager abstractors'. Their impairments might affect not so much abstract linguistic *knowledge* as the level of abstraction they employ during linguistic *processing*.

12.00 - 13.00 Lunch break
13.00 - 13.30 Philip van Oosten, Dries Tanghe & Véronique Hoste
Towards a Gold Standard for Readability Prediction.


Readability is an ill-defined property of text. A possible description is the common perception the community of language speakers has about it. We intend to estimate it by means of the assesments by a panel of expert readers. We describe related work. Readability formulas assign a holistic score to any text, based on shallow language characteristics. Other research successfully shows that deeper characteristics have an inßuence on readability, but fails to turn those conclusions into a holistic score. Based on research on the Eindhoven Corpus, we show that the readability formulas strongly correlate with shallow text characteristics and that strong correlations exist among di?erent formulas. Further, we argue that there is a necessity for a new Gold Standard corpus for readability prediction, based on a clean methodology. Finally, we present such a methodology and web applications that are applicable to construct the corpus.

13.30 - 14.00 Martin Reynaert
Parallel identification of the spelling variants in OCR-ed, historical corpora.


We present a new approach based on anagram hashing to globally handle the typographical variation in large and possibly noisy text collections. Typographical variation is typically handled in a local fashion: given one particular text string some system of retrieving near-neighbours is applied, where near-neighbours are other text strings that differ from the particular string by a given number of characters. The difference in characters between the original string and one of its retrieved near-neighbours we call a particular character confusion. We present a global way of performing this action: given a possible particular character confusion, we identify - in parallel, i.e. in one single operation on anagram-hash derived bit vectors - all the pairs of text strings in the text collection to which the particular confusion applies. The algorithm was evaluated on about 23,000 English attested typos from the Reuters RCV1 text collection. We further explored its usefulness for unsupervised linking of a historical Dutch word list to its contemporary counterpart. In this talk we present results of applying the approach to large, historical collections of Dutch OCR-ed Acts of Parliament.

14.00 - 14.30 Coffee break
14.30 - 15.00 Sander Wubben
Monolingual Machine Translation.


In this talk I will present an approach which regards paraphrasing as a translation task, with the source and translation languages being the same. For this task a phrase based machine translation system is used. Pairs of paraphrases can simply be fed to such a system.

These pairs of paraphrases are obtained by acquiring headline clusters from Google News, and then for each cluster selecting the available paraphrase candidates by using simple surface similarities. This dataset is used to train a phrase-based translation model using the MOSES package and a language model. Using these models we can generate paraphrases for any given new headline.

This system is compared to a baseline word substitution model, which utilizes WordNet to substitute words with their synonyms. We presented nine test subjects with 160 headlines and the output from both systems and asked them to rate the generated paraphrases. We show that the human judges prefer the paraphrases generated by the phrase based machine translation system. We will show how this system may possibly be improved and how it can be used for a related task: sentence compression. We will also try to formulate a way to automatically rate paraphrase quality.

15.00 - 15.30 Lieve Macken & Walter Daelemans
A chunk-driven bootstrapping approach to extracting translation patterns.


We present a linguistically-motivated sub-sentential alignment system that extends the intersected IBM Model 4 word alignments. The alignment system is chunk-driven and requires only shallow linguistic processing tools for the source and the target languages, i.e. part-of-speech taggers and chunkers. We conceive the sub-sentential aligner as a cascade model consisting of two phases. In the first phase, anchor chunks are linked based on the intersected word alignments and syntactic similarity. In the second phase, we use a bootstrapping approach to extract more complex translation patterns. The results show an overall AER reduction and competitive F-Measures in comparison to the commonly used symmetrized IBM Model 4 predictions (intersection, union and grow-diag-final) on different text types for English-Dutch. More in particular, in comparison with the intersected word alignments, the proposed method improves recall, without sacrificing precision. Moreover, the system is able to align discontinuous chunks.

15.30 - 16.00 Maarten van Gompel, Peter Berck & Antal van den Bosch
Extending Memory-Based Machine Translation to Phrases.


We present a phrase-based extension to memory-based machine translation. This form of example-based machine translation employs lazy-learning classifiers to translate fragments of the source sentence to fragments of the target sentence. Source-side fragments consist of variable-length phrases in a local context of neighboring words, translated by the classifier to a target-language phrase. We compare three methods of phrase extraction, and present a new decoder that re-assembles the translated fragments into one final translation. Results show that one of the proposed phrase-extraction methods, the one used in Moses, leads to a translation system that outperforms context-sensitive word-based approaches. The differences, however, are small, arguably because the word-based approaches already capture phrasal context implicitly due to their source-side and target-side context sensitivity.

16.00 - 16.30 Coffee break
16.30 - 17.00 Peter Berck & Antal van den Bosch
WOPR: A Memory-Based Swiss Army Knife.


WOPR is a software package that wraps around TiMBL, and offers a choice of functions all related to language modeling. Built-in functionality includes word prediction with user-determinable context: neighboring words, a frequency-filtered context word memory with decay, and document-global features. Besides "all words" prediction, WOPR can be set to zoom in on specific prediction subsets (such as confusibles), or specific contexts. It can test language models on new text, reporting perplexities, prediction distributions, and word-level entropy, and can export ARPA-formatted language model files. It can filter its output to produce spelling correction candidates. It has a server mode. Finally, WOPR is scriptable. We demonstrate WOPR's functionalities by presenting new results on memory-based language modeling.

17.00 - 17.20 Menno van Zaanen
Associative Memory-Based Storage.


As we all know, machine learning techniques have been developed and used to generalize over training examples, allowing classification of unseen instances. The aim of these techniques is to improve (according to some measure) on a certain task by using the (training) experience.

Memory-based learning is a particular machine learning technique that generalizes decisions by comparing new, unseen instances to training instances of which the classification is known. A useful aspect of this approach is that the number of classes is essentially unlimited (although practically limited by the number of training instances).

The idea we are investigating at the moment makes use of the fact that all training instances are stored, combined with the fact that the approach can handle a large number of output classes. Classify with k=1 and only request instances that have distance 0 with the test instance, results only in instances that have been seen during training. Essentially, this reduces k-NN to an associative memory-based storage.

We have applied this associative storage in a simple context-free parser (allowing TiMBL to store a context-free grammar). By relaxing the distance and k, it is possible to find similar grammar rules when exactly matching grammar rules cannot be found. This results in a robust parser. Also, additional (context-sensitive) information can easily be incorporated in the feature vectors.

18.00 Dinner
Tuesday 10 November 2009

10.00 - 10.30 Morning Coffee
10.30 - 11.30 Invited talk by Sophia Ananiadou - National Text Mining Centre (NaCTeM)
Linking text to knowledge: text mining techniques for knowledge discovery



11.30 - 12.00 Isabelle Delaere, Véronique Hoste & Peter Velaerts
ABOP, an Automatic optimizer for Patient Information Leaflets.
(System Demo)


ABOP, an Automatic optimizer for Patient Information Leaflets (PILs), aims to improve the readability of Dutch PILs by tackling three of the issues that make a PIL hard to read: the scientific terminology used, the redundancy which makes a PIL needlessly lengthy and the overlap between illocutionary acts which often make no distinction between an instruction, a warning and mere informative text. ABOP combines a highly accurate learning-based terminology extraction web service with an application that can be plugged into Microsoft Word and easily used by PIL authors.

12.00 - 13.00 Lunch break
13.00 - 13.30 Antal van den Bosch & Roser Morante
Dependency parsing and semantic role labeling: Variants of a single task.


In this presentation we provide arguments for the assertion that syntactic dependency parsing and semantic role labeling are merely exponents of one and the same underlying task. This syntacto-semantic task can be defined in a more syntactic way (as dependency parsing), a more semantic way (as semantic role labeling), or as a joint task that maps to a joint syntacto-semantic label set. Using the constraint-satisfaction inference approach of Sander Canisius, and the CoNLL 2009 shared task datasets of Spanish and Catalan, we show that learning the joint task leads to performances that are at least on par with the results of learning the tasks in isolation, and in fact often improve over the isolated approaches. We venture into explaining this remarkable result by analysing the various components of the joint learning system: why is it that a more complicated label system, with less examples per label on average, leads to better parses and semantic role assignments?

13.30 - 14.00 Frederik Vaassen
deLearyous - Training Interpersonal Communication Skills through Natural Language Interaction with Autonomous Virtual Characters.


The deLearyous project is a collaboration between Groep T in Leuven, the e-learning company Opikanoba and the Computational Linguistics and Psycholinguistics (CLiPS) group at the University of Antwerp. Its goal is to develop a program that allows trainees to improve their communication skills by interacting with a virtual character through written natural language. The project is currently still in its preliminary stages.

CLiPS' part in the Delearyous project involves automatic emotion extraction from written text input so these emotions can be used as cues to steer the virtual character down the correct dialogue paths. Sentences are classified into four possible emotion classes. Classification is done based on different types of features, including word tokens, syntactic functions, part-of-speech tags and characters. Different machine learning and statistical techniques have been tried, most notably TiMBL -a memory-based learner- and a Naive Bayes classifier. On a small corpus of 120 sentences, the best results seem to be achieved by presenting a set of 200 character trigrams and syntactic functions to a Naive Bayes classifier. Work on this task is ongoing, and an approach based on a Dutch emotion keyword list is currently in development.

14.00 - 14.30 Roser Morante, Vincent Van Asch & Walter Daelemans
Comparison of two memory-based approaches to event extraction.


In this talk we compare two memory-based machine learning systems that extract event frames from biomedical texts according to the guidelines of the BioNLP Shared Task 2009. The task consists of finding event triggers and event participants. The main difference between the systems is that one consists of a pipeline of classifiers for finding event trigger finding and for finding event participants, whereas the other integrates classifiers that learn the event triggers and the event participants jointly. We will show that the second system obtains better overall results, although the first system is more precise.

4.30 - 15.00 Coffee break
15.00 - 15.30 Erwin Marsi
An algorithm for detecting semantic similarity using syntactic tree alignment.


Our goal is to analyze to what extent two Dutch sentences have similar meaning. Our approach is to align their syntactic trees, where each node in the source tree may be aligned to another node in the target tree. In addition, alignments are labeled with a semantic similarity relation. Given the sentences "Arbour ging vanuit de Macedonische hoofdstad Skopje op weg naar Racak" and "Ze was op weg van Macedonie naar Racak", for example, the NP "de Macedonische hoofdstad Skopje" is aligned to the NP "Macedonie" with the relation "Specifies", because the former contains more specific information than the latter.

Over the last two years we have constructed a substantial mono-lingual treebank containing manually aligned syntax trees of similar sentences (the DAESO corpus). This resource allows us to cast the tasks of alignment and labeling as a classification problem. Our algorithm takes the following steps. First, we use Timbl to exhaustively predict an alignment relation (if any) for every possible pair of source and target nodes. Features employed range from low-level string similarity to dependency relations and lexical-semantic relations. Next, we associate a cost with these predictions: the entropy of the class distribution in the nearest neighbors set. We then proceed to delete alignments so that only one-to-one node alignments are left, while at same time minimizing the overall alignment cost. This is in fact a well-known global optimization problem known as the "assignment problem" for which polynomial time algorithms like the "Hungarian algorithm" exist. Finally, we filter alignments above a certain maximum cost.

In this talk, we will introduce the corpus data, explain the alignment algorithm, present preliminary results, and discuss remaining issues.

15.30 - 16.00 Els Lefever & Véronique Hoste
SemEval-2: Cross-lingual Word Sense Disambiguation.


We propose a multilingual unsupervised Word Sense Disambiguation (WSD) task for a sample of English nouns. Instead of providing manually sense-tagged examples for each sense of a polysemous noun, our sense inventory is built up on the basis of the Europarl parallel corpus. The multilingual setup involves the translations of a given English polysemous noun in five supported languages, viz. Dutch, French, German, Spanish and Italian.

Organizing this task consists in: (a) the manual creation of a multilingual sense inventory for a lexical sample of English nouns and (b) the evaluation of systems on their ability to disambiguate new occurrences of the selected polysemous nouns. For the creation of the hand-tagged gold standard, all translations of a given polysemous English noun are retrieved in the five languages and clustered by meaning. Human annotators label each instance with the appropriate cluster and their top-3 translations from this cluster. The frequencies of these translations are used to assign weights to all translations in the gold standard. Systems can participate in some of the five bilingual evaluation subtasks and in a multilingual subtask covering all language pairs. To score the system output, we perform a "best" evaluation (where the credit for each correct guess is divided by the number of guesses) and a more "relaxed" evaluation of maximum 10 system guesses (where systems are not penalized for a higher number of guesses).

16.00 - 16.30 Discussion / Closing


List of participants
  • Sophia Ananiadou (NaCTeM)
  • Vincent van Asch (CLiPS)
  • Peter Berck (ILK)
  • Antal van den Bosch (ILK)
  • Walter Daelemans (CLiPS)
  • Orphee Declercq (LT3)
  • Isabelle Delaere (LT3)
  • Bart Desmet (LT3)
  • Véronique Hoste (LT3)
  • Steve Hunt (ILK)
  • Mike Kestemont (CLiPS)
  • Els Lefever (LT3)
  • Kim Luyckx (CLiPS)
  • Lieve Macken (LT3)
  • Erwin Marsi (ILK)
  • Guy De Pauw (CLiPS)
  • Martin Reynaert (ILK)
  • Herman Stehouwer (ILK)
  • Dries Tanghe (LT3)
  • Frederik Vaassen (CLiPS)
  • Bram Vandekerckhove (CLiPS)
  • Philip Vanoosten (LT3)
  • Klaar Vanopstal (LT3)
  • Sander Wubben (ILK)
  • Menno van Zaanen (ILK)
  • Kalliopi Zervanou (ILK)

Last update: