Wednesday 15 October

09.00 - 10.00 Registration
10.00 - 10.30 Welcome Coffee & Opening

Chair: Walter Daelemans
10.30 - 11.00 Guy De Pauw & Gilles-Maurice de Schryver Improving the Morphological Analysis of a Swahili Corpus for Lexicographic Purposes

We present the first ever attempt at building a comprehensive data-driven morphological analyzer for a Bantu language. We demonstrate how this can be achieved with relatively little manual effort and experimental results show that the method compares favourably to a meticulously designed rule-based technique, even when it is trained on the basis of its output. Defining the problem of data-driven morphological analysis on the level of the syllable, rather than on the character level, furthermore shows how techniques typically designed with Indo-European language processing in mind, can be adjusted to work for Bantu languages as well.
11.00 - 11.30 Sander Wubben A semantic relatedness metric based on free link structure

A metric for semantic relatedness is presented based on shortest path computation in networks of three kinds: WordNet, ConceptNet, and Wikipedia. While shortest paths in WordNet are known to correlate well with semantic similarity, an is-a hierarchy is less suited for estimating semantic relatedness. We demonstrate this by using two networks with a free link structure, ConceptNet (with typed, directed relations) and Wikipedia (with untyped hyperlinks). Using the Finkelstein-353 dataset, a benchmark set of human judged semantic relatedness of word pairs, we show that a shortest path metric (based on breadth-first search) run on Wikipedia attains a better correlation than WordNet-based metrics. ConceptNet attains a good correlation as well, but suffers from a low concept coverage.
11.30 - 12.00 Roser Morante, Anthony Liekens & Walter Daelemans Learning the Scope of Negation in Biomedical Texts

In this talk we present a machine learning system that finds the scope of negation in biomedical texts. The system consists of two memory-based engines, one that decides if the tokens in a sentence are negation signals, and another that finds the full scope of these negation signals. Our approach to negation detection differs in two main aspects from existing research on negation. First, we focus on finding the scope of negation signals, instead of determining whether a term is negated or not. Second, we apply supervised machine learning techniques, whereas most existing systems apply rule-based algorithms. As far as we know, this way of approaching the negation scope finding task is novel.
12.00 - 13.30 Lunch

Chair: Véronique Hoste
13.30 - 14.00 Els Lefever, Lieve Macken & Véronique Hoste Language independent bilingual terminology extraction from a parallel corpus

We present a language independent terminology extraction module that is based on a sub-sentential alignment system that links linguistically motivated phrases in parallel texts. Statistical filters are applied on the bilingual list of candidate terms that is extracted from the alignment output. We investigate the performance of both the alignment and terminology extraction module for three different language pairs (French-English, French-Italian, French-Dutch) and highlight the specific problems for each language pair (E.g morphological decomposition that increases the performance of the French-Dutch term extraction). Comparison with standard terminology extraction programs shows encouraging results and reveals that the linguistically based alignment module is particularly well suited for the extraction of complex multiword terms.
14.00 - 14.30 Erik Tjong Kim Sang Shallow Parsing and Full Parsing: Which is the Better Preprocessor?

We compare two processing methods for a single natural language processing task. One applies shallow parsing techniques while the other uses a full dependency parser. We show that for the task under investigation, hypernym extraction from text, the former allows for better performance. We compare the output of the two approaches and look for an explanation for this unexpected result.
14.30 - 15.00 Iris Hendrickx Re-Using Sources for Multi-Document Summarization

In this talk I discuss our participation in the TAC 2008 Summarization Track. We took part in the update task. The update task assumes that a user has already read several articles about a certain topic. The aim is to produce small (100 words) update summaries that only contain the new information not known by the user. The core of the presented system is the Graph-based system developed by Wauter Bosma for DUC 2006. We present an improved version which uses Coreference resolution and Sentence compression as added sources.
15.00 - 15.30 Coffee break

Chair: Erwin Marsi
15.30 - 16.00 Antal van den Bosch Memory-based word compl

Word completion, the prediction of a word given a number of characters of the word are already given, is typically employed in text editing software operated in circumstances where an above-average amount of effort is required for typing, either because the entry medium is limited (e.g. mobile phone keypads) or because the user is less able to type than average. A successful word completion algorithm is able to predict the intended word as soon as possible, so that the user can press an "accept" button and thereby save keystrokes. Word completion is typically evaluated by measuring keystroke savings in simulation experiments. We present a series of experiments in which we explore two issues: (1) the role of context beyond the characters of the word currently being entered, and (2) the relation between training and test data. Context is shown to boost savings, but it takes some feature engineering to limit the memory footprint of the algorithm. With respect to training and test data, a strong boosting effect is shown when training and test data are sampled from the same source. Results are presented both for normal QWERTY keyboards and mobile phone keypads.
16.00 - 16.30 Timur Fayruzov, Martine De Cock, Chris Cornelis & Véronique Hoste Features for Protein Interaction Extraction: is More Always Better?

To improve the quality of the protein interaction extraction, many approaches employ not only lexical data (i.e., the text itself), but also a lot of additional information, such as the output of shallow and full parsers. This information usually includes part-of-speech tags (shallow syntactic information) as well as parse and dependency trees (deep syntactic information). However, there have been very few attempts to evaluate the impact of different types of features for this task. In our research, we take a state-of-the-art approach that utilizes a support vector machine with a tree kernel, and derive from it several other methods that take into account only subsets of initial feature set.
17.30 Dinner

Thursday 16 October

10.00 - 10.30 Morning Coffee

Chair: Antal van den Bosch
10.30 - 11.30 Invited talk by Khalil Sima'an How Memory Should be Used: On Well-Behaved Statistical Models of Translation and Parsing

In this talk I will briefly review some of our recent work on phrase-based statistical machine translation and try to relate it to our earlier work on memory-based models of parsing. In particular, the talk will look at improved statistical estimation for phrase-based SMT and discuss how it relates to consistent statistical estimation for data-oriented parsing. We will consider the modeling issues and motivate the estimation of these models by statistical memory-based/nonparametric methods (mainly smoothing). The nature of the talk will be mostly intuitive, which I hope will allow us to see and discuss possible connections to earlier work on statistical smoothing using memory-based learning.
11.30 - 12.00 Kim Luyckx & Walter Daelemans Authorship Attribution and Verification with Many Authors and Limited Data

Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.
12.00 - 13.30 Lunch

Chair: Martin Reynaert
13.30 - 14.00 Erwin Marsi Sentence alignment in comparable text using shallow features

The DAESO (Detecting And Exploiting Semantic Overlap) is a Stevin project which involves the collection and annotation of a large corpus of parallel/comparable text (>500k words, Dutch) from various sources (from book translations to press releases about the same news topic). All text material is preprocessed, parsed, aligned at the sentence level, and finally aligned at the fine-grained level of syntactic nodes. In this talk, I will focus of the subtask of sentence alignment for comparable text. The main question is: how far can we get with only shallow textual features? I will present experimental results covering a range of features and metrics in combination with simple thresholding. The suggested answer is: surprisingly far, although the approach is inherently limited, as I will show by means of examples.
14.00 - 14.30 Marieke van Erp Employing Wikipedia to ontologise domain specific information

An approach is presented that utilises the content and link structure of Wikipedia to build domain specific ontologies. Since Wikipedia is a growing community maintained resource it is richer on domain specific information than other encyclopaedic resources. Another advantage of Wikipedia is its link structure which provides useful clues on relatedness of concepts which is exploited to narrow down the search space for semantic relations. Once snippets from Wikipedia pages that may contain information on the type of relation between two concepts are identified, these snippets are parsed by a shallow parser to extract relations. Results are evaluated by two human experts.
14.30 - 15.00 Coffee Break

Chair: Guy de Pauw
15.00 - 15.30 Véronique Hoste, Klaar Vanopstal, Els Lefever & Isabelle Delaere Classification-based scientific term detection in patient information

Despite the legislative efforts to improve the readability of patient information, different surveys have shown that respondents still feel distressed by reading the patient information leaflet. One of the main sources of distress is the use of scientific terminology. In order to assess the scale of the problem, we collected a Dutch-English parallel corpus of European Public Assessment Reports (EPARs) which was annotated by two linguists. This corpus was used to evaluate and train an automatic approach to scientific term detection. As an alternative to the dictionary-based approaches, which suffer from low coverage, we present a classification-based approach relying on a wide variety of information sources, such as local context and lexical information, termhood and unithood information, cognate identification and morphological information. We present the experimental findings and show some first results of the automatic replacement of the detected terms by their popular counterpart.
15.30 - 16.00 Menno van Zaanen Machine learning with sequences

In the field of natural language processing, we are often dealing with problems that deal with sequences of symbols, such as sentences. On the other hand, in the field of machine learning, many problems are treated as classification problems. A classifier takes an event and assigns a (pre-defined) class to it. When we combine these two types of problems, i.e. when events are in the form of sequences, the classification may be based on aspects of the structure of these events. For instance, the fact that certain symbols co-occur in a sequence may be an indication that these sequences belong to a certain class. In this talk, I will describe some research I have been doing in this field recently. This work is still in progress. I will show some results on the task of question classification (assign the type of answer to a question), and composer classification (given a musical piece, assign its composer).
16.00 - 16.30 Lieve Macken & Walter Daelemans Aligning complex translational correspondences via bootstrapping

In this talk, we describe how we try to align more complex translational correspondences via bootstrapping. We start from a sentence-aligned parallel corpus in which straightforward translational correspondences have been aligned on the basis of lexical clues and syntactic similarity. In a first step, candidate rules are extracted from all sentence pairs that only contain 1:1, 1:n and n:1 unlinked chunks. In a second step, the rules are applied on the whole training corpus, resulting in new sentence pairs containing 1:1, 1:n and n:1 unlinked chunks. The bootstrapping process is repeated several times.

List of participants

CNTS (UA) ILK (UvT) Language Technology and Computational Intelligence (Associatie UGent) UGroningen UVA

Last update: