In my PhD-project called MEMPHIX (memory based paraphrasing using implicit and explicit semantics), I aim to develop a way to automatically generate paraphrases. The ability to paraphrase (or to say the same thing in another way), can serve a variety of purposes. It can serve to explain something or to provide feedback in dialogue. Generating shorter paraphrases is useful for subtitles or news feeds. Paraphrasing can also change the register of a text: from formal speak to street language, or from old-fashioned prose to present-day language. Paraphrasing might also be valuable to strengthen question answering, dialogue systems and machine translation. In the MEMPHIX project, a system is built that learns to generate paraphrases on the basis of examples. This can be done by training a Memory Based or Phrase Based Machine Translation system. Pairs of paraphrases can simply be fed to such a system.

While the generation of paraphrases can be driven in the first place by surface similarities (leaving semantics completely implicit), explicit semantic information may also play a role, such as the semantic roles of NPs, the coreference relations between NPs and pronouns, and the semantic relatedness between certain constituents. Such information may be computed through automatic means (parsing, semantic role labeling, co-reference resolution). The project compares the direct implicit route with the use of explicitly computed semantics.

In my project I have access to a Dutch corpus of 1 million words developed in the DAESO project, consisting of pairs of texts that express paraphrased or at least comparable information from various domains. However, the data from DAESO alone might not be enough. The first step in this project is to automatically collect more data (i.e. aligned paraphrases). This is done by mining the web. Headline clusters can be acquired from Google News, and for each cluster the available paraphrase candidates can be selected using surface similarities. These aligned paraphrases can then be used to train an MT system.

Gezond Verstand

Assertions such as "fire is hot" or "you are likely to find a cow near a farm" are very logical to us, but not to computers. We consider them common sense (or in Dutch: gezond verstand), and it can be argued that if we want computers to be able to understand anything, they need to have common sense as well. This project aims to collect these common sense assertions from kids, who are very involved in dealing with these assumptions, so that we can eventually feed this data to computers. You can find more information on the project's website (in Dutch), or you can have a look at the Open Mind Commons website.

Open Boek

At the National Service for Archaeology, Cultural Landscape and Built Heritage (RACM) I was involved in building a smart archaelogical retrieval system titled Open Boek, which enables archeaologists to search not only by keywords but also by chronology and place in reports of excavations. The development of this system is carried out in the RICH project.