Sander Canisius Research page

Machine learning

Underlying all my work is a strong foundation of machine learning. The beauty of the machine learning approach to natural language processing lies in the fact that a single general-purpose learning algorithm can be used to perform tasks as diverse as part-of-speech tagging, named-entity recognition, and grapheme-to-phoneme conversion (in addition to many, many tasks outside of natural language processing). Improvements to the learning algorithm immediately reflect on the performance on all these tasks.

In my research, I primarily focus on machine learning approaches to structured classification problems. Many general-purpose machine learning algorithms expect an input space consisting of vectors of feature-value pairs, and an output space consisting of simple class labels. In structured classification, both the input and output space may have a more complex structure; for example: part-of-speech tagging is a task where the input consists of sentences, i.e. variable-length sequences of tokens, and the output is a sequence of part-of-speech labels. While part-of-speech tagging may be mapped to a traditional feature-value representation, valuable information may be lost during the process.

Special-purpose sequence labelling techniques aim to preserve as much global sequential context as possible. Circumventing the translation step of sequences to a fixed-length feature-value format, these techniques promise to be superior to more general-purpose machine learners. My work covers both the development of new sequence labelling methods and the evaluation of existing ones; trying to gain more complete insights into their strengths and weaknesses.

Related software

Python language binding for the Tilburg Memory-Based Learner (TiMBL)

Related publications

Sander Canisius, Antal van den Bosch, and Walter Daelemans (2006)
Constraint Satisfaction Inference: Non-probabilistic Global Inference for Sequence Labelling
In Proceedings of the EACL 2006 Workshop on Learning Structured Information in Natural Language Applications, Trento, April 2006.
Sander Canisius, Antal van den Bosch, and Walter Daelemans (2005)
Rule meta-learning for trigram-based sequence processing
In J. Cussens and C. Nedellec (Eds.), Proceedings of the Fourth Learning Language in Logic Workshop, pp. 3-10, Bonn, August 2005.