Sander Canisius Research page

Machine learning

Underlying all my work is a strong foundation of machine learning. The beauty of the machine learning approach to natural language processing lies in the fact that a single general-purpose learning algorithm can be used to perform tasks as diverse as part-of-speech tagging, named-entity recognition, and grapheme-to-phoneme conversion (in addition to many, many tasks outside of natural language processing). Improvements to the learning algorithm immediately reflect on the performance on all these tasks.

In my research, I primarily focus on machine learning approaches to structured classification problems. Many general-purpose machine learning algorithms expect an input space consisting of vectors of feature-value pairs, and an output space consisting of simple class labels. In structured classification, both the input and output space may have a more complex structure; for example: part-of-speech tagging is a task where the input consists of sentences, i.e. variable-length sequences of tokens, and the output is a sequence of part-of-speech labels. While part-of-speech tagging may be mapped to a traditional feature-value representation, valuable information may be lost during the process.

Special-purpose sequence labelling techniques aim to preserve as much global sequential context as possible. Circumventing the translation step of sequences to a fixed-length feature-value format, these techniques promise to be superior to more general-purpose machine learners. My work covers both the development of new sequence labelling methods and the evaluation of existing ones; trying to gain more complete insights into their strengths and weaknesses.


Related software

python-timbl
Python language binding for the Tilburg Memory-Based Learner (TiMBL)

Related publications

Sander Canisius, Antal van den Bosch, and Walter Daelemans (2006)
Constraint Satisfaction Inference: Non-probabilistic Global Inference for Sequence Labelling
In Proceedings of the EACL 2006 Workshop on Learning Structured Information in Natural Language Applications, Trento, April 2006.
[pdf]
Sander Canisius, Antal van den Bosch, and Walter Daelemans (2005)
Rule meta-learning for trigram-based sequence processing
In J. Cussens and C. Nedellec (Eds.), Proceedings of the Fourth Learning Language in Logic Workshop, pp. 3-10, Bonn, August 2005.
[pdf]