What do objects 11551 and 15576 have in common? Finding hidden variables in a textual database

Author(s): Marieke van Erp

Reference:Presented at CLIN 17 - Computational Linguistics in the Netherlands, Leuven, Belgium, January 12, 2007.

Abstract: An experiment is presented in which the main aim is to find hidden variables in a textual database. The database is an animal specimen database from the Dutch National Museum for Natural History describing where, when, by whom and under what circumstances an animal specimen was found. Most animal specimens were collected during expeditions, but information on these expeditions is not explicitly available in the database. They can however be inferred from the available data. In this experiment clustering algorithms are applied to the database in order to make the expeditions explicit. The influence of feature modelling and selection and different algorithms is investigated. There is a particular focus on the effects of different representations of textual information on the clustering process.

