Bootstrapping Multilingual Geographical Gazetteers from Corpora.

Author(s): Marieke van Erp

Reference: Proceedings of the 11th ESSLLI Student Session, Malaga, Spain, 31 July - 11 August 2006.

Abstract: In this paper an approach to automatically generating multilingual geographical name gazetteers via two bootstrapping loops on different corpora is presented. First, a small seed-list of geographical names is matched to an unannotated dataset in one language, and training data for a memory-based classifier is generated. Memory-based learning is applied to extend the gazetteer. Then a cross-over to a different language is made by matching this extended gazetteer to a corpus in a different language. Again, training data for a classifier is generated and the bootstrapping process is repeated in order to extend the gazetteer further. This process is quite similar to co-training, in which information from other sources is introduced to enhance classification. To estimate the difference between the initial seed-list and the final gazetteer and thereby to evaluate the performance of the algorithm, they were matched to three datasets with manually annotated geographical entities.

[pdf]   [Publications]   [Home]