ILK Publications with Abstracts

ILK - Publications with abstracts

2000 | 1999 | 1998 | 1997 | 1996 | 1995 | 1994 | 1993 | 1992

2000

ILK-0002

Integrating seed names and n-grams for a named entity list and classifier

Author(s): Sabine Buchholz and Antal van den Bosch
Reference: In: Proceedings of LREC-2000, Athens, Greece, June 2000, pp. 1215-1221.

[Postscript, with corrections]

We present a method for building a named-entity list and machine-learned named-entity classifier from a corpus of Dutch newspaper text, a rule-based named entity recognizer, and labeled seed name lists taken from the internet. The seed names, labeled either as PERSON, LOCATION, ORGANIZATION, or ADJECTIVAL name, are looked up in a 83-million word corpus, and their immediate contexts are stored as instances of their label. The latter 8-grams are used by a decision-tree learning algorithm that, after training, (i) can produce high-precision labeling of instances to be added to the seed lists, and (ii) more generally labels new, unseen names. Unlabeled named-entity types are labeled with a precision of 61% and a recall of 56%; aiming at optimizing precision, an overall precision of 83% can be obtained (a top precision of 88% on PERSON). On free text, named-entity token labeling accuracy is 71%.

ILK-0004

Unpacking multi-valued symbolic features and classes in memory-based language learning

Author(s): Antal van den Bosch and Jakub Zavrel
Referemce: In P. Langley (Ed.), Proceedings of the Seventeenth International Conference on Machine Learning, pp. 1055-1062. San Francisco, CA: Morgan Kaufmann, 2000.

[Postscript]

In supervised machine-learning applications to natural language processing, tasks are typically formulated as classification tasks mapping multi-valued features to multi-valued classes. Memory-based or instance-based learning algorithms are suited for such representations, but they are not restricted to them; both features and classes may be unpacked in binary values. We demonstrate in a matrix of empirical tests on a range of natural language learning tasks that when using k=1 in the k-NN classifier kernel, binary unpacking of features and classes tends to be harmful to generalization accuracy. Unpacking features and classes causes the kernel classifier to rely on smaller sets of nearest neighbors, which generally leads to more misclassifications; only when the data is not sparse in the multi-valued case (when the average number of equidistant nearest neighbors is well above a handful), unpacking can lead to improved generalization accuracy.

ILK-0006

A distributed, yet symbolic model of text-to-speech processing

Author(s): Antal van den Bosch and Walter Daelemans
Reference: In P. Broeder and J.M.J. Murre (Eds.), Models of Language Acquisition: inductive and deductive approaches. Oxford University Press, 76-99, 2000.

[Postscript of preprint]

In this paper, a data-oriented model of text-to-speech processing is described. On the basis of a large text-to-speech corpus, the model automatically gathers a distributed, yet symbolic representation of subword-phoneme association knowledge, representing this knowledge in the form of paths in a decision tree. Paths represent context-sensitive rewrite rules which unambiguously map strings of letters onto single phonemes. The more ambiguous the mapping is, the larger the stored context. The knowledge needed for converting a spelling word to its phonemic transcription is thus represented in a distributed fashion: many different paths contribute to the phonemisation of a word, and a single path may contribute to phonemisations of many words. Some intrinsic properties of the data-oriented model are shown to have relations with psycholinguistic concepts such as a language's orthographic depth, and word pronunciation consistency.

1999

ILK-9902

Forgetting exceptions is harmful in language learning

Author(s): Walter Daelemans, Antal van den Bosch, and Jakub Zavrel.
Reference: Machine Learning, special issue on natural language learning, 34, pp. 11-43, 1999.

Preprint postscript

We show that in language learning, contrary to received wisdom, keeping exceptional training instances in memory can be beneficial for generalization accuracy. We investigate this phenomenon empirically on a selection of benchmark natural language processing tasks: grapheme-to-phoneme conversion, part-of-speech tagging, prepositional-phrase attachment, and base noun phrase chunking. In a first series of experiments we combine memory-based learning with training set editing techniques, in which instances are edited based on their typicality and class prediction strength. Results show that editing exceptional instances (with low typicality or low class prediction strength) tends to harm generalization accuracy. In a second series of experiments we compare memory-based learning and decision-tree learning methods on the same selection of tasks, and find that decision-tree learning often performs worse than memory-based learning. Moreover, the decrease in performance can be linked to the degree of abstraction from exceptions (i.e., pruning or eagerness). We provide explanations for both results in terms of the properties of the natural language processing tasks and the learning algorithms.

ILK-9903

Recent Advances in Memory-Based Part-of-Speech Tagging

Author(s): Jakub Zavrel and Walter Daelemans.
Reference: VI Simposio Internacional de Comunicacion Social, Santiago de Cuba, pp. 590-597, 1999.

Postscript, MS-Word

Memory-based learning algorithms are lazy learners. Examples of a task are stored in memory and processing is largely postponed to the time when new instances of the task need to be solved. This is then done by extrapolating directly from those remembered instances which are most similar to the present ones. Using memory-based learning for Part-of-Speech tagging has a number of advantages over traditional statistical POS taggers: (i) there is no need for an additional smoothing component for sparse data, (ii) even low-frequent or exceptional patterns can contribute to generalization, (iii) the use of a weighted similarity metric allows for an easy integration of different information sources, and (iv) both development time and processing speed are very fast (in the order of hours and thousands of words/sec, respectively). In recent work, we have applied the Memory-Based tagger (MBT) to a number of different languages and corpora (English, Dutch, Czech, Swedish, and Spanish). Furthermore, we have performed a controlled experimental comparison of MBT with several other POS tagging algorithms.

ILK-9904

Interpreting knowledge representations in BP-SOM

Author(s): Ton Weijters and Antal van den Bosch.
Reference: Behaviormetrika, 26:1, pp. 107-128, 1999.

Artificial Neural Networks (ANNs) are able, in general and in principle, to learn complex tasks. Interpretation of models induced by ANNs, however, is often extremely difficult due to the non-linear and non-symbolic nature of the models. To enable better interpretation of the way knowledge is represented in ANNs, we present BP-SOM, a neural network architecture and learning algorithm. BP-SOM is a combination of a multi-layered feed-forward network (MFN) trained with the back-propagation learning rule (BP), and Kohonen's self-organising maps (SOMs). The involvement of the SOM in learning leads to highly structured knowledge representations both at the hidden layer and on the SOMs. We focus on a particular phenomenon within trained BP-SOM networks, viz. that the SOM part acts as an organiser of the leaqrning material into instance subsets that tend to be homogeneous with respect to both class labelling and subsets of attribute values. We show that the structured knowledge representation can either be exploited directly for rule extraction, or be used to explain a generic type of checksum solution found by the network for learning M-of-N tasks.

ILK-9906

Toward an exemplar-based computational model for cognitive grammar

Author(s): Walter Daelemans.
Reference: In Johan van der Auwera, Frank Durieux, and Ludo Lejeune (Eds.) English as a Human Language. To honour Louis Goossens. Munchen: LINCOM Europa, 73-82, 1998.

An exemplar-based computational framework is presented which is compatible with Cognitive Grammar. In an exemplar-based approach, language acquisition is modeled as the incremental, data-oriented storage of experiential patterns, and language performance as the extrapolation of information from those stored patterns on the basis of a language-independent information-theoretic similarity metric. We show that this simple architecture works for many aspects of phonological, morphological, and morphosyntactic acquisition and processing. Furthermore, we sketch how the approach may also work for syntactic processing. A central insight of the approach, based on the results of computational modeling experiments, is that abstraction of representations is not only unnecessary to achieve generalization (i.e. to make the system productive, and to make it go `beyond' the learned patterns), but even harmful, and that useful language-independent metrics can be found for defining similarity in the context of language processing.

ILK-9907

Memory-Based Shallow Parsing

Author(s): Walter Daelemans, Sabine Buchholz, Jorn Veenstra.
Reference: To appear in: Proceedings of CoNLL-99, Bergen, Norway, June 12, 1999.

ILK-0002

Integrating seed names and n-grams for a named entity list and classifier

ILK-0004

Unpacking multi-valued symbolic features and classes in memory-based language learning

ILK-0006

A distributed, yet symbolic model of text-to-speech processing

ILK-9902

Forgetting exceptions is harmful in language learning

ILK-9903

Recent Advances in Memory-Based Part-of-Speech Tagging

ILK-9904

Interpreting knowledge representations in BP-SOM

ILK-9906

Toward an exemplar-based computational model for cognitive grammar

ILK-9907

Memory-Based Shallow Parsing

ILK-9908

Cascaded Grammatical Relation Assignment

ILK-9909

Memory-based morphological analysis

ILK-9910

Instance-family abstraction in memory-based language learning

ILK-9911

Machine learning of word pronunciation: the case against abstraction

ILK-9912

Memory-based language processing

ILK-9913

Careful abstraction from instance families in memory-based language learning

ILK-9914

Memory-Based Word Sense Disambiguation

ILK-9801

Modularity in inductively-learned word pronunciation systems

ILK-9802

Do not forget: Full memory in memory-based learning of word pronunciation

ILK-9804

Rapid development of NLP modules with memory-based learning

ILK-9805

Interpretable neural networks with BP-SOM

ILK-9806

Toward inductive lexicons: a case study

ILK-9807

Fast NP Chunking Using Memory-Based Learning Techniques

ILK-9808

Improving data driven wordclass tagging by system combination

ILK-9809

Distinguishing complements from adjuncts using memory-based learning

ILK-9810

TreeTalk-D: a machine learning approach to Dutch word pronunciation

ILK-9811

Unsupervised learning of subcategorisation information and its application in a parsing subtask

ILK-9701

IGTree: Using Trees for Compression and Classification in Lazy Learning Algorithms.

ILK-9702

Memory-Based Learning: Using Similarity for Smoothing

ILK-9703

Resolving PP Attachment Ambiguities with Memory-Based Learning

ILK-9704

TreeTalk-D: a Machine Learning Approach to Dutch Grapheme-to-Phoneme Conversion

ILK-9706

Empirical Learning of Natural Language Processing Tasks.

ILK-9707

Data Mining as a Method for Linguistic Analysis: Dutch Diminutives.

ILK-9708

A Feature-Relevance Heuristic for Indexing and Compressing Large Case Bases.

ILK-9709

Skousen's Analogical Modeling Algorithm: A comparison with Lazy Learning.

Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion

Abstraction Considered Harmful: Lazy Learning of Language Processing.

Morphological Analysis as Classification: an Inductive-Learning Approach.

Unsupervised Discovery of Phonological Categories through Supervised Learning of Morphological Rules

Artificial Intelligence Models of Language Processing

MBT: A Memory-Based Part of Speech Tagger-Generator

A Computational Model of P&P: Dresher and Kaye (1990) revisited.

The Profit of Learning Exceptions.

Linguistics as Data Mining: Dutch Diminutives

Memory-Based Lexical Acquisition and Processing.

Measuring the Complexity of Writing Systems

Transfer interrupted!

Default Inheritance in an Object-Oriented Representation of Linguistic Categories.

A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion.