Antal van den Bosch - abstracts

Daelemans, W. and van den Bosch, A. (1992).
Generalization performance of backpropagation learning on a syllabification task.

We investigated the generalization capabilities of backpropagation learning in feed-forward and recurrent feed-forward connectionist networks on the assignment of syllable boundaries to orthographic representations in Dutch (hyphenation). This is a difficult task because phonological and morphological constraints interact, leading to ambiguity in the input patterns. We compared the results to different symbolic pattern matching approaches, and to an exemplar-based generalization scheme, related to a k-nearest neighbour approach, but using a similarity metric weighed by the relative information entropy of positions in the training patterns. Our results indicate that the generalization performance o backpropagation learning for this task is not better than that of the best symbolic pattern matching approaches, and of exemplar-based generalization.


Van den Bosch, A., and Daelemans, W. (1993).
Data-oriented methods for grapheme-to-phoneme conversion.

It is traditionally assumed that various sources of linguistic knowledge and their interaction should be formalised in order to be able to convert words into their phonemic representations with reasonable accuracy. We show that using supervised learning techniques, based on a corpus of transcribed words, the same and even better performance can be achieved, without explicit modeling of linguistic knowledge.
In this paper we present two instances of this approach. A first model implements a variant of instance-based learning, in which a weighed similarity metric and a database of prototypical exemplars are used to predict new mappings. In the second model, grapheme-to-phoneme mappings are looked up in a compressed text-to-speech lexicon (table lookup) enriched with default mappings. We compare performance and accuracy of these approaches to a connectionist (backpropagation) approach and to the linguistic knowledge-based approach.


Daelemans, W. and van den Bosch, A. (1993).
Tabtalk: Reusability in data-oriented grapheme-to-phoneme conversion.

In the traditional (knowledge-based) approach to the design of grapheme-to-phoneme modules in text-to-speech systems, it is claimed that various explicitly coded, language-specific, linguistic knowledge sources are necessary for a good performance. Due to knowledge acquisition bottlenecks, this implies long development cycles. As an alternative, we propose to use inductive methods from machine learning in a simple combined Trie Search and Similarity-Based Reasoning approach and show that, for Dutch, its performance is better than that of the knowledge-based approach and backpropagation learning. Furthermore, we show that our approach is reusable for any language for which a training corpus exists.


Daelemans, W. and van den Bosch, A. (1994).
A language-independent, data-oriented architecture for grapheme-to-phoneme conversion.

We report on an implemented grapheme-to-phoneme conversion architecture. Given is a set of examples (spelling words with their associated phonetic representations) in a language, a grapheme-to-phoneme conversion system is automatically produced for that language which takes as its input the spelling of words, and produces as its output the phonetic transcriptions according to the rules implicit in the training data. This paper describes the architecture and focuses on our solution to the alignment problem: given the spelling and the phonetic transcription of a word (often differing in length), these two representations have to be aligned in such a way that grapheme symbols or strings of grapheme symbols are consistently associated with the same phonetic symbol. If this alignment has to be done by hand, it is extremely labour-intensive.


Van den Bosch, A., Content, A., Daelemans, W., and De Gelder, B. (1994).
Analysing orthographic depth of different languages using data-oriented algorithms.

We propose a quantitative operationalisation of the concept of orthographic depth, which plays a crucial role in psycholinguistic modelling of reading aloud (and learning to read aloud) in different languages. The orthographic depth of a language is expressed by measuring the complexity of letter-phoneme alignment and the complexity of grapheme-phoneme correspondences within that language. We present the alignment problem and the correspondence problem as tasks to three different data-oriented learning algorithms, and submit them to English, French and Dutch learning and testing material. Generalisation performance metrics are used to propose for each corpus a two-dimensional orthographic depth value.


Van den Bosch, A., Content, A., Daelemans, W., and De Gelder, B. (1994).
Measuring the complexity of writing systems.

We propose a quantitative operationalisation of the complexity of a writ- ing system. This complexity, also referred to as orthographic depth, plays a crucial role in psycholinguistic modelling of reading aloud (and learning to read aloud) in several languages. The complexity of a writing system is expressed by two measures, viz. that of the complexity of letter-phoneme alignment and that of the complexity of grapheme-phoneme correspondences. We present the alignment problem and the correspondence problem as tasks to three different data-oriented learning algorithms, and submit them to English, French and Dutch learning and testing material. Generalisation performance metrics are used to propose for each corpus a two-dimensional writing system complexity value.


Van den Bosch, A., Weijters, A., and van den Herik, H.J. (1995).
Scaling effects with greedy and lazy machine-learning algorithms.

We report on a series of experiments in which three machine-learning algorithms are trained to hyphenate English words, viz. the backpropagation algorithm, the cascade-backpropagation algorithm, and the information-gain-tree algorithm (IG-Tree). English hyphenation is an interesting testing ground for machine learning: it has a few underlying principles, and a large amount of exceptions. A successful learning algorithm must be able to deal with both. The three learning algorithms vary in the way they compress the training material: by extracting a limited number of general rules (greedy learning), or by storing large amounts of instances (lazy learning). Our experiments show that the lazy learning algorithm (i.e., the IG-Tree algorithm) scales up better than the two greedy learning algorithms (the backpropagation algorithm and the cascade-backpropagation algorithm). Moreover, our results call for including very large problem sets in collections of machine-learning benchmark problems.


Van den Bosch, A., Weijters, A., Van den Herik, J., and Daelemans, W. (1995).
The profit of learning exceptions.

For many classification tasks, the set of available task instances can be roughly divided into regular instances and exceptions. We investigate three learning algorithms that apply a different method of learning with respect to regularities and exceptions, viz. (i) back-propagation, (ii) cascade back-propagation (a constructive version of back-propagation), and (iii) information-gain tree (an inductive decision-tree algorithm). We compare the bias of the algorithms towards learning regularities and exceptions, using a task-independent metric for the typicality of instances. We have found that information-gain tree is best capable of learning exceptions. However, it outperforms back-propagation and cascade back-propagation only when trained on very large training sets.


Van den Bosch, A., and Weijters, A. (1995).
Stretching the limits of learning without modules.

Decomposing a hard problem into easier sub-problems (`modularisation') is a powerful problem-solving technique. Modularisation is often based on expert knowledge and can lead to efficient high-performance models. Contrasting with this expert-based approach is the approach of machine-learning algorithms such as back-propagation and symbolic inductive-learning algorithms that do not make us of a predetermined modular architecture. We present examples of machine-learned models without modules of problems that are traditionally solved by expert-based modularisation. The machine-learned models perform equally good as or better than the expert-based models. This surprising fact gives rise to the question whether the performance of machine-learned models could be further increased when modularisation is somehow incorporated in the learning algorithms. We describe work in progress on the development of machine learning algorithms that automatically construct modular architectures during learning.


Van den Bosch, A., Daelemans, W., and Weijters, A. (1996).
An inductive-learning approach to morphological analysis.

Morphological analysis is an important subtask in text-to-speech conversion, hyphenation, and other language engineering tasks. The traditional approach to performing morphological analysis is to combine a morpheme lexicon, sets of (linguistic) rules, and heuristics to find a most probable analysis. In contrast, we present an inductive learning approach in which morphological analysis is reformulated as a segmentation task. We report on a number of experiments in which five inductive learning algorithms are applied to three variations of the task of morphological analysis. Results show (i) that the generalisation performance of the algorithms is good, and (ii) that the lazy learning algorithm IB1-IG performs best on all three tasks. We conclude that lazy learning of morphological analysis as a classification task is indeed a viable approach; moreover, it has the strong advantages of avoiding the knowledge-acquisition bottleneck, being fast and deterministic in learning and processing, and being language-independent.


Van den Bosch, A., Daelemans, W., and Weijters, A. (1996).Morphological analysis as classification: An inductive-learning approach.

Morphological analysis is an important subtask in text-to-speech conversion, hyphenation, and other language engineering tasks. The traditional approach to performing morphological analysis is to combine a morpheme lexicon, sets of (linguistic) rules, and heuristics to find a most probable analysis. In contrast we present an inductive learning approach in which morphological analysis is reformulated as a segmentation task. We report on a number of experiments in which five inductive learning algorithms are applied to three variations of the task of morphological analysis. Results show (i) that the generalisation performance of the algorithms is good, and (ii) that the lazy learning algorithm IB1-IG performs best on all three tasks. We conclude that lazy learning of morphological analysis as a classification task is indeed a viable approach; moreover, it has the strong advantages over the traditional approach of avoiding the knowledge-acquisition bottleneck, being fast and deterministic in learning and processing, and being language-independent.


Daelemans, W., and Van den Bosch, A. (1997).
Language-independent data-oriented grapheme-to-phoneme conversion.

We describe an approach to grapheme-to-phoneme conversion which is both language-independent and data-oriented. Given a set of examples (spelling words with their associated phonetic representation) in a language, a grapheme-to-phoneme conversion system is automatically produced for that language which takes as its input the spelling of words, and produces as its output the phonetic transcription according to the rules implicit in the training data. We describe the design of the system, and compare its performance to knowledge-based and alternative data-oriented approaches.


Wolters, M., and Van den Bosch, A. (1997).
Automatic Phonetic Transcription of Words Based On Sparse Data.

The relation between the orthography and the phonology of a language has traditionally been modelled by hand-crafted rule sets. Machine-learning (ML) approaches offer a means to gather this knowledge automatically. Problems arise when the training material is sparse. Generalising from sparse data is a well-known problem for many ML algorithms. We present experiments in which connectionist, instance-based, and decision-tree learning algorithms are applied to a small corpus of Scottish Gaelic. The results show that instance-based learning in the IB1-IG algorithm yields the best generalisation performance, and that most algorithms tested perform tolerably well. Given the availability of a lexicon, even if it is sparse, ML is a valuable and efficient tool for automatic phonetic transcription of written text.


Van den Bosch, A., Weijters, A., Van den Herik, H.J., and Daelemans, W. (1997).
When small disjuncts abound, try lazy learning: A case study.

Machine learning is becoming recognised as a source of generic and powerful tools for tasks studied and implemented in language technology. Lazy learning with information-theoretic similarity matching has appeared a salient approach, demonstrated to be superior over other machine-learning approaches in various comparative studies. It is asserted both in theoretical machine learning and in reports on applications of machine learning to natural language that the success of lazy learning may be due to the fact that language data contains small disjuncts, i.e., small clusters of identically-classified instances. We propose three measures to discover small disjuncts in our data: (i) we count and analyse indexed clusters of instances in induced decision trees; (ii) we count clusters of friendly (identically-classified) instances immediately surrounding instances by using similarity metrics from lazy learning; (iii) we compare average sizes of friendly-instance clusters using different similarity metrics. The measures are illustrated by a sample language task, viz. word pronunciation. Two conclusions are arrived at: (i) our data indeed contains large amounts of small disjuncts of about three to a hundred instances, and (ii) there are important differences in feature relevance in the data, exploited appropriately when lazy learning is augmented with information-theoretic similarity matching. We claim that the measures introduced in this paper are useful for predicting the suitedness of lazy learning in general.


Weijters, A., Van den Bosch, A., and Van den Herik, H.J. (1997).
Intelligible neural networks with BP-SOM.

Back-propagation learning (BP) is known for its serious limitations in generalising knowledge from certain types of learning material. BP-SOM is an extension of BP which overcomes some of these limitations. BP-SOM is a combination of a multi-layered feed-forward network (MFN) trained with BP, and Kohonen's self-organising maps (SOMs). In earlier reports, it has been shown that BP-SOM improved the generalisation performance whereas it decreased simultaneously the number of necessary hidden units without loss of generalisation performance. These are only two effects of the use of SOM learning during training of MFNs. In this paper we focus on two additional effects. First, we show that after BP-SOM training, activations of hidden units of MFNs tend to oscillate among a limited number of discrete values. Second, we identify SOM elements as adequate organisers of instances of the task at hand. We visualise both effects, and argue that they lead to intelligible neural networks and can be employed as a basis for automatic rule extraction.


Van den Bosch, A., Weijters, A., and Daelemans, W. (1998).
Do not forget: Full memory in memory-based learning of word pronunciation.

Memory-based learning, keeping full memory of learning material, appears a viable approach to learning NLP tasks, and is often superior in generalisation accuracy to eager learning approaches that abstract from learning material. Here we investigate three partial memory-based learning approaches which remove from memory specific task instance types estimated to be exceptional. The three approaches each implement one heuristic function for estimating exceptionality of instance types: (i) typicality, (ii) class prediction strength, and (iii) friendly-neighbourhood size. Experiments are performed with the memory-based learning algorithm IB1-IG trained on English word pronunciation. We find that removing instance types with low prediction strength (ii) is the only tested method which does not seriously harm generalisation accuracy. We conclude that keeping full memory of types rather than tokens, and excluding minority ambiguities appear to be the only performance-preserving optimisations of memory-based learning.


Van den Bosch, A. and Daelemans, W. (1998).
Modularity in inductively-learned word pronunciation systems.

In leading morpho-phonological theories and state-of-the-art text-to-speech systems it is assumed that word pronunciation cannot be learned or performed without in-between analyses at several abstraction levels (e.g., morphological, graphemic, phonemic, syllabic, and stress levels). We challenge this assumption for the case of English word pronunciation. Using IGTree, an inductive-learning decision-tree algorithms, we train and test three word-pronunciation systems in which the number of abstraction levels (implemented as sequenced modules) is reduced from five, via three, to one. The latter system, classifying letter strings directly as mapping to phonemes with stress markers, yields significantly better generalisation accuracies than the two multi-module systems. Analyses of empirical results indicate that positive utility effects of sequencing modules are outweighed by cascading errors passed on between modules.


Vroomen, J., Van den Bosch, A., and De Gelder, B. (1998).
A connectionist model for bootstrap learning of syllabic structure.

We report on a series of experiments with simple recurrent networks (SRNs) solving phoneme prediction in continuous phonemic data. The purpose of the experiments is to investigate whether the network output could function as a source for syllable boundary detection. We show that this is possible, using a generalisation of the network which resembles the linguistic sonority principle. We argue that the primary generalisation of the network, i.e., the fact that sonority varies in a hat-shaped way across phonemic strings, ending and starting at syllable boundaries, is an indication that sonority might be a major cue in discovering the essential building bricks of language when confronted with unsegmented running speech. The segment which is most directly related to sonority patterns, the syllable, has received considerable attention in psycholinguistics as being an element of natural language that is easily grasped by language learners. The phoneme prediction network presents a simulation of the necessary bootstrap to arrive at the discovery of syllabic segmentation in unsegmented speech, which can be used as a basis for the segmentation of larger structures like words.


Van den Bosch, A. (draft, conditionally accepted).
Careful abstraction over instance families in memory-based language learning.

Empirical studies in inductive language learning point at pure memory-based learning as a successful approach to many language learning tasks, often performing better than lerning methods that abstract from the learning material. The possibility is left open, however, that limited, careful abstraction in memory-based learning may be harmless to generalisation, as long as the disjunctivity of language data is preserved. We compare three types of careful abstraction: editing, oblivious (partial) decision-tree abstraction, and generalised instances, in a single-task study. Only when combined with feature weighting, careful abstraction can equal pure memory-based learning. In a multi-task case study we find that the FAMBL algorithm, a new careful abstractor which merges families of instances, performs close to pure memory-based learning, though it equals it only on one task. On the basis of the gathered empirical results, we argue for the incorporation of the notion of instance families, i.e., carefully generalised instances, in memory-based language learning.

Last update: Tue Dec 8 1998