| Antal van den Bosch - abstracts |
Daelemans, W. and van den Bosch, A. (1992).
We investigated the generalization capabilities of backpropagation
learning in feed-forward and recurrent feed-forward connectionist
networks on the assignment of syllable boundaries to orthographic
representations in Dutch (hyphenation). This is a difficult task
because phonological and morphological constraints interact, leading
to ambiguity in the input patterns. We compared the results to
different symbolic pattern matching approaches, and to an
exemplar-based generalization scheme, related to a k-nearest
neighbour approach, but using a similarity metric weighed by the
relative information entropy of positions in the training patterns.
Our results indicate that the generalization performance o
backpropagation learning for this task is not better than that of the
best symbolic pattern matching approaches, and of exemplar-based
generalization.
It is traditionally assumed that various sources of linguistic
knowledge and their interaction should be formalised in order to be
able to convert words into their phonemic representations with
reasonable accuracy. We show that using supervised learning
techniques, based on a corpus of transcribed words, the same and even
better performance can be achieved, without explicit modeling of
linguistic knowledge.
In the traditional (knowledge-based) approach to the design of
grapheme-to-phoneme modules in text-to-speech systems, it is claimed
that various explicitly coded, language-specific, linguistic knowledge
sources are necessary for a good performance. Due to knowledge
acquisition bottlenecks, this implies long development cycles. As an
alternative, we propose to use inductive methods from machine learning
in a simple combined Trie Search and Similarity-Based Reasoning
approach and show that, for Dutch, its performance is better than that
of the knowledge-based approach and backpropagation learning.
Furthermore, we show that our approach is reusable for any language
for which a training corpus exists.
We report on an implemented grapheme-to-phoneme conversion
architecture. Given is a set of examples (spelling words with their
associated phonetic representations) in a language, a
grapheme-to-phoneme conversion system is automatically produced for
that language which takes as its input the spelling of words, and
produces as its output the phonetic transcriptions according to the
rules implicit in the training data. This paper describes the
architecture and focuses on our solution to the alignment
problem: given the spelling and the phonetic transcription of a
word (often differing in length), these two representations have to be
aligned in such a way that grapheme symbols or strings of grapheme
symbols are consistently associated with the same phonetic symbol. If
this alignment has to be done by hand, it is extremely
labour-intensive.
We propose a quantitative operationalisation of the concept of
orthographic depth, which plays a crucial role in psycholinguistic
modelling of reading aloud (and learning to read aloud) in different
languages. The orthographic depth of a language is expressed
by measuring the complexity of letter-phoneme alignment and
the complexity of grapheme-phoneme correspondences within that
language. We present the alignment problem and the correspondence
problem as tasks to three different data-oriented learning
algorithms, and submit them to English, French and Dutch learning and
testing material. Generalisation performance metrics are used to
propose for each corpus a two-dimensional orthographic depth value.
We propose a quantitative operationalisation of the complexity of a writ-
ing system. This complexity, also referred to as orthographic depth,
plays a crucial role in psycholinguistic modelling of reading aloud (and
learning to read aloud) in several languages. The complexity of a
writing system is expressed by two measures, viz. that of the complexity
of letter-phoneme alignment and that of the complexity of
grapheme-phoneme correspondences. We present the alignment problem and the
correspondence problem as tasks to three different data-oriented
learning algorithms, and submit them to English, French and Dutch learning
and testing material. Generalisation performance metrics are used to
propose for each corpus a two-dimensional writing system complexity
value.
We report on a series of experiments in which three machine-learning
algorithms are trained to hyphenate English words, viz.
the backpropagation algorithm, the cascade-backpropagation algorithm,
and the information-gain-tree algorithm (IG-Tree). English
hyphenation is an interesting testing ground for machine learning:
it has a few underlying principles, and a large amount of exceptions.
A successful learning algorithm must be able to deal with both.
The three learning algorithms vary in the way they compress the
training material: by extracting a limited number of general
rules (greedy learning), or by storing large amounts of instances
(lazy learning). Our experiments show that the lazy learning
algorithm (i.e., the IG-Tree algorithm)
scales up better than the two greedy learning algorithms
(the backpropagation algorithm and the cascade-backpropagation algorithm).
Moreover, our results call for including very large problem sets in
collections of machine-learning benchmark problems.
For many classification tasks, the set of
available task instances can be roughly divided into regular
instances and exceptions. We investigate three
learning algorithms that apply a different method of learning with
respect to regularities and exceptions, viz. (i) back-propagation,
(ii) cascade back-propagation (a constructive version of back-propagation),
and (iii) information-gain tree (an inductive decision-tree
algorithm). We compare the bias of the algorithms towards
learning regularities and exceptions, using a task-independent metric
for the typicality of instances. We have found that information-gain
tree is best capable of learning exceptions. However, it outperforms
back-propagation and cascade back-propagation only when trained on
very large training sets.
Decomposing a hard problem into easier sub-problems (`modularisation')
is a powerful problem-solving technique. Modularisation is often based
on expert knowledge and can lead to efficient high-performance
models. Contrasting with this expert-based approach is the approach of
machine-learning algorithms such as back-propagation and symbolic
inductive-learning algorithms that do not make us of a predetermined
modular architecture.
We present examples of machine-learned models without modules
of problems that are traditionally solved by expert-based
modularisation. The machine-learned models perform
equally good as or better than the expert-based models. This
surprising fact gives rise to the question whether the performance of
machine-learned models could be further increased when modularisation is
somehow incorporated in the learning algorithms. We describe work
in progress on the development of machine learning algorithms that
automatically construct modular architectures during learning.
Morphological analysis is an important subtask in text-to-speech
conversion, hyphenation, and other language engineering tasks. The
traditional approach to performing morphological analysis is to
combine a morpheme lexicon, sets of (linguistic) rules, and heuristics
to find a most probable analysis. In contrast, we present an inductive
learning approach in which morphological analysis is reformulated as a
segmentation task. We report on a number of experiments
in which five inductive learning algorithms are applied to three
variations of the task of morphological analysis. Results show (i)
that the generalisation performance of the algorithms is good, and
(ii) that the lazy learning algorithm IB1-IG performs
best on all three tasks. We conclude that lazy learning of
morphological analysis as a classification task is indeed a viable
approach; moreover, it has the strong advantages of avoiding the
knowledge-acquisition bottleneck, being fast and deterministic in
learning and processing, and being language-independent.
Morphological analysis is an important subtask in text-to-speech
conversion, hyphenation, and other language engineering tasks. The
traditional approach to performing morphological analysis is to
combine a morpheme lexicon, sets of (linguistic) rules, and heuristics
to find a most probable analysis. In contrast we present an inductive
learning approach in which morphological analysis is reformulated as a
segmentation task. We report on a number of experiments
in which five inductive learning algorithms are applied to three
variations of the task of morphological analysis. Results show (i)
that the generalisation performance of the algorithms is good, and
(ii) that the lazy learning algorithm IB1-IG performs
best on all three tasks. We conclude that lazy learning of
morphological analysis as a classification task is indeed a viable
approach; moreover, it has the strong advantages over the traditional
approach of avoiding the knowledge-acquisition bottleneck, being fast
and deterministic in learning and processing, and being
language-independent.
We describe an approach to grapheme-to-phoneme conversion which is
both language-independent and data-oriented. Given a set
of examples (spelling words with their associated phonetic
representation) in a language, a grapheme-to-phoneme conversion system
is automatically produced for that language which takes as its input
the spelling of words, and produces as its output the phonetic
transcription according to the rules implicit in the training data. We
describe the design of the system, and compare its performance to
knowledge-based and alternative data-oriented approaches.
Generalization
performance of backpropagation learning on a syllabification task.
Van den Bosch, A., and Daelemans, W. (1993).
Data-oriented methods
for grapheme-to-phoneme conversion.
In this paper we present two instances of this approach. A first
model implements a variant of instance-based learning, in which
a weighed similarity metric and a database of prototypical exemplars
are used to predict new mappings. In the second model,
grapheme-to-phoneme mappings are looked up in a compressed
text-to-speech lexicon (table lookup) enriched with default
mappings. We compare performance and accuracy of these approaches to a
connectionist (backpropagation) approach and to the linguistic
knowledge-based approach.
Daelemans, W. and van den Bosch, A. (1993).
Tabtalk: Reusability in
data-oriented grapheme-to-phoneme conversion.
Daelemans, W. and van den Bosch, A. (1994).
A
language-independent, data-oriented architecture for
grapheme-to-phoneme conversion.
Van den Bosch, A., Content, A., Daelemans, W., and De Gelder, B.
(1994).
Analysing orthographic depth of different
languages using data-oriented algorithms.
Van den Bosch, A., Content, A., Daelemans, W., and De Gelder, B.
(1994).
Measuring the complexity of writing systems.
Van den Bosch, A., Weijters, A., and van den Herik, H.J. (1995).
Scaling effects with greedy and lazy machine-learning algorithms.
Van den Bosch, A., Weijters, A., Van den Herik, J., and Daelemans,
W. (1995).
The profit of learning exceptions.
Van den Bosch, A., and Weijters, A. (1995).
Stretching the limits of learning without modules.
Van den Bosch, A., Daelemans, W., and Weijters, A. (1996).
An
inductive-learning approach to morphological analysis.
Van den Bosch, A., Daelemans, W., and Weijters,
A. (1996).
Daelemans, W., and Van den Bosch, A. (1997).
Language-independent
data-oriented grapheme-to-phoneme conversion.
The relation
between the orthography and the phonology of a language has
traditionally been modelled by hand-crafted rule sets.
Machine-learning (ML) approaches offer a means to gather this
knowledge automatically. Problems arise when the training material is
sparse. Generalising from sparse data is a well-known problem for many
ML algorithms. We present experiments in which connectionist,
instance-based, and decision-tree learning algorithms are applied to a
small corpus of Scottish Gaelic. The results show that instance-based
learning in the IB1-IG algorithm yields the best generalisation
performance, and that most algorithms tested perform tolerably
well. Given the availability of a lexicon, even if it is sparse, ML is
a valuable and efficient tool for automatic phonetic transcription of
written text.
Machine learning is becoming recognised as a source of generic and
powerful tools for tasks studied and implemented in language
technology. Lazy learning with information-theoretic similarity
matching has appeared a salient approach, demonstrated to be superior
over other machine-learning approaches in various comparative
studies. It is asserted both in theoretical machine learning and in
reports on applications of machine learning to natural language that
the success of lazy learning may be due to the fact that language data
contains small disjuncts, i.e., small clusters of
identically-classified instances. We propose three measures to
discover small disjuncts in our data: (i) we count and analyse indexed
clusters of instances in induced decision trees; (ii) we count
clusters of friendly (identically-classified) instances
immediately surrounding instances by using similarity metrics from
lazy learning; (iii) we compare average sizes of friendly-instance
clusters using different similarity metrics. The measures are
illustrated by a sample language task, viz. word pronunciation. Two
conclusions are arrived at: (i) our data indeed contains large amounts
of small disjuncts of about three to a hundred instances, and (ii)
there are important differences in feature relevance in the data,
exploited appropriately when lazy learning is augmented with
information-theoretic similarity matching. We claim that the measures
introduced in this paper are useful for predicting the suitedness of
lazy learning in general.
Back-propagation learning (BP) is
known for its serious limitations in generalising knowledge from
certain types of learning material. BP-SOM is an extension of
BP which overcomes some of these limitations. BP-SOM is a
combination of a multi-layered feed-forward network (MFN)
trained with BP, and Kohonen's self-organising maps (SOMs).
In earlier reports, it has been shown that BP-SOM
improved the generalisation performance whereas it decreased
simultaneously the number of necessary hidden units without loss of
generalisation performance. These are only two effects of the use of
SOM learning during training of MFNs. In this paper we
focus on two additional effects. First, we show that after BP-SOM
training, activations of hidden units of MFNs tend to
oscillate among a limited number of discrete values. Second, we
identify SOM elements as adequate organisers of instances of the
task at hand. We visualise both effects, and argue that they lead to
intelligible neural networks and can be employed as a basis for
automatic rule extraction.
Memory-based learning, keeping full memory of learning material,
appears a viable approach to learning NLP tasks, and is often
superior in generalisation accuracy to eager learning approaches that
abstract from learning material. Here we investigate three
partial memory-based learning approaches which remove from memory
specific task instance types estimated to be exceptional. The three
approaches each implement one heuristic function for estimating
exceptionality of instance types: (i) typicality, (ii) class
prediction strength, and (iii) friendly-neighbourhood
size. Experiments are performed with the memory-based learning algorithm
IB1-IG trained on English word pronunciation. We find that
removing instance types with low prediction strength (ii) is the only
tested method which does not seriously harm generalisation
accuracy. We conclude that keeping full memory of types rather
than tokens, and excluding minority ambiguities appear to be the only
performance-preserving optimisations of memory-based learning.
In leading morpho-phonological theories and state-of-the-art
text-to-speech systems it is assumed that word pronunciation cannot be
learned or performed without in-between analyses at several
abstraction levels (e.g., morphological, graphemic, phonemic,
syllabic, and stress levels). We challenge this assumption for the
case of English word pronunciation. Using IGTree, an
inductive-learning decision-tree algorithms, we train and test three
word-pronunciation systems in which the number of abstraction levels
(implemented as sequenced modules) is reduced from five, via three, to
one. The latter system, classifying letter strings directly as mapping
to phonemes with stress markers, yields significantly better
generalisation accuracies than the two multi-module systems. Analyses
of empirical results indicate that positive utility effects of
sequencing modules are outweighed by cascading errors passed on
between modules.
We report on a series of experiments with simple recurrent networks
(SRNs) solving phoneme prediction in continuous phonemic data.
The purpose of the experiments is to investigate whether the network
output could function as a source for syllable boundary detection. We
show that this is possible, using a generalisation of the network which
resembles the linguistic sonority principle. We argue that the
primary generalisation of the network, i.e., the fact that sonority varies
in a hat-shaped way across phonemic strings, ending and starting at
syllable boundaries, is an indication that sonority might be a major cue
in discovering the essential building bricks of language when confronted
with unsegmented running speech. The segment which is most directly
related to sonority patterns, the syllable, has received considerable
attention in psycholinguistics as being an element of natural language that
is easily grasped by language learners. The phoneme prediction network
presents a simulation of the necessary bootstrap to arrive at the
discovery of syllabic segmentation in unsegmented speech, which can be
used as a basis for the segmentation of larger structures like words.
Empirical studies in inductive language learning point at
pure memory-based learning as a successful approach to many language
learning tasks, often performing better than lerning methods that
abstract from the learning material. The possibility is left open,
however, that limited, careful abstraction in memory-based learning
may be harmless to generalisation, as long as the disjunctivity of
language data is preserved. We compare three types of careful
abstraction: editing, oblivious (partial) decision-tree abstraction,
and generalised instances, in a single-task study. Only when combined
with feature weighting, careful abstraction can equal pure
memory-based learning. In a multi-task case study we find that the
FAMBL algorithm, a new careful abstractor which merges families
of instances, performs close to pure memory-based learning, though it
equals it only on one task. On the basis of the gathered empirical
results, we argue for the incorporation of the notion of instance
families, i.e., carefully generalised instances, in memory-based
language learning.
Wolters, M., and Van den Bosch, A. (1997).
Automatic Phonetic
Transcription of Words Based On Sparse Data.
Van den Bosch, A., Weijters, A., Van den Herik, H.J., and Daelemans,
W. (1997).
When small disjuncts abound, try lazy learning: A case study.
Weijters, A., Van den Bosch, A., and Van den Herik, H.J. (1997).
Intelligible neural networks with BP-SOM.
Van den Bosch, A., Weijters, A., and Daelemans, W. (1998).
Do not forget: Full memory in memory-based learning of word
pronunciation.
Van den Bosch, A. and Daelemans, W. (1998).
Modularity in inductively-learned word pronunciation systems.
Vroomen, J., Van den Bosch, A., and De Gelder, B. (1998).
A
connectionist model for bootstrap learning of syllabic
structure.
Van den Bosch, A. (draft, conditionally accepted).
Careful abstraction over instance families in memory-based language learning.
| Last update: Tue Dec 8 1998 |