-
The MBT tagger is a fast and accurate Part-of-Speech tagger that is generated
automatically from a tagged example corpus by Memory-Based Learning techniques.
Currently the demo includes three languages. You can type a sentence and
the tagger assigns Part-of-Speech categories to each word by extrapolation
from the most similar example in the training data. The words are encoded
as a pattern combining lexical entry and context (known words), and suffix
and prefix letters, and context (unknown words).
-
TreeTalk
converts words to their phonemic transcription (with word stress
markers), and also generates WAV or AU speech output (generated by MBROLA). The current demo incorporates two modules,
each trained with the IGTree decision tree algorithm, the first being
the conversion of graphemes to phonemes, and the second being the
assignment of word stress to the phonemic string generated by the
first module. Words are converted letter by letter; each letter serves
as the focus of a fixed-width window which is matched in the decision
tree to stored windows. The trees are trained on Dutch and English
word-pronunciation pairs from the CELEX lexical data base.
-
MBMA analyses
the morphology of Dutch words. Its engine is a TiMBL server operating on a
data set of Dutch wordform morphology from CELEX. MBMA performs
a one-step mapping of instances (letters in their immediate letter
context) to complex classes encoding segmentation, derivation,
affixation, inflection, spelling changes, and POS tagging. The demo
operates in two modes. In normal mode, MBMA performs basic
segmentation, redoes spelling changes, figures out inflectional
features, and assigns POS tags. In the experimental daring
mode, some additional heuristics are applied to find out the (hierarchical)
bindings and relations between morphemes. The daring mode is under
development.
-
Shallow parsing is a useful preprocessing step for many Natural
Language Processing applications. Sentences are then no longer just
sequences of words, but receive some structure: groups of words that
closely belong together are marked, specific relations between (groups
of) words are found. In contrast to full parsing, shallow parsing does
not attempt to find a structure comprising the whole
sentence. Therefore, it is in general much faster.
The Memory-Based Shallow Parser (MBSP) applies several modules to an
English sentence supplied by the user. It first assigns a
Part-of-Speech to each word in the sentence (see MBT). In a next step
MBSP recognises chunks (non-overlapping, non-embedded
constituents). Finally, MBSP assigns subjects and objects to the verbal
chunks in the sentence. MBSP is trained on the Wall Street Journal
(WSJ) treebank, a link to more recent WSJ material is included.