Natural language processing models and systems typically employ
abstract linguistic representations (syntactic, semantic, or
pragmatic) as intermediate working units. The question underlying the
Implicit Linguistics project is whether we can do without them, since
any invented intermediate structure is always implicitly
encoded somehow in the words at the surface, and the way they are
ordered.
Text-to-text processing
Classes of natural processing tasks in which this question can be
investigated in extremo are processes in which form is mapped to form,
i.e., in which neither the input nor the output contains abstract
elements to begin with. The project focuses on spelling
correction (converting a corrupted text to a clean, intended form
of the same text) and machine translation (converting a text in
one language to a text in another language carrying approximately the
same meaning). The approach is in fact a generic one, applicable to
any text-to-text task such as paraphrasing and summarization, and even extending to question answering, dialogue, and document retrieval.
Memory-based machine translation
Many current machine translation tools, such
as translate.google.com,
indeed implement a direct mapping of source to target text, leaving
all of syntax and semantics implicit; they hide in the form of
statistical translation models between collocationally strong phrases,
and of statistical language models of the target language. Our take on
this problem involves using context on the source side, and using
memory-based classification as a translation model. A first attempt
mapped word trigrams of the source language to word trigrams of the
target language; an ideal processing example is shown in the
animation.
Memory-based spelling correction
Analogously, we develop context-based spelling correction techniques. Here, text-to-text processing translates to converting dirty text to clean text; the underlying task is to discover what the author intended to write. Our approach handles spelling errors across the board: typos, confusibles (to versus too versus two), and specific morpho-syntactic agreement errors such as the infamous homophonous ending of Dutch verbs in d, t, or dt.
We built the online Dutch context-sensitive spelling
checker Valkuil.net to
showcase this technology. If you write Dutch texts, you may consider
checking them with this demo.
Implicit linguistics
In the end, the project aims at linking with ideas in linguistics on the bottom-up
emergence of notions like "constructions" from raw data, and on
implicit language learning. An important part of the project is therefore to investigate the
hidden, implicit presence of constructions (tying lexical units,
spanning single morphemes to multiple words, to syntactic and semantic
functions) embedded in the way a memory-based language model organizes its
memory.
This demo visualizes a sub-class of constructions, those of ngrams with high frequency. In the demo you can either look up a word and see in which constructions it occurs, or enter a sentence (or select one from a news website) and see which constructions occur in that sentence. In the results you can find translations that have been generated by the statistical machine translation system Moses. You can search for words in English, French, German, Italian, Spanish, and Dutch; translations are available between English and the other languages. More information about the demo can be found on this page.
New project BasiLex, a Nijmegen-Tilburg-Leiden-Amsterdam joint project to develop a corpus and lexicon of language encountered by children in primary school, is seeking a developer. December 9, 2010
New demo launched: constructicon
demo. Either look up a word and see in which constructions it
occurs, or enter a sentence (or select one from a news website) and
see which constructions occur in that sentence. Discover the dark
matter between lexicon, syntax, and semantics. September 25, 2010
ILK organizes symposium on June
30, 2010 on Text
Mining in the Real World, with speakers Eduard
Hovy (ISI) and Piek Vossen (Free University Amsterdam). Symposium precedes the Ph.D. defense of Marieke van Erp on June 30, 2010, 14:00. Entrance is free; please register. See webpage for more details. June 10, 2010
Dutch popular science
website Kennislink
mentions
TICCL, Text-Induced Corpus Cleanup, developed by ILK's Martin
Reynaert, in an in-depth article on
optical character recognition and how language technology can help
improve OCR (in Dutch). January 16, 2009
Antal van den
Bosch gives lecture on "computer common sense" at the Tilburg-Eindhoven
Children's
University (Kinderuniversiteit, site is in Dutch), targeted
at children of age 9-12, on
November 10, 2008. September 27, 2008
The Implicit Linguistic project receives funding from NWO NCF for running experiments on the SGI Origin 3000 MIPS/Altix 3700 at SARA, the Dutch national academic computing centre. May 22, 2007
Martin Reynaert is one of the winners
of the BNC
XML Prize, based on a submitted bug report indicating 3,500
spelling errors in the BNC. Martin is awarded a free copy of the new BNC
corpus. April 5, 2007
Dutch popular science magazine Quest quotes Antal van den Bosch and other
Dutch AI researchers in a cover article "Mens versus Computer" (man vs
machine). February 14, 2007
Software
TiMBL, Tilburg
Memory-Based Learner, an efficient implementation of k-nearest
neighbor classification and even faster and highly scalable IGTree
decision tree induction
ABL, Alignment-Based Learning, a grammatical inference system
Stehouwer, H. (2011). Statistical language models for alternative sequence selection. Ph.D. thesis, Tilburg University. [pdf]
Van Gompel, M., Van den Bosch, A., and Berck,
P. (2011). Extending memory-based machine translation to phrases. In
T. Markus, P. Monachesi, and E. Westerhout (Eds.), Computational
Linguistics in the Netherlands 2010: Selected Papers from the
Twentieth CLIN Meeting. Utrecht, LOT:
pp. 45-58. [pdf]
Wubben, S., Van den Bosch, A., and Krahmer,
E. (2011). Paraphrasing headlines by machine translation. In
T. Markus, P. Monachesi, and E. Westerhout (Eds.), Computational
Linguistics in the Netherlands 2010: Selected Papers from the
Twentieth CLIN Meeting. Utrecht, LOT:
pp. 169-183. [pdf]
Van den Bosch, A., and Bouma, G., editors (2011). Interactive multi-modal question answering. Berlin: Springer Verlag. ISBN 978-3-642-17524-4.
[publisher webpage]
Canisius, S., Van den Bosch, A., and Daelemans, W. (2011). Constraint satisfaction inference for entity recognition. In A. van den Bosch and G. Bouma (Eds.), Interactive multi-modal question answering, pp. 199-219. Berlin: Springer Verlag. [publisher webpage]
Haque, R., Kumar Naskar, S., Van den Bosch, A., and Way, A. (2011). Integrating source-language context into phrase-based statistical machine translation. Machine Translation, 23:3, pp. 239-285. [journal webpage]
Bogers, T., and Van den Bosch, A. (2011). Fusing recommendations for social bookmarking. International Journal of Electronic Commerce, 15:3, pp. 33-75. [journal page]
Van den Bosch, A. (2011). Effects of context and recency in
scaled word completion. Computational Linguistics in the
Netherlands Journal, 1,
pp. 79-94. [pdf]
2010
Haque, R., Kumar Naskar, S., Van den Bosch, A., and Way, A. (2010). Supertags as source language context in hierarchical phrase-based SMT. In Proceedings of AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas, Denver, CO., pp. 210-219. [pdf]
Van den Bosch, A., Nauts, P., and Eckhardt, N. (2010). A Kid's Open Mind Common Sense. In Commonsense Knowledge: Papers from the AAAI Fall Symposium, Arlington, VA., pp. 114-119. [pdf]
Van Zaanen, M., and Kanters, P. (2010). Automatic mood classification using tf*idf based on lyrics. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 75-80. [pdf]
Van Zaanen, M., and Gaustad, T. (2010). Grammatical inference as class discrimination. In José Sempere and Pedro García (Eds.), Grammatical Inference: Theoretical Results and Applications, Lecture Notes in Computer Science 6339, pp. 245-257. Berlin/Heidelberg: Springer Verlag. [pdf]
Katrenko, S., and Van Zaanen, M. (2010). Rademacher Complexity and grammar induction algorithms: What it may (not) tell us. In José Sempere and Pedro García (Eds.), Grammatical Inference: Theoretical Results and Applications, Lecture Notes in Computer Science 6339, pp. 293-296. Berlin/Heidelberg: Springer Verlag. [pdf]
Stehouwer, H. and Van Zaanen, M. (2010). Enhanced suffix arrays as language models: Virtual k-testable languages. In Proceedings of ICGI 2010, Valencia, Spain. [pdf]
Stehouwer, H. and Van Zaanen, M. (2010). Using suffix arrays as language models: Scaling the n-gram. In Proceedings of the 22st Benelux Conference on Artificial Intelligence (BNAIC-2010), Luxembourg. [pdf]
Stehouwer, H. and Van Zaanen, M. (2010). Finding patterns in strings using suffix arrays. In Proceedings of Computational Linguistics - Applications 2010, pp. 151-158. [pdf]
Van Gompel, M. (2010). UvT-WSD1: A cross-lingual word sense disambiguation system. In SemEval'10: Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 238-241. [pdf]
Reynaert, M. (2010). Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition [DOI 10.1007/s10032-010-0133-5]
Wubben, S., Van den Bosch, A., and Krahmer, E. (2010). Paraphrase
generation as monolingual translation: Data and evaluation. In
J. Kelleher, B. Mac Namee, and I. van der Sluis
(Eds.), Proceedings of the 10th International Workshop on Natural
Language Generation (INLG 2010), pp. 203-207, Dublin, July 2010. [pdf]
Daelemans, W., and Van den Bosch, A. (2010). Memory-based
learning. In A. Clark, C. Fox, and S. Lappin (Eds.), Handbook of
Computational Linguistics and Natural Language Processing. Oxford,
UK: Wiley-Blackwell Publishers, pp. 154-179. [book webpage]
2009
Stehouwer, H., and Van Zaanen, M. (2009). Token merging in language model-based confusible disambiguation. In T. Calders, K. Tuyls, and M. Pechinizkiy (Eds.), Proceedings of the 21st Benelux Conference on Artificial Intelligence (BNAIC-2009), pp. 241-248. [pdf]
Van Gompel, M., Van den Bosch, A., and Berck, P. (2009). Extending memory-based machine translation to phrases. In M. Forcada and A. Way (Eds.), Proceedings of the Third Workshop on Example-Based Machine Translation, Dublin, Ireland, pp. 79-86. [pdf]
Haque, R., Naskar, S., Van den Bosch, A., and Way, A. (2009). Dependency relations as source context in phrase-based SMT. In Proceedings of PACLIC 23: the 23rd Pacific Asia Conference on Language, Information and Computation, Hong Kong, China, pp. 170-179. [pdf]
Morante, R., Van Asch, V., and Van den Bosch, A. (2009). Dependency parsing and semantic role labeling as a single task. In Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing (RANLP-2009), pp 275-280. [pdf]
Reynaert, M. (2009). Parallel identification of the spelling variants in corpora. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data 2009 (AND-2009), Barcelona, Spain, pp. 77-84. [pdf]
Morante, R., Van Asch, V., and Van den Bosch, A. (2009). Joint memory-based learning of syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL): Shared Task, pp. 25-30. Stroudsburg, PA: Association for Computational Linguistics. [pdf]
Canisius, S., and Van den Bosch, A. (2009). A constraint satisfaction approach to machine translation. In Proceedings of the 13th Annual Meeting of the European Association for Machine Translation (EAMT-2009), pp. 182-189. [pdf]
Stehouwer, H., and Van Zaanen, M. (2009). Language models for contextual error detection and correction. In Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, Athens, Greece, pp. 41-48. [pdf]
Wubben, S., Van den Bosch, A., Krahmer, E., and Marsi, E. (2009). Clustering and matching headlines for automatic paraphrase acquisition. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), Athens, Greece, pp. 122-125. [pdf]
Van den Bosch, A., and Berck, P. (2009). Memory-based machine translation and language modeling. The Prague Bulletin of Mathematical Linguistics No. 91, pp. 17-26. [pdf]
Bogers, T., and Van den Bosch, A. (2009). Language modeling for spam detection in social reference manager websites. In R. Aly, C. Hauff, I. den Hamer, D. Hiemstra, T. Huibers, and F. de Jong (Eds.), Proceedings of the 9th
Belgian-Dutch Information Retrieval Workshop (DIR 2009), pp 87-94.
[pdf]
Wubben, S., and Van den Bosch, A. (2009). A semantic relatedness metric based on free link structure. In H.C. Bunt, V. Petukhova, and S. Wubben (Eds.), Proceedings of the Eighth International Conference on Computational Semantics (IWCS-8), pp. 355-359. [pdf]
Stehouwer, H. and Van den Bosch, A. (2009). Putting the t where it belongs: Solving a confusion problem in Dutch. In S. Verberne, H. van Halteren, and P.-A. Coppen (Eds.), Computational Linguistics in the Netherlands 2007: Selected Papers from the 18th CLIN Meeting, January 22, 2009, Groningen, pp. 21-36. [pdf]
2008
Reynaert, M. (2008). Non-interactive OCR post-correction for giga-scale digitization projects. In A. Gelbukh (Ed.), Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008. Lecture Notes in Computer Science Vol. 4919/2008, Berlin / Heidelberg: Springer, pp. 617-630.
[pdf, first page - corrected, post-publication version]
Hoste, V., and Van den Bosch, A. (2008). A modular approach to learning Dutch co-reference. In C. Johansson (Ed.), Proceedings from the First Bergen Workshop on Anaphora Resolution (WAR I), Bergen, Norway, pp. 51-75.
[pdf]
Oostdijk, N., Reynaert, M., Monachesi, P., Van Noord, G., Ordelman, R., Schuurman, I., and Vandeghinste, V. (2008). From D-Coi to SoNaR: A reference corpus for Dutch. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech, Morocco, 2008.
[pdf]
Reynaert, M. (2008). All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech,Morocco, 2008.
[pdf]
Van den Bosch, A., and Bogers, T. (2008). Efficient Context-Sensitive Word Completion for Mobile Devices. In MobileHCI 2008: Proceedings of the 10th International Conference on Human-Computer Interaction with Mobile Devices and Services, IOP-MMI special track, pp 465-470. Amsterdam, The Netherlands, September 2008.
[pdf]
Bogers, T., and Van den Bosch, A. (2008). Using language models for spam detection in social bookmarking. In Proceedings of 2008 ECML/PKDD Discovery Challenge Workshop, pp 1-12. Antwerp, Belgium, September 2008.
[pdf]
2007
Van den Bosch, A., and Van der Sloot, K. (2007). Superlinear parallelization of k-nearest neighbor retrieval. In M. Dastani and E. de Jong (Eds.), Proceedings of the 19th Belgian-Dutch Artificial Intelligence Conference (BNAIC-2007), Utrecht, The Netherlands, pp. 65-72. [pdf]
Stroppa, N., Van den Bosch, A., and Way, A. (2007). Exploiting source similarity for SMT using context-informed features. In A. Way and B. Gawronska (Eds.), Proceedings of the 11th International Conference on Theoretical Issues in Machine Translation (TMI 2007), Skövde, Sweden, pp. 231-240. [pdf]
Canisius, S., and Van den Bosch, A. (2007). Recompiling a
knowledge-based dependency parser into memory. In Proceedings of
the International Conference on Recent Advances in Natural Language
Processing (RANLP-2007), Borovets, Bulgaria, pp. 104-108. [pdf]
Van den Bosch, A., Stroppa, N., and Way, A. (2007). A memory-based
classification approach to marker-based EBMT. F. Van Eynde,
V. Vandeghinste, and I. Schuurman (Eds.), In Proceedings of the METIS-II
Workshop on New Approaches to Machine Translation, 63-72. January 11,
2007, Leuven, Belgium. [pdf]
Soudi, A., Van den Bosch, A., and Neumann, G., editors
(2007). Arabic computational morphology: Knowledge-based and
empirical methods. Berlin: Springer. 308 p. ISBN
978-1-4020-6045-8. [publisher
webpage]
Van den Bosch, A., Busser, G.J., Canisius, S., and Daelemans,
W. (2007). An efficient memory-based morpho-syntactic tagger and
parser for Dutch. In P. Dirix, I. Schuurman, V. Vandeghinste, and
F. Van Eynde (Eds.), Computational Linguistics in the Netherlands:
Selected Papers from the Seventeenth CLIN Meeting, Leuven,
Belgium, pp. 99-114. [preprint
pdf]
Van den Bosch, A., Marsi, E., and Soudi, A. (2007). Memory-based
morphological analysis and part-of-speech tagging of Arabic. In Soudi,
A., Van den Bosch, A., and Neumann, G. (Eds), Arabic computational
morphology: Knowledge-based and empirical methods, Chapter 11,
pp. 203-219. Berlin: Springer. [pdf of
preprint]
Van den Bosch, A., and Van der Sloot, K. (2007). Superlinear parallelisation of the k-nearest neighbor classifier. In P. Adriaans, M. van Someren, and S. Katrenko (Eds.), Proceedings of the 18th BENELEARN Conference. May 14, 2007, Amsterdam, The Netherlands. [pdf]
2006
Van den Bosch, A. (2006). All-words prediction as the ultimate
confusible disambiguation. In Proceedings of the HLT-NAACL Workshop
on Computationally hard problems and joint inference in speech and
language processing, June 2006, New York City, NY. [pdf]
Van den Bosch, A. (2005). Scalable classification-based word
prediction and confusible correction. Traitement Automatiques des
Langues, 46:2, 39-63. [pdf
of preprint]