Implicit Linguistics: Machine Learning of Text-to-Text Processing
  
Implicit Linguistics: Machine Learning of Text-to-Text Processing
Project members

Embedding

Implicit Linguistics (2006-2011) was funded by NWO, the Netherlands Foundation for Scientific Research, as part of the Vici program (also see NWO's project info). The project took place within the ILK research group of the Tilburg centre for Cognition and Communication and the Department of Communication and Information Sciences of the Tilburg School of Humanities.

        

Implicit learning

Natural language processing models and systems typically employ abstract linguistic representations (syntactic, semantic, or pragmatic) as intermediate working units. The question underlying the Implicit Linguistics project is whether we can do without them, since any invented intermediate structure is always implicitly encoded somehow in the words at the surface, and the way they are ordered.

Text-to-text processing

Classes of natural processing tasks in which this question can be investigated in extremo are processes in which form is mapped to form, i.e., in which neither the input nor the output contains abstract elements to begin with. The project focuses on spelling correction (converting a corrupted text to a clean, intended form of the same text) and machine translation (converting a text in one language to a text in another language carrying approximately the same meaning). The approach is in fact a generic one, applicable to any text-to-text task such as paraphrasing and summarization, and even extending to question answering, dialogue, and document retrieval.

Memory-based machine translation

Many current machine translation tools, such as translate.google.com, indeed implement a direct mapping of source to target text, leaving all of syntax and semantics implicit; they hide in the form of statistical translation models between collocationally strong phrases, and of statistical language models of the target language. Our take on this problem involves using context on the source side, and using memory-based classification as a translation model. A first attempt mapped word trigrams of the source language to word trigrams of the target language; an ideal processing example is shown in the animation.

 

Memory-based spelling correction

Analogously, we develop context-based spelling correction techniques. Here, text-to-text processing translates to converting dirty text to clean text; the underlying task is to discover what the author intended to write. Our approach handles spelling errors across the board: typos, confusibles (to versus too versus two), and specific morpho-syntactic agreement errors such as the infamous homophonous ending of Dutch verbs in d, t, or dt.

We built the online Dutch context-sensitive spelling checker Valkuil.net to showcase this technology. If you write Dutch texts, you may consider checking them with this demo.

 

Implicit linguistics

In the end, the project aims at linking with ideas in linguistics on the bottom-up emergence of notions like "constructions" from raw data, and on implicit language learning. An important part of the project is therefore to investigate the hidden, implicit presence of constructions (tying lexical units, spanning single morphemes to multiple words, to syntactic and semantic functions) embedded in the way a memory-based language model organizes its memory.

This demo visualizes a sub-class of constructions, those of ngrams with high frequency. In the demo you can either look up a word and see in which constructions it occurs, or enter a sentence (or select one from a news website) and see which constructions occur in that sentence. In the results you can find translations that have been generated by the statistical machine translation system Moses. You can search for words in English, French, German, Italian, Spanish, and Dutch; translations are available between English and the other languages. More information about the demo can be found on this page.

Media

News and media coverage

Software

  • TiMBL, Tilburg Memory-Based Learner, an efficient implementation of k-nearest neighbor classification and even faster and highly scalable IGTree decision tree induction
  • ABL, Alignment-Based Learning, a grammatical inference system
  • Dimbl, parallel TiMBL for multi-CPU machines
  • Timpute, a TiMBL wrapper for textual database auto-cleaning by imputation
  • Paramsearch, automatic wrapper-based ML parameter selection

  • WOPR, Memory-based language modeling
  • MBMT, Memory-based machine translation
  • PBMBMT, Phrase-based memory-based machine translation
  • DEMOCRAT, Consensus-driven synthesis of machine translation outputs

  • Frog, integrated Dutch morphological analyzer, tagger, and dependency parser
  • Clam, computational linguistics application mediator
  • MBT, Memory-based tagger-generator and tagger
  • Ucto, Unicode tokenizer

  • suffixtree, C++ suffix tree datatype
  • sarrays, C++ suffix array datatype
Project publications

For more publications, see ILK publications and technical reports

2011
  • Stehouwer, H. (2011). Statistical language models for alternative sequence selection. Ph.D. thesis, Tilburg University. [pdf]
  • Van Gompel, M., Van den Bosch, A., and Berck, P. (2011). Extending memory-based machine translation to phrases. In T. Markus, P. Monachesi, and E. Westerhout (Eds.), Computational Linguistics in the Netherlands 2010: Selected Papers from the Twentieth CLIN Meeting. Utrecht, LOT: pp. 45-58. [pdf]
  • Wubben, S., Van den Bosch, A., and Krahmer, E. (2011). Paraphrasing headlines by machine translation. In T. Markus, P. Monachesi, and E. Westerhout (Eds.), Computational Linguistics in the Netherlands 2010: Selected Papers from the Twentieth CLIN Meeting. Utrecht, LOT: pp. 169-183. [pdf]
  • Van den Bosch, A., and Bouma, G., editors (2011). Interactive multi-modal question answering. Berlin: Springer Verlag. ISBN 978-3-642-17524-4. [publisher webpage]
  • Canisius, S., Van den Bosch, A., and Daelemans, W. (2011). Constraint satisfaction inference for entity recognition. In A. van den Bosch and G. Bouma (Eds.), Interactive multi-modal question answering, pp. 199-219. Berlin: Springer Verlag. [publisher webpage]
  • Haque, R., Kumar Naskar, S., Van den Bosch, A., and Way, A. (2011). Integrating source-language context into phrase-based statistical machine translation. Machine Translation, 23:3, pp. 239-285. [journal webpage]
  • Bogers, T., and Van den Bosch, A. (2011). Fusing recommendations for social bookmarking. International Journal of Electronic Commerce, 15:3, pp. 33-75. [journal page]
  • Van den Bosch, A. (2011). Effects of context and recency in scaled word completion. Computational Linguistics in the Netherlands Journal, 1, pp. 79-94. [pdf]
2010
  • Haque, R., Kumar Naskar, S., Van den Bosch, A., and Way, A. (2010). Supertags as source language context in hierarchical phrase-based SMT. In Proceedings of AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas, Denver, CO., pp. 210-219. [pdf]
  • Van den Bosch, A., Nauts, P., and Eckhardt, N. (2010). A Kid's Open Mind Common Sense. In Commonsense Knowledge: Papers from the AAAI Fall Symposium, Arlington, VA., pp. 114-119. [pdf]
  • Van Zaanen, M., and Kanters, P. (2010). Automatic mood classification using tf*idf based on lyrics. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR 2010), pp. 75-80. [pdf]
  • Van Zaanen, M., and Gaustad, T. (2010). Grammatical inference as class discrimination. In José Sempere and Pedro García (Eds.), Grammatical Inference: Theoretical Results and Applications, Lecture Notes in Computer Science 6339, pp. 245-257. Berlin/Heidelberg: Springer Verlag. [pdf]
  • Katrenko, S., and Van Zaanen, M. (2010). Rademacher Complexity and grammar induction algorithms: What it may (not) tell us. In José Sempere and Pedro García (Eds.), Grammatical Inference: Theoretical Results and Applications, Lecture Notes in Computer Science 6339, pp. 293-296. Berlin/Heidelberg: Springer Verlag. [pdf]
  • Stehouwer, H. and Van Zaanen, M. (2010). Enhanced suffix arrays as language models: Virtual k-testable languages. In Proceedings of ICGI 2010, Valencia, Spain. [pdf]
  • Stehouwer, H. and Van Zaanen, M. (2010). Using suffix arrays as language models: Scaling the n-gram. In Proceedings of the 22st Benelux Conference on Artificial Intelligence (BNAIC-2010), Luxembourg. [pdf]
  • Stehouwer, H. and Van Zaanen, M. (2010). Finding patterns in strings using suffix arrays. In Proceedings of Computational Linguistics - Applications 2010, pp. 151-158. [pdf]
  • Van Gompel, M. (2010). UvT-WSD1: A cross-lingual word sense disambiguation system. In SemEval'10: Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, pp. 238-241. [pdf]
  • Reynaert, M. (2010). Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition [DOI 10.1007/s10032-010-0133-5]
  • Wubben, S., Van den Bosch, A., and Krahmer, E. (2010). Paraphrase generation as monolingual translation: Data and evaluation. In J. Kelleher, B. Mac Namee, and I. van der Sluis (Eds.), Proceedings of the 10th International Workshop on Natural Language Generation (INLG 2010), pp. 203-207, Dublin, July 2010. [pdf]
  • Daelemans, W., and Van den Bosch, A. (2010). Memory-based learning. In A. Clark, C. Fox, and S. Lappin (Eds.), Handbook of Computational Linguistics and Natural Language Processing. Oxford, UK: Wiley-Blackwell Publishers, pp. 154-179. [book webpage]
2009
  • Stehouwer, H., and Van Zaanen, M. (2009). Token merging in language model-based confusible disambiguation. In T. Calders, K. Tuyls, and M. Pechinizkiy (Eds.), Proceedings of the 21st Benelux Conference on Artificial Intelligence (BNAIC-2009), pp. 241-248. [pdf]
  • Van Gompel, M., Van den Bosch, A., and Berck, P. (2009). Extending memory-based machine translation to phrases. In M. Forcada and A. Way (Eds.), Proceedings of the Third Workshop on Example-Based Machine Translation, Dublin, Ireland, pp. 79-86. [pdf]
  • Haque, R., Naskar, S., Van den Bosch, A., and Way, A. (2009). Dependency relations as source context in phrase-based SMT. In Proceedings of PACLIC 23: the 23rd Pacific Asia Conference on Language, Information and Computation, Hong Kong, China, pp. 170-179. [pdf]
  • Morante, R., Van Asch, V., and Van den Bosch, A. (2009). Dependency parsing and semantic role labeling as a single task. In Proceedings of the 7th International Conference on Recent Advances in Natural Language Processing (RANLP-2009), pp 275-280. [pdf]
  • Reynaert, M. (2009). Parallel identification of the spelling variants in corpora. In Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data 2009 (AND-2009), Barcelona, Spain, pp. 77-84. [pdf]
  • Morante, R., Van Asch, V., and Van den Bosch, A. (2009). Joint memory-based learning of syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL): Shared Task, pp. 25-30. Stroudsburg, PA: Association for Computational Linguistics. [pdf]
  • Canisius, S., and Van den Bosch, A. (2009). A constraint satisfaction approach to machine translation. In Proceedings of the 13th Annual Meeting of the European Association for Machine Translation (EAMT-2009), pp. 182-189. [pdf]
  • Stehouwer, H., and Van Zaanen, M. (2009). Language models for contextual error detection and correction. In Proceedings of the EACL 2009 Workshop on Computational Linguistic Aspects of Grammatical Inference, Athens, Greece, pp. 41-48. [pdf]
  • Wubben, S., Van den Bosch, A., Krahmer, E., and Marsi, E. (2009). Clustering and matching headlines for automatic paraphrase acquisition. In Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009), Athens, Greece, pp. 122-125. [pdf]
  • Van den Bosch, A., and Berck, P. (2009). Memory-based machine translation and language modeling. The Prague Bulletin of Mathematical Linguistics No. 91, pp. 17-26. [pdf]
  • Bogers, T., and Van den Bosch, A. (2009). Language modeling for spam detection in social reference manager websites. In R. Aly, C. Hauff, I. den Hamer, D. Hiemstra, T. Huibers, and F. de Jong (Eds.), Proceedings of the 9th Belgian-Dutch Information Retrieval Workshop (DIR 2009), pp 87-94. [pdf]
  • Wubben, S., and Van den Bosch, A. (2009). A semantic relatedness metric based on free link structure. In H.C. Bunt, V. Petukhova, and S. Wubben (Eds.), Proceedings of the Eighth International Conference on Computational Semantics (IWCS-8), pp. 355-359. [pdf]
  • Stehouwer, H. and Van den Bosch, A. (2009). Putting the t where it belongs: Solving a confusion problem in Dutch. In S. Verberne, H. van Halteren, and P.-A. Coppen (Eds.), Computational Linguistics in the Netherlands 2007: Selected Papers from the 18th CLIN Meeting, January 22, 2009, Groningen, pp. 21-36. [pdf]
2008
  • Reynaert, M. (2008). Non-interactive OCR post-correction for giga-scale digitization projects. In A. Gelbukh (Ed.), Proceedings of the Computational Linguistics and Intelligent Text Processing 9th International Conference, CICLing 2008. Lecture Notes in Computer Science Vol. 4919/2008, Berlin / Heidelberg: Springer, pp. 617-630. [pdf, first page - corrected, post-publication version]
  • Hoste, V., and Van den Bosch, A. (2008). A modular approach to learning Dutch co-reference. In C. Johansson (Ed.), Proceedings from the First Bergen Workshop on Anaphora Resolution (WAR I), Bergen, Norway, pp. 51-75. [pdf]
  • Oostdijk, N., Reynaert, M., Monachesi, P., Van Noord, G., Ordelman, R., Schuurman, I., and Vandeghinste, V. (2008). From D-Coi to SoNaR: A reference corpus for Dutch. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech, Morocco, 2008. [pdf]
  • Reynaert, M. (2008). All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In Proceedings of the Sixth International Language Resources and Evaluation (LREC'08). Marrakech,Morocco, 2008. [pdf]
  • Van den Bosch, A., and Bogers, T. (2008). Efficient Context-Sensitive Word Completion for Mobile Devices. In MobileHCI 2008: Proceedings of the 10th International Conference on Human-Computer Interaction with Mobile Devices and Services, IOP-MMI special track, pp 465-470. Amsterdam, The Netherlands, September 2008. [pdf]
  • Bogers, T., and Van den Bosch, A. (2008). Using language models for spam detection in social bookmarking. In Proceedings of 2008 ECML/PKDD Discovery Challenge Workshop, pp 1-12. Antwerp, Belgium, September 2008. [pdf]
2007
  • Van den Bosch, A., and Van der Sloot, K. (2007). Superlinear parallelization of k-nearest neighbor retrieval. In M. Dastani and E. de Jong (Eds.), Proceedings of the 19th Belgian-Dutch Artificial Intelligence Conference (BNAIC-2007), Utrecht, The Netherlands, pp. 65-72. [pdf]
  • Stroppa, N., Van den Bosch, A., and Way, A. (2007). Exploiting source similarity for SMT using context-informed features. In A. Way and B. Gawronska (Eds.), Proceedings of the 11th International Conference on Theoretical Issues in Machine Translation (TMI 2007), Skövde, Sweden, pp. 231-240. [pdf]
  • Canisius, S., and Van den Bosch, A. (2007). Recompiling a knowledge-based dependency parser into memory. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-2007), Borovets, Bulgaria, pp. 104-108. [pdf]
  • Van den Bosch, A., Stroppa, N., and Way, A. (2007). A memory-based classification approach to marker-based EBMT. F. Van Eynde, V. Vandeghinste, and I. Schuurman (Eds.), In Proceedings of the METIS-II Workshop on New Approaches to Machine Translation, 63-72. January 11, 2007, Leuven, Belgium. [pdf]
  • Soudi, A., Van den Bosch, A., and Neumann, G., editors (2007). Arabic computational morphology: Knowledge-based and empirical methods. Berlin: Springer. 308 p. ISBN 978-1-4020-6045-8. [publisher webpage]
  • Van den Bosch, A., Busser, G.J., Canisius, S., and Daelemans, W. (2007). An efficient memory-based morpho-syntactic tagger and parser for Dutch. In P. Dirix, I. Schuurman, V. Vandeghinste, and F. Van Eynde (Eds.), Computational Linguistics in the Netherlands: Selected Papers from the Seventeenth CLIN Meeting, Leuven, Belgium, pp. 99-114. [preprint pdf]
  • Van den Bosch, A., Marsi, E., and Soudi, A. (2007). Memory-based morphological analysis and part-of-speech tagging of Arabic. In Soudi, A., Van den Bosch, A., and Neumann, G. (Eds), Arabic computational morphology: Knowledge-based and empirical methods, Chapter 11, pp. 203-219. Berlin: Springer. [pdf of preprint]
  • Van den Bosch, A., and Van der Sloot, K. (2007). Superlinear parallelisation of the k-nearest neighbor classifier. In P. Adriaans, M. van Someren, and S. Katrenko (Eds.), Proceedings of the 18th BENELEARN Conference. May 14, 2007, Amsterdam, The Netherlands. [pdf]
2006
  • Van den Bosch, A. (2006). All-words prediction as the ultimate confusible disambiguation. In Proceedings of the HLT-NAACL Workshop on Computationally hard problems and joint inference in speech and language processing, June 2006, New York City, NY. [pdf]
  • Van den Bosch, A. (2005). Scalable classification-based word prediction and confusible correction. Traitement Automatiques des Langues, 46:2, 39-63. [pdf of preprint]
Antal.vdnBosch@uvt.nl | Last update: