Universiteit * van Tilburg

Glossary

This page lists (in alphabetical order) some of the terminology around the main topics -- language technology, text mining and machine learning -- involved in the MITCH project.

Artificial intelligence

Artificial intelligence (AI) refers to intelligent behaviour exhibited by a computer. Besides text-oriented applications (such as MITCH), the term AI covers a broad range of disciplines ranging from expert systems to genetic algorithms and from face recognition to robotics. It combines research areas from biology, computational linguistics, computer science, math, operations research, philosophy and psychology.

In the context of natural language processing it should be noted that "intelligence" can be defined as the machine's capability to perform human-like conversation. This idea was described by Alan Turing in the 1950 paper Computing machinery and intelligence. It proceeds as follows: a human judge engages in a natural language conversation with two other parties, one a human and the other a machine; if the judge cannot reliably tell which is which, then the machine is said to pass the test. It is assumed that both the human and the machine try to appear human. To explicitly test the machine's linguistic capability, the conversation is usually limited to a text-only channel.

Cognitive science

Cognitive science is a research area that studies the human mind and intelligence. This broad characterization often narrows down to subsystems of human cognition described in formal models in terms of symbols, propositions and logic. Cognitive science differs from the more pragmatic AI research, which seeks to build useful machines, not to model humans.

Computational linguistics

Computational linguistics deals with the formal modeling of natural language from a computational perspective. It covers processing all levels, from phonology, through syntax and semantics to style and pragmatics in an interdisciplinary manner combining (amongst others) linguistics, computer science, artificial intelligence, cognitive psychology and logic.

Data mining

Data mining uses sophisticated data search capabilities and statistical algorithms to find correlations and patterns in large pre-existing databases. The discovered clusters induce hypotheses for (causal) relations. Data mining incorporates techniques from statistics, information retrieval, machine learning and other AI techniques. From a machine learning perspective, the activity of identifying clusters is characterized as an unsupervised method.

Historically, the data mining area emerged as a response to the challenges and opportunities faced by the database community in the 1990s, which saw an explosion in growth of data. Applications are traditionally in banking, insurances and (direct) marketing.

Language technology

In most contexts, language technology is synonymous with computational linguistics. If a difference between the two exists, that's primarily a difference in emphasis: while computational linguistics is about modeling aspects of language and testing these using computers (in a top-down fashion), language technology focuses on extracting knowledge from large amounts of data (in a bottom-up fashion). Analogous to the AI / cognitive science distinction, language technology tends to be more pragmatic and application-oriented.

Information retrieval

Information retrieval (IR) is the art and science of searching for information in large bodies of textual material. This involves searching for information in documents (or searching for documents themselves), searching for meta-data that describes documents (bibliographic descriptions, keywords), or searching (networked, hypertext-connected) IR databases. IR concentrates on the issues of how to find meaningful index keywords and how to organise them, using quantitative methods (statistics, "relevance ranking" of results).

Compared to text mining, IR is a relatively old discipline. Research dates back to the 1950s and 1960s, when the need for automated retrieval systems became evident to cope with the information explosion in scientific literature.

Machine learning

Machine learning is an area of artificial intelligence concerned with the development of techniques that allow computers to "learn" through the analysis of data sets. Machine learning combines aspects of information theory, statistics and computer science to study both the analysis of data and the algorithmic complexity of computational implementations. A major part of research efforts is in the development of tractable approximations or simulations of the actual statistics and inference algorithms, which in their most general form are often infeasible (NP-hard).

Machine learning has been applied to search engines, medical diagnosis, detecting credit card fraud, stock market analysis, classifying DNA sequences, speech and handwriting recognition, game playing and robot locomotion.

Machine learning can be subdivides into supervised learning (or classification), unsupervised learning (clustering) and reinforcement learning ("robotics"). The MITCH project will focus on unsupervised learning for finding relations between and different resources sets and (partially) supervised techniques for correction and normalization at (text) field level.

Operations research

Operations research (OR) refers to the planning of (military) operations. (The name arose during World War II, as allied military planners looked for ways to improve their logistics and training schedules.)

It uses mathematical models, statistics, combinatorial and search algorithms to aid in decision-making. It is most often used to analyse complex real-world systems, typically with the goal of improving or optimising performance. Artificial intelligence, like many computer science-related disciplines, has strong roots in OR.

Text mining

Text data mining, or simply text mining refers to the process of extracting interesting and non-trivial information and knowledge from plain and unstructured text. As a relatively young discipline, it combines much of the recent progress from information retrieval, data mining, machine learning, statistics and computational linguistics.

The MITCH project will use information extraction (the extraction of (semi-)structured information from unstructured documents), named entity recognition (search and classification of atomic elements in unstructured text) and factoid detection.


CATCH ILk .naturalis NW0 Universiteit * van Tilburg
© 2005 Tilburg University, Tijn Porcelijn | Last update: 2 August 2005