|
This page lists (in alphabetical order) some of the terminology around the main topics -- language technology, text mining and machine learning -- involved in the MITCH project.
Artificial intelligence (AI) refers to intelligent behaviour exhibited by a
computer. Besides text-oriented applications (such as MITCH), the term AI
covers a broad range of disciplines ranging from expert systems to genetic
algorithms and from face recognition to robotics. It combines research areas
from biology, computational linguistics, computer science,
math, operations research, philosophy and psychology.
In the context of natural language processing it should be noted that
"intelligence" can be defined as the machine's capability to perform
human-like conversation. This idea was described by Alan Turing in the 1950
paper Computing
machinery and intelligence. It proceeds as follows: a human judge
engages in a natural language conversation with two other parties, one a
human and the other a machine; if the judge cannot reliably tell which is
which, then the machine is said to pass the test. It is assumed that both
the human and the machine try to appear human. To explicitly test the
machine's linguistic capability, the conversation is usually limited
to a text-only channel.
Cognitive science is a research area that studies the human mind and intelligence. This broad characterization often narrows down to subsystems of human cognition described in formal models in terms of symbols, propositions and logic. Cognitive science differs from the more pragmatic AI research, which seeks to build useful machines, not to model humans.
Computational linguistics deals with the formal modeling of natural language from a computational perspective. It covers processing all levels, from phonology, through syntax and semantics to style and pragmatics in an interdisciplinary manner combining (amongst others) linguistics, computer science, artificial intelligence, cognitive psychology and logic.
Data mining uses sophisticated data search capabilities and statistical
algorithms to find correlations and patterns in large pre-existing databases.
The discovered clusters induce hypotheses for (causal) relations.
Data mining incorporates techniques from statistics, information retrieval,
machine learning and other AI techniques. From a machine
learning perspective, the activity of identifying clusters is
characterized as an unsupervised method.
Historically, the data mining area emerged as a response to the challenges
and opportunities faced by the database community in the 1990s, which saw an
explosion in growth of data. Applications are traditionally in banking,
insurances and (direct) marketing.
In most contexts, language technology is synonymous with computational linguistics. If a difference between the two exists, that's primarily a difference in emphasis: while computational linguistics is about modeling aspects of language and testing these using computers (in a top-down fashion), language technology focuses on extracting knowledge from large amounts of data (in a bottom-up fashion). Analogous to the AI / cognitive science distinction, language technology tends to be more pragmatic and application-oriented.
Information retrieval (IR) is the art and science of searching for
information in large bodies of textual material. This involves searching for
information in documents (or searching for documents themselves), searching
for meta-data that describes documents (bibliographic descriptions,
keywords), or searching (networked, hypertext-connected) IR databases. IR
concentrates on the issues of how to find meaningful index keywords and how
to organise them, using quantitative methods (statistics, "relevance
ranking" of results).
Compared to text mining, IR is a relatively old discipline. Research dates
back to the 1950s and 1960s, when the need for automated retrieval systems
became evident to cope with the information explosion in scientific
literature.
Machine learning is an area of artificial
intelligence concerned with the development of techniques that allow
computers to "learn" through the analysis of data sets. Machine learning
combines aspects of information theory, statistics and computer science
to study both the analysis of data and the algorithmic complexity of
computational implementations. A major part of research efforts is in the
development of tractable approximations or simulations of the actual
statistics and inference algorithms, which in their most general form are
often infeasible (NP-hard).
Machine learning has been applied to
search engines,
medical diagnosis,
detecting credit card fraud,
stock market analysis,
classifying DNA sequences,
speech and handwriting recognition,
game playing and robot locomotion.
Machine learning can be subdivides into supervised learning (or
classification), unsupervised learning (clustering) and reinforcement
learning ("robotics"). The MITCH project will focus on unsupervised
learning for finding relations between and different resources sets and
(partially) supervised techniques for correction and normalization at
(text) field level.
Operations research (OR) refers to the planning of (military) operations.
(The name arose during World War II, as allied military planners looked for
ways to improve their logistics and training schedules.)
It uses mathematical models, statistics, combinatorial and search algorithms
to aid in decision-making. It is most often used to analyse complex
real-world systems, typically with the goal of improving or optimising
performance.
Artificial intelligence, like many computer science-related
disciplines, has strong roots in OR.
Text data mining, or simply text mining refers to the process of extracting
interesting and non-trivial information and knowledge from plain and
unstructured text. As a relatively young discipline, it combines much of the
recent progress from
information retrieval,
data mining,
machine learning,
statistics and
computational linguistics.
The MITCH project will use
information extraction (the extraction of (semi-)structured information from
unstructured documents),
named entity recognition (search and classification of atomic elements in
unstructured text) and
factoid detection.
|
|
|
|
|
|
|