Caroline Sporleder & Kalliopi Zervanou (eds.) Proceedings of the ECAI 2010 Workshop on Language
Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), August 16, 2010, Lisbon, Portugal.
Accepted papers list
- Nils Reiter, Oliver Hellwig, Anand Mishra, Irina Gossmann, Borayin Maitreya Larios, Julio Cezar Rodrigues, Britta Zeller and Anette Frank:
Adapting Standard NLP Tools and Resources to the Processing of Ritual Descriptions.
In this paper we investigate the use of standard natural language processing (NLP) tools and annotation methods
for processing linguistic data from ritual science. The work is embedded in an interdisciplinary project
that addresses the study of the structure and variance of rituals, as investigated in ritual science,
under a new perspective: by applying empirical and quantitative computational linguistic analysis
techniques to ritual descriptions. We present motivation and prospects of such a computational approach to
ritual structure research and sketch the overall project research plan. In particular, we motivate the choice
of frame semantics as a theoretical framework for the structural analysis of rituals. We discuss the special
characteristics of the textual data and especially focus on the question of how standard NLP methods, resources
and tools can be adapted to the new domain.
- Yevgeni Berzak, Michal Richter and Carsten Ehrler:
Similarity-Based Navigation in Visualized Collections of Historical Documents.
Working with large and unstructured collections of historical documents is a challenging task for historians.
We present an approach for visualizing such collections in the form of graphs in which similar documents are connected by edges.
The strength of the similarities is measured according to the overlap of historically significant information such as Named Entities,
or the overlap of general vocabulary. The visualization approach provides structure that helps unveiling otherwise hidden information
and relations between documents. We implement the idea of similarity graphs within an Information Retrieval system supported
by an interactive Graphical User Interface. The system allows querying the database, visualizing the results and browsing
the collection graphically in an effective and informative way.
- Martin Volk, Torsten Marek and Rico Sennrich:
Reducing OCR Errors by Combining Two OCR Systems.
This paper describes our efforts in building a heritage corpus of Alpine texts.
We have already digitized the yearbooks of the Swiss Alpine Club from 1864 until 1982.
This corpus poses special challenges since the yearbooks are multilingual and vary in orthography and layout.
We discuss methods to improve OCR performance and experiment with combining two different OCR programs
with the goal to reduce the number of OCR errors. We describe a merging procedure
that uses a unigram language model trained on the uncorrected corpus itself to select the best alternative,
and report on evaluation results which show that the merging procedure helps to improve OCR quality.
- Csaba Oravecz, Bálint Sass and Eszter Simon:
Semi-automatic Normalization of Old Hungarian Codices.
An annotated corpus of Old Hungarian is being developed,
which requires a number of standard computational language processing tasks:
sentence segmentation and tokenization, normalization of tokens
and morphological analysis, and automatic morphosyntactic disambiguation.
The paper presents how the normalization process of historical texts
can be aided with the application of a neat probabilistic model,
which renders the output of automatic normalization
as a (well constrained) set of legitimate transliterations
for each Old Hungarian token, from which a human annotator
can select the context fitting element.
- Hernani Costa, Hugo Gonçalo Oliveira and Paulo Gomes:
The Impact of Distributional Metrics in the Quality of Relational Triples.
This work analyses the benefits of applying metrics based on the occurrence
of words and their neighbourhoods in documents to a set of relational triples
automatically extracted from corpora. In our experimentation, we start by
using a simple system to extract semantic triples from a corpus.
Then, the same corpus is used for weighting each triple according to
well-known distributional metrics. Finally, we take some conclusions
on the correlation between the values given by the metrics and
the evaluation made by humans.
- Eric Auer, Peter Wittenburg, Han Sloetjes, Oliver Schreer, Stefano Masneri, Daniel Schneider and Sebastian Tschöpel:
Automatic Annotation of Media Field Recordings.
In the paper we describe a new attempt to come to automatic detectors processing
real scene audio-video streams that can be used by researchers world-wide to speed up
their annotation and analysis work. Typically these recordings are taken in field
and experimental situations mostly with bad quality and only little corpora preventing
to use standard stochastic pattern recognition techniques.
Audio/video processing components are taken out of the expert lab and are integrated
in easy-to-use interactive frameworks so that the researcher can easily start them
with modified parameters and can check the usefulness of the created annotations.
Finally a variety of detectors may have been used yielding a lattice of annotations.
A flexible search engine allows finding combinations of patterns opening completely
new analysis and theorization possibilities for the researchers who until were
required to do all annotations manually and who did not have any help
in pre-segmenting lengthy media recordings.
- Iris Hendrickx, Michel Généreux and Rita Marquilhas:
Automatic Pragmatic Text Segmentation of Historical Letters.
In this investigation we aim to reduce the manual work-load by automatic processing
of the corpus of historical letters for pragmatic research.
We focus on two consecutive sub tasks: the first task is automatic text segmentation
of the letters in formal/informal parts using a statistical n-gram based technique.
As a second task we perform a further semantic labeling of the formal parts
of the letters using supervised machine learning. The main stumbling block
in our investigation is data sparsity due to the small size of the data set
and enlarged by the spelling variation present in the historical letters.
We try to address the latter problem with a dictionary look up and edit
distance text normalization step. We achieve results of 83.7% micro-averaged F-score
for the text segmentation task and 63.4% for the semantic labeling task.
Even though these scores are not high enough to completely replace the manual
annotation with automatic annotation, our results are promising and demonstrate
that an automatic approach based on such small data set is feasible.
- Michael Piotrowski:
From Law Sources to Language Resources.
The Collection of Swiss Law Sources is an edition of historical Swiss texts
with legal relevance from the early Middle Ages up to 1798.
The sources are manuscripts in historical variants of German, French,
Italian, Rhaeto-Romanic, and Latin, which are transcribed, annotated,
and published as editions of historical sources. The Collection is currently
being digitized and will be made available on the Web as facsimiles.
However, for a subset of the collection digital printing data in the form of
FrameMaker documents is available. As this represents a sizable body of
medieval and early modern text in various languages without OCR errors,
it could serve as a valuable language resource for the processing of
cultural heritage texts and for the development and evaluation of
specialized NLP methods and techniques. This paper briefly describes
the retrodigitization of the Collection of Swiss Law Sources and
then discusses the conversion of the FrameMaker files in order
to make the texts suitable for automatic processing and
for the extraction and derivation of language resources.
- Andreas Schwarte, Christopher Haccius, Sebastian Steenbuck and Sven Steudter:
Usability Enhancement by Mining, Processing and Visualizing Data from the Federal German Archive.
The purpose of this paper is to present the results of a project which deals
with mining data from historical corpora. We present a general approach of how
existing language processing tools can be used and integrated to automate
information retrieval and data processing. Moreover, we discuss ideas
of flexible access to the data as well as presentation. Our findings
and results are illustrated in a prototype system which allows for
multi-dimensional queries on the corpus. The corpus for this project
consists of the German cabinet meeting protocols from 1949 to 1964.
- Thierry Declerck, Antonia Scheidel and Piroska Lendvai:
Proppian Content Descriptors in an Augmented Annotation Schema for Fairy Tales.
This paper describes a proposal for combining linguistic and domain specific annotation
for supporting Cultural Heritage and Digital Humanities research, exemplified in the
fairy tale domain. Our goal is to semi-automatically annotate fairy tales, in particular to
locate and mark up fairy tale characters and the actions they are involved in, which can be
subsequently queried in a corpus by both linguists and specialists in the field.
The characters and actions are defined in Propp’s structural analysis to folk tales,
which we aim to implement in a fully fledged way, contrary to existing resources. We argue
that the approach devises a means for linguistic processing of folk tale texts in order
to support their automated semantic annotation in terms of narrative units and functions.