Universiteit * van Tilburg

LaTeCH 2010
Home       Programme       Proceedings       Instructions for Presentations       Call for Papers       Registration       Organisation & Contact

Proceedings


Caroline Sporleder & Kalliopi Zervanou (eds.) Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), August 16, 2010, Lisbon, Portugal.

Accepted papers list


  1. Nils Reiter, Oliver Hellwig, Anand Mishra, Irina Gossmann, Borayin Maitreya Larios, Julio Cezar Rodrigues, Britta Zeller and Anette Frank:
    Adapting Standard NLP Tools and Resources to the Processing of Ritual Descriptions. [slides]

    Abstract: In this paper we investigate the use of standard natural language processing (NLP) tools and annotation methods for processing linguistic data from ritual science. The work is embedded in an interdisciplinary project that addresses the study of the structure and variance of rituals, as investigated in ritual science, under a new perspective: by applying empirical and quantitative computational linguistic analysis techniques to ritual descriptions. We present motivation and prospects of such a computational approach to ritual structure research and sketch the overall project research plan. In particular, we motivate the choice of frame semantics as a theoretical framework for the structural analysis of rituals. We discuss the special characteristics of the textual data and especially focus on the question of how standard NLP methods, resources and tools can be adapted to the new domain.

  2. Yevgeni Berzak, Michal Richter and Carsten Ehrler:
    Similarity-Based Navigation in Visualized Collections of Historical Documents. [slides]

    Abstract: Working with large and unstructured collections of historical documents is a challenging task for historians. We present an approach for visualizing such collections in the form of graphs in which similar documents are connected by edges. The strength of the similarities is measured according to the overlap of historically significant information such as Named Entities, or the overlap of general vocabulary. The visualization approach provides structure that helps unveiling otherwise hidden information and relations between documents. We implement the idea of similarity graphs within an Information Retrieval system supported by an interactive Graphical User Interface. The system allows querying the database, visualizing the results and browsing the collection graphically in an effective and informative way.

  3. Martin Volk, Torsten Marek and Rico Sennrich:
    Reducing OCR Errors by Combining Two OCR Systems. [slides]

    Abstract: This paper describes our efforts in building a heritage corpus of Alpine texts. We have already digitized the yearbooks of the Swiss Alpine Club from 1864 until 1982. This corpus poses special challenges since the yearbooks are multilingual and vary in orthography and layout. We discuss methods to improve OCR performance and experiment with combining two different OCR programs with the goal to reduce the number of OCR errors. We describe a merging procedure that uses a unigram language model trained on the uncorrected corpus itself to select the best alternative, and report on evaluation results which show that the merging procedure helps to improve OCR quality.

  4. Csaba Oravecz, Bálint Sass and Eszter Simon:
    Semi-automatic Normalization of Old Hungarian Codices. [slides]

    Abstract: An annotated corpus of Old Hungarian is being developed, which requires a number of standard computational language processing tasks: sentence segmentation and tokenization, normalization of tokens and morphological analysis, and automatic morphosyntactic disambiguation. The paper presents how the normalization process of historical texts can be aided with the application of a neat probabilistic model, which renders the output of automatic normalization as a (well constrained) set of legitimate transliterations for each Old Hungarian token, from which a human annotator can select the context fitting element.

  5. Hernani Costa, Hugo Gonçalo Oliveira and Paulo Gomes:
    The Impact of Distributional Metrics in the Quality of Relational Triples. [slides]

    Abstract: This work analyses the benefits of applying metrics based on the occurrence of words and their neighbourhoods in documents to a set of relational triples automatically extracted from corpora. In our experimentation, we start by using a simple system to extract semantic triples from a corpus. Then, the same corpus is used for weighting each triple according to well-known distributional metrics. Finally, we take some conclusions on the correlation between the values given by the metrics and the evaluation made by humans.

  6. Eric Auer, Peter Wittenburg, Han Sloetjes, Oliver Schreer, Stefano Masneri, Daniel Schneider and Sebastian Tschöpel:
    Automatic Annotation of Media Field Recordings. [slides]

    Abstract: In the paper we describe a new attempt to come to automatic detectors processing real scene audio-video streams that can be used by researchers world-wide to speed up their annotation and analysis work. Typically these recordings are taken in field and experimental situations mostly with bad quality and only little corpora preventing to use standard stochastic pattern recognition techniques. Audio/video processing components are taken out of the expert lab and are integrated in easy-to-use interactive frameworks so that the researcher can easily start them with modified parameters and can check the usefulness of the created annotations. Finally a variety of detectors may have been used yielding a lattice of annotations. A flexible search engine allows finding combinations of patterns opening completely new analysis and theorization possibilities for the researchers who until were required to do all annotations manually and who did not have any help in pre-segmenting lengthy media recordings.

  7. Iris Hendrickx, Michel Généreux and Rita Marquilhas:
    Automatic Pragmatic Text Segmentation of Historical Letters. [slides]

    Abstract: In this investigation we aim to reduce the manual work-load by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform a further semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try to address the latter problem with a dictionary look up and edit distance text normalization step. We achieve results of 83.7% micro-averaged F-score for the text segmentation task and 63.4% for the semantic labeling task. Even though these scores are not high enough to completely replace the manual annotation with automatic annotation, our results are promising and demonstrate that an automatic approach based on such small data set is feasible.

  8. Michael Piotrowski:
    From Law Sources to Language Resources. [slides]

    Abstract: The Collection of Swiss Law Sources is an edition of historical Swiss texts with legal relevance from the early Middle Ages up to 1798. The sources are manuscripts in historical variants of German, French, Italian, Rhaeto-Romanic, and Latin, which are transcribed, annotated, and published as editions of historical sources. The Collection is currently being digitized and will be made available on the Web as facsimiles. However, for a subset of the collection digital printing data in the form of FrameMaker documents is available. As this represents a sizable body of medieval and early modern text in various languages without OCR errors, it could serve as a valuable language resource for the processing of cultural heritage texts and for the development and evaluation of specialized NLP methods and techniques. This paper briefly describes the retrodigitization of the Collection of Swiss Law Sources and then discusses the conversion of the FrameMaker files in order to make the texts suitable for automatic processing and for the extraction and derivation of language resources.

  9. Andreas Schwarte, Christopher Haccius, Sebastian Steenbuck and Sven Steudter:
    Usability Enhancement by Mining, Processing and Visualizing Data from the Federal German Archive. [slides]

    Abstract: The purpose of this paper is to present the results of a project which deals with mining data from historical corpora. We present a general approach of how existing language processing tools can be used and integrated to automate information retrieval and data processing. Moreover, we discuss ideas of flexible access to the data as well as presentation. Our findings and results are illustrated in a prototype system which allows for multi-dimensional queries on the corpus. The corpus for this project consists of the German cabinet meeting protocols from 1949 to 1964.

  10. Thierry Declerck, Antonia Scheidel and Piroska Lendvai:
    Proppian Content Descriptors in an Augmented Annotation Schema for Fairy Tales. [slides]

    Abstract: This paper describes a proposal for combining linguistic and domain specific annotation for supporting Cultural Heritage and Digital Humanities research, exemplified in the fairy tale domain. Our goal is to semi-automatically annotate fairy tales, in particular to locate and mark up fairy tale characters and the actions they are involved in, which can be subsequently queried in a corpus by both linguists and specialists in the field. The characters and actions are defined in Propp’s structural analysis to folk tales, which we aim to implement in a fully fledged way, contrary to existing resources. We argue that the approach devises a means for linguistic processing of folk tale texts in order to support their automated semantic annotation in terms of narrative units and functions.


NW0 CATCH
Last update: Thu 2 Sep 2010; K.Zervanou (at) uvt.nl