Sentential meaning is at the core of the formal semantics program, so much so that formal semanticists have developed sophisticated theories of the meaning of the sentence "every farmer who owns a donkey beats it", but do not have much to say about what is the meaning of "donkey" to begin with. Computational linguists, on the other hand, have developed powerful distributional methods to automatically induce rich representations of the meaning of "donkey" and thousands of other words from corpora, but cannot, using the same methods, capture the meaning of "a donkey", let alone that of whole sentences. For decades, the two fields have consequently not communicated much, respecting what looks more like a non-aggression pact than a mutually beneficial division of labour. Very recently, however, several proposals for a "compositional" corpus-based distributional semantics have emerged. In this talk, I will review some of these approaches, present preliminary evidence of their effectiveness in scaling up to the phrasal domain and discuss to what extent the representations of phrases and (eventually) sentences we will get out of compositional distributional semantics are related to what formal semanticists are trying to attain.
It is of great importance to the Dutch Parliament to have recent documents at their disposal quickly, including recent media articles. The archival service of the Dutch Parliament tries to accommodate this by manually assigning tens of thousands of keywords from the Parlementsthesaurus to documents each year. Members of Parliament can then retrieve these documents by using a search engine. Although this delivers good results, it is a rather slow process. It can take several weeks before a newly published article is made available through the parliamentary archive. To improve on this process, the archival service is currently developing a system that automatically assigns keywords to documents. While waiting for an editor to handle them, the articles can already be made available through the search engine. This also helps the editors, because all they have to do is check if the assigned keywords are correct.
Many universities and scientific institutes are in possession of interesting language resources. These resources are often quite specialized and relatively unknown. Finding out which resources can answer a given research question is a challenge in itself.
Current infrastructural initiatives try to tackle this issue by collecting metadata about the resources and establishing centers with stable repositories to ensure the availability of the resources. However, due to the heterogeneity of the data collections and the available systems, it is currently impossible to directly search over these resources in an integrated manner. It would be beneficial if the researcher could, by means of a simple query, determine which resources and which centers contain information beneficial to his or her research, or even work on a set of distributed resources as a virtual corpus.
We discuss an architecture for a distributed search environment. Furthermore, we will answer the question of how the availability of such a search infrastructure helps the computational linguistics researcher.
The construction of a large and richly annotated corpus of written Dutch was identified as one of the priorities of the STEVIN programme (2004-2011). Such a corpus, sampling texts from conventional and new media, was deemed invaluable for scientific research and application development. Therefore in 2008 the SoNaR project was initiated. The project aimed at the construction of a large, 500 MW reference corpus of contemporary written Dutch. Since its inception huge effort was put into acquiring a data set that represents a vast range of text types and into arranging IPR.
With final delivery to the Dutch-Flemish HLT Centre (TST-Centrale) now scheduled for March, 1st. 2012, we present an overview of the contents of the corpus. With 40 text types to be balanced over the two main Dutch regions, i.e. The Netherlands and the Flemish part of Belgium, many challenges have had to be met creatively in order to be, or not to be, overcome.
We highlight some of the more salient aspects of the SoNaR corpus contents.
In this research we describe the design and implementation of a part-of-speech tagger specifically for Dutch data from the popular microblogging service Twitter. Starting from the D-COI part-of-speech tagset, which is also used in the SoNaR project, we added several Twitter specific tags to allow the tagging of hashtags, "@mentions", emoticons and urls. We designed and implemented a conversion module, which modifies the part-of-speech output of the Frog tagger to incorporate the Twitter-specific tags. Running the Frog tagger and the conversion module sequentially leads to a part-of-speech tagger for Dutch tweets. Approximately 1 million tweets collected in the context of the SoNaR project were tagged by Frog and the converter combined. A sub-set of annotated tweets have been manually checked. Lastly, we evaluated the modified part-of-speech tagger.
This project was accomplished by eight Master's students from Tilburg University, who had just completed a course in natural language processing. In addition to the theoretical knowledge they acquired during the course, this project, which took approximately a week, offered them insight into the practical decisions that need to be made while working on natural language processing projects.
The recent construction of large linguistic treebanks for spoken and written Dutch (e.g. CGN, LASSY, Alpino) has created new and exciting opportunities for the empirical investigation of Dutch syntax and semantics.
However, the exploitation of such treebanks usually requires knowledge of specific data structures and/or query languages, such as XPath. Linguists who are unfamiliar with formal languages are often reluctant towards learning such a language.
In order to make treebank querying more attractive for non-technical users we developed a method in which linguists can use examples as a starting point for searching the Lassy treebank without knowledge about tree structures nor formal query languages. By allowing linguists to search for similar constructions as the example they provide, we hope to bridge the gap between traditional and computational linguistics.
The user provides the query engine with an example sentence, marking which parts of the sentence are the focus of the query. Through automated parsing of the example sentence and subtree extraction of the sentence part under focus the treebank is queried for the extracted subtree.
The architecture of the tool is optimized for searching the LASSY treebank, but the approach can be adapted to other treebank lay-outs.
We propose a computational model for open-domain Surface Realisation that takes semantic representations as input and generates texts. The semantic formalism we use is Discourse Representation Structures (DRSs) Discourse Representation Theory. A large corpus of text annotated with DRSs, the Groningen SemBank, forms the basis for developing and evaluating (stochastic) text generation models.
DRSs are recursive structures describing the meaning of text. This makes it hard to use them directly for traditional machine learining approaches. To deal with problem, we present a graph-based, meaning-preserving but "flat" format for DRSs. This format is essentially a set of n-tuples representing edges and nodes of a graph. There is a straightforward translation from the standard DRS syntax to the equivalent in graph notation.
There are two layers are in DRS graphs: argument structure, describing how concepts (including events) are connected to each other via thematic roles; and discourse structure, describing scope relations (e.g., negation, disjunction) and rhetorical relations between discourse units. Both layers are presented as tuples in the same graph, and are connected by common nodes.
VALKUIL.net is a freely available automatic spelling corrector for Dutch, capable of detecting ordinary typos but also agreement errors and confusions between existing words. The system is not based on explicit linguistic knowledge, but rather on a large Dutch text corpus of approx. 1.5 billion words, from which context-sensitive spelling error detection and correction modules are automatically derived. These modules use memory-based classification as their processing engine, and detect spelling or agreement errors based on a local context of neighboring words. When their expectation deviates sufficiently from the text they process, they are allowed to raise a spelling error flag and generate a correction suggestion.
Most modules are specialized on a particular confusible, such as determining whether the occurrence of 'zei' should actually be 'zij' or vice versa. Other modules treat a category of confusions, such as between words ending in 'd' or 'dt', while one other module tries to detect any error based on a generic memory-based language model. Additionally, particular context-insensitive modules correct split and run-on errors, and check all words against a corpus-derived frequency word list to check for a frequent counterpart in the list at a small Levenshtein distance.
Users of VALKUIL.net, which was publicly released in June 2011, are given the option to donate their errors and interactions with the system, producing an evaluation corpus that can be used for retraining and improving the system. We are evaluating VALKUIL.net using this growing resource, as well as three other corpora with annotated errors. (1) In the context of the Implicit Linguistics project at Tilburg University a sub-corpus of the SoNaR corpus was annotated for errors; (2) the OpenTaal.nl organization is providing access to a growing volunteer-annotated collection of errors in web-crawled texts; and (3) the WebCorp search engine is used to acquire specific sub-corpora based on targeted queries, such as sentences containing the word bigram error 'ik wordt'. We present the first evaluation results of the various VALKUIL.net modules on these four test resources, and discuss the challenges involved in estimating accuracy, precision and recall of performance on the two tasks of spelling error detection and correction.
There are several annotated corpora nowadays available that include some semantic annotation, in particular PropBank, FrameNet, the Penn Discourse TreeBank, and OntoNotes. Yet, none of these resources contain annotations that are motivated by formal semantic theory. The aim of the Groningen SemBank project is to produce a corpus of texts annotated with Discourse Representation Structures (DRSs), following DRT (Discourse Representation Theory).
One of our objectives is to integrate linguistic phenomena (such as word senses, scope, thematic roles, anaphora, presupposition, rhetorical structure) into a single annotation formalism, instead of covering single phenomena in an isolated way, as is mostly the case with existing resources. This will provide a better handle on explaining dependencies between various ambiguous linguistic phenomena. Another objective is to annotate texts, not sentences (as in ordinary treebanks). This gives us means to deal with ambiguities on the sentence level that require the discourse context for resolving them.
Manually annotating a reasonably large corpus with gold-standard DRSs is obviously a hard and time-consuming task. Therefore, we use a bootstrapping approach that employs state-of-the-art NLP tools to get a reasonable approximation of the target annotations to start with. Human annotations are coming from two main sources: experts (linguists) and non-experts (players of a game with a purpose). The annotation of a text comprises several layers: boundaries (for tokens and sentenes) tags (part of speech, named entities, word senses), syntactic structure (based on categorial grammar), and semantic structure (DRSs including thematic roles and discourse relations).
When natural language processing tasks overlap in their linguistic input space, they can be technically merged. Applying machine learning algorithms to the new joint task and comparing the results of joint learning with disjoint learning of the original tasks may bring to light the linguistic relatedness of the two tasks. We present a joint learning experiment with dependency parsing and semantic role labeling of Catalan and Spanish. The developed systems are based on local memory-based classifiers predicting constraints on the syntactic and semantic dependency relations in the resulting graph based on the same input features. In a second global phase, a constraint satisfaction inference procedure produces a dependency graph and semantic role label assignments for all predicates in a sentence. The comparison between joint and disjoint learning shows that dependency parsing is better learned in a disjoint setting, while semantic role labeling benefits from joint learning. We explain the results by providing an analysis of the output of the systems.
Named Entities (NEs) are an important component of Natural Language and applications dealing with language/linguistic data: for instance, NEs constitute 10% of the contents of journalistic texts (Freiburg, 2002), and 30% of Internet queries (Artiles, 2007). Thus automatically processing them is useful. However, ambiguity may alter the performance of NE processing. There is ambiguity when a NE refers to more than a single person. Our work addresses this question in the framework of a NEs processing (notably identification and extraction) platform. In this context, the problem presented by the ambiguity of NEs is twofold. On the one hand, it is mandatory to have a database of ambiguous NEs as comprehensive as possible and to assign to each one a unique identifier (ID); on the other hand, during the automatic processing of NEs in their occurrence context, it is necessary to detect the cases of ambiguity, before identifying the person, place, organization, etc. referred to by the ambiguous NE, and, if it exists in the database, assigning the corresponding unique ID . We present the system we set up to address this problem. We begin by describing the salient points of the state of the art that are relevant for our approach. Then, we describe our method, and its innovative aspects in regard to the state of the art. Finally, we present the evaluation that we carried out and comment its results.
Finding trends in online information is a trendy business. Detecting and analyzing trends on social media - Twitter most notably- is a hot topic.
The use of trending topics has not yet been extensively applied and documented for corporate website search applications. Detecting trends in their own search traffic is an important matter for companies, as it gives more accurate information around activities related to their business (like marketing campaigns or incidents). Manually identifying these emerging trends would be very labor-intensive and they cannot be easily extracted from the returned results for user queries in the search application. Typically the top n of most frequently asked queries is very stable and new trends will not make it to this list. Trends can be new topics still unknown to the search solution (like new brand or product names, like `iPhone 4S'), but also well-known terms that suddenly become popular (due to news events, like `Tokyo', `Lybia', `credit crisis').
In order to find trending topics we used a dataset containing a year of search logs from a corporate website (ca. 1 million queries). Our approach is based on frequency patterns for individual search queries. By comparing query frequency patterns from different periods we are able to extract trend candidates. In this talk we will present our method and its evaluation.
We extract a non-polarized, time stamped co-occurrence network of people, organizations and locations from biographical texts covering the domain of Dutch social history using named entity recognition and temporal tagging. Social network analysis metrics such as centrality, modularity and clustering are calculated for each day that is included in the time span of the network. We use these metrics to model changes in the network over time and convert them into numeric vectors. Classes are attached to the vectors depending on whether a strike, the formation of a political party or the formation of a workers' union has occurred on the day corresponding to the vector. By training support vector machines on incremental sets of these vectors, we aim to develop a system capable of predicting these types of events and to maybe shed some light on the social processes that cause these events to occur.
Syntax-based semantic space models have been shown to perform better than other types of semantic models (e.g., bag-of-words). These models usually employ syntactic features which are extracted from large parsed corpora. However, if the grammar and the parser used to parse the corpora encounter many unknown words, this could damage the quality of the delivered syntactic analyses. That, in turn, affects the quality of the semantic models.
We employ the lexical acquisition (LA) method we presented in Cholakov and Van Noord (COLING 2010) to learn lexical entries for words unknown to the Dutch Alpino grammar and dependency parser. After adding the acquired lexical entries to the Alpino lexicon, we parse newspaper corpora (~2 billion words). The obtained syntactic analyses are used to extract dependency relations which are then employed in a syntax-based semantic model. The performance of the model is evaluated against CORNETTO, a Dutch lexico-semantic hierarchy. For a set of candidate words randomly extracted from CORNETTO, the task of the semantic model is to map each word in the test set to the candidate word which is most similar to it.
We compare the obtained results to a semantic model which uses dependency relations from the same corpora but when parsed with the standard Alpino configuration. In this case, a built-in unknown word guesser is used to predict lexical entries for the unknown words. As a further comparison, we also build a bag-of-words model. The syntactic-based model combined with LA achieves the best performance.
Lexical categories, also known as word classes, are commonly used to model syntactic and semantic word properties in a simple and computationally efficient way. Learning them from distributional clues has been researched from the perspective of both cognitive modeling and NLP applications. Whereas incremental online learning and good performance on simulations of human behavior are important criteria in cognitive modeling (Parisien et al. 2008), efficiency and effectiveness in semi-supervised learning scenarios are essential for NLP applications (Turian et al 2010). In this research we evaluate versions of recently developed word class learning algorithms on both sets of criteria. First we consider the online information theoretic model for category acquisition of Chrupala and Alishahi 2010, and explore two modifications (i) vary the greediness of the algorithm by introducing a version with a beam search and (ii) vary the tradeoff between informativeness and parsimony. Second, we compare the performance of this family of models with the recently proposed efficient word class induction method using Latent Dirichlet Allocation (Chrupala 2011), and with incremental versions of this approach. We evaluate on both human behavior simulations such as word prediction and grammaticality judgment, as well as standard semi-supervised NLP tasks.
References:
C. Parisien, A. Fazly, and S. Stevenson. An incremental Bayesian model for learning syntactic categories. CoNLL 2008.
J. Turian, L. Ratinov, Y. Bengio. Word representations: a simple and general method for semi-supervised learning. ACL 2010.
G. Chrupala and A. Alishahi. Online Entropy-based Model of Lexical Category Acquisition. CoNLL 2010.
G. Chrupala. Efficient induction of probabilistic word classes with LDA. IJCNLP 2011.
This talk will focus on automating Dutch semantic role annotation. In the framework of the STEVIN-funded SoNaR project a one million word core corpus has been enriched with semantic information, including semantic roles. 500 K have been manually verified following the PropBank approach (Palmer et al. 2005), i.e. Dutch verbs are linked onto English framefiles. For the annotation we were able to rely on manually verified Dutch dependency structures from LassyKlein. This manually verified subset was used as training material to retrain an existing labeler (Stevens et al., 2007) in order to tag the remaining 500K automatically. A main advantage of the SoNaR corpus is its rich text type diversity, which allowed us to experiment with six distinct text genres. Instead of focusing on the labeler and features itself, we closely investigate the effect of training on a more diverse data set to see whether merely varying the genre or amount of training data optimizes performance. We see that training on large data sets is necessary but that including genre-specific training material is also crucial to optimize classification.
M Palmer, D. Gildea, and P. Kingsbury. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31:71–105.
G. Stevens, P. Monachesi, A. van den Bosch. 2007. A pilot study for automatic semantic role labeling in a Dutch Corpus. Selected papers from the seventeenth CLIN meeting. LOT Occasional Series 7, Utrecht.
This presentation introduces a novel method for fully unsupervised verb subcategorization frame (SCF) induction. Treating SCFs as multi-way co-occurrence problem, we use multi-way tensor factorization to cluster frequent verbs from a large corpus according to their syntactic behaviour. The SCF lexicon that emerges from the clusters is shown to have an F-score of 72%, not far below a method that relies on hand-crafted rules. Moreover, the tensor factorization method is shown to have the advantages of revealing latent structure in the data, and being able to use relations not explicitly represented in the SCFs, such as modifiers and subtypes of clausal complements. We investigate a variety of features for the task.
The data set for this year's i2b2 Natural Language Processing Challenge was an unusual one. 900 suicide notes, annotated with 15 topics (13 emotions and 2 speech acts): abuse, anger, blame, fear, forgiveness, guilt, happiness or peacefulness, hopefulness, hopelessness, information, instructions, love, pride, sorrow and thankfulness. The objective of the challenge was to accurately predict their presence at the sentence level.
We developed a system that uses 15 SVM models (1 per topic), using the combination of features that was found to perform best on a given topic. The features represented lexical and semantic information: lemma and trigram bag of words, and features using WordNet, SentiWordNet and a list of subjectivity clues. SVM results were improved by changing the classification threshold, using bootstrap resampling to prevent overfitting.
We discuss the usability of this approach for emotion classification, and present the results.
High dimension vector spaces give rise to dierent problems, such as data sparse- ness and the existence of irrelevant data. Hence, dimensionality reduction is usually a part of data mining process. Feature extraction and feature selection are used to overcome high dimensionality for both problems, respectively. In this paper, we introduce a feature selection technique that is based on the Latent Semantic Index- ing (LSI) feature extraction approach. The proposed LSI-based feature selection approach select features according to their corresponding vector legnth in when pro- jected on a lower dimensional space. It was found that the performance of an SVM based Word Sense Disambiguation (WSD) with the proposed LSI-based feature se- lection outperforms its performance when the convensional LSI reduction technique by is 14% with 6.8% dimensionality reduction.
The Groningen SemBank project aims to provide a large collection of semantically annotated texts. We employ a fairly traditional NLP toolchain to get a reasonable approximation of the target annotations. After each step, Bits of Wisdom (BOWs) may be applied: pieces of information coming from experts (linguists), external NLP tools and a game with a purpose played by non-experts. BOWs give information about word and sentence boundaries, parts of speech, named entities, word senses, bracketings, co-reference, prepositional attachment, quantifier scope and so on. They help to resolve ambiguities and correct errors. This allows us to iteratively improve the annotation accuracy in each release of SemBank.
The BOWs approach is different from traditional annotation correction methods in that it is not based on changes to one canonical copy of the annotation, but rather on assertions which can be applied selectively and in non-chronological order. This is important in a scenario where corrections come from many different sources that cannot always be in sync or even take each other’s outputs into account. A BOW is applied by making the minimal set of changes to the existing annotation required to make it consistent with the BOW. BOWs are not necessarily correct, they may agree or disagree, so a judge component will be used to assess the reliability of BOWs based on their source and frequency, and to resolve conflicts accordingly.
The toolchain, interleaved with BOW applications, forms the backend of a wiki-like Web application used by experts to view and edit the annotation (via BOWs) in real time. We employ a sophisticated modular setup enabling safe and efficient execution of the toolchain for possibly multiple documents in parallel.
For journalists, events are defined by the four W : What, Where, When, Who. The When is the subject of many works in date and time processing. There are also many works on Named-entity recognition about the Where (place) and the Who(participant). So, we propose here to focus on the What.
It is difficult to define clearly this What. We propose a linguistic approach. We postulate that elements which are predicates (verbs, adjectives, nouns) can help to define and find the content of this What. For the moment we will focus only on the predicate nominal and propose to detect and classify (automatically) the nouns which are predicates of events from unstructured French journalistic texts.
For that, we rely on the linguistic work of Seong-Heon Lee(2001)who proposed a semantic typology of nouns which are predicates of events. This typology is interesting because it was not constructed intuitively. The construction is based on the Object classes of Gaston Gross(1996,2003). The aim is to gather in the same group elements which have both similar semantic features and similar syntactic properties (example they are activated by the same support verbs). Here are two examples of different object classes of predicate nouns. C1:{foire(fair), biennal(biennial),exposition(exhibition),...},C2 :{incendie (fire), avalanche (avalanche),...}.
We use local grammars to detect automatically the predicate nouns and evaluate the recall and the precision of our system.
References
Seong-Heon Lee. 2001. Les classes d'objets d'événements. Pour une typologie sémantique des noms prédicatifs d'événements. PhD thesis, Université Paris 13.
Gaston Gross. 1996. Prédicats nominaux et compatibilité aspectuelle. Langages, 30(121):54-72.
Gaston Gross. 2003. On the Description of Classes of Predicates. Language Research, special issue : 39-53.
we present FoLiA: Format for Linguistic Annotation. FoLiA is an extensible XML-based annotation format for the representation of language resources. It introduces a flexible paradigm independent of language or label set, designed to be highly expressive, generic and extensible. The aspiration of the proposed format is to be a universal "one format fits all" framework, preventing users of having to cope with a wide variety of different formats. In the light of this, FoLiA takes an approach in which a single XML file represents an entire document with all its linguistic annotations. Uniformity, expressiveness and expandability are three important principles upon which the format is founded. The envisioned use of the format is as a universal storage and exchange format for language resources, including corpora with rich annotations. FoLiA emerged from a practical need in the context of Computational Linguistics in the Netherlands and Flanders, arising from multiple projects. It builds on the foundation layed by the D-Coi XML format, which is hitherto a de-facto standard. FoLiA is currently being adopted by various projects.
Short messages to cell phones (SMSs) have become the most popular means of communication on digital fronts, especially in Africa and South Africa in particular. This inspires the abuse of such systems by advertisers through the distribution of SPAM. It has therefore become necessary to incorporate a filtering system similar to e-mail classification on these low resource devices. Content filtering has not been utilised in the mobile domain as these models often are not small and accurate enough to operate sufficiently well in a very restrictive domain. In this article a number of popular machine learning algorithms by which to classify a SMS based on the content of the message will be explored and compared for their usefulness in this particular problem. The advantages and drawbacks of Decision Trees, Naive Bayes classifiers (regular and multinomial), Support Vector Machines and a simple bag-of-words approach will be discussed and their performance evaluated, after which a suggested implementation of the bag-of-words approach in a prototype SMS Spam Filter will be shown. This prototype on the Android™ platform makes use of a number of natural language processing techniques and uses a corpus consisting of both Afrikaans and English SMS messages.
With the arrival of the new media, we see the return of a centuries old phenomenon: a high degree of spelling variation. Other than for traditional written texts, variation goes beyond the occasional spelling error or the choice between established and recently revised spelling. The nature of the media give rise to some of this higher degree of variation: SMS, Twitter and chat show length and/or time constraints and invite strategies like abbreviations and optimized phrasings, whether creative or lexicalized for the whole medium or for specific peer groups. Another cause lies with the authors, many of whom have only little formal training in writing, leading to many more spelling “errors”. And finally, at least in some peer groups, “normal” language is mixed extensively with street language, for which no actual standardized spelling exists.
For linguistics, this variation presents an inviting new area of research. For Information technology, on the other hand, it is rather a nuisance as most NLP systems expect a single standardized spelling. For both fields, the first step is to identify which word forms should be taken together as they appear to represent the same “normal” form. In this paper, we embark on an initial investigation into this task for Dutch as present on Twitter. Using a combination of speech-based variation patterns and clustering techniques, we create groups of word forms that ideally should map one-on-one to “normal” forms. We evaluate our clustering by manually measuring false reject and false accept rates on a sample set of word form pairs.
We aim to reduce the problem of spelling variations in a Portuguese historical corpus of personal letters. The letters cover a period from the 16th to the 20th century and have been manually transcribed to create a digital edition that can be used for historical, linguistic, and sociological research. As the letters were written by a diverse group of authors, some of whom were semi-literate, and most of the manuscripts predate the first standardisation of Portuguese spelling, which took place only in 1911, they contain many spelling variations. We want to make the corpus available for further linguistic research and also make it accessible to a larger community.
We investigate to what extent we can perform the task of spelling modernization for Portuguese automatically. We adapted a well studied statistical tool for spelling normalization, VARD2 (Baron et al, 2008) to the Portuguese language and we study its performance over four different time periods. We evaluate the performance of the automatic normalization tool with an intrinsic and an extrinsic method. Firstly, we compare the automatically normalized text with a manually normalized text. Secondly, as an evaluation in use we measure the effect and usefulness of automatic spelling normalization for the task of automatic POS-tagging.
Resolving referring expressions in a text is a key point for full text understanding. Besides clear co-reference relations between referring expressions where both referents point to the same entity in the discourse, texts can also contain more indirect referring relations such as part-whole relations or hyperonym relations.
We present an analysis and automatic resolution of bridge relations in Dutch across six different text genres sampled from the COREA and SoNAR corpora. We report on the annotation guidelines and inter-annotation agreement results for the bridge relations. To get more insight in what is annotated exactly in the data, we performed an in-depth manual analysis of the different types of bridge relations found across the different data sets. This analysis reveals that for all genres bridging references stand mostly in a class relationship, which is exactly the kind of information represented in a WordNet hierarchy. This inspired us to investigate to what extent a standard coreference resolution system for Dutch is capable of resolving bridge relations across different text genres and study the effect of adding semantic features encoding WordNet information. Our results reveal modest improvements when using the Dutch Cornetto information for all but one genre.
With the increasing amount of information available on the Internet one of the most challenging tasks is to provide search interfaces that are easy to use without having to learn a specific syntax. Hence, we present a query interface exploiting the intuitiveness of natural language for the largest web-based tourism platform.
Various analyses shows how users formulate queries when their imagination is not limited by conventional search interfaces with structured forms consisting of check boxes, radio buttons and special-purpose text fields. The results of this field test are thus valuable indicators into which direction the web-based tourism information system should be extended to better serve the customers.
The increasing rate of patent application filings each year creates a vast amount of text that becomes rapidly impossible to be managed by human effort alone. Patent management therefore requires a reliable and fully automated patent classification system.
The most popular approach to patent classification is the well-known the Bag-of-Words1 (BOW) model but like (Krier and Zaccà, 2002) claim this may not be an adequate document representation for the complex domain of patent texts. The main criticism on the BOW model is its lack of precision: it does not take any context into account, thus it cannot effectively disambiguate the terms in the vague and ambiguous terminology of the patent domain. In the field of text categorisation, there have been many attempts in the last decades to extend the BOW representation with different kinds of syntactic phrases which incorporate the immediate context of a word and thus automatically capture some of its semantics. Such experiments have had mixed results (Brants, 2003) due to data sparseness problems.
Rather than dismiss the use of syntactic phrases in full, it seems worthwhile to examine which information contained in syntactic phrases is most informative for patent classification and what is superfluous. To this end, we conduct leave-one-out type classification experiments for different syntactic relators on a subset of the CLEF-IP 2010 patent corpus. Using the state-of-the-art Stanford parser, we generate linguistic phrases for the text in the patents, which are combined with the words to form a mixed text representation; We then perform leave-one-out classification experiments on the resulting text files for the 6 most frequent syntactic relations in the Stanford output.
Preliminary results suggest that a combination of phrasal relations found within the noun phrase are the most informative for patent classification.
The Korean bound noun cwul plays a special role in leading non-factive interpretations without any negation marker. For example, the prior interpretation of the sentence Hana-nun caki-ka yeyppu-n cwul a-n-ta (Hana-TP herself-NM pretty-COMP BN know-PRS-DC) is that ‘Hana (wrongly) thinks herself pretty,’ not that ‘Hana knows that she is pretty.’ Although such an interesting interpretation is led by cwul, many previous sentiment analysis approaches cannot reflect such an effect of cwul on factive meanings. The aim of this paper is to improve results of sentiment analysis by utilizing factive information in cwul. Various linguistic cues play a role in discriminating meanings of cwul. Typical cues are i) whether the following case marker of cwul is an objective case marker -ul or an adverbial case marker –lo, ii) whether a complementizer of complement sentences is -nun/(u)n or –(u)l, and iii) the tense and aspect of predicates co-occurring with cwul in the matrix sentences. All of these cues are used in sentiment analysis experiments. Experimental results from SVMlight show that considering factive value in sentiment analysis is very effective. Especially, by using combining factive value of cwul and modal suffixes related to evidentiality, I obtain the best result without complex chunking or any parsing process. Evidentiality is strongly related to factivity. This paper sheds light on the question why pragmatic meaning of expressions should be considered in sentiment analysis.
The goal of the IWT Tetra funded TExSIS project is to create a system for the automatic extraction of mono- and multilingual term lists from company documentation. Such lists are crucial for the consistent use and translation of company specific terminology. The system’s architecture is based on an existing approach for multilingual term extraction (Macken et al. 2008, Lefever et al. 2009) and is being further developed for 4 languages: Dutch, French, English and German.
In this talk, we present the gold standard corpus built for the evaluation of the TExSIS terminology extractor. In order to assess the domain-independence of the system, text data is collected from companies of different domains (e.g. automotive, dredging, human resources…) for all language pairs. In this corpus, a reference point for terminology extraction is created by manually identifying all occurrences of terms and their translations. Additionally, since the quality of the extracted terminology lists highly depends on the performance of the system’s underlying components, we create a gold standard for the entire extraction pipeline. In order to do this, the corpus is preprocessed (sentence splitting, tokenization, POS tagging, lemmatization, chunking and Named Entity Recognition) and sentence aligned using the respective TExSIS components. The output is then manually corrected.
Finally, the TExSIS extractor is also evaluated by testing its practical applicability in a machine translation context. For this purpose, the terminology extraction output is integrated in different MT systems.
Lefever, E., Macken, L., & Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel corpus. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. The Association for Computational Linguistics, Athens, Greece.
Macken, L., Lefever, E., & Hoste, V. (2008). Linguistically-based sub-sentential alignment for terminology extraction from a bilingual automotive corpus. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). Manchester, UK.
A number of e-infrastructure projects have been underway in the past couple of years aiming to provide an integrated and interoperable research infrastructure. These projects must tackle a large number of subjects including legal, organizational and technical challenges to ensure persistent, secure and reliable access to data and services that is easy to use by researchers. Rather than being a one-time effort, building these infrastructures is a continuous process of convergence towards acceptable standards of operations to all parties involved. These projects do not operate in isolation but are part of a much wider landscape of related infrastructure projects and initiatives, involving user communities, data services and cross domain activities such as security and curation. For a project such as CLARIN the technological challenges fall into three main categories: data, tools/services and infrastructure. While there is much experience with data management (metadata, publication, persistency, security) which rely on many of the infrastructural services, such as persistent identifiers, the situation for tools/services is less mature. Projects such as CLARIN-NL TTNWW have seen a large increase in the number of services that are being made available, such as speech recognizers, NLP tools or annotation tools, questions on how to make these available in a cost effective manner over a period of many decades is still largely an unsolved challenge. The project is a joint collaboration between TTNWW partners and BigGrid.
In this paper we will stress-test a recently proposed technique for computational authorship verification, “unmasking”, which has been well received in the literature. The technique envisages an experimental set-up commonly referred to as ‘authorship verification’, a task generally deemed more difficult than so-called ‘authorship attribution’. We will apply the technique to authorship verification across genres, an extremely complex text categorization problem that so far has remained unexplored. The unmasking technique is attractive for authorship verification across genres, because of the interference between genre markers and authorial style markers. It might help remedy genre-related artifacts in that superficial genre-related differences between same-author texts in different genres will be filtered out easily and removed from the model early in the degradation process. After the removal of these non-essential stylistic features, one could hypothesize that only features more relevant for authorial identity will be preserved.
We focus on five representative contemporary English-language authors. For each of them, the corpus under scrutiny contains several texts in two genres (literary prose and theatre plays). The experiments reported on in this paper confirm that unmasking is an interesting technique for computational authorship verification, especially yielding reliable results within the genre of (larger) prose works in our corpus. Authorship verification, however, proves much more difficult in the theatrical part of our corpus. The original settings for the various parameters often appear to be genre-specific or even author-specific, so that further research on optimization is desirable. Finally, we will show that interpretability is an important asset of the unmasking technique.
Transformation-based learning (TBL) is a technique for learning transformation rules that refine the initial solution to a problem. For instance, in part-of-speech tagging (Brill, 1995), one could annotate a word using its most frequent tag, and use a list of rules derived from TBL to correct errors in this initial tagging.
TBL creates a set of candidate rules by instantiating rule templates using a training corpus. TBL iteratively selects and applies the best rule according to a scoring function, that measures the impact of applying a rule to the training corpus.
We propose a variant of TBL that learns rules for transforming trees. From a set of tree pairs, it learns rules that attempt to transform an initial tree into a correct tree. Various scoring functions can be used to emphasize certain tree similarities, such as leaf node labels or the internal tree structure.
In our experiments, we applied tree TBL to learn rules for headline generation. Headline generation attempts to construct an appropriate headline from the initial sentence(s) of a newspaper article. In our system, a headline is generated by parsing the first sentence of an article using Alpino, and then applying transformation rules to remove subtrees to shorten the sentence. The transformation rules are learned using a corpus that is derived from the Twente Nieuwscorpus. We show that the resulting headline generator gives competitive results.
Finally, we describe work-in-progress that uses tree TBL for learning rules that correct invalid dependency structures.
In this study we take a machine learning approach to predicting the timing of backchannels. In the traditional approaches for this task, the machine learning algorithms are presented with positive samples of appropriate moments to give a backchannel, which are found in a corpus of recorded human-human-interactions, and negative samples, which are randomly selected from the moments where no backchannel is recorded. Since backchanneling is optional, another person may have given a backchannel at one of the moments which are selected as negative sample. This makes these negative samples noisy and the performance of these models is highly dependent on the quality of the random selection of these negative samples.
To avoid this, we take a novel iterative approach to learning the model, where we select the negative samples more reliably by using the results of a subjective evaluation after each iteration. In this subjective evaluation participants indicate whether generated backchannels based on the current model are appropriate or not. The backchannels that are judged as inappropriate are used in the next iteration as negative samples for the machine learning algorithm. This ensures that the negative samples we select are in fact inappropriate moments to perform a backchannel. The positive samples are selected in the traditional manner. By doing this iteratively we constantly refine our model to perform better.
We will present a study that shows the benefits of our novel iterative approach over the traditional approach for learning the timing of backchannels.
Attempts to build a corpus of syntactically parsed texts in one of the lesser studied languages, Chechen, are seriously hindered by the number of ambiguities within words written in the Cyrillic script. This script underdifferentiates vowels, mapping most of the short, long and diphthong variants onto one grapheme (11 graphemes are unique, 3 are ambiguous, and 2 can represent any of three phonemes). The ambiguity on the word level, which perculates to an ambiguity on the tagging and parsing level, can be reduced by phonemisation to an existing practical Latin-based orthography developed for academic work.
This paper describes how a maximum entropy model can be used to get a near-perfect accuracy in terms of the Ambiguous Vowel Grapheme Agreement Ratio (AVGAR). The baseline for fully automatic phonemisation is a 65% AVGAR, which occurs when all vowels are classified as short ones. A simple maximum entropy model can reach an AVGAR of about 93%. The exact AVGAR depends on the size of the trainingset, and is negatively influenced if the feature does not contribute relevatn linguistic knowledge.
The baseline of fully manual phonemisation is estimated to be above 99%. We propose a semi-automatic phonemisation strategy, which treats classifications with a probability above a certain threshold automatically, while it asks for user-input when the probability is below the threshold. An AVGAR of 99% can, depending on the feature set, be reached with a threshold of about 50%, and the user’s manual classification task is then reduced to about 1.75% of the data.
Wikipedia is a very popular online multilingual encyclopedia that contains millions of articles covering most written languages. Wikipedia pages contain monolingual hypertext links to other pages, as well as inter-language links to the corresponding pages in other languages. These inter-language links, however, are not always complete.
We present a prototype for a cross-lingual link discovery tool that discovers missing Wikipedia inter-language links to corresponding pages in other languages for ambiguous nouns. Although the framework of our approach is language-independent, we built a prototype for our application using Dutch as an input language and Spanish, Italian, English, French and German as target languages. The input for our system is a set of Dutch pages for a given ambiguous noun, and the output of the system is a set of links to the corresponding pages in our five target languages.
Our link discovery application contains two submodules. In a first step all pages are retrieved that contain a translation (in our five target languages) of the ambiguous word in the page title (Greedy crawler module), whereas in a second step all corresponding pages are linked between the focus language (being Dutch in our case) and the five target languages (Cross-lingual web page linker module). We consider this second step as a disambiguation task and apply a cross-lingual Word Sense Disambiguation framework to determine whether two pages refer to the same content or not.
Annotations of multimodal resources are the grounds for linguistic research. However, creation of those annotations is a very laborious task, which can take 100 times the length of the annotated media. For this reason innovative audio and video processing algorithms are needed, in order to improve the efficiency and quality of the annotation process. This is the aim of the AVATecH project, which is a collaboration of the Max-Planck Institute for Psycholinguistics (MPI) and the Fraunhofer institutes HHI and IAIS.
People with physical impairments who have trouble operating devices manually, could greatly benefit from vocal interfaces to control devices at home (such as the TV, radio, lights, etc.). Nevertheless, the use of assistive domestic vocal interfaces by this group of people is still far from common, due to technical and practical constraints. One of the main problems is the lack of robustness of the speech recognition system to environmental noise and to idiosyncratic pronunciations related to speech pathology and/or regional variation. Another important issue is the amount of learning and adaptation required from the user, since a restrictive vocabulary and grammar are usually preprogrammed in the system.
The ALADIN project aims to address these problems by developing a robust, self-learning domestic vocal interface that adapts to the user instead of the other way around. The vocabulary and the grammar of the system are to be learnt on the basis of a limited set of user commands and associated controls (actions). The module for unsupervised grammar induction is designed by CLiPS. One of the targeted applications is a voice controlled computer game: patience. We have compiled a small corpus of patience commands and associated moves in a number of experiments, in which participants were asked to play patience using voice commands, executed by the experimenter. This audio corpus, manually transcribed and linguistically annotated, is used to select an appropriate grammar formalism and semantic representation for the expected range of possible commands, and as input data for some initial grammar induction experiments.
We present a system to automatically identify emotion-carrying sentences in suicide notes and to detect the specific fine-grained emotion conveyed. With this system, we competed in Track 2 of the 2011 Medical NLP Challenge (Pestian et al., 2011), where the task was to distinguish between fifteen emotion labels, from guilt, sorrow, and hopelessness to hopefulness and happiness. An additional complication was the fact that half of the sentences was left unannotated, often not for lack of emotion, but for lack of inter-annotator agreement.
Since a sentence can be annotated with multiple emotions, we designed a thresholding approach that enables assigning multiple emotion labels to a single instance. We rely on the probability estimates returned by our LibSVM classifier and experimentally set thresholds on these probabilities. Emotion labels are assigned only if their probability exceeds a certain threshold and if the probability of the sentence being emotion-free is low enough. We show the advantages of this thresholding approach by comparing it to a naïve system that assigns only the most probable label to each test sentence, and to a system trained on emotion-carrying sentences only.
When solving document understanding problems, most NLP systems rely on lexical, syntactic, semantic and/or discourse information. However, in practice, the more complex a domain is, the harder it is to account for the huge variety present in the data. Much of this variety comes from rarely occurring phenomena which makes collecting all the information impractical.
To solve this problem we introduce the concept of semantic tunneling, an approach for leveraging the formatting information of documents to uncover additional semantic information unknown to the model. The process of identifying additional semantic information relies on the assumption that similarity in formatting indicates similarity in semantic content.
Semantic tunneling is tested on the domain of resume parsing using a data set of 800 CVs containing formatting information. We explore the section segmentation task which is cast as a sequential learning problem using Conditional Random Fields.
The effectiveness of semantic tunneling is illustrated through a series of experiments that vary the amount and type of semantic deficiency. These experiments demonstrate that semantic tunneling enables a recovery of 40% to 80% of the errors made by the baseline system. The result is a more robust model which is able to capture new semantic knowledge and avoid errors caused by misspelling or unconventional word use. We believe that the effectiveness of semantic tunneling is not limited to resume parsing but generalizes to many other domains where formatting information carries implicit semantic information.
In the last edition of CLIN, we described the construction of a database for lexical orthographic errors in French, from the requests made to an on-line dictionary. We filtered the requests end kept those with a frequency above 200. This gave us a set of 58500 distinctive forms (that corresponds to 169 millions requests) in which 15000 are erroneous forms. The main problem was to pair the erroneous forms with their correct forms, and to achieve this goal, we used different techniques including exploration of graphic neighbours and phonetization. Last year, we managed to reach 70% of correction with no ambiguity, but as it seemed a rather low rate for us, we improved the programs and now the correction ratio reaches 86%. Actually, this ratio could be given in another way : the system has a 91,5% coverage, and a 94,0 % precision. In this communication, we will first describe the techniques used to improve our results : patterns, transitivity, and feedback. Then we will talk about the reasons of non- or bad correction, and about the limits of our system too. At the end we will explain the use of analogy in building a lighter correction system, based on the results of the preceding programs. This analogic system has been designed with 1400 items with a frequency above 2000, for which it gives a 98,3% coverage (with a 100% precision). When used with the complete list of items, its coverage is 78,5%, with a 93,7% precision.
In this presentation I discuss several approaches to annotating modality, I analyse their advantages and disadvantages and I propose an integrated scheme to annotate modality across genres. Until recently, research on Natural Language Processing (NLP) has focused on propositional aspects of meaning. For example, semantic role labeling, question answering or text mining tasks aim at extracting information of the type “who does what when and where”. However, understanding language involves also processing extra-propositional aspects of meaning, such as modality. Modality is a grammatical category that allows to express aspects related to the attitude of the speaker towards her statements in terms of degree of certainty, reliability, subjectivity, sources of information, and perspective. I understand modality in a broad sense, which involves related concepts like subjectivity (Wiebe et al., 2004), hedging (Hyland, 1998), evidentiality (Aikhenvald, 2004), uncertainty (Rubin et al., 2005), committed belief (Diab et al., 2009) and factuality (Saurí and Pustejovsky, 2009). I will focus on three categorization schemes: the scheme developed for the task Processing modality and negation for machine reading (Morante and Daelemans, 2011), the scheme used to annotate attribution in the Penn Discourse Treebank (Prasad et al., 2006), and the scheme used in the SIMT SCALE project (Baker et al., 2010).
Current NLP research in postulating veridicality of events sets the epistemic reading of modality as default for modality markers. Thus, it ignores important characteristics of other modality meanings. To handle this challenge, we have elaborated a modality annotation scheme which takes into account theoretical linguistics issues. This scheme adapts the two-dimensional classification of Palmer (2001), in which modality is divided into propositional (epistemic and evidential) and event (deontic and dynamic) modality.
Propositional modality reflects the author’s own or reported opinion about event factuality. To assess the status of the event, it is important to extract the source that made the proposition, the time of the proposition as well as to postulate a reliability value for each source. Contrastingly, event modality expresses impartially the relation between the proposition and its referent in the real world. For event modality we need to postulate whether the event is factual, counterfactual or uncertain and if uncertain at what time the event was potential. It could be important as well to define the required conditions for event actualization or the obstacles preventing it.
To evaluate our scheme, we have manually annotated events under the scope of modal verbs with the above defined values in a corpus of biographical Wikipedia articles. We have also defined the main features which play a significant role in determining the meaning of modality: clause types, presence of different parenthetical constructions, adverbs, etc. On the basis of feature interaction we established the rules for annotating event modality within our application.
At the previous CLIN meeting, we introduced Dact: a stand-alone tool, freely available for multiple platforms, to browse CGN and Lassy syntactic annotations in an attractive graphical form, and to search in such annotations according to a number of criteria, specified elegantly by means of XPath, the WWW standard query language for XML.
In this presentation, we present a number of improvements which have been integrated in Dact over the last few months. The improvements include:
Support for Oracle Berkely DB XML. Earlier versions of Dact supported libxml2 (http://xmlsoft.org) for XML processing. The new version of Dact also supports the Oracle Berkely DB XML libraries. The advantage is that database technology is available for preprocessing the syntactic annotations, and as a result query evaluation is (much) faster.
A further advantage is, that XPath 2.0 is supported. The availability of XPath 2.0 is useful in order to specify quantified queries. Such queries have been argued for, for instance, in Gerlof Bouma (2008). In the presentation, we provide a number of linguistic examples.
For the development of more complicated search queries, Dact now supports macros. The macro mechanism can be used to maintain a number of search queries which a user wants to have easily available. In addition, the macro mechanism is useful in order to build up larger, more complicated search queries from simpler parts.
Dact can be used both for the Lassy Small corpus: the 1 million word corpus of written Dutch, with manually verified syntactic annotations, as well as for the 1 million word subset of the spoken Dutch corpus (CGN) with syntactic annotations.
Translation Memory / Machine Translation (TM/MT) integration [He et al. 2010; Koehn and Snellart, 2010] is one recent trend in Statistical Machine Translation to establish the final consortium system. (Other such system includes system combination strategy [Barnagee et al. 02]). In these consortium systems, the criteria we deploy to these combinations decides critically the final quality of translation outputs. [He et al. 2010; Koehn and Snellart, 2010] deploy the criteria of the smaller post-editing efforts (or edit distance / TER distance [Snover et al., 06]) to select in the consortium. Instead, we propose a method to adapt the `salient' features of the small corpus into the larger corpus: one of the natural interests why we want to integrate human-made Translation Memory into the huge corpus would be to keep the `salient' features, such as the preferred wording / styles and unpreferred / fluctuated expressions, of human translation in tiny corpus, but to use basically the huge corpus. We will discuss the pros and cons of these systems.
For traditional written text, we know that, although authors can generally be recognized by their specific language use (consistency), this use changes when moving between genres and/or topic domains (adaptation). However, both consistency and adaptation have been measured mostly for established authors, who can be expected to have some control over their writing style be it on the basis of formal training or merely out of experience.
In the new media, we see that practically anybody can become a prolific author, even without training or extensive experience. We would expect that consistency is present to at least the same and probably a higher degree for such authors, since we assume their language use to be closer to their “natural” ideolect. But it is harder to predict whether for them too their language use shifts when writing in specific contexts.
We investigate consistency and adaptation by comparing the language use by 95 prolific authors when contributing to 48 highly active hash tags in Twitter. The comparison focuses on Dutch and uses three measurements which are traditionally also used to distinguish genres / topic domains, namely average sentence length (here replaced by tweet length), type/token ratio and the actual vocabulary used.
The construction of a large and richly annotated corpus of written Dutch was identified as one of the priorities of the STEVIN programme (2004-2011). Such a corpus, sampling texts from conventional and new media, was deemed invaluable for scientific research and application development. Therefore in 2008 the SoNaR project was initiated. The project aimed at the construction of a large, 500 MW reference corpus of contemporary written Dutch. Since its inception huge effort was put into acquiring a data set that represents a vast range of text types and into arranging IPR. All data have been converted to a standard XML format and with the exception of the data originating from the social media (tweets, chats and sms) all data have been lemmatized and tagged for parts of speech, while also named entity labeling has been provided (automatically). A 1 MW subset of the corpus has been enriched with different types of semantic annotation, viz. named entity labeling, annotation of co-reference relations, semantic role labeling and annotation of spatial and temporal relations. These annotations have been manually verified, yielding world-class, new training sets for Dutch. The SoNaR project was concluded 1 December 2011. Its results will be made available through the Dutch-Flemish HLT Centre (TST-Centrale) who will be responsible for distribution and future maintenance of the corpus. It is expected that the corpus will become available by early spring 2012.
Semantic relations such as synonyms and hyponyms are useful for various NLP applications, such as word sense disambiguation (Patwardhan et al., 2003), query expansion (Voorhees, 1994), document categorization (Tikk et al., 2003), question answering (Sun et al., 2005), etc. Semantic relation extraction techniques aims to discover meaningful relations between a given set of words.
One approach for semantic relation extraction is based on the lexico-syntactic patterns which are constructed either manually (Hearst, 1992) or semi-automatically (Snow et al., 2004). We study the alternative approach, which relies on a similarity measure between lexical units (see Lin (1998) or Sahlgren (2006)). In spite of the significant improvements during the last years, the similarity-based relation extraction remains far from being perfect: Curran and Moens (2002) compared 9 measures and their variations and report Precision@1=76%, and Precision@5=52% for the best measure. Panchenko (2011) compared 21 measures and reports Fmeasure=78% for the best one.
Previous studies suggest that different measures provide complimentary types of semantic information. In our ongoing research we are trying to exploit the heterogeneity of existing similarity measures so as to improve relation extraction. We investigate how measures based on semantic networks, corpora, web, and dictionaries may be efficiently combined.
First, we present an evaluation protocol which is adopted to the similarity-based relation extraction. Second, we compare baseline corpus-, knowledge-, web-, and definition-based similarity measures with this protocol. Finally, we present our preliminary results on combination of different measures. We test three types of combination techniques – based on relation, similarity, and feature fusion.
Arabic language, in category of natural languages is one of the important instances and importance of this language originates from its traditional and historical background. More than two hundred million people speak in Arabic and this matter also amplifies the necessity of studying Arabic. Moreover from the viewpoint of computational linguistics, Arabic because of its structural and morphological riches is taken into consideration by researches and is a suitable test bed for testing new parsing/analyzing methods. In this paper a new mechanism of parsing/generating is proposed, which is underlined by shuffle morphology. By employing these concepts a new formal grammar for Arabic verbs is developed. This grammar is embedded in body of a morpho-syntactic tagger and works with accuracy of 98.5%.
RightNow Intent Guide is a knowledge-based solution that uses natural language processing to provide higher relevancy than keyword-based website searches. This solution relies on a database of frequently asked questions, usually provided by RightNow customers, against which user queries are matched. One way to automate the process of populating the database of questions would be to use summarisation software to extract the most pertinent pieces of information from web content. If we treat the resulting summaries as candidates for frequently asked questions, we can pre-fill the database with those questions and ensure full website coverage. The success of this method is obviously dependent on the quality of the summarization software.
The DAISY project (see De Belder et al. Forthcoming), which aims to develop summarisation tools for Dutch, provided a unique opportunity to apply this approach. We first evaluated the intrinsic quality of the summaries produced by adapting and applying various methods. For the intrinsic evaluation we used current customers’ knowledge-bases of questions, as well as the Intent Guide’s extensive tiered semantic hierarchies, to build modified Pyramid models (Nenkova and Passonneau 2004) against which to assess the summaries. We also applied Rouge (Lin 2004) summarisation evaluation techniques to the same data. We then measured the extrinsic quality and coverage of the DAISY output by embedding the summaries within the Intent Guide solution and evaluating them against real-life user queries. We present our results and conclusions in this talk.
De Belder, J., de Kok, D., van Noord, G., Nauze, F., van der Beek, L., and Moens, M. (Forthcoming) “Question Answering of Informative Web Pages: How Summarization Technology Helps”, In <Stevin Project Book>
Lin, C. Y. (2004) “ROUGE: A Package for Automatic Evaluation of Summaries”, In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop (2004), pp. 74-81.
Nenkova, A. and Passonneau, R. (2004) “Evaluating Content Selection in Summarization: The Pyramid Method”, In HLT-NAACL (2004), pp. 145-152.
The Language Archive manages one of the largest and most varied sets of natural language data. This data consists of video and audio enriched with annotations. It is available for more than 250 languages, many of which are endangered. Researchers have a need to access this data conveniently and efficiently. We provide several browse and search methods to cover this need, which have been developed and expanded over the years. Metadata and content-oriented search methods can be connected for a more focused search. We aims to provide a complete overview of the available search mechanisms, with a focus on annotation content search.
In terms of preservation and access to cultural heritage data the MPI occupies a unique position. The uniqueness is captured by four aspects of the situation, (1) the cultural heritage data concerns mainly currently spoken linguistic data, (2) often the linguistic data is accompanied by video to capture non-verbal communication and cultural background, (3) the archive now hosts data from many linguistic preservation projects, linguistic studies and psycholinguistic experiments, and (4) there are a variety of methods offered to browse, search, and leverage the large archive of data. We provide an overview of the archive, in the form of a brief history and an overview of available data We describe the current access methods, including our powerful combined meta-data and content search.
Natural language processing tools developed in research projects typically run on one platform and require a significant amount of work to install on other platforms. One way to improve the accessibility of a tool is to build a web service layer around it which allows access from client machines through a standard protocol. We built a web service layer around the Dutch parser Alpino by using the German Weblicht framework. In this talk we describe some implementation issues and describe the protocol which allows access to the parser from any machine with a network connection. The work presented in this talk is part of the TTNWW project funded by CLARIN-NL.
In 2011, we collected 800 million Dutch Twitter messages. In this talk we give three examples of tasks which can be performed with this data. First, we use the time stamps associated with the messages to track usage frequencies of words during the day. This reveals expected patterns for eating, sleeping and watching TV. It also brings to light popular current events and unexpectedly fashionable common words. Next, we divide a part of the data based on the gender of the author and examine the differences between male and female speech. We find clear differences in the vocabularies, for example in the usage of the first-person singular pronoun. Finally, we present the results of an experiment in predicting the results of the Dutch regional elections of 2011 by counting Twitter messages mentioning political parties. Our prognosis was only slightly worse than the predictions of the national polls (24% error vs. 19%).
Overlapping speech is a common phenomenon in spontaneous, conversational human-human speech. The literature tells us that overlaps are made not at random locations in the speech, but are rather placed at coordinated locations, triggered by (multimodal) cues both from the speaker and listener, and coordinated by a finely tuned turn-taking system. While overlapping speech is often associated with power displays, some overlaps are actually signals of rapport and interest. Hence, identifying the type and nature of overlaps is important in analyzing the social behavior of the speakers and in characterizing the conversation. Moreover, automatically identifying the type of overlapping speech is important for the development of socially intelligent spoken dialogue systems (e.g., embodied conversational agents). For example, when a user starts to speak while the agent is still speaking, the agent should be able to assess whether the user's overlap was an attempt to take over the floor or whether it was a supportive cue to show that he/she was listening or agreeing. In order to investigate overlapping speech as a social signal, we first need a way to describe and characterize overlaps in conversation. In this study, we explore how overlaps in speech can be described and introduce an annotation scheme. The annotations obtained will allow for automatic analyses of overlapping speech such as the automatic interpretation of overlaps (e.g., competitive vs. cooperative) using prosodic features. We applied our annotation scheme to multiparty conversations (AMI corpus) and we will show analyses of the annotations obtained.
We describe an approach for training a semantic role labeler through cross-lingual projection between different types of parse trees. After applying an existing semantic role labeler to parse trees in a resource-rich language (English), we partially project the semantic information to the parse trees of the corresponding target sentences, based on word alignment. From this precision-oriented projection, which focuses on similarities between source and target sentences and parse trees, a large set of features describing target predicates, roles and predicate-role connections is generated, independently from the type of tree annotation (phrase structure or dependencies). These features describe tree paths starting at or connecting nodes. We train classifiers in order to add candidate predicates and roles to the ones resulting from projection. We investigate whether these candidates facilitate the alignment of translation divergences and annotational divergences between parse trees.
The growing amount of digital information along with the rise of social media changes how organizations operate and how people communicate. This brings new challenges in the domain of public safety and security. Specifically, the Netherlands Forensic Institute develops automatic methods for analyzing data collections in order to find possible indications of criminal behavior. Since our sources are largely unstructured and semi-structured text documents, computational linguistics plays a key role in analyzing them.
In this talk, we identify three potential applications of computational linguistics in forensic document analysis. The first application is identifying previously unknown players in criminal networks. A combination of documents (police files, forums, etc.) may reveal how criminal networks are organised. This may lead us to persons or organizations which were previously not known to be relevant. A second issue is detecting fraud by analyzing the process flow in a organization. Irregularities in business processes may indicate fraudulent transactions. Often, such irregularities can be traced by analyzing structured log files, if available. In the absence of adequate logging, the process flow may be reconstructed by extracting business transactions from alternative sources, such as emails, invoices, and other internal documents. Third, in some document types (including blogs, forums, chat), identifying the author may be non-trivial or even impossible. Even so, we may use textual features to classify documents by characteristics of the author, such as gender, age and background.
These three applications share the challenges of dealing with heterogeneous sources, rare phenomena, and lack of ground truth data.
The Groningen SemBank project aims to provide fully resolved deep se- mantic annotations using the framework of Discourse Representation The- ory (drt). This includes accounting for presupposition projection, as well as other projection phenomena, such as conventional implicatures. Follow- ing van der Sandt's theory, presupposition projection is treated as anaphora resolution, which means that presuppositions either bind to a previously in- troduced discourse entity or `accommodate', i.e., create an antecedent at a suitable (global or local) level of discourse.
In SemBank, drt is extended with discourse relations, following the sdrt approach. We propose several criteria for the proper representation of presuppositions. These criteria include the denition of free and bound variables on all stages of analysis, the visibility of accommodated presuppo- sitions versus asserted content and proper interaction with rhetorical rela- tions. Moreover, the application of SemBank requires the annotations to be user friendly, such that they can be easily checked.
Various alternatives for representing presuppositions are discussed, based on existing literature. We propose a Presupposition Normal Form to make annotation and processing of presuppositions easier. This results in a sep- aration between presupposed and asserted content. We argue that the pre- suppositional content does not just comprise a simple drs, as follows from van der Sandt's theory, but can also contain complex rhetorical relations, as in the case of conventional implicatures. The SemBank corpus will be used to motivate and illustrate the phenomena to be accounted for.
Hedge cue detection is a Natural Language Processing (NLP) task that consists of determining whether sentences contain hedges. These linguistic devices indicate that authors do not or cannot back up their opinions or statements with facts. This binary classification problem, i.e. distinguishing factual versus uncertain sentences, only recently received attention in the NLP community. We use kLog, a new logical and relational language for kernel-based learning, to tackle this problem. We present results on the CoNLL 2010 benchmark dataset that consists of a set of paragraphs from Wikipedia, one of the domains in which uncertainty detection has become important. Our approach shows competitive results compared to state-of-the-art systems.
Text prediction is the task of suggesting text while the user is typing. Its main aim is to reduce the number of keystrokes that are needed to type a text. In our presentation, we address the influence of text type and domain differences on text prediction quality.
By training and testing our text prediction algorithm on four different text types (Wikipedia, Twitter, conversational speech and FAQ) with equal corpus sizes, we found that there is a clear effect of text type on text prediction quality: training and testing on the same text type gave percentages of saved keystrokes between 52 and 57%; training on a different text type caused the scores to drop to percentages between 34 and 44%.
In a case study, we compared a number of training corpora for a specific data set for which training data is sparse: questions and answers about neurological issues. A comparison between data of the same text type and data of the same topic domain suggested that domain is more important than text type. However, by far the best results were obtained with a 80-20 split of the target data, even though this training corpus was much smaller than the other corpora.
We have compiled a corpus of 80 Dutch texts from expository and persuasive genres, which we annotated for rhetorical and genre-specific discourse structure and lexical cohesion. These three annotation layers are based on the same text segmentation into discourse units.
We analyze coherence structures with Rhetorical Structure Theory (RST; Mann and Thompson 1988). For genre analysis, we identified the genre-specific main functional text components (moves; Upton 2002) and overlayed the RST-trees with a segmentation into a sequence of moves.
Our analysis of lexical cohesion (Halliday and Hasan 1976, Tanskanen 2006) classifies the semantic relations among lexical items in the text as repetition, systematic semantic relation, or collocation. Items participating in lexical cohesion include content words (nouns, verbs, adjectives, and adverbs of place, time, and frequency) and proper names.
We are using this resource for corpus-based studies of discourse relations, discourse markers, cohesion, and genre differences. We are also exploring automatic text segmentation and semi-automatic discourse annotation.
References
Michael Halliday and Ruqaiya Hasan. Cohesion in English. London, 1976.
William Mann and Sandra Thompson. Rhetorical structure theory: Toward a functional theory of text organization. Text, 8:243–281, 1988.
Sanna-Kaisa Tanskanen. Collaborating towards coherence: Lexical cohesion in English discourse. Amsterdam, 2006.
Thomas Upton. Understanding direct mail letters as a genre. International Journal of Corpus Linguistics, 7:65–85, 2002.
State of the art Word Sense Disambiguation (WSD) systems require large sense-tagged corpora to reach satisfactory results. The number of English language resources increased in the past years, while most other languages are still under-resourced. The situation is no different for Dutch. In order to overcome this data bottleneck, the DutchSemCor project will deliver a Dutch corpus that is sense-tagged with senses from the Cornetto lexical database. Part of this corpus (circa 300K examples) is manually tagged. Current statistics for the manual annotations show an average of 91% Inter-Annotator Agreement. The remainder is automatically tagged using different WSD systems (knowledge-based, supervised and a combination of these two) and validated by human annotators (active learning). The first tests show promising results: an F-score of 74.17% for supervised WSD and an F-score of 63.66% for the knowledge-based system. The project uses existing corpora compiled in other projects (SoNaR, CGN, OpenTaal); these are extended with Internet examples for word senses that are less frequent and do not (sufficiently) appear in the corpora. We developed different tools for the purpose of manual tagging and active learning (SAT) and for importing web-snippets into the corpus (Snippet-tool). We report on the status of the project, we describe the tools and the working method used for sense tagging and active learning and, finally, we show the evaluations of the WSD systems with the current training data.
Language resources are a very valuable asset. Not only now, where they form the basis for new scientific publications, but also in the future when new research might need to reassess previous findings. Primary data, like audio and video recordings, can by the curation efforts of archive managers still be accessible in this future. However, for a lexicon or a grammatical description curation is not so easy. The semantics of the terminology used by the creators of these resources can have drifted off, i.e., the terms might now have a (slightly) different meaning. So it is easily possible that future users have a hard time interpreting the resource in the right way or even come to wrong conclusions based on wrong assumptions. A possible solution would be to make the semantics associated with these resources explicit. The Data Category Registry, nicknamed ISOcat, is taking that route.
ISOcat provides a way for resource creators to describe and share the semantics of the elementary descriptors, called data categories, in their resources. Future researchers should thus be able to take a resource from an archive and get the semantic descriptions of the data categories used in the resource. These descriptions should then help this researcher to interpret the resource.
A Data Category Registry can also be used relate several contemporary resources in which the terms used may or may not share the same meaning. However, ISOcat cannot do the whole job on its own; it is being assisted by two other cats, named RELcat and SCHEMAcat.
In this talk we will show the audience how they could make use of our three pets in order to profit from all tools and resources around. The Dutch/Flemish CLARIN pilot project TTNWW will be used as a show case.
In this presentation we will address the task of Sentence Simplification and the evaluation thereof. Sentence Simplification is the task of paraphrasing a sentence to make it easier to understand. It is a popular topic and recent years have seen the development of various Sentence Simplification systems. A common approach is to align sentences from Wikipedia and Simple Wikipedia and to train some sort of simplification algorithm on the sentence pairs. To determine the success such an approach, a measure of evaluation is needed. Different measures are generally used, such as BLEU and various readability scores. In this talk we will give an overview of some of these systems, and discuss why sometimes the evaluation that is done on these systems is not very convincing. In addition to that, results will be presented of a Sentence Simplification experiment we ran ourselves. We compare our system against the state of the art and ask human judges to judge the amount of simplification, to what extent the meaning of the source sentence is preserved and the extent to which the sentence has remained grammatical.
Digital metadata descriptions play an important role in making digital content available and searchable on-line. In our work, we are concerned with cultural heritage metadata for textual objects, such as archive documents. This metadata is mostly manually created and often lacks detailed annotation, consistency and, most importantly, explicit semantic content descriptors which would facilitate online browsing and exploration of available information. We attempt to automatically enrich this metadata with content information and structure the metadata collection based on respective content. We view the problem from an unsupervised, text mining perspective, whereby multi-word terms recognised in free text are assumed to indicate content. In turn, the inter-relationships among the recognised terms are assumed to reveal the knowledge structure of the document collection. For term extraction, we implement the C/NC value algorithm (Frantzi et al., 2000). For term clustering, we experiment with combinations of three variants of hierarchical clustering algorithms (complete, single and average linkage) with two term similarity measures: (i) term co-occurrence in a document and (ii) lexical similarity (Nenadic and Ananiadou, 2006).