VaREG corpus


Corpus VaREG is a collection of referring expressions for the study of individual variation in the choice of referential form. It contains 9,588 referring expressions, produced by 78 writers for 563 references - around 20 referring expressions per reference - in 36 English texts.


To collect the referring expressions, different writers were introduced with texts in which all references to their main topic were replaced with gaps. Their task was to fill each of those gaps with a reference to the topic. The experiment can be visualized here. To visualize the writers' referring expressions and the variation among their forms, click here.


The analysis of the corpus revealed significant variation among writers in their choices of referential form in same situations. Mostly when they had to choose the form of a reference in the object position of a sentence; to a previously mentioned topic in the text and first mentioned in the sentence; and distant from its antecedent reference. For more information, check the slide presentation or the article about the corpus.


Click here to download the corpus.


  author    = {Castro Ferreira, Thiago  and  Krahmer, Emiel  and  Wubben, Sander},
  title     = {Individual Variation in the Choice of Referential Form},
  booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2016},
  address   = {San Diego, California},
  publisher = {Association for Computational Linguistics},
  pages     = {423--427},
  url       = {}