REGnames corpus

We introduce the REGnames corpus to the study of proper names. It contains 53.102 proper names referring to people in 15.241 webpages extracted from the Wikilinks corpus, which was initi ally collected for the study of cross-document coreference and consists of more than 40 million references to almost 3 million entities in around 11 million webpages. All the references annotated in Wikilin ks were grouped according to the Wikipedia page of the entity.

To see the statistics of the corpus, click here. For more informat ion, read the article about it here.


Click here to download the first version of the corpus (described at the INLG paper) or here to download the second version (used in the experiments described at the EACL paper).


          author    = {Castro Ferreira, Thiago  and  Wubben, Sander  and  Krahmer, Emiel},
          title     = {Towards proper name generation: a corpus analysis},
          booktitle = {Proceedings of the 9th International Natural Language Generation conference},
          month     = {September 5-8},
          year      = {2016},
          address   = {Edinburgh, UK},
          publisher = {Association for Computational Linguistics},
          pages     = {222--226},
          url       = {}


          author    = {Castro Ferreira, Thiago  and  Krahmer, Emiel  and  Wubben, Sander},
          title     = {Generating flexible proper name references in text: Data, models and evaluation},
          booktitle = {Proceedings of the 15th European Chapter of the Association for Computational Linguistics (EACL)},
          month     = {April 3-7},
          year      = {2017},
          address   = {Valencia, Spain},
          publisher = {Association for Computational Linguistics}