Individual variation in the choice of referential form

Thiago Castro Ferreira

PhD student at:
TiCC - University of Tilburg

Supervised by:
Emiel Krahmer / Sander Wubben

Automatic Text Generation

Input:

Output:
John has a Math test on April 2nd, 2015.

Automatic Text Generation

Extracted from Narrative Science

Benner had a good game at the plate for Hamilton A’s-Forcini. Benner went 2-3, drove in one and scored one run. Benner singled in the third inning and doubled in the fifth inning. Benner had a good game at the plate for Hamilton A’s-Forcini. Benner went 2-3, drove in one and scored one run. Benner singled in the third inning and doubled in the fifth inning.

Referring expressions!

Referring Expression Generation (REG)

Crucial for the coherence of the produced text

Given a context with two players: John and Benner

He had a good game at the plate for Hamilton A’s-Forcini. He went 2-3, drove in one and scored one run.
vs.
Benner had a good game at the plate for Hamilton A’s-Forcini. He went 2-3, drove in one and scored one run.

What is the most consistent text?

Choice of Referential Form

First decision of REG models

... proper name?
- Benner went 2-3, drove in one and scored one run.
... description?
- The player went 2-3, drove in one and scored one run.
... pronoun?
- He went 2-3, drove in one and scored one run.
... demonstrative?
- This player went 2-3, drove in one and scored one run.
... empty?
- Benner went 2-3, _ drove in one and _ scored one run.

Choice of referential form based on corpus

Goal: Take choices similar to human ones

Limitation: Available corpora have a unique referring expression for each situation

Corpus vs. Model

Corpus:
- The player went 2-3, drove in one and scored one run.

Model for Choice of referential form:
- Benner went 2-3, drove in one and scored one run.

Is this choice wrong? Depends...

The use of a description does not necessarily mean that the use of a proper name is wrong.

Corpus vs. Model

Corpus:
- Writer 1: The player went 2-3, drove in one and scored one run.
- Writer 2: ???
- Writer 3: ???
- ...

Model for Choice of referential form:
- ??? went 2-3, drove in one and scored one run.

New corpus

Collection of more than one referring expression per situation

Link to the experiment

Materials

36 texts
12 news texts, 12 product reviews and 12 encyclopedic texts

78 participants
~ 20 per text

9588 referring expressions collected
Annotated according to 5 referential forms

Variation in the writers' choices

Quantified in each gap by the normalized entropy measure

$H(X) = - \sum\limits_{i = 1}^{n = 5} \frac{p(x_{i}) \log (p(x_{i}))}{\log (n)}$
where $n$ is the number of referential forms
ranging from 0 (full agreement) to 1 (full variation)

Annotated Data

Link to the corpus

By genre

Referential Status

New referents vs. Old referents

Syntactic Position

Subject referents vs. Object referents

Recency

Close to their previous mention vs. Distant to their previous mention

Higher variation when the referent is...

Old in the text, and new in the sentence

Object of the sentence

Distant from its previous reference

Modeling

Naive Bayes
$P(r_{k} \mid f) \propto P(r_{k}) \prod\limits_{i}^{|f|} P(f_{i} \mid r_{k})$
for each of $k$ referential forms
PoS: $f$ composed by the part-of-speech information from the previous and latter words
NB: $f$ composed by referential statuses, syntactic position and categorical recency

	JSD	Accuracy	F-Score
Random	0.64	0.19	0.26
PoS	0.36	0.67	0.66
NB	0.31	0.76	0.74
NB+PoS	0.33	0.72	0.73

JSD: Jensen–Shannon divergence
Accuracy and F-Score: measured based on the major referential form of each situation

Conclusions

Considerable amount of individual variation in the choice of referential form

Linguistic factors can distinguish situations with similar distributions of referential forms.
Future work: Besides referential statuses, syntactic position and recency, is there any other factor?