This tutorial will introduce the FoLiA python library, part of PyNLPl. The FoLiA library provides an Application Programming Interface for the reading, creation and manipulation of FoLiA XML documents.
Prior to reading this document, it is highly recommended to first read to FoLiA documentation itself and familiarise yourself with the format and underlying paradigm.
Any script that uses FoLiA starts with the import:
from pynlpl.formats import folia
Subsequently, a document can be read from file and into memory as follows:
doc = folia.Document(file="/path/to/document.xml")
This returns an instance that holds the entire document.
Once you have loaded a document, all data is available for you to read and manipulate as you see fit. We will first illustrate some simple use cases:
You may want to simply print all (plain) text contained in the document, which is as easy as:
print doc
Alternatively, you can obtain a string representation of all text:
text_u = unicode(doc) #unicode instance
text = str(doc) #UTF-8 encoded
For any subelement of the document, you can obtain its text in the same fashion.
A document instance has an index which you can use to grab any of its sub elements by ID. Querying using the index proceeds similar to using a python dictionary:
word = doc['example.p.3.s.5.w.1']
print word
Usually you do not know in advance the ID of the element you want, or you want multiple elements. There are some methods of iterating over certain elements using the FoLiA library.
For example, you can iterate over all words:
for word in doc.words():
print word
That however gives you one big iteration of words without boundaries. You may more likely seek word within sentences. So we first iterate over all sentences, then over the words therein:
for sentence in doc.sentences():
for word in sentence.words():
print word
Or including paragraphs, assuming the document has them:
for paragraph in doc.paragraphs():
for sentence in paragraph.sentences():
for word in sentence.words():
print word
You can also use this method to obtain a specific word, by passing an index parameter:
word = sentence.words(3) #retrieves the fourth word
If you want to iterate over all of the child elements of a certain element, regardless of what class they are, you can simply do so as follows:
for element in doc:
if isinstance(element, folia.Sentence):
print "this is a sentence"
else:
print "this is something else"
If applied recursively this allows you in principle to traverse the entire element tree.
There is a generic method available on all elements to select child elements of any desired class. This method is by default applied recursively. Internally, the paragraphs(), words() and sentences() methods seen above are simply shortcuts that make use of the select method:
sentence = doc['example.p.3.s.5.w.1']
words = sentence.select(folia.Word)
for word in words:
print word
Note that the select method is by default recursive, set the third argument to False to make it non-recursive. The second argument can be used for restricting matches to a specific set.
As you know, the FoLiA paradigm introduces sets, classes, annotator with annotator types and confidence values. These attributes are easily accessible on any element that has them:
- element.id (string)
- element.set (string)
- element.cls (string) Since class is already a reserved keyword in python, the library consistently uses cls
- element.annotator (string)
- element.annotatortype (set to folia.AnnotatorType.MANUAL or folia.AnnotatorType.AUTO)
- element.confidence (float)
Attributes that are not available for certain elements, or not set, default to None.
FoLiA is of course a format for linguistic annotation. So let’s see at how to obtain annotations. This can be done using annotations() or annotation(), which is very similar to the select method, except that it will raise an exception when no such annotation is found. The difference between annotation() and annotations() is that the former will grab only one and raise an exception if there are more between which it can’t disambiguate:
for word in doc.words():
try:
pos = word.annotation(folia.PosAnnotation, 'CGN')
lemma = word.annotation(folia.LemmaAnnotation)
print "Word: ", word
print "ID: ", word.id
print "PoS-tag: " , pos.cls
print "PoS Annotator: ", pos.annotator
print "Lemma-tag: " , lemma.cls
except folia.NoSuchAnnotation:
print "No PoS or Lemma annotation"
Note that the second argument of annotation(), annotations() or select() can be used to restrict your selection to a certain set. In the above example we restrict ourselves to Part-of-Speech tags in the CGN set.
(to be written still)
(to be written still)
(Yet to be written)
Creating a new FoliA document, rather than loading an existing one from file, can be done by explicitly providing an ID for the new document in the constructor:
doc = folia.Document(id='example')
Assuming we begin with an empty document, we should first add a Text element. Then we can append paragraphs, sentences, or other structural elements. The append() is always used to append new children to an element:
text = doc.append(folia.Text)
paragraph = text.append(folia.Paragraph)
sentence = paragraph.append(folia.Sentence)
sentence.append(folia.Word, 'This')
sentence.append(folia.Word, 'is')
sentence.append(folia.Word, 'a')
sentence.append(folia.Word, 'test')
sentence.append(folia.Word, '.')
Adding annotations, or any elements for that matter, is done using the append method. Let’s build on the previous example:
#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)
#Add Part-of-Speech tag
word.append(folia.PosAnnotation, set='brown-tagset',cls='n')
#Add lemma
lemma.append(folia.LemmaAnnotation, cls='test')
Note that in the above examples, the append() method takes a class as first argument, and subsequently takes keyword arguments that will be passed to the classes’ constructor.
A second way of using append() is by simply passing a child element and constructing it prior to appending. The following is equivalent to the above example:
#First we grab the fourth word, 'test', from the sentence
word = sentence.words(3)
#Add Part-of-Speech tag
word.append( folia.PosAnnotation(doc, set='brown-tagset',cls='n') )
#Add lemma
lemma.append( folia.LemmaAnnotation(doc , cls='test') )
The append method always returns that which was appended.
In the above example we first instantiate a PosAnnotatation and a LemmaAnnotation. Instantiation of any element follows the following pattern:
Class(document, *children, **kwargs)
The common attributes are set using equally named keyword arguments:
- id=
- cls=
- set=
- annotator=
- annotatortype=
- confidence=
Not all attributes are allowed for all elements, and certain attributes are required for certain elements. ValueError exceptions will be raised when these constraints are not met.
Instead of setting id. you can also set the keyword argument generate_id_in and pass it another element, an ID will be automatically generated, based on the ID of the element passed. When you use the first method of appending, instatation with generate_id_in will take place automatically behind the screens when applicable and when id is not explicitly set.
Any extra non-keyword arguments should be FoLiA elements and will be appended as the contents of the element, i.e. the children or subelements. Instead of using non-keyword arguments, you can also use the keyword argument content and pass a list. This is a shortcut made merely for convenience, as Python obliges all non-keyword arguments to come before the keyword-arguments, which if often aesthetically unpleasing for our purposes. Example of this use case will be shown in the next section.
Adding span annotation is easy with the FoLiA library, not withstanding the fact that there’s more to it than adding token annotation.
As you know, span annotation uses an stand-off annotation embedded in annotation layers. These layers are in turn embedded at the sentence level. In the following example we first create a sentence and then add a syntax parse:
sentence = text.append(folia.Sentence)
sentence.append(folia.Word, 'The',id='example.s.1.w.1')
sentence.append(folia.Word, 'boy',id='example.s.1.w.2')
sentence.append(folia.Word, 'pets',id='example.s.1.w.3')
sentence.append(folia.Word, 'the',id='example.s.1.w.4')
sentence.append(folia.Word, 'cat',id='example.s.1.w.5')
sentence.append(folia.Word, '.', id='example.s.1.w.6')
#Adding Syntax Layer
layer = sentence.append(folia.SyntaxLayer)
#Adding Syntactic Units
layer.append(
SyntacticUnit(self.doc, cls='s', contents=[
SyntacticUnit(self.doc, cls='np', contents=[
SyntacticUnit(self.doc, self.doc['example.s.1.w.1'], cls='det'),
SyntacticUnit(self.doc, self.doc['example.s.1.w.2'], cls='n'),
]),
SyntacticUnit(self.doc, cls='vp', contents=[
SyntacticUnit(self.doc, self.doc['example.s.1.w.3'], cls='v')
SyntacticUnit(self.doc, cls='np', contents=[
SyntacticUnit(self.doc, self.doc['example.s.1.w.4'], cls='det'),
SyntacticUnit(self.doc, self.doc['example.s.1.w.5'], cls='n'),
]),
]),
SyntacticUnit(self.doc, self.doc['example.s.1.w.6'], cls='fin')
])
)
To make references to the words, we simply pass the word instances and use the document’s index to obtain them. Note also that passing a list using the keyword argument contents is wholly equivalent to passing the non-keyword arguments separately.
(Yet to be written)
Tests whether a new element of this class can be added to the parent. Returns a boolean or raises ValueError exceptions (unless set to ignore)!
This will use OCCURRENCES, but may be overidden for more customised behaviour.
This method is mostly for internal use.
Append a child element. Returns the added element
If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.
Generic example, passing a pre-generated instance:
word.append( folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated:
word.append( folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text with a class:
word.append( “house”, cls=’original’ )
Obtain the feature value of the specific subset.
Example:
sense = word.annotation(folia.Sense)
synset = sense.feat('synset')
Insert a child element at specified index. Returns the added element
If an instance is passed as first argument, it will be appended If a class derived from AbstractElement is passed as first argument, an instance will first be created and then appended.
Generic example, passing a pre-generated instance:
word.insert( 3, folia.LemmaAnnotation(doc, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL ) )
Generic example, passing a class to be generated:
word.insert( 3, folia.LemmaAnnotation, cls="house", annotator="proycon", annotatortype=folia.AnnotatorType.MANUAL )
Generic example, setting text of a specific correctionlevel:
word.insert( 3, "house", corrected=folia.TextCorrectionLevel.CORRECTED )
Internal class method used for turning an XML element into an instance of the Class.
This method will be called after an element is added to another. It can do extra checks and if necessary raise exceptions to prevent addition. By default makes sure the right document is associated.
This method is mostly for internal use.
Appends a child element like append(), but replaces any existing child element of the same type and set. If no such child element exists, this will act the same as append()
to be an alternative.
See append() for more information.
Select child elements of the specified class.
A further restriction can be made based on set. Whether or not to apply recursively (by default enabled) can also be configured, optionally with a list of elements never to recurse into.
Example:
text.select(folia.Sense, 'cornetto', True, [folia.Original, folia.Suggestion, folia.Alternative] )
Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.
Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!
Abstract element, all span annotation elements are derived from this class
Returns a list of Paragraph elements found (recursively) under this element.
Returns a list of Sentence elements found (recursively) under this element
Returns a list of Word elements found (recursively) under this element.
Abstract element, all token annotation elements are derived from this class
Elements that allow token annotation (including extended annotation) must inherit from this class
Obtain a list of alternatives, either all or only of a specific annotation type, and possibly restrained also by set.
Obtain annotations. Very similar to select() but raises an error if the annotation was not found.
Processes a corpus of various FoLiA documents using a parallel processing. Calls a user-defined function with the three-tuple (filename, args, kwargs) for each file in the corpus. The user-defined function is itself responsible for instantiating a FoLiA document! args and kwargs, as received by the custom function, are set through the run() method, which yields the result of the custom function on each iteration.
Get the text explicitly associated with this element (of the specified class). Returns the TextContent instance rather than the actual text. Raises NoSuchText exception if not found.
Unlike text(), this method does not recurse into child elements (with the sole exception of the Correction/New element), and it returns the TextContent instance rather than the actual text!
Description is an element that can be used to associate a description with almost any other FoLiA element
Structure element representing some kind of division. Divisions may be nested at will, and may include almost all kinds of other structure elements.
This is the FoLiA Document, all elements have to be associated with a FoLiA document. Besides holding elements, the document hold metadata including declaration, and an index of all IDs.
Add a text to the document:
Example 1:
doc.append(folia.Text)
Return a list of all paragraphs found in the document.
If an index is specified, return the n’th paragraph only (starting at 0)
Save the document to FoLiA XML.
Return a list of all sentence found in the document. Except for sentences in quotes.
If an index is specified, return the n’th sentence only (starting at 0)
Return a list of all active words found in the document. Does not descend into annotation layers, alternatives, originals, suggestions.
If an index is specified, return the n’th word only (starting at 0)
Feature elements can be used to associate subsets and subclasses with almost any annotation element
Element for the representation of a graphical figure. Structure element.
Gap element. Represents skipped portions of the text. Contains Content and Desc elements
Quote: a structure element. For quotes/citations. May hold words or sentences.
Sentence element. A structure element. Represents a sentence and holds all its words (and possibly other structure such as LineBreaks, Whitespace and Quotes)
Text content element (t), holds text to be associated with whatever element the text content element is a child of.
Text content elements have an associated correction level, indicating whether the text they hold is in a pre-corrected or post-corrected state. There can be only once of each level. Text content elements on structure elements like Paragraph and Sentence are by definition untokenised. Only on Word level and deeper they are by definition tokenised.
Find the default reference for text offsets: The parent of the current textcontent’s parent (counting only Structure Elements and Subtoken Annotation Elements)
Note: This returns not a TextContent element, but its parent. Whether the textcontent actually exists is checked later/elsewhere
Word (aka token) element. Holds a word/token and all its related token annotations.
Word reference. Use to refer to words from span annotation elements. The Python class will only be used when word reference can not be resolved, if they can, Word objects will be used