FoLiA: Format for Linguistic Annotation
FoLiA: Format for Linguistic Annotation

FoLiA is an XML-based format for Linguistic Annotation suitable for representing written language resources such as corpora. Its goal is to unify a variety of linguistic annotations in one single rich format, without committing to any particular standard annotation set. Instead, it seeks to accommodate any desired system or tagset, and so offer maximum flexibility. This makes FoLiA language independent. Due to its generalised set up, it is easy to extend the FoLiA format to suit your custom needs for linguistic annotation.

XML is an inherently hierarchic format. FoLiA does justice to this by utilising a hierarchic, inline, setup. We inherit from the D-Coi format, which posits to be loosely based on a minimal subset of TEI. Because of the introduction of a broader paradigm, FoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi will not accept FoLiA XML. It is however easy to convert FoLiA to less complex or verbose formats such as the D-Coi format, or plain-text. Converters will be provided. This may entail some loss of information if the simpler format has no provisions for particular types of information specified in the FoLiA format.

Notable features are:

  • XML-based, UTF-8 encoded
  • Language and tagset independent
  • Can encode both tokenised as well as untokenised text + partial reconstructability of untokenised form even after tokenisation.
  • Generalised paradigm, extensible and flexible
  • Provenance support for all linguistic annotations: annotator, type (automatic or manual), time.

FoLiA was written by Maarten van Gompel.

The FoLiA poster presented at CLIN 21:

Resources

The documentation, validation schema and other resources for the latest FoLiA version can be found below. Consult the FoLiA github repository for all available resources.

Two software libraries are available for working with the FoLiA format from within your own scripts and applications. Make sure to check back regularly for updates, as both are still being actively developed.

  • pynlpl.formats.folia (Python library) - by Maarten van Gompel, distributed as part of PyNLPl
  • libfolia (C++ library) - by Ko van der Sloot
    • No documentation available yet
    • Note: libfolia is a dependency for both Frog and ucto. You will need it to compile them!

FoLiA is currently being integrated in NLP software developed at the ILK Research Group: Ucto, a generic tokenizer, and Frog, a Dutch morpho-syntactic processor.

For additional support and announcements please join the