| |
|
|
|
|
|
FoLiA: Format for Linguistic Annotation
FoLiA is an XML-based format for Linguistic
Annotation suitable for representing written language resources such
as corpora. Its goal is to unify a variety of linguistic annotations
in one single rich format, without committing to any particular
standard annotation set. Instead, it seeks to accommodate any desired
system or tagset, and so offer maximum flexibility. This makes FoLiA
language independent. Due to its generalised set up, it is easy to
extend the FoLiA format to suit your custom needs for linguistic
annotation.
XML is an inherently hierarchic format. FoLiA does justice to this
by utilising a hierarchic, inline, setup. We inherit from
the D-Coi format, which posits to be loosely based on a minimal
subset of TEI. Because of the introduction of a broader
paradigm, FoLiA is not
backwards-compatible with D-Coi, i.e. validators for D-Coi will not
accept FoLiA XML. It is however easy to convert FoLiA to less complex
or verbose formats such as the D-Coi format, or plain-text. Converters
will be provided. This may entail some loss of information if the
simpler format has no provisions for particular types of information
specified in the FoLiA format.
Notable features are:
- XML-based, UTF-8 encoded
- Language and tagset independent
- Can encode both tokenised as well as untokenised text + partial reconstructability of untokenised form even after tokenisation.
- Generalised paradigm, extensible and flexible
- Provenance support for all linguistic annotations: annotator, type (automatic or manual), time.
FoLiA was written by Maarten van
Gompel.
The FoLiA poster presented at CLIN 21:
|
|
|
|
|
|
|
|
|
|
|
|
Resources
The documentation, validation schema and other resources for the latest FoLiA version can be found below. Consult the FoLiA github repository for all available resources.
Two software libraries are available for working with the FoLiA format from within your own scripts and applications. Make sure to check back regularly for updates, as both are still being actively developed.
- pynlpl.formats.folia (Python library) - by Maarten van Gompel, distributed as part of PyNLPl
- libfolia (C++ library) - by Ko van der Sloot
- No documentation available yet
- Note: libfolia is a dependency for both Frog and ucto. You will need it to compile them!
FoLiA is currently being integrated in NLP software developed
at the ILK Research
Group: Ucto, a generic
tokenizer, and Frog, a
Dutch morpho-syntactic processor.
For additional support and announcements please join the
|
|
|
|
|
|
|