| |
|
|
|
|
|
Ucto: Unicode Tokenizer
Ucto tokenizes text files: it
separates words from punctuation, and splits sentences. It offers
several other basic preprocessing steps such as changing case that you
can all use to make your text suited for further processing such as
indexing, part-of-speech tagging, or machine translation.
Ucto comes with tokenisation rules for several languages and
can be easily extended to suit other languages. It has been
incorporated for tokenizing Dutch text
in Frog, our Dutch
morpho-syntactic processor.
Features
- Comes with tokenization rules for English, Dutch, French, Italian, and Swedish; easily extendible to other languages.
- Recognizes dates, times, units, currencies, abbreviations.
- Recognizes paired quote spans, sentences, and paragraphs.
- Produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input.
- Optional conversion to all lowercase or uppercase.
- Optionally produces FoLiA xml.
Ucto is free software; you can
redistribute it and/or modify it under the terms of the GNU General Public
License as published by the Free
Software Foundation.
Ucto was written by Maarten van
Gompel with contributions from Ko van der Sloot. Work
on Ucto is funded
by NWO, the Netherlands
Organisation for Scientific Research, under
the Implicit Linguistics
project and the CLARIN-NL
program.
|
|
|
|
|
|
|
|
|
|
|
|
Documentation
Documentation for ucto is available for download here (PDF).
Download and installation
Consult these installation
instructions for details on how to install this software if you are using a Debian, Ubuntu, or Fedora-based
system. If you want to build the code from source yourself, download
To install, please follow these basic instructions:
Ucto depends on libfolia (a FoLiA library), ticcutils and ICU 3.6 or higher. You may also need to install fresh versions of pkgconfig and the autoconf toolkit. Mac users are advised to install the latest version of XCode, and use either Fink or Macports to install these packages.
|
|
|
|
|
|
|
|