Tilburg, 21 June 2010 Herman Stehouwer Suffixarray package Licenced under the GPLv3, see the LICENCE file. INTRODUCTION: This package implements an efficient suffixarray in template-based C++. Space utilisation for a corpus of lenght N is N*sizeof(index) + 4N*sizeof(char) + exceptions. (where index is defined as the type used to index the corpus, usually an unsigned int for std::vector and std::string) (Exceptions are fairly rare, we store most indexes as relative indexes and the longest-common-prefix values as characters) In high-LCP corpora this implementation will not be very efficient. Natural Language data is stored very efficiently, which was our goal. Buiding the suffix array is fairly efficient time-wise using a deep-shallow sorting strategy with a blind trie. (Much faster than C++'s regular sort function, due to the nature of the datastructure.) This suffix array library provides the following core functionality once the suffix array is build: - Is the query an infix of the read-in corpus. - Answer how often the query occurs in the corpus. - Answer where all the positions of the query in the corpus are. - Do the same for skipgrams. These questions are answered very efficiently by using the implicit suffix tree structure on the suffix array. - Finding highly compressing patterns in the data. INSTALL ./configure make make install HOWTO USE THE PACKAGE Please see the doc/guide.pdf. BUGS ETC. For bugs, remarks, suggestions etc. please email the author at j.h.stehouwer(_AT-)uvt.nl. All comments are welcomed. If you use the library for a software project or for research I would love to know your results.