Sander Canisius Research page

python-timbl

Introduction

Memory-based learning covers a class of machine learning algorithms that simply store all training examples in memory and classify unseen test instances by extrapolating from similar training instances. Many such algorithms are based on the k-nearest neighbour rule, which classifies unseen test instances by selecting the most frequent class among the k most similar training instances. By not using any form of abstraction in the training phase (as do many other well-known learning algorithms), memory-based learning algorithms manage to preserve not only the global but also the local characteristics of the instance space, including those parts of the instance space that are populated by small numbers of exceptional cases. This last property has been found to be appropriate for domains in which exceptions are an important part of many tasks, rather than something to abstract away from. This is true, for example, for natural language processing.

The Tilburg Memory-Based Learner (TiMBL) implements several memory-based learning algorithms and can be used off the shelf to learn any task for which training data are available. The command-line application is the easiest way to get started using TiMBL. With it, simple text files containing feature-value descriptions of test instances can be processed in batch; the output of which is another text file with the test instances and the predicted class label for each of those. For more flexible access to TiMBL, two other options exist.

  1. Running TiMBL in server mode

    In server mode, TiMBL listens for classification requests on a network socket. Section 6.3 of the TiMBL reference guide contains more detailed information on the server interface. The advantage of this method is that TiMBL can be accessed from any programming language. Another, potentially useful consequence of running TiMBL in server mode is that several different applications can share one TiMBL instance, which may prove very beneficial with respect to memory consumption when many TiMBL-based applications run in parallel.

  2. Using the C++ programming interface

    Provided with the TiMBL distribution is a library interface that can be linked with user code in order to incorporate TiMBL functionality in any application. The advantage of using this C++ interface is that all functionality is bundled in one executable, rather than requiring an extra TiMBL executable to run the server. The Memory-based tagger is an example of an application using the TiMBL C++ API..

python-timbl is a Python extension module wrapping the full TiMBL C++ programming interface. With this module, all functionality exposed through the C++ interface is also available to Python scripts. Being able to access the API from Python greatly facilitates prototyping TiMBL-based applications.

License

python-timbl is free software, distributed under the terms of the GNU General Public License with a special exception that allows linking with TiMBL (which has a GPL-incompatible license). Do note, however, that using python-timbl in your applications implies using TiMBL as well. As a result, when using python-timbl in your applications, you will also have to comply with the terms of the TiMBL license. Among others, this license requires proper citation in publication of research that uses TiMBL.

Download

python-timbl is distributed in source-code form only. The latest release version is 2006.06.21. See the next section for guidelines on compiling it for your system.

In addition to the above file, you can also download a file with epydoc-generated API documentation for the Python module. This file is not required to use python-timbl, but the HTML documentation is not included with the source code, so you may want to download it anyway.

Installation

python-timbl depends on two external packages, which must have been built and/or installed on your system in order to successfully build python-timbl. The first is TiMBL itself; download its tarball from TiMBL's homepage and follow the installation instructions. In the remainder of this section, it is assumed that $TIMBL_ROOT points to the directory in which TiMBL was built. This directory contains (among others) libTimbl.a and TimblAPI.h.

The second prerequisite is Boost.Python, a library that facilitates writing Python extension modules in C++. Many Linux distributions come with prebuilt packages of Boost.Python. If so, install this package; if not, refer to the Boost installation instructions to build and install Boost.Python manually. In the remainder of this section, let $BOOST_HEADERS refer to the directory that contains the Boost header files, and $BOOST_LIBS to the directory that contains the Boost library files. If you installed Boost.Python with your distribution's package manager, these directories are probably /usr/include and /usr/lib respectively.

If both prerequisites have been installed on your system, python-timbl can be built and installed with the following command.

python setup.py \
       build_ext --boost-include-dir=$BOOST_HEADERS \
                 --boost-library-dir=$BOOST_LIBS \
                 --timbl-include-dir=$TIMBL_ROOT \
                 --timbl-library-dir=$TIMBL_ROOT \
       install --prefix=/dir/to/install/in

The --prefix option to the install command denotes the directory in which the module is to be installed. If you have the appropriate system permissions, you can leave out this option. The module will then be installed in the Python system tree. Otherwise, make sure that the installation directory is in the module search path of your Python system.

Usage

There are several different places to look for documentation on python-timbl usage. Probably the most complete is the TiMBL API guide. Although this document actually describes the C++ interface to TiMBL, the latter is similar enough to its Python binding for this document to be a useful reference for python-timbl as well. For an overview of the differences between the C++ API and its Python binding, read the subsection on that topic below.

A smaller, but more Python-oriented source of documentation is provided by the module's docstrings. A simple help(timbl) after importing the module shows all its classes and methods with a brief description for each. Alternatively, an epydoc-generated HTML version of this same documentation is available for download.

Differences with respect to the C++ API

For most part, the Python TiMBL interface follows the C++ version closely. The differences are listed below.

  • Naming style

    In the C++ interface, method names are in UpperCamelCase; for example, Classify, SetOptions, etc. In contrast, the Python interface uses lowerCamelCase: classify, setOptions, etc.

  • Method overloading

    TiMBL's Classify methods use the C++ method overloading feature to provide three different kinds of outputs. Method overloading is non-existant in Python though; therefore, python-timbl has three differently named methods to mirror the functionality of the overloaded Classify method. The mapping is as follows.

    #
    # bool TimblAPI::Classify(const std::string& Line,
    #                         std::string& result);
    #
    def TimblAPI.classify(line) -> bool, result
    
    #
    # bool TimblAPI::Classify(const std::string& Line,
    #                         std::string& result,
    #                         double& distance);
    #
    def TimblAPI.classify2(line) -> bool, result, distance
    
    #
    # bool TimblAPI::Classify(const std::string& Line,
    #                         std::string& result,
    #                         std::string& Distrib,
    #                         double& distance);
    #
    def TimblAPI.classify3(line) -> bool, result, Distrib, distance
    
  • Python-only methods

    Three TiMBL API methods print information to a standard C++ output stream object (ShowBestNeighbors, ShowOptions, ShowSettings). In the Python interface, these methods will only work with Python (stream) objects that have a fileno method returning a valid file descriptor. Alternatively, three new methods are provided (bestNeighbo(u)rs, options, settings); these methods return the same information as a Python string object.

Getting started

The following examples make use of a data set for Dutch diminutive suffix prediction. The files for this data set (dimin.*) are included in the TiMBL distribution and can be found in the demos/ subdirectory. Let's start with creating a TiMBL instance and training it on the examples in the dimin.data file.

>>> import timbl
>>> timblApi = timbl.TimblAPI("-mM -k5 -w3", "")
>>> timblApi.learn("dimin.data")
Examine datafile 'dimin.data' gave the following results:
Number of Features: 12
InputFormat       : C4.5

Phase 1: Reading Datafile: dimin.data
Start:          0 @ Mon Jun 19 14:55:03 2006
Finished:    3949 @ Mon Jun 19 14:55:03 2006
Calculating Entropy         Mon Jun 19 14:55:03 2006
Feature Permutation based on GainRatio/Values :
< 9, 5, 11, 1, 12, 7, 4, 3, 10, 8, 2, 6 >
Phase 2: Learning from Datafile: dimin.data
Start:          0 @ Mon Jun 19 14:55:03 2006
Finished:    3949 @ Mon Jun 19 14:55:03 2006

Size of InstanceBase = 24232 Nodes, (484640 bytes), 51.61 % compression

True

Given a trained TiMBL instance, a test file can be processed, writing the output to another file, just as the TiMBL command-line application would do.

>>> timblApi.test("dimin.test", "test.out", "")
Examine datafile 'dimin.test' gave the following results:
Number of Features: 12
InputFormat       : C4.5


Starting to test, Testfile: dimin.test
Writing output in:          test.out
Algorithm     : IB1
Global metric : Value Difference, Prestored matrix
Deviant Feature Metrics:(none)
Size of value-matrix[1] = 60 Bytes
Size of value-matrix[2] = 672 Bytes
Size of value-matrix[3] = 780 Bytes
Size of value-matrix[4] = 96 Bytes
Size of value-matrix[5] = 60 Bytes
Size of value-matrix[6] = 1932 Bytes
Size of value-matrix[7] = 1292 Bytes
Size of value-matrix[8] = 396 Bytes
Size of value-matrix[9] = 32 Bytes
Size of value-matrix[10] = 2496 Bytes
Size of value-matrix[11] = 1152 Bytes
Size of value-matrix[12] = 1020 Bytes
Total Size of value-matrices 9988 Bytes

Weighting     : Chi-square

Tested:      1 @ Mon Jun 19 14:57:47 2006
Tested:      2 @ Mon Jun 19 14:57:47 2006
Tested:      3 @ Mon Jun 19 14:57:47 2006
Tested:      4 @ Mon Jun 19 14:57:47 2006
Tested:      5 @ Mon Jun 19 14:57:47 2006
Tested:      6 @ Mon Jun 19 14:57:47 2006
Tested:      7 @ Mon Jun 19 14:57:47 2006
Tested:      8 @ Mon Jun 19 14:57:47 2006
Tested:      9 @ Mon Jun 19 14:57:47 2006
Tested:     10 @ Mon Jun 19 14:57:47 2006
Tested:    100 @ Mon Jun 19 14:57:47 2006
Ready:     950 @ Mon Jun 19 14:57:47 2006
Seconds taken: 1 (950.00 p/s)
overall accuracy:        0.977895  (929/950), of which 950 exact matches

There were 5 ties of which 2 (40.00%) were correctly resolved
True

So far, nothing has been done that could not have been done using the TiMBL command-line application. The most important advantage of directly accessing the TiMBL API is the ability to classify individual instances. This can be done using one of the three classify* methods.

>>> timblApi.classify("=,=,=,=,=,=,=,=,+,p,e,=,?")
(True, 'T')
>>> timblApi.classify2("=,=,=,=,=,=,=,=,+,p,e,=,?")
(True, 'T', 0.0)
>>> timblApi.classify3("=,=,=,=,=,=,=,=,+,p,e,=,?")
(True, 'T', '{ T 6.00000 }', 0.0)