| |||||||||||||
|
python-timbl
IntroductionMemory-based learning covers a class of machine learning algorithms that simply store all training examples in memory and classify unseen test instances by extrapolating from similar training instances. Many such algorithms are based on the k-nearest neighbour rule, which classifies unseen test instances by selecting the most frequent class among the k most similar training instances. By not using any form of abstraction in the training phase (as do many other well-known learning algorithms), memory-based learning algorithms manage to preserve not only the global but also the local characteristics of the instance space, including those parts of the instance space that are populated by small numbers of exceptional cases. This last property has been found to be appropriate for domains in which exceptions are an important part of many tasks, rather than something to abstract away from. This is true, for example, for natural language processing. The Tilburg Memory-Based Learner (TiMBL) implements several memory-based learning algorithms and can be used off the shelf to learn any task for which training data are available. The command-line application is the easiest way to get started using TiMBL. With it, simple text files containing feature-value descriptions of test instances can be processed in batch; the output of which is another text file with the test instances and the predicted class label for each of those. For more flexible access to TiMBL, two other options exist.
python-timbl is a Python extension module wrapping the full TiMBL C++ programming interface. With this module, all functionality exposed through the C++ interface is also available to Python scripts. Being able to access the API from Python greatly facilitates prototyping TiMBL-based applications. Licensepython-timbl is free software, distributed under the terms of the GNU General Public License with a special exception that allows linking with TiMBL (which has a GPL-incompatible license). Do note, however, that using python-timbl in your applications implies using TiMBL as well. As a result, when using python-timbl in your applications, you will also have to comply with the terms of the TiMBL license. Among others, this license requires proper citation in publication of research that uses TiMBL. Downloadpython-timbl is distributed in source-code form only. The latest release version is 2006.06.21. See the next section for guidelines on compiling it for your system. In addition to the above file, you can also download a file with epydoc-generated API documentation for the Python module. This file is not required to use python-timbl, but the HTML documentation is not included with the source code, so you may want to download it anyway. Installationpython-timbl depends on two external packages, which must have been built and/or installed on your system in order to successfully build python-timbl. The first is TiMBL itself; download its tarball from TiMBL's homepage and follow the installation instructions. In the remainder of this section, it is assumed that $TIMBL_ROOT points to the directory in which TiMBL was built. This directory contains (among others) libTimbl.a and TimblAPI.h. The second prerequisite is Boost.Python, a library that facilitates writing Python extension modules in C++. Many Linux distributions come with prebuilt packages of Boost.Python. If so, install this package; if not, refer to the Boost installation instructions to build and install Boost.Python manually. In the remainder of this section, let $BOOST_HEADERS refer to the directory that contains the Boost header files, and $BOOST_LIBS to the directory that contains the Boost library files. If you installed Boost.Python with your distribution's package manager, these directories are probably /usr/include and /usr/lib respectively. If both prerequisites have been installed on your system, python-timbl can be built and installed with the following command. python setup.py \ build_ext --boost-include-dir=$BOOST_HEADERS \ --boost-library-dir=$BOOST_LIBS \ --timbl-include-dir=$TIMBL_ROOT \ --timbl-library-dir=$TIMBL_ROOT \ install --prefix=/dir/to/install/in The --prefix option to the install command denotes the directory in which the module is to be installed. If you have the appropriate system permissions, you can leave out this option. The module will then be installed in the Python system tree. Otherwise, make sure that the installation directory is in the module search path of your Python system. UsageThere are several different places to look for documentation on python-timbl usage. Probably the most complete is the TiMBL API guide. Although this document actually describes the C++ interface to TiMBL, the latter is similar enough to its Python binding for this document to be a useful reference for python-timbl as well. For an overview of the differences between the C++ API and its Python binding, read the subsection on that topic below. A smaller, but more Python-oriented source of documentation is provided by the module's docstrings. A simple help(timbl) after importing the module shows all its classes and methods with a brief description for each. Alternatively, an epydoc-generated HTML version of this same documentation is available for download. Differences with respect to the C++ APIFor most part, the Python TiMBL interface follows the C++ version closely. The differences are listed below.
Getting startedThe following examples make use of a data set for Dutch diminutive suffix prediction. The files for this data set (dimin.*) are included in the TiMBL distribution and can be found in the demos/ subdirectory. Let's start with creating a TiMBL instance and training it on the examples in the dimin.data file. >>> import timbl >>> timblApi = timbl.TimblAPI("-mM -k5 -w3", "") >>> timblApi.learn("dimin.data") Examine datafile 'dimin.data' gave the following results: Number of Features: 12 InputFormat : C4.5 Phase 1: Reading Datafile: dimin.data Start: 0 @ Mon Jun 19 14:55:03 2006 Finished: 3949 @ Mon Jun 19 14:55:03 2006 Calculating Entropy Mon Jun 19 14:55:03 2006 Feature Permutation based on GainRatio/Values : < 9, 5, 11, 1, 12, 7, 4, 3, 10, 8, 2, 6 > Phase 2: Learning from Datafile: dimin.data Start: 0 @ Mon Jun 19 14:55:03 2006 Finished: 3949 @ Mon Jun 19 14:55:03 2006 Size of InstanceBase = 24232 Nodes, (484640 bytes), 51.61 % compression True Given a trained TiMBL instance, a test file can be processed, writing the output to another file, just as the TiMBL command-line application would do. >>> timblApi.test("dimin.test", "test.out", "") Examine datafile 'dimin.test' gave the following results: Number of Features: 12 InputFormat : C4.5 Starting to test, Testfile: dimin.test Writing output in: test.out Algorithm : IB1 Global metric : Value Difference, Prestored matrix Deviant Feature Metrics:(none) Size of value-matrix[1] = 60 Bytes Size of value-matrix[2] = 672 Bytes Size of value-matrix[3] = 780 Bytes Size of value-matrix[4] = 96 Bytes Size of value-matrix[5] = 60 Bytes Size of value-matrix[6] = 1932 Bytes Size of value-matrix[7] = 1292 Bytes Size of value-matrix[8] = 396 Bytes Size of value-matrix[9] = 32 Bytes Size of value-matrix[10] = 2496 Bytes Size of value-matrix[11] = 1152 Bytes Size of value-matrix[12] = 1020 Bytes Total Size of value-matrices 9988 Bytes Weighting : Chi-square Tested: 1 @ Mon Jun 19 14:57:47 2006 Tested: 2 @ Mon Jun 19 14:57:47 2006 Tested: 3 @ Mon Jun 19 14:57:47 2006 Tested: 4 @ Mon Jun 19 14:57:47 2006 Tested: 5 @ Mon Jun 19 14:57:47 2006 Tested: 6 @ Mon Jun 19 14:57:47 2006 Tested: 7 @ Mon Jun 19 14:57:47 2006 Tested: 8 @ Mon Jun 19 14:57:47 2006 Tested: 9 @ Mon Jun 19 14:57:47 2006 Tested: 10 @ Mon Jun 19 14:57:47 2006 Tested: 100 @ Mon Jun 19 14:57:47 2006 Ready: 950 @ Mon Jun 19 14:57:47 2006 Seconds taken: 1 (950.00 p/s) overall accuracy: 0.977895 (929/950), of which 950 exact matches There were 5 ties of which 2 (40.00%) were correctly resolved True So far, nothing has been done that could not have been done using the TiMBL command-line application. The most important advantage of directly accessing the TiMBL API is the ability to classify individual instances. This can be done using one of the three classify* methods. >>> timblApi.classify("=,=,=,=,=,=,=,=,+,p,e,=,?") (True, 'T') >>> timblApi.classify2("=,=,=,=,=,=,=,=,+,p,e,=,?") (True, 'T', 0.0) >>> timblApi.classify3("=,=,=,=,=,=,=,=,+,p,e,=,?") (True, 'T', '{ T 6.00000 }', 0.0) |