written by Maarten van Gompel
PBMBMT Download & source repository: http://github.com/proycon/pbmbmt
We present PBMBMT, a system for phrase-based memory-based machine translation. This is a type of example-based machine translation in which the translation model takes the form of approximate k-nearest neighbour classifiers trained to map words or phrases in context to a target word or phrase. PBMBMT embraces the concept of phrases, as opposed to the single words or fixed n-grams that earlier work in memory-based machine translation focused on. PBMBMT usually employs a phrase translation table, such as generated by Moses, as the basis for the generation of training and test instances for the classifiers.
The system is available under the GNU Public License v3 and is suited primarily for research purposes.
The theory is extensively described in the following publications. It is strongly recommend to first read the first mentioned paper and become acquainted with the underlying theory prior to attempting to use the software provided on this page.
The following scheme illustrates the various components in Phrase-based Memory-based Machine Translation:
Note that external software is needed for various tasks. Generation of a word alignment between corpora, and generation of phrase-translation tables requires Moses and GIZA++ respectively. Furthermore, the software itself relies on TiMBL for the actual classification task.
Installation of PBMBMT is fairly complex due to the many dependencies, an installation script install.sh has been written to facilitate installation. It will automatically download and compile the following necessary dependencies:
by ILK, Tilburg University:
Note that PBMBMT runs on unix systems only. Windows is not supported, nor is PBMBMT tested on Mac OS X, but with expert knowledge it may be done; please inform us of a successful installation. PBMBMT is written in Python 2.5, so make sure to have at least version 2.5 or higher installed (but lower than Python 3.0!). PBMBMT or its installer script furthermore depends on the following mandatory software which you will need to install through your distribution if not available already, package names for Debian/Ubuntu-based systems are provided explicitly:
If you want to work with your own training data from scratch, then you also need to install GIZA++ and Moses, which are required for the generation of respectively the word-alignment data and the phrase-translation table. In such cases you will also need to create your own language model using SRILM.Download PBMBMT from its repository at github as per the following instructions:
$ git clone email@example.com:proycon/pbmbmt.git
$ cd pbmbmt
The installation script will also download sample data which includes tokenised and word-aligned data from the OpenSubtitles parallel corpus (English and Dutch), split into training and test-sets. It also includes a phrase-translation table. This data will be used in this how-to.
The software consists out of several sub-systems, corresponding to the main stages in the process:
The system is invoked through pbmbmt.py, which allows specific subsystems to be selected. Note that generation of word-alignments and phrase-translation tables is not part of this process but has to be done in advance using GIZA++ and Moses.
Pass the -h flag to pbmbmt.py to receive an overview of the full range of options that are accepted, a full discussion of which is beyond the scope of this how-to.
We will be using the sample data to build a translation system from Dutch to English. Note that PBMBMT is rather picky about how input and output files are named and where they are placed. You will see a pbmbmt/data/ directory that contains all data files. Test results, i.e. translation results and their evaluation, will be placed in pbmbmt/data/test.
The sample data comes from the OpenSubtitles corpus, an open source parallel corpus. The sample data has already been tokenised, sentence-aligned, and word-aligned, so we start with the phrase-extraction and instance generation phases as shown earlier in the scheme.
The following files are included in the sample data:
Note that it is mandatory for the word-alignment file and the test-corpus to share the very same identifier (the name up until the first period), in this case "OpenSub-dutch". The system will always detect this identifier and use it also in all output files, which will be stored in a directory within test/ of the very same name. An additional experiment identifier can be set if one wants to run multiple experiments on the same data, to make sure each experiment gets stored in a different directory without mixing up files. This somewhat forced naming convention aims to prevent any confusion and enables the system to quickly find the many files it needs.
The instructions below demonstrate the phrase-translation table method of phrase-extraction, using the default split-file format. The phrase-table is stored in OpenSub-dutch.train.phrasetable. Since this may be the first of several experiments we choose "exp1" as experiment identifier. Recall that each experiment needs a unique output infix, otherwise the files get mixed up:
For all of the commands in this section, we assume you to be in the pbmbmt directory, which you should already be in after running the installation script, if not issue something like:
$ cd pbmbmt
Now we invoke the system, with all of its subsystems as follows on our OpenSubtitles data:
$ ./pbmbmt.py --Dsrilm=OpenSub-english.lm -- OpenSub-dutch OpenSub-english test exp1
The syntax here is:
$ ./pbmbmt.py [options] -- [source] [reference] [set] [experiment-ID]
Since we specified no further options, all aforementioned subsystems will be invoked. If this is undesired, then the -G flag enables only instance generation, -T enables TiMBL classifier, -D decoding and -S scoring and classification. See the help output (-h) for a full list of parameters.
Note that experiments can take considerable time and take up considerable resources, a state-of-the-art machine with lots of memory and multiple CPUs is strongly recommended. After all is done, you will find a directory pbmbmt/data/test/OpenSub-dutch.exp1 holding all output and intermediate files for this experiment.
These loose ends will be integrated into the documentation at a later stage:
$ ngram-count -order 3 -interpolate -kndiscount -text OpenSub-english.train.txt -lm OpenSub-english.lm
Last updated: 2011-01-19