This page accompanies my master thesis on "Phrase-based Memory-based Machine Translation" and provides a how-to explaining how to use the software described in the thesis.
Whereas the thesis is very theoretical in nature, this page is very practical. It is strongly recommend to first read either the derived paper or full thesis and become acquainted with the underlying theory prior to attempting to use the software provided on this page.
The software is not so interesting for end-users seeking a deployable and fully functional translation system, but may be valuable for researchers in the field of Machine Translation.
In my master thesis I aim to investigate a phrase-based approach of Memory-based Machine Translation. This is a form of automatic translation powered by lazy-learning classifiers to translate fragments of the input sentence. A parallel corpus serves as the basis for training such a classifier. In the phrase-based approach the principal component of these fragments is a phrase of arbitrary length. This can be contrasted to prior research in the field in which this component was a single word. A key element in the research is a comparison of three methods of phrase extraction. A new decoder has been developed to deal with the characteristics unique to this approach, and re-assemble the translated fragments into one final translation.
Moreover, this same research is also presented in the paper Extending Memory-Based Machine Translation to Phrases, by Maarten van Gompel, Antal van den Bosch and Peter Berck, presented in the 3rd Workshop on Example-Based Machine Translation in Dublin, Ireland. It is highly recommended to read this paper, and optionally the full thesis if further details are desired.
The following scheme illustrates the various components in Phrase-based Memory-based Machine Translation (see chapter 3 for a thorough explanation):

Note that external software is needed for various tasks. Generation of a word alignment between corpora, and generation of phrase-translation tables has to be done using Moses and GIZA++. Furthermore, the software itself relies on TiMBL, a memory-based learner.
In order to save time, some sample data can be downloaded that includes tokenised and word-aligned data from the OpenSubtitles parallel corpus (English and Dutch), split into training and test-sets. It also includes a phrase-translation table. This data will be used in the how-to.
The software consists out of the following main parts:
Note that all of the python programs may take a help flag -h that will explain the full range of options that they take, a full discussion of which is beyond the scope of this how-to
Note that the software runs only on Linux or similar operating systems.
We will be using the sample data to translate from Dutch to English. The data has already been tokenised, sentence-aligned, and word-aligned, so we start with the phrase-extraction and instance generation phases as shown earlier in the scheme.
The following files are included in the sample data:
Note that it is mandatory for the word-alignment file and the test-corpus to share the very same identifier (the name up until the first period), in this case "OpenSub-dutch". The system will always detect this identifier and use it also in all output files, which will be stored in a directory of the very same name. An additional "output name" can be set if one wants to run multiple experiments on the same data, to make sure each experiment gets stored in a different directory without mixing up files. This somewhat forced naming convention aims to prevent any confusion and enables the system to quickly find the files it needs.
The instructions below demonstrate the phrase-translation table method of phrase-extraction, using the default split-file format. The phrase-table is stored in OpenSub-dutch.train.phrase-table. Since this may be the first of several experiments we choose "exp1" as output identifier. Recall that each experiments needs a unique output infix, otherwise the files get mixed up:
For all of the commands in this section, we assume you to be in the data directory, and we for simplicity here assume you extracted both the software and the sample data in your home directory.
$ cd ~/data/
Now generate training instances:
$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -p OpenSub-dutch.train.phrase-table
This will result in a directory OpenSub-dutch.exp1 that will be used to store all output and intermediate files for this experiment. We can now call a script to start training. This script invokes TiMBL (possibly multiple times) for training. Beware that this takes a lot of time and takes considerable memory and disk-space.
$ ~/pbmbmt/pbmbmt-train OpenSub-dutch.exp1
Now we go into a similar procedure to generate test instances and to perform the actual memory-based classification. Note again that the actual testing may take considerable time and memory.
$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -p OpenSub-dutch.train.phrase-table
$ ~/pbmbmt/pbmbmt-test OpenSub-dutch.exp1
Everything is now ready for the last step, the actual decoding. For this we need a Language Model for the target language (English), generated by SRILM. This language model must be generated from at least all of the training data that was used, otherwise the SRILM Python-module will fail with a rather obscure "Key" error. A language model has been provided in the sample data, OpenSub-english.lm. But if you would want to create it anew, you would call SRILM similar to the following:
$ ngram-count -order 3 -interpolate -kndiscount -text OpenSub-english.train.txt -lm OpenSub-english.lm
We then call the decoder, which will output the translations to STDOUT and provide some status output to STDERR. It makes most sense to simply redirect the output to a file:
$ ~/pbmbmt/pbmbmt-decode -t OpenSub-dutch.test.txt -o exp1 -l OpenSub-english.lm > OpenSub-dutch.exp1/decode.out
The above might not work directly, you most likely need to explicitly prepend a library path so the SRILM-Python binding can be found:
$ LD_LIBRARY_PATH=~/pbmbmt/ ~/pbmbmt/pbmbmt-decode -t OpenSub-dutch.test.txt -o exp1 -l OpenSub-english.lm > OpenSub-dutch.exp1/decode.out
We are now done, and you will find the automatic translations in OpenSub-dutch.exp1/decode.out. This file you could now for example use to compute BLEU scores, using OpenSub-dutch.test.txt as source and OpenSub-english.test.txt as reference.
The above description used a phrase-translation table for phrase extraction. You could alternatively use any of the other methods described in my thesis, phrase-list or marker based chunking, or no method at all (word-based). The only difference is in the calls to the instance generator:
Phrase-list method
$ ~/pbmbmt/pbmbmt-make-phraselist OpenSub-dutch.train.txt 25 > OpenSub-dutch.phraselist.25
$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -l OpenSub-dutch.phraselist.25
$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -l OpenSub-dutch.phraselist.25
Marker-based Chunking
$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -m markers.nl -M markers.en
$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -m markers.nl -M markers.en
Word-based (No phrase extraction)
$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -w
$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -w
Make sure to change exp1 to something else if you want to do new experiments alongside the primary one illustrated in this how-to, or otherwise make sure you clear the previously generated directory first, if not, the system will mix up files.
If you want to try one of the other instance format files, pass for example -f 6 for the fixed-feature method using 6 features, or pass -F for the phrase-as-single-feature method (see my thesis for the theory behind these)
The calls for training, testing, and decoding remain exactly the same in any case. Note though that the decoder is capable of accepting a wide range of parameters.