Phrase-based Memory-based Machine Translation

by Maarten van Gompel

Introduction

This page accompanies my master thesis on "Phrase-based Memory-based Machine Translation" and provides a how-to explaining how to use the software described in the thesis.

Whereas the thesis is very theoretical in nature, this page is very practical. It is strongly recommend to first read either the derived paper or full thesis and become acquainted with the underlying theory prior to attempting to use the software provided on this page.

The software is not so interesting for end-users seeking a deployable and fully functional translation system, but may be valuable for researchers in the field of Machine Translation.

Abstract

In my master thesis I aim to investigate a phrase-based approach of Memory-based Machine Translation. This is a form of automatic translation powered by lazy-learning classifiers to translate fragments of the input sentence. A parallel corpus serves as the basis for training such a classifier. In the phrase-based approach the principal component of these fragments is a phrase of arbitrary length. This can be contrasted to prior research in the field in which this component was a single word. A key element in the research is a comparison of three methods of phrase extraction. A new decoder has been developed to deal with the characteristics unique to this approach, and re-assemble the translated fragments into one final translation.

Moreover, this same research is also presented in the paper Extending Memory-Based Machine Translation to Phrases, by Maarten van Gompel, Antal van den Bosch and Peter Berck, presented in the 3rd Workshop on Example-Based Machine Translation in Dublin, Ireland. It is highly recommended to read this paper, and optionally the full thesis if further details are desired.

Download paper
Download full thesis

PBMBMT Setup

The following scheme illustrates the various components in Phrase-based Memory-based Machine Translation (see chapter 3 for a thorough explanation):

phrase-based memory-based machine translation architecture

Note that external software is needed for various tasks. Generation of a word alignment between corpora, and generation of phrase-translation tables has to be done using Moses and GIZA++. Furthermore, the software itself relies on TiMBL, a memory-based learner.

In order to save time, some sample data can be downloaded that includes tokenised and word-aligned data from the OpenSubtitles parallel corpus (English and Dutch), split into training and test-sets. It also includes a phrase-translation table. This data will be used in the how-to.

The software consists out of the following main parts:

Note that all of the python programs may take a help flag -h that will explain the full range of options that they take, a full discussion of which is beyond the scope of this how-to

PBMBMT How-To

Download and Installation Instructions

Note that the software runs only on Linux or similar operating systems.

  1. PBMBMT is written in Python 2.5, so make sure to have at least version 2.5 or 2.6 installed (but not Python 3.0!). PBMBMT furthermore depends on the following mandatory software which you will need to install:
    • TiMBL - Needed for Memory-based Learning
    • SRILM - Needed to recompile the SRILM-Python module, used by the pbmbmt decoder. A Linux x86_64 binary version of the SRILM-Python module is already included in the software, so if this happens to match with your architecture and distribution, you are in luck and need not download this software. However, SRILM is always needed when you want to generate your own language model rather than the one provided with the sample data.
  2. The following software is optional, needed only if you work with your own data rather than the sample data provided for this how-to:
    • Moses - Needed for the creation of phrase-translation tables
    • GIZA++ - Needed for the generation of word-alignments
  3. Download and extract the PBMBMT software:
  4. Download and extract the sample data. This is a sizeable download (120MB) and extracts to nearly 800MB.
  5. For dealing with the language model, the PBMBMT Decoder relies on a small python binding to SRILM (made by Sander Canisius), and included with the software.
    1. Try if the provided library happens to work, from the pbmbmt directory, issue the following command: $ echo "import srilm" | python. If no error appears, we're all done. Otherwise we have to recompile as shown in the next step
    2. Make sure to have downloaded and compiled SRILM. Then edit the script pbmbmt/makesrilm and make sure the first line points to whereever your SRILM sources reside. Then execute the script to recompile the srilm.so SRILM-Python binding and repeat the former step to see if everything works okay.

Usage Instructions

We will be using the sample data to translate from Dutch to English. The data has already been tokenised, sentence-aligned, and word-aligned, so we start with the phrase-extraction and instance generation phases as shown earlier in the scheme.

The following files are included in the sample data:

Note that it is mandatory for the word-alignment file and the test-corpus to share the very same identifier (the name up until the first period), in this case "OpenSub-dutch". The system will always detect this identifier and use it also in all output files, which will be stored in a directory of the very same name. An additional "output name" can be set if one wants to run multiple experiments on the same data, to make sure each experiment gets stored in a different directory without mixing up files. This somewhat forced naming convention aims to prevent any confusion and enables the system to quickly find the files it needs.

The instructions below demonstrate the phrase-translation table method of phrase-extraction, using the default split-file format. The phrase-table is stored in OpenSub-dutch.train.phrase-table. Since this may be the first of several experiments we choose "exp1" as output identifier. Recall that each experiments needs a unique output infix, otherwise the files get mixed up:

For all of the commands in this section, we assume you to be in the data directory, and we for simplicity here assume you extracted both the software and the sample data in your home directory.

$ cd ~/data/

Now generate training instances:

$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -p OpenSub-dutch.train.phrase-table

This will result in a directory OpenSub-dutch.exp1 that will be used to store all output and intermediate files for this experiment. We can now call a script to start training. This script invokes TiMBL (possibly multiple times) for training. Beware that this takes a lot of time and takes considerable memory and disk-space.

$ ~/pbmbmt/pbmbmt-train OpenSub-dutch.exp1

Now we go into a similar procedure to generate test instances and to perform the actual memory-based classification. Note again that the actual testing may take considerable time and memory.

$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -p OpenSub-dutch.train.phrase-table
$ ~/pbmbmt/pbmbmt-test OpenSub-dutch.exp1

Everything is now ready for the last step, the actual decoding. For this we need a Language Model for the target language (English), generated by SRILM. This language model must be generated from at least all of the training data that was used, otherwise the SRILM Python-module will fail with a rather obscure "Key" error. A language model has been provided in the sample data, OpenSub-english.lm. But if you would want to create it anew, you would call SRILM similar to the following:

$ ngram-count -order 3 -interpolate -kndiscount -text OpenSub-english.train.txt -lm OpenSub-english.lm

We then call the decoder, which will output the translations to STDOUT and provide some status output to STDERR. It makes most sense to simply redirect the output to a file:

$ ~/pbmbmt/pbmbmt-decode -t OpenSub-dutch.test.txt -o exp1 -l OpenSub-english.lm > OpenSub-dutch.exp1/decode.out

The above might not work directly, you most likely need to explicitly prepend a library path so the SRILM-Python binding can be found:

$ LD_LIBRARY_PATH=~/pbmbmt/ ~/pbmbmt/pbmbmt-decode -t OpenSub-dutch.test.txt -o exp1 -l OpenSub-english.lm > OpenSub-dutch.exp1/decode.out

We are now done, and you will find the automatic translations in OpenSub-dutch.exp1/decode.out. This file you could now for example use to compute BLEU scores, using OpenSub-dutch.test.txt as source and OpenSub-english.test.txt as reference.

Alternative Usage

The above description used a phrase-translation table for phrase extraction. You could alternatively use any of the other methods described in my thesis, phrase-list or marker based chunking, or no method at all (word-based). The only difference is in the calls to the instance generator:

Phrase-list method

$ ~/pbmbmt/pbmbmt-make-phraselist OpenSub-dutch.train.txt 25 > OpenSub-dutch.phraselist.25
$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -l OpenSub-dutch.phraselist.25
$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -l OpenSub-dutch.phraselist.25

Marker-based Chunking

$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -m markers.nl -M markers.en
$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -m markers.nl -M markers.en

Word-based (No phrase extraction)

$ ~/pbmbmt/pbmbmt-make-instances --train=OpenSub-dutch.A3.final -o exp1 -w
$ ~/pbmbmt/pbmbmt-make-instances --test=OpenSub-dutch.test.txt -o exp1 -w

Make sure to change exp1 to something else if you want to do new experiments alongside the primary one illustrated in this how-to, or otherwise make sure you clear the previously generated directory first, if not, the system will mix up files.

If you want to try one of the other instance format files, pass for example -f 6 for the fixed-feature method using 6 features, or pass -F for the phrase-as-single-feature method (see my thesis for the theory behind these)

The calls for training, testing, and decoding remain exactly the same in any case. Note though that the decoder is capable of accepting a wide range of parameters.