next up previous
Next: Concepts of our Statistical Up: Statistical Machine Translation with Previous: Statistical Machine Translation with

Introduction

Many machine translation systems have been studied for long time and there was three generations of this technology.

The first generation was a rule-based translation method, which was developed over the course of many years. This method had translation rules that were written by hand. Thus, if the input sentence completely matched the rule, the output sentence had the best quality. However, many expressions are used for natural language, this technology had very small coverage. In addition, the main problem are that the cost to write rules was too high and that maintaining the rules was hard.

The second generation was example-based machine translation method. This method finds a similar sentence from corpus and generates a similar output sentence. The problem with this method is calculating the similarity. Many methods like dynamic program (DP) are available. However, they are very heuristic and intuitive and not based on mathematics.

The third generation was a statistical machine translation method and this method is very popular now. This method is based on the statistics, and it seems very reasonable. There are many versions of statistical machine translation models available. An early model of statistical machine translation was based on IBM1 1#1 5[2]. This model is based on individual words, and thus a ``null word'' model is needed. However, this ``null word'' model sometimes has very serious problems, especially in decoding. Thus, recent statistical machine translation systems usually use phrase based models. This phrase based statistical machine translation model has translation model and language model. The phrase table is a translation model for phrase-based SMT and consists of Japanese language phrases and corresponding English language phrases and these probabilities. And word 2#2-gram model is used as a language model.

By the way, there are two points to evaluate English sentences for Japanese to English machine translation. One is adequacy, and the other is fluency. We believe adequacy is related to translation model 3#3 and fluency is related to language model 4#4. Similar languages like English and Italian may only require short phrases for accurate translations. However, languages that differ greatly, like Japanese and English, require long phrase table for accurate translation. We implemented our statistical machine translation model using long phrase tables.

Also, we found long parallel sentences for training parallel data are easily result into wrong phrase table, and wrong phrase table made poor translation results especially for the adequacy. Therefore we removed long parallel sentences.

We used general tools for statistic machine translation for this experiments. As the results, the proposed method was effective for the Intrinsic-JE task. However, it was not effective for the Intrinsic-EJ task. tasks. And our system had average performance for NTCIR-7 Patent Translation task . For example, our system was the 20th place in 34 system for Intrinsic-JE task and the 12th place in 20 system for Intrinsic-EJ task[1].


next up previous
Next: Concepts of our Statistical Up: Statistical Machine Translation with Previous: Statistical Machine Translation with
Jin'ichi Murakami 2008-12-22