Next: Pattern-Based Machine Translation Up: Pattern-Based Statistical Machine Translation Previous: Subtasks

Introduction

Recently, phrase-based statistical machine translation, which we describe as ''phrase-based SMT,'' has been very popular. However, there are many serious problems. One is the translation quality. For Japanese-English translation, a rule-based machine translation system is better than phrase-based SMT [4].

There are about 3,000,000 Japanese-English parallel translating patents sentences[4]. Nevertheless, the performance of phrase-based SMT is lower than rule-based machine translation. Commercial machine translation systems are classed as rule-based machine translation systems. We considered that this poor performance is the fundamental problem of phrase-based SMT and especially caused by the reordering model.

There are three models for phrase-based SMT: translation, language, and reordering models. These models each have problems. The translation model is the probability of a source phrase matching a target phrase. This model is calculated by using Och's heuristic and IBM model 1-5. However, this model produces strange grammar phrases. The language model normally uses -gram, which is very reasonable for stochastic language model. However, the -gram model has local information and does not have global information. Also, the reliability of high order -gram (for exmaple 5-gram) is low because there are many parameters. Therefore, an oracle number of monolingual sentences is needed. To overcome these problems, smoothing techniques like delete interpolation or Kneser-Ney are used. However, these techniques sometimes decrease the translation performance. Finally, we consider that the reordering model have the most important problems. Normally, the -gram model has local, not global, information. To surmount this problem, the reordering model is used. However, this model is not so effective for Japanese-English translation. In our opinion, word reordering is also local, not global, information. And as a more serious problem, the word reordering may be deterministic and not statistical.

To overcome these problems with the pattern-based machine translation and statitical machine translation system, we propose a pattern-based statistical machine translation system. The conventional pattern-based machine translation is a kind of rule-based machine translation and uses translation patterns and translation word dictionaries. Translation patterns provide word order. This means that the reordering problem is no longer problem for pattern-based machine translation. Therefore, the output is grammatical and tends to be a good translation. However, this system is costly because the translation patterns and word dictionaries are made manually. On the other hand, the statistical machine translation is low in cost because it uses only source and target sentence pairs and does not have to be made manually. Using these tools, we can implement automatic pattern-based statistical machine translation. GIZA++[3] can get the source and target word pairs automatically from the source and target sentence pairs. Also, we can make Japanese-English translation patterns by using the automatically obtained word pairs.

Finally, we investigated the output sentences of the proposed method and surveyed the rule-based machine translation to make a comparison.

Next: Pattern-Based Machine Translation Up: Pattern-Based Statistical Machine Translation Previous: Subtasks

Jin'ichi Murakami 2013-06-26