Recently, phrase-based statistical machine translation, which we describe as ''phrase-based SMT,'' has been very popular. However, there are many serious problems. One is the translation quality. For Japanese-English translation, a rule-based machine translation system is better than phrase-based SMT [4].
There are about 3,000,000 Japanese-English parallel translating patents sentences[4]. Nevertheless, the performance of phrase-based SMT is lower than rule-based machine translation. Commercial machine translation systems are classed as rule-based machine translation systems. We considered that this poor performance is the fundamental problem of phrase-based SMT and especially caused by the reordering model.
There are three models for phrase-based SMT: translation, language,
and reordering models. These models each have problems. The
translation model is the probability of a source phrase matching a
target phrase. This model is calculated by using Och's heuristic and
IBM model 1-5. However, this model produces strange grammar phrases.
The language model normally uses
-gram, which is very reasonable
for stochastic language model. However, the
-gram model has local
information and does not have global information. Also, the
reliability of high order
-gram (for exmaple 5-gram) is low because
there are many parameters. Therefore, an oracle number of monolingual
sentences is needed. To overcome these problems, smoothing techniques
like delete interpolation or Kneser-Ney are used. However, these
techniques sometimes decrease the translation performance. Finally, we
consider that the reordering model have the most important
problems. Normally, the
-gram model has local, not global,
information. To surmount this problem, the reordering model is
used. However, this model is not so effective for Japanese-English
translation. In our opinion, word reordering is also local, not
global, information. And as a more serious problem, the word
reordering may be deterministic and not statistical.
To overcome these problems with the pattern-based machine translation and statitical machine translation system, we propose a pattern-based statistical machine translation system. The conventional pattern-based machine translation is a kind of rule-based machine translation and uses translation patterns and translation word dictionaries. Translation patterns provide word order. This means that the reordering problem is no longer problem for pattern-based machine translation. Therefore, the output is grammatical and tends to be a good translation. However, this system is costly because the translation patterns and word dictionaries are made manually. On the other hand, the statistical machine translation is low in cost because it uses only source and target sentence pairs and does not have to be made manually. Using these tools, we can implement automatic pattern-based statistical machine translation. GIZA++[3] can get the source and target word pairs automatically from the source and target sentence pairs. Also, we can make Japanese-English translation patterns by using the automatically obtained word pairs.
Finally, we investigated the output sentences of the proposed method and surveyed the rule-based machine translation to make a comparison.