One problem with phrase-based statistical machine translation is the
language model. Generally, an
-gram model is used as the language
model. However, this model includes only local language information
and does not include grammatical information. We studied hierarchical
phrase-based statistical machine translation (HSMT) [Li
2009] as a
way to include grammatical information. However, HSMT analysis is
similar to that of context-free grammars (CFG). We believe that such
analysis complicates statistical machine translation by adding too
many parameters. Therefore, it is unreliable and does not perform
well, especially for the small amount of training data. On the
contrary, PBMT is well known and has been extensively
studied. Normally, PBMT is simple and has few parameters compared to
CFG-based MT, and the output of PBMT contains grammatical
information. However, there is a trade-off between the coverage of
input sentences and the translation quality in the PBMT results. If we
obtain good translation quality, then the coverage of RBMT for input
sentences is low in the translation. If we obtain high coverage for
input sentences, the translation quality is low.
We propose a two-stage MT system to overcome these problems. We developed a PBMT system for the first stage. This PBMT system had low coverage and high quality. When Japanese sentences were translated using this system, the quality of the output was good, and the outputs contained grammatical information. When not using the PBMT system to translate Japanese sentences, we used a standard SMT system. Therefore, we can obtain good quality from the entire system. Also, PBMT systems are usually created manually, which results in a huge labor cost. Therefore, we developed an automatically created PBMT system. However, this automatic PBMT output sometimes had less fluency, so we added SMT after PBMT to improve the fluency. In this system, we used PBMT in the pre-processing stage of SMT.