BLEU

$\displaystyle BLEU$	$\textstyle =$	$\displaystyle BPexp W_{n}\sum\limits_{n = 1}^N \left( {log_e{P_{n}}}\right)$	(3.1)
$\displaystyle W_{n}$	$\textstyle =$	$\displaystyle \frac{1}{N}$	(3.2)
$\displaystyle P_{n}$	$\textstyle =$	$\displaystyle \frac{\sum\limits_{i} {出力文中iと参照文iで一致したN-gram数 }} {\sum\limits_{i}{出力文中iの中の全N-gram数 }}$	(3.3)

ここで，BPは短い翻訳文が高い評価にならないように補正を行うパラメータである．また $W_{n}$ は

-gramの重みである．具体的な計算例を以下に示す．

[

c]例日本語文：お先に失礼します。
参照文：Excuse me , I must be going now .
出力文：Excuse me , but I mest be going now .

$\displaystyle P_1=\frac{9}{10}，P_2=\frac{7}{9}，P_3=\frac{5}{8}，P_4=\frac{3}{7}， W_1=1，W_2=\frac{1}{2}，W_3=\frac{1}{3}，W_4=\frac{1}{4}$

(3.4)

$\displaystyle BLEUスコア$	$\textstyle =$	$\displaystyle e^{W_4(logP_1+logP_2+logP_3+logP_4)}$	(3.5)
	$\textstyle =$	$\displaystyle e^{\frac{1}{4}(log\frac{9}{10}+log\frac{7}{9}+log\frac{5}{8}+log\frac{3}{7})}$	(3.6)
	$\textstyle =$	$\displaystyle 0.6580$	(3.7)

またBLEUは，英語とフランス語などの文法構造が近い言語間において，人手評価と評価が一致する場合が多い．しかし，英語と日本語などの文法構造が異なる言語間において，人手評価と評価が一致しない場合がある．原因として，BLEUは部分的な単語列の一致数を調べ，スコアを求めていることが挙げられる．そのため，参照文との比較において，同一の単語列を局所的に含む出力文が高いスコアを算出する．したがって，出力文において，文法的な誤りが存在しても高いスコアを算出してしまう．表3.2に具体的な例文を示す．なお，表3.2に対応するBLEUスコアを表3.3に示す．

=5pt

表: 翻訳例
入力文	その機械の構造には欠陥がある。
出力文1	The structure of the machine has a defect .
出力文2	The structure of the is a fault in the machine .
参照文	There is a fault in the machine 's construction .

=30pt

表: 1文におけるBLEUスコア
出力文1	BLEU＝0.000
出力文2	BLEU＝0.367

表3.3より，出力文1と出力文2を比較すると， 1文におけるBLEUスコアは，出力文2が良い評価となる．しかし出力文2は``the is"と出力されているので，文法的に誤っている．