next up previous
次へ: Memory and Computational Cost 上へ: A Spontaneous Speech Recognition 戻る: Introduction

Recognition Algorithm Using Word Trigram Models


In this section, we describe a frame synchronous speech recognition algorithm using word trigram models based on Viterbi search. Among the many speech recognition algorithms, the Viterbi search (or one-pass DP) is well suited for Markov models used as language models like bigrams or trigrams.

To compute the Viterbi path to the n'th recognized word $w_n$, at time $t$ for the bigram algorithm, we need to know the Viterbi paths emerging from each $ w_{n-1} $ word candidate, at time $t-1$. However, for the trigram algorithm, for each previous word candidate $ w_{n-1} $, we need to know not only the Viterbi paths emerging from words at time $t-1$, but also the most likely paths passing through all possible $ w_{n-2},w_{n-1} $ word pair combinations. This means that this algorithm essentially requires a lot of memory and high computational costs.

We define $w(t)$ as the word uttered at time $t$, $
w_{\mbox{\tiny {Prev}}}(t) $ as the word uttered prior to $w(t)$, and
$G_t(w_1,w_0,i) = \mbox{Prob}(O_0,O_1,...,O_{t-1}
\bigwedge s(t) = i, w(t) = w_0, w_{\mbox{\tiny {Prev}}}(t) =
w_1) $. We can calculate $G_t(w_1,w_0,i)$ recursively using the following algorithm (Table  1 ).


表 1: Recognition algorithm using word trigram models (Viterbi search)
[ Definition ]
$ l_w $:number of states of word $w$
$ a^w_{ij}$:transition probability in word $w$
from state $s_i$ to state $s_j$
$ b^w_j(O(t))$:symbol output probability in word $w$
at state $s_j$ for observation vector at frame $t$
$P(w_c\vert w_a,w_b)$:trigram probability of word $ w_c$
after $w_a,w_b$ have appeared.
$Q$:vocabulary size
$T$:number of input frames
$\alpha $:weight between trigram probability and HMM likelihood
[ Initialization ]
execute step1 for $w_0=0,...,Q-1$
1) $G_0(start,w_0,0) = P(w_0\vert start,start) $
( $start$ means sentence head )
[ Viterbi search ]
execute step2 and step6 for $t=0,1,..,T-1$
2) execute step3 for $ w_1=0,...,Q-1 $
3) execute step4 for $w_0=0,...,Q-1$
4) execute step5 for $ i=0,1,...,l_{w_0}-2 $
5) if $ i = 0 $
$G_t(w_1,w_0,i)=$ $ G_{t-1}(w_1,w_0,i) \times a^{w_0}_{i,i} \times b^{w_0}_i(O_t)$
else
$G_t(w_1,w_0,i)=$
$ \max( G_{t-1}(w_1,w_0,i) \times $ $ a^{w_0}_{i,i} \times b^{w_0}_i(O_t),$
$ G_{t-1}(w_1,w_0,i-1) \times a^{w_0}_{i-1,i} \times b^{w_0}_{i-1}(O_t)) $
[ Viterbi search ( word boundaries ) ]
6) execute step7 for $ w_1=0,1,...,Q-1 $
7) execute step8 for $ w_0=0,1,...,Q-1 $
8) $ \Delta = \mathop{\rm max}_{ 0\leq w_2 \leq Q-1 } ( G_{t-1}(w_2,w_1,l_{w_1}-2) $
$ \times a^{w_1}_{l_{w_1}-2,l_{w_1}-1} $ $ \times b^{w_2}_{l_{w_1}-2}(O_t) \times P(w_0\vert w_2,w_1) ^ \alpha) $
if $\Delta \geq G_t(w_1,w_0,0) $ then $G_t(w_1,w_0,0)=\Delta$



next up previous
次へ: Memory and Computational Cost 上へ: A Spontaneous Speech Recognition 戻る: Introduction
Jin'ichi Murakami 平成13年1月19日