next up previous
次へ: Recognition Algorithm Using Word 上へ: A Spontaneous Speech Recognition 戻る: A Spontaneous Speech Recognition

Introduction


This paper describes a spontaneous speech recognition algorithm using word trigram models and a procedure for dealing with filled-pauses.

Spontaneous speech would be ideal for human-machine interfacing, if it were not for the many serious technical problems associated with its automatic recognition: filled-pauses, hesitations, self-corrections, out-of-vocabulary words, etc. These conditions make the acoustic modeling difficult. Therefore, lower perplexity language modeling has been prefered for spontaneous speech recognition.

Among the many language models for speech recognition, word bigram models[1] seem to be the most widely used because of their effectiveness and simplicity. However, for spontaneous speech recognition, word trigram models[2] should be used because they have been shown to produce a relatively lower perplexity over other language models (for example, network grammar, context free grammar, etc.). Yet, word trigram models pose some problems of their own when applied to speech recognition systems. One of the biggest problems is the large memory requirements and computational costs required for Viterbi decoding[3].

One possible way of avoiding these problems is to use an N-best paradigm: First, N-best candidate lists are generated using a word bigram model. Next, the candidates on these lists are rescored using a word trigram model and the best candidate is selected as the recognition result. This method has been used in a BBN recognition system[4]. Unfortunately, this algorithm's disadvantage is that the correct sentence may be out of the N-best lists with the word bigram model; consequently, this algorithm might decrease the performance of the word trigram model. Therefore, the need remains for developing efficient implementations of word trigram models directly.

In addressing these problems, we have devised a new algorithm that combines a word trigram model and a procedure for dealing with filled-pauses as a first step toward spontaneous, large-vocabulary, speaker-independent speech recognition. Firstly, we modified a frame synchronous Viterbi recognition algorithm using the word trigram model to greatly reduce the memory requirement and computational cost. Secondly, we addressed the problem of handling pauses and filled-pauses in language modeling by ignoring them in the calculation of the word trigram probabilities.

Speaker-independent speech recognition experiments were carried out using spontaneous 261-sentence data. On recording the speech data, each speaker remembered the intention of the utterance, and spoke freely while including many kinds of filled-pauses. The recognition task was a conference registration task, and the vocabulary size and word perplexity were 1567 and 4, respectively. The sentence recognition rate was greatly increased from 25.3% to 42.0% for the top-1 candidate and from 39.5% to 63.4% for the top-8 candidates. These results show the effectiveness of combining a word trigram model and filled-pause procedure.



next up previous
次へ: Recognition Algorithm Using Word 上へ: A Spontaneous Speech Recognition 戻る: A Spontaneous Speech Recognition
Jin'ichi Murakami 平成13年1月19日