次へ: 目次目次

確率的言語モデルによる自由発話認識に関する研究

博士（工学）

村上仁一

豊橋技術科学大学

論文要旨

日本文音声入力においては、音声の持つ物理的特性に着目した音声認識装置の限界を克服するため、日本語の文法や意味を用いた自然言語処理を併用することの必要性が指摘されている。この場合の言語処理の方法として、多くの言語モデルがあるが、大きく分類してルールベースの言語モデルと確率ベースの言語モデルがある。

言語の確率ベースの研究を行なう場合、基本的には大量のテキストデータ量が必要である。英語ではデータベースの重要性が認識されていて古くから Brown corpusやAP corpusなどがあるが、しかし日本語ではコンピュータに読み込める形式で利用できる大量のデータベースが最近まで存在していなかった。そのため、確率的な言語モデルの研究は最近まであまり報告されていなった。しかし、この状況も新聞記事がCD-ROMで提供されるようになり、国際電気通信基礎技術研究所（ATR）が各種対話データを販売するなど、状況が変化し始めている。

そこで、本論文では、日本語において-gramモデルの有効性をシミュレーションや実際の音声認識実験などで定量的に示した。また自由発話認識に向けて、自由発話の言語的特徴や音響的な特徴を研究した。そして実際に自由発話認識にむけた文音声認識のアルゴリズムを提案し、認識実験の結果について述べている。

各章の内容は以下の通りである。

第1章では、本論文の目的、動機について述べた。

第2章では、音声認識システムを実現するために必要な要素技術について述べた。音声認識の基本としてHMMが挙げられる。そして、そのパラメータを学習データに対して尤度を最大にする Baum-Wlech アルゴリズムがある。また、連続音声認識の基本アルゴリズムとして,tree-trellisサーチやViterbiサーチ（one-pass DP）がある。ただし、実用的な認識アルゴリズムでは、計算量やメモリー量の減少が必須である。これらの要素技術について報告した。

第3章では言語をマルコフモデルで表現するときのデータ量と収束性について研究した。調査項目としては、主にエンロトピーとカバー率である。そして、全テキストデータの98%はマルコフモデルで近似できるが、残り2%が収束しないことを示した。

第4章では、日本語におけるかなや漢字、品詞の bigramおよびtrigramの有効性を、新聞記事や医療用X線CTの所見作成、ATRの国際会議の予約のタスクにおいて、連続分布HMMと単語trigramを使用して文認識について検討し、その有効性を示した。

第5章では、自由発話認識のアルゴリズムとその実験結果について述べた。自由発話では間投詞や言い淀みや言い誤り、言い直しなどが頻繁に出現する。これらの間投詞や言い直しは文の全ての場所に出現する可能性がある。そこでこれらの単語をスキップすることで、自由発話の認識が可能になる。実際に、これを実現し、実験結果によりその有効性を示した。

第6章では、自由発話の特徴について言語的な面と音響的な面から研究した。この結果、対話文の50% は「あのー」、「えーと」などの間投詞を含むこと。また、言い直しは約10%に出現することが示された。また、４人の話者について朗読発話と自由発話の音響的な違いについて述べた。そして、自由発話は、朗読発話よりも発声が曖昧になるものの。各発話環境で音響モデルを学習すれば、あまり大きな音素認識率の差は無いことが示された。

第7章では、音声情報に含まれている韻律情報の情報量について述べた。韻律情報は,継続時間などの多くの要素から構成されているが、本章では、この中から特にアクセント句境界の位置およびアクセント核の位置の持つ情報量に焦点を当てて情報量を測定した。実験の結果、アクセント句境界の位置がアクセント情報が持つ情報量は 5.16bitであることが示された。

第8章では、異なる個の信号源より生成された信号系列が、どの信号源から生成されたのかを分割・識別する問題について述べた。そして、応用例として複数話者発話の識別をあげ、 Ergodic HMMを用いた問題の解決方法を提示した。この実験の結果、複数話者発話の識別においては341ms程度の長時間窓分析したLPCケプストラムを用いることにより、より良好な識別性能が得られること、および尤度の高いモデルを選択することにより平均識別率は向上することが得られた。

第9章では、Ergodic HMMを利用した確率付ネットワーク文法の自動学習について述べた。Ergodic HMMと確率つきネットワーク文法が類似した構造を持ち、同種のパラメータで表現される。また、大量のテキストデータを利用してHMMのパラメータをBaum-Welch アルゴリズムで学習できる。実際の会話から作成した単語列をErgodic HMMに学習させて、確率つきネットワーク文法を自動的に抽出することを試みた。その結果、Ergodic HMMの構造は学習データの特徴をとらえた文法的特徴を示しており、単語を文中での機能によって分類して出力していることがわかった。さらに、得られたErgodic HMMを言語モデルとして連続音声認識に用いた。この認識実験の結果、単語 bigramよりも高い性能が得られ、提案したアルゴリズムの有効性が示された。

第 10 章では、本論文の成果をまとめ、今後の研究課題について述べた。

ABSTRACT

Study of Spontaneous Speech Recognition based on Stochastic Language Modeling

There are two types of natural language modeling. One covers the class of deterministic models like network grammar or context free grammar, that exploit some known specific properties of the language. The other includes the class of statistical models like bigram or trigram in which one tries to characterize the statistical properties of the corpus. These statistical models include stochastic context free grammar and Markov process, a sort of non-deterministic finite state automaton.

A lot of text data is needed to study statistical language models. In English, the Brown corpus and AP corpus are well known. However, such sources for Japanese have only just been created. As one example, newspapers are now available on CD-ROM.

In this paper, we describe stochastic language modeling with emphasis on the bigram model and trigram model. We also describe the efficiency of these models for continuous speech recognition. Moreover, a spontaneous speech recognition system based on stochastic language models is described.

Chapter 1 describes the background, motivation, and special features of this study.

In Chapter 2, we outline some common speech recognition models and algorithms like one-pass Viterbi decoding, HMM, and the Baum-Welch algorithm.

In Chapter 3, we study -gram modeling for language processing, especially in terms of entropy and cover rates.

In Chapter 4, we study the effectiveness of bigrams and trigrams of Kana, Kanji, and part-of-speech. And we carried out an experiment for newspapers and X-ray CT scanning reports, ATR international conference tasks, respectively.

In Chapter 5, we describe a spontaneous speech recognition algorithm based on word trigram models. Focusing on spontaneous speech recognition, we propose a skip phone procedure to handle the many filled pauses and false starts observed in spontaneous speech. Even though the proposed method employs a simple procedure, we obtain a 47.7% sentence recognition rate for spontaneous speech. Including semantically correct sentences, the sentence recognition rate is about 75%.

In Chapter 6, we present a preliminary study of spontaneous speech recognition, describing both the acoustic and linguistic characteristics of spontaneous speech. A preliminary study was done to compare spontaneous and read speech. In hand-labeled spontaneous speech, the labeling uncertainty increased by about 50%. A phoneme recognition experiment yielded a two fold increase in the error rate. Filled pauses appeared in 40% of 11,000 sentences of spontaneous speech utterances and false starts were found in 10% of the sentences.

In Chapter 7, we investigate the amount of information contained by accents in speech signals. It is very difficult to measure the amount of information in accents because these concepts are not clearly defined. Therefore, Kana-Kanji translations are used. First, the number of Kanji candidates that are translated using syllable information is counted. Second, the number of Kanji candidates that are translated using syllable and accent information is counted. Their ratio indicates that the amount of information in accents is 5.16 bits. Although this quantity is small compared to Japanese syllables, it is important for speech recognition.

In Chapter 8, we consider signals that originate from a number of sources. As an application of a multiple signal source identification problem, an experiment is performed on unknown speaker identification. The results indicate that the model is sensitive to the initial values of the Ergodic HMM and that employing the long-distance LPC cepstrum is effective for signal preprocessing.

In Chapter 9, we investigate statistical network grammar using Ergodic Hidden Markov Model(HMM). HMM is very rich in mathematical structure so that language models are determined more precisely than that with stochastic network grammar or Markov processes. In this chapter, we develop a statistical network grammar automatically from about 4000 words using Ergodic HMM. The resultant model indicates that some grammatical features exist even though the process was automatic.

Finally, in Chapter 10, we summarize our work and describe open or further problems.

次へ: 目次目次

Jin'ichi Murakami 平成13年1月5日