[
This paper describes the frequency of occurrence of n-grams of syllables, kanji-kana, part-of-speech, and word in Japanese newspaper text and X-ray CT scanning reports. An algorithm for continuous speech sentence recognition using word HMMs and a word bigram model is also discussed.
It is well known that the word bigram or word trigram models is effective tools for a speech recognition system. However, the convergence properties for n-gram probabilities have not been reported for Japanese.
In this paper, we firstly report the convergence of unigram, bigram, trigram and 4-gram language models for syllable, kanji-kana, word and part-of-speech units extracted from newspaper text and X-ray CT scanning reports.
Secondly, we report a phrase recognition algorithm using the word bigram model and word HMMs for X-ray CT scanning reports. The usual high amount of training data for word HMMs could be reduced to a single utterance using a technique known as fuzzy-vector-quantization.
A sentence recognition experiment using this algorithm was carried out to test the efficiency of the word bigram model with a vocabulary of about 3000 words. The following phrase recognition rates were obtained in this experiment: 96.8% for text-closed data and normal findings, 78.1% for text-closed data and abnormal findings, 86,5% for text-open data and normal findings, 72.1% for text-open data and abnormal findings. (The term "abnormal findings" refers to the content of the X-ray CT scanning reports.)
]