In this section, we describe the experimental results obtained using this algorithm. This experiment is a speaker dependent continuous speech recognition and the test sentences are spoken by an broadcast announcer. The flooring probability was set to . The experimental conditions are shown in Table 2.
algorithm | continuous mixture HMM + beam search |
+ word trigram models | |
mixture number | max 14 ( valid for each syllable ) |
state number | 3-state 4-loop left-to-right model |
acoustic parameter | 16th order LPC cepstrum + power |
+delta power + 16th order delta cepstrum | |
frame window | 20 ms |
frame period | 5 ms |
training voice | word speech (5,240 words) |
syllable category count | 52 syllables |
vocabulary | 1,567 |
beam width | 4,096 |
duration control | no |
language information | word trigram models |
unit of recognition | sentence |
test sentence count | 38 sentences |
speaking style | read speech |
speech content | international conference task (model conversation) |
The probabilities of the word trigram models for language information are calculated as follows.
The training data includes about 15,000 sentences with 190,000 words of the ATR Dialog Database. Table 3 shows the task entropy. The experimental results are shown in Figure 1. Results from word bigram models are shown together for comparison. We obtained a 78% sentence recognition rate for text-closed data and 40% for test-open data. Figure 4 shows erroneous output results for the text-closed data. These results show that many sentences are semantically correct. Only four sentences are completely wrong. The sentence recognition rate is 89%. These completely wrong sentences include the pause in speech data.
We think that the pause causes the error because acoustic parameters and word trigram models do not correspond. However, the recognition rate for text-open data is the same for word bigram models and word trigram models. This is due to the small flooring probability value and the small amount of text data. Yet, in statistical language models like word trigram models, the recognition rate for text-open data depends on the coverage between the training data and test data. Therefore, we believe that the reliability of the recognition rate for the text-open data is very low.
correct → output |
kaiginoshukuhakushisetsunitsuiteotazuneshitainodesuga |
→ kaiginoshukuhakushisetsunitsuiteotazuneshitaiNdesuga |
会議の宿泊施設についてお尋ねしたいのですが |
→ 会議の宿泊施設についてお尋ねしたいんですが |
kyoutopuriNsuhoterugakaigizyounihachikainodesuga |
→ kyoutopuriNsuhoterugakaigizyounihachikaiNdesuga |
京都プリンスホテルが会議場には近いんですが |
→ 京都プリンスホテルが会場には近いんですが |
soredehakyoutopuriNsuhoteruoyoyakushitainodesuga |
→ soredehakyoutopuriNsuhoteruoyoyakushitaiNdesuga |
それでは京都プリンスホテルを予約したいのですが |
→ それでは京都プリンスホテルを予約したいんですが |
hoterunotehaimoshiteitadakerunodesuka |
→ hoterunotehaimoshiteitadakeruNdesuka |
ホテルの手配もしていただけるのですか |
→ ホテルの手配もしていただけるんですか |
dehaonamaetogozyuushooonegaishimasu |
→ gohaQpyouninarukatanogozyuushooonegaishimasu |
ではお名前とご住所をお願いします |
→ ご発表になる方のご住所をお願いします |
zyuushohatoukyoutominatokushiNbashiiQchoumeichibaNchisaNgoudesu |
→ zyuushohanechoQtokyoutonokaiginihatourokunikaNshimashitekyounoseQshoNnoichibaNsaNgoudesu |
住所は東京都港区新橋1丁目1番3号です |
→ 住所はねちょっと京都の会議には登録に関しまして今日のセッションの1番3号です |
deNwabaNgoumoonegaishimasu |
→ deNwabaNgouonegaishimasu |
電話番号もお願いします |
→ 電話番号お願いします |
deNwabaNgouhasaNsaNichinoniigoniiichidesu |
→ roNbuNnohaQpyouhagozeNchuunokuzinikaizyounichikaidesu |
電話番号は331の2521です |
→ 論文の発表は午前中の9時に会場に近いです |
kyoutopuriNsuhoterunihachigatsuyoQkakarayoukamadehitoribeyaootorishimashita |
→ kyoutopuriNsuhoterunihachigatsuyoQkakarayoukamadenihahaQpyoushaootorishimashita |
京都プリンスホテルに8月4日から8日まで一人部屋をお取りしました |
→ 京都プリンスホテルに8月4日から8日までには発 表者にお送りしました。 |