Jin'ichi Murakami Haseo Hotta
Dept. of Information and Knowledge Eng., Tottori University,
4-101 Koyama-Minami, Tottori 680-8550, Japan
Japanese has homonyms like ``$BH$(B'' ( , [chopsticks]) and ``$B66(B'' ( [bridge]) 1. These words have the same syllables but different accents. However, normal speech recognition uses formants and not prosodies $B!%(BSo, homonym speech recognition in Japanese has not been studied.
In Chinese, the difference in the accent (tone) creates different meaning of a word. It is called ``four-tone'' or ``tone sandhi''. Thus, many prosody studies on speech recognition Chinese, have been conducted$B!%(BThese research used both MFCC and pitch frequency. MFCC indicates a formant structure, and pitch frequency indicates a part of prosody. However, reliably estimating pitch frequency has been very difficult. Double pitch and half pitch often estimated. Also vowels have pitch, but voiceless consonants do not.
In this study, we used the effect of pitch on formants, and did not directly use the extraction of pitch frequency. More specifically, we used an accent model based on the phoneme with word mora length, word mora position, the type of accent, and accent high or low.
Using this model, we studied speaker-independent homonym speech recognition. We also used a pair set of homonyms for an evaluation. In accent model, the number of syllables in HMM is too much. So we used semi-continuous HMM in this study. We also used MFCC and FBANK for the acoustic parameters.
Accent Model and Accent Triphone Model
In this section, we describe the accent model. The model indicates the phoneme label with the word mora length and word mora position and the type of accent and accent high or low added. This accent model has vowels and nasal and double consonants. And the normal consonant does not have these labels.
More specifically, we labeled labeled vowels as well as nasal and double consonants with seven digit numbers. The first pair of numbers indicates the mora length for a word. The second pair indicates the word's mora position. The third pair indicates the word's accent type. The final number indicates the accent at the mora position. It is expressed using 0 for low and 1 for high. Fig. 1 shows an example of the models.
This accent model is a context-independent accent model. In this paper, we also use an accent triphone model that is a context-dependent accent model. Table1 shows an example of the accent, accent triphone, and triphone models. Example word is `` ''. This word of Japanese kanji expression is ``$B=)(B'', and English expression is ``Autumn''. indicates that the accent of ``a'' is high. indicates that the accent of ``ki'' is low. In this table, shows the after context dependent phoneme, and shows the before context dependent phoneme.
|triphone||a k||a k i||k i|
Homonym Recognition Experiments
Training Data and Test Data
We used an ATR A-set database. This database has 5240 words spoken by each of ten male and ten female speakers. Speakers were professional and voiced very clearly. For the training data, we used nine speakers, and odd number words. That is, we used 2620 x 9 words for training, and other one speaker was kept for testing.
For the test data, we used homonym data for a speech database. To survey the word accent, we used the ``NHK Japanese Accent Dictionary''. The ATR A-set database had 31 pairs totaling 62 words. However, the speech data had different accents. Thus, we used correctly accented words in this database. As a result, we used 11 pairs of homonyms (i.e., 22 words). Table 2 shows the test homonyms data. In this table, indicates that the accent of the `` syllable '' is high and indicates that the accent of the `` syllable '' is low. `` '' is Japanese kanji expression and  is English meaning.
|``$B5o$k(B'' [stay]|| ``$B
|``$BBe$($k(B'' [change]||``$BJV$k(B'' [reverse]|
|``$B7g$1$k(B'' [missing]||``$B6n$1$k(B'' [run]|
|``$B5!7y(B'' [mood]||``$B5/8;(B'' [origin]|
|``$B8x3+(B'' [public]||``$B9R3$(B'' [voyage]|
|``$BCV$/(B''[carry]||``$B2/(B'' [A hundred millions]|
|``$B;XL>(B'' [nominate]||``$B;aL>(B'' [full name]|
|``$BEY(B'' [at a time]||``$BB-B^(B'' [Japanese socks]|
|``$BFA(B'' [virtue]||``$B2r$/(B'' [solve]|
|``$BIU$1$k(B'' [attach]||``$BDR$1$k(B'' [steep]|
|``$B0x$k(B'' [cause]||``$BLk(B'' [night]|
We conducted an experiment with three male speakers and three female speakers. We used the HTK tool kit  and FBANK and MFCC in these experiments. Also, we used full covariance HMM and diagonal covariance HMM. MFCC and FBANK have the same number of Gaussian densities.
Table 3 shows acoustic analysis parameters and the parameters of HMM. The experimental conditions are also shown in table4.
|record frequency||16 kHz|
|window length||25 ms|
|frame period||10 ms|
|Number of analyses||12 order MFCC|
|+ 12 order MFCC|
|(MFCC)||+ log power|
|+ log power|
|Number of analyses||24 order FBANK|
|+ 24 order FBANK|
|(FBANK)||+ log power|
|+ log power|
|HMM model||3 loop 4 state|
|semi continuous densities|
|# Gaussian densities||MFCC 1024|
|of state||+ MFCC 1024|
|(Diagonal)||+ log power 64|
|+ log power 64|
|# Gaussian densities||MFCC 128|
|of state||+ MFCC 128|
|(Full)||+ log power 16|
|+ log power 16|
Flowchart of Making Accent Model and Accent Triphone Model
The initial HMM is very important to training. And data spareness for accent model and accent triphone model is a serious problem. Thus, we made the initial accent model HMM from a phoneme model HMM, and the initial triphone model HMM was made from the phoneme model HMM . Also, the initial accent triphone model HMM was made from triphone models HMM. Also, to avoid the problem of data spareness for accent model and accent triphone model, we used semi continuous HMM.
Figure2 shows the flowchart for the accent model HMM and accent triphone model HMM.
Results of Homonym Speech Recognition
Tables 5 show the results of speaker independent homonym speech recognition. In this table, ``MAU'', ``MMY'', and ``MNM'' indicate a male speaker, and ``FAF'', ``FMS'', and ``FTK'' indicate a female speaker. ``Ave.(Male)'' indicates the average of male speakers (MAU MMY MNM). ``Ave.(Female)'' indicates the average of female speakers (FAF FMS FTK). ``Ave.(Total)'' indicates the average of all speakers. Table5 shows the results of the error rate using ``MFCC and Diagonal Covariance HMM'', ``MFCC and Full covariance HMM'', ``FBANK and Diagonal Covariance HMM'', and ``FBANK and Full covariance HMM''.
The following results were obtained in these experiments.
The maximum average homonym recognition rate (89%) was obtained for the accent triphone model and MFCC and full covariance HMM (table5). However, the results differed between male speakers and female speakers.
Female speakers had a higher recognition rate than male speakers. Male speakers had a higher MFCC recognition rate than FBANK. Female speakers had the opposite trend. The maximum recognition rate of male speakers was 92% with the accent triphone model and MFCC and full covariance HMM (table 5). The maximum recognition rate of female speakers was 94% with the accent triphone model and FBANK and full covariance HMM (table 5) .
The average MFCC recognition rate was slightly higher than the average FBANK recognition rate. MFCC was effective with male speakers, and FBANK was effective with female speakers.
|Ave. (Female)||14%( 9/66)||14%(9/66)|
|Ave. (Female)||6%( 4/66)||6%( 4/66)|
In most cases, the accent triphone model was better than the accent model. However, the difference was small between the two. It was large only with MFCC and full covariance HMM. The error rate improved 23% to 8%, whereas the improvement was not very large in other experiments.
Analysis of homonym recognition error
Across experiments, errors for homonym recognition were 2 mora high low and 3 mora low high high words. Table 6 shows an example of the errors for 2 mora homonyms. As shown in this table, these homonyms are easy errors for people.
|[A hundred millions]|
|``$B0x$k(B'' [cause]||``$BLk(B'' [night]|
Comparison of FBANK and MFCC
MFCC was more effective than FBANK for speaker-independent homonym recognition in many experiments. However, among female speakers, FBANK was more effective than MFCC in many experiments. FBANK has prosodies and formants, information on both prosodies and formants, while MFCC has only formant information. However, the prosodies affect the formants. Thus, homonym speech recognition is possible even with MFCC. However, FBANK seems better overall than MFCC for homonym speech recognition.
This hypothesis for speaker independent speech recognition holds true on female speech but incorrect on male speech.
Comparison of Males and Females
There are no differences in accent components of relative f0 and intensity between the male and female groups. Normally, female speakers generally have higher pitch frequency. It makes difficult to separate formant and pitch. Thus, female speakers are worse than male speakers at normal speech recognition.
However the opposite results were obtained with homonyms speech recognition. The error rate of homonym speech recognition is lower for female speakers than male speakers. We think that the change in female speakers' pitch frequency is larger than the change in male speakers' pitch frequency, thereby providing support for this conclusion.
Comparison of proposed method and other models
We must compare of proposed method and other models. As pitch extraction is the most important point in the paper. So we will have a data by the proposed systems with separate pitch extractor.
In this study, we surveyed the recognition rates of Japanese speaker-independent homonym speech. To recognize the homonyms, we created an accent model and an accent triphone model. An accent model had a phoneme label with word mora length and word mora position and the type of accents and accent high or low. An accent triphone model had a triphone label with word mora length and word mora position and the type of accents and accent high or low. Also, we did not use pitch extraction. For acoustic parameters, we used MFCC and FBANK.
Using these models and parameters, we studied the homonym speech recognition rates. And we obtained the following results.
Using accent triphone models, MFCC, and full covariance HMM, we obtained 89% homonym word accuracy.
The MFCC produced higher average recognition rates than FBANK, meaning that it was generally more effective. However, MFCC was better than FBANK for male speakers, and FBANK was better than MFCC for female speakers.
Much difference was evident in the recognition rates of the speakers.
In the future, we will use FBANK because this parameter is effective for speaker dependent recognition and for female speakers. Or we will use other parameters like LDC. And we will use discriminative training for HMM.