In this paper, we consider signals originating from a sequence of sources. More specifically, the problems of segmenting such signals and relating the segments to their sources are addressed. This issue has wide applications in many fields. For example, the automatic determination of the acoustic unit, speaker discrimination, and the discrimination of the utterance mode are candidates
The unknown-multiple signal source clustering problem divides into the following four sub- problems.
This paper assumes that the number of categories is known. Therefore, we consider the problem of automatically segmenting the observed signal sequence and classifying the category for each segmented interval.
On the other hand, in the areas of speaker identification [3] and language modeling [5], the Ergodic HMM is often used, where all states are connected to each other. When such an Ergodic HMM is applied to the unknown-multiple signal source clustering problem, it is expected that the category corresponds to the state, and the signal sequence corresponds to the symbol sequence output from the state. Therefore, the signal source sequence can be determined by using the Viterbi algorithm over the observed sequence. Baum-Welch training is also used to estimate HMM parameters from observed sequences.
As an example of the unknown-multiple signal source cluster problem, an experiment is performed on speaker classification. Each speaker speaks randomly and this speech data is recorded using only one channel. The purpose is to segment the speech data according to the speaker. The following results are obtained. Using LPC cepstrum with a long-term window for 4 male speakers, the average classification rate is 67.5%. Selecting the Ergodic HMM with high likelihood yields the average classification rate of 78.8%.
Relevant studies are as follows. A method using Kullback information based on the code book is presented in[1]. In this study, segmentation boundaries and the number of categories are already known. Papers [4] discuss the same classification problem for multiple speaker utterances. They assumed that the acoustic parameter follows a Gaussian distribution for each speaker, and try to solve the problem using VQ clustering. We note that for the normal speaker identification problem [3], each speaker model is constructed beforehand. These models are applied to the input speech to identify the speaker. This paper considers capturing the speech sequential speech of several speakers, and the problem is to segment and classify the speech of each speaker.