田本真詞
村上仁一
嵯峨山茂樹
TAMOTO masafumi
MURAKAMI jin-ichi
SAGAYAMA shigeki
東京工業大学工学部
Tokyo Institute of Technology
ATR 自動翻訳電話研究所
ATR Interpretating Telephony Research Laboratories
There are dichotomous types of natural language modeling. One is the class of deterministic models, exploiting some known specific properties of the language, and the other is the class of statistical models in which one tries to characterize the statistical properties of the corpus.
These statistical models include stochastic context free grammar and Markov process, a sort of non-deterministic finite state automaton. An HMM effectively exploits language models as a random process. By choozing specific parameters of HMM (i.e. the number of states in the model), grammatical rule can be estimated in a well-defined manner as a transition network. HMM is very rich in mathmatical structure so that language models are determined more precisely than that of stochastic context free grammar or Markov process.
This paper includes the results obtained to characterize inter-clause grammar automatically from 25 syntactic categories using ergodic HMM on a 30000-clause corpus of syntactic category sequences. The resultant model indicates that some common subnetwork exists even though the process is carried out automatically.