next up previous
次へ: Semantic-Vector Space Model 上へ: Vector Space Model based 戻る: Vector Space Model based

Introduction

With the increasing availability of information in electronic form, it becomes more important and feasible to have automatic methods to retrieve such information. In addition to the conventional method by Key Words, many new methods, such as full-text search, passage retrieval, contents retrieval and VSM (Vector Space Model) have been investigated.

Among these methods, VSM is one promising method for improving the performance of information retrieval as well as clustering. However, conventional VSM uses so many words per vector element that similarity calculation requires much time. When the query sentence includes only a few words pertaining to the vector elements, the query vector becomes too sparse to find the relevant documents.

In order to resolve these problems, many researches have been conducted. The most simple way to reduce the vector dimension is the selection of elements based on the value of $tf \cdot idf$(Salton, McGill 1983). Hierarchical classification analyses are frequently used for term and document clustering (Jardin et.al, 1971).

In the case of VMS, it has been assumed that the meaning of the words which represent the bases of vector are independent from each other, however, this assumption does not hold in actual documents. Then, in order to reduce the number of dimension, KL method (Borco and Bernick 1963) and LSI method (Deerwester et al. 1990, Faloutsos and Lin 1995, Golub and Loan 1996)were proposed where new bases were generated by linear combination of vector bases.

Semantic similarities between bases were considered in KL method and the vectors which represent each cluster were selected as the new vector bases. On the other hand, LSI(Latent Semantic Indexing) tries to find the new meanings behind plural words used for bases. It finds the new bases from the matrix composed of specific vectors by using SVD(Singular Value Decomposition, Golub and Kahan 1965) method. This method was applied slso to a numerical database(Jiang et al. 1999).

LSI is an attractive method that can reduce the dimension without decreasing the performance of information retrieval. However, the calculation of SVD requires much time to apply it to a large number of documents. In some cases, vector bases are determined from the limmitted number of the documents(Deerwester et al. 1990).

In addition to the above, a pseudo-feed back method (named as Two Stage ad-hoc retrieval) was also proposed(Burkley et al. 1996, Kwock and Chan 1998).

By the way, Mining Term Association was known as the learning method to acquire the semantic relations between words and applied to the documents from the Internet(Lin et al. 1998). However, it is difficult automatically to determine the semantic relation of words at high accuracy.

In order to resolve these problems, this paper proposes a new method using semantic attributes as vector elements instead of literal terms, which can easily reduce the dimension without decreasing the performance.

In this method, the Semantic Attribute System defined by "A-Japanese-Lexicon" (Ikehara et. al, 1997) is used. The semantic usage of Japanese words was hierarchically classified from the view point of "is-a" and "has-a" relationship into 2,710 categories called "Semantic Attribute". The semantic usage of 400 thousand Japanese words was defined using this system. Therefore, the meanings of most of the Japanese words used in the documents can be represented by Semantic Attributes; the similarity of meanings between a query sentence and the documents in the database are assessed through these Semantic Attributes to improve the "recall" performance. It is expected that the vector dimension can easily be reduced using upper-lower relations between Semantic Attributes.

In this paper, experiments in information retrieval are conducted applying our method to TREC test collection "BMIR-J2" (Kitani 1998) to evaluate the efficiency compared to conventional VSM.



next up previous
次へ: Semantic-Vector Space Model 上へ: Vector Space Model based 戻る: Vector Space Model based
Jin'ichi Murakami 平成13年10月5日