(2) Generalization by Weight

次へ: Generalization Cost 上へ: Irreducible Minimum Vector 戻る: (1) Generalization by Granularity

(2) Generalization by Weight

In this case, the target of generalization will be the semantic attributes not frequently used in the documents. But such a semantic attribute does not always become the target. Here, assuming that all documents in the database have the same probability to be relevant, let the summation of all document vectors be

. When the semantic attributes $\char93 i$ have a small value in

, they have little influence on the total performance of information retrieval. The most appropriate system will be obtained when all of the attributes in

have a balanced weight.

Consequently, the semantic attributes that increase the weight imbalance of the elements of vector should not be generalized even if they have small weight. Taking these conditions into consideration, we show how to select the attributes to be generalized.

Now, let us define the specific vector for all of the documents in the database as follows:

$\begin{displaymath} V_t=(n_1,n_2,\cdot \cdot \cdot n_i, \cdot \cdot \cdot n_m）　 \end{displaymath}$

(4)

Here, represents the total frequency of words in the database, the meaning of which is $\char93 i$ . And is the number of attributes used by a specific vector.

Let us introduce the evaluation function to assess the weight balance of bases by their "variation".

$\displaystyle H=(n_1- \bar{n})^2+(n_2- \bar{n})^2+ \cdot \cdot \cdot +(n_i- \bar{n})^2+ \cdot \cdot \cdot +(n_m- \bar{n})^2$

(5)

Here, $\bar{n}$ represent the mean value of .

$\begin{displaymath} \bar{n}=\sum_{i=1}^m n_i/m \end{displaymath}$

(6)

According to the above discussion, generalization should be performed by selecting the semantic attributes $\char93 i$ which decrease the value of .

Now, let us consider the case in which a semantic attribute $\char93 i$ is generalized into the upper node $\char93 j$ . is added to and m decreases by 1. Let the evaluation function be after the generalization. The change of the evaluation function $\Delta H (=H-H')$ is given as follows:

$\displaystyle H-H'=(n_i-\bar{n})^2+(n_j-\bar{n})^2-(n_i+n_j-\bar{n})^2$

(7)

Letting as a condition, we obtain the following relation:

$\displaystyle n_i n_j < \bar{n}^2/2$

(8)

From this relation, we find that an attribute $\char93 i$ that satisfies the condition (8) should be generalized.

Thus the generalization procedure is as follows:

120D: Generalize the semantic attribute $\char93 i$ where $n_i \cdot n_j$ is smallest.
220D: Experimentally evaluate the performance of information retrieval. If the degradation exceeds the threshold, stop or else return to 120D.

次へ: Generalization Cost 上へ: Irreducible Minimum Vector 戻る: (1) Generalization by Granularity

Jin'ichi Murakami 平成13年10月5日