next up previous
次へ: Generalization Cost 上へ: Irreducible Minimum Vector 戻る: (1) Generalization by Granularity

(2) Generalization by Weight

In this case, the target of generalization will be the semantic attributes not frequently used in the documents. But such a semantic attribute does not always become the target. Here, assuming that all documents in the database have the same probability to be relevant, let the summation of all document vectors be $V_t$. When the semantic attributes $\char93 i$ have a small value in $V_t$, they have little influence on the total performance of information retrieval. The most appropriate system will be obtained when all of the attributes in $V_t$ have a balanced weight.

Consequently, the semantic attributes that increase the weight imbalance of the elements of vector $V_t$ should not be generalized even if they have small weight. Taking these conditions into consideration, we show how to select the attributes to be generalized.

Now, let us define the specific vector $V_t$ for all of the documents in the database as follows:


\begin{displaymath}
V_t=(n_1,n_2,\cdot \cdot \cdot n_i, \cdot \cdot \cdot n_m) 
\end{displaymath} (4)

Here, $n_i$ represents the total frequency of words in the database, the meaning of which is $\char93 i$. And $m$ is the number of attributes used by a specific vector.

Let us introduce the evaluation function $H$ to assess the weight balance of bases by their "variation".



$\displaystyle H=(n_1- \bar{n})^2+(n_2- \bar{n})^2+ \cdot \cdot \cdot +(n_i- \bar{n})^2+ \cdot \cdot \cdot +(n_m- \bar{n})^2$     (5)

Here, $\bar{n}$ represent the mean value of $n_i$.


\begin{displaymath}
\bar{n}=\sum_{i=1}^m n_i/m
\end{displaymath} (6)

       

According to the above discussion, generalization should be performed by selecting the semantic attributes $\char93 i$ which decrease the value of $H$.

Now, let us consider the case in which a semantic attribute $\char93 i$ is generalized into the upper node $\char93 j$. $n_i$ is added to $n_j$ and m decreases by 1. Let the evaluation function be $H'$ after the generalization. The change of the evaluation function $\Delta H (=H-H')$ is given as follows:



$\displaystyle H-H'=(n_i-\bar{n})^2+(n_j-\bar{n})^2-(n_i+n_j-\bar{n})^2$     (7)

Letting $H−H'>0$ as a condition, we obtain the following relation:



$\displaystyle n_i n_j < \bar{n}^2/2$     (8)

From this relation, we find that an attribute $\char93 i$ that satisfies the condition (8) should be generalized.

Thus the generalization procedure is as follows:

120D
Generalize the semantic attribute $\char93 i$ where $n_i \cdot n_j$ is smallest.
220D
Experimentally evaluate the performance of information retrieval. If the degradation exceeds the threshold, stop or else return to 120D.


next up previous
次へ: Generalization Cost 上へ: Irreducible Minimum Vector 戻る: (1) Generalization by Granularity
Jin'ichi Murakami 平成13年10月5日