In this section, we first give detailed applications of
Dirac notation for the three classical models: vector space model, Boolean
model and probabilistic model in their classical ways. Then we introduce Fock
space as a unified way to represent the above models.

Suppose we have **t-independent
terms** in our collection, we can think of them as othonormal vectors in a
t-dimensional real vector space:

(1.1.1a)

In **Dirac notation**,
we can define a set of the orthonormal vectors ([1], pages 192-195):

(1.1.1b)

Here, is called a **ket**; is its Hermit conjugate (or, transpose for
real vector), and is called a **bra**; is a bracket, the *inner product of two vectors*. More about Dirac notation can be
found in Ref [19] and [20].

Because any vector in the term-document space (i.e., terms,
documents and queries) can be uniquely expanded with respect to the terms, we
say that the terms construct a basis of the space. Moreover, if the expansion
coefficients are uniquely determined by the inner product of document (and
query) with term, the completeness can be written as:

(1.1.1c)

Now we can represent a query vector *q* and document (μ=1, 2…N) as:

(1.1.2a)

Here, and are called *term weights* of a query or a document with respect to the i^{th}
term. They **actually define** the inner
product of a vector (or a query) with a term ([2], page 29; [3], page 15):

(1.1.2b)

Note that we have used the fact that, for two real vectors,

This is the *classical vector space model** (VSM)*, proposed by Salton et al ([4,
5]). We call the vector space spanned by these t base-vectors of Eq. (1.1.1b) a
*Weighted
Term Vector Space*, or WT-space for short. The weights can be calculated based on different schemes, usually
involving ([2], page 29; [3], page 15) following parameters:

:
The frequency of *i-th* term* *in document (1.1.3a)

:
The number of documents containing *i-th*
term (1.1.3b)

:
The length (or word count) of the *-th*
document (1.1.3c)

:
Number of all documents in the collection (1.1.3d)

(1.1.3e)

:
The frequency of *i-th* term* *in the query ([2], page 29) (1.1.3f)

In reference [3], the following formula is used to calculate
*idf*,
and :

(1.1.4a)

(1.1.4b)

(1.1.4c)

Many variations on how to compute weights have been done.
One good performer is identified as ([3], page 17):

(1.1.4d)

After we have computed the weights, we now can calculate the
relevance or the Similarity Coefficient (SC) of document with respect to query *q*. In classical vector model,
SC is the cosine of the angle between the two vectors, ([2], page 27), which
can be written as:

(1.1.5)

Here, using (1.1.2a, b) and (1.1.1c), we have:

(1.1.6)

** **

(1.1.7a)

** **

(1.1.7b)

In many cases, we want to normalize and ,
so that:

(1.1.8)

We see that the *normalized
WT-space *is **a bounded t-dimensional
continuous space over the field of [0, 1], **restricted in the unit cubes:

(1.1.9)

Note that, the vectors restricted by Eq. (1.1.9) *do
not form a vector space*, because their inverse elements are not
included, and they do not close to vector addition. But they do form a
subspace, closed to retrieval-related operations. We call such space **a retrieval-closed space**. Actually,
this space is better to be classified as a Fock space (see Section 1.3).

Up to now, the only properties (or assumptions) we know
about the base vectors are in Eq. (1.1.1). These assumptions may be
relaxed or violated in other models. We also have to define the rules to
calculate the inner product of document and query with term vector (the
weights), for example, by Eq. (1.1.4).

If we following the discussion in Ref [19], we can rewrite
Eq. (1.1.6) using the trace operation:

(1.1.10)

We accept is as Relevance Coefficient (*the cosine law*), while in
Ref [19], the related probability is given by (page 87) the *cosine-squared
law*:

(1.1.11)

** **

In reference [3], a simple example is used throughout that
book. To make it convenient for readers, we will reuse this example and restate
the example ([3], page 15). The example, referred to *GF-Example* from now on,
has a collection of three Documents and one Query as described in *Appendix
A*.

To avoid unnecessary confusion, we use the following
conventions for labeling indexes of vectors or tensors:

**Index Convention**:

- For
terms, we use Latin chars
*i, j, k,
l, m*…
- For
documents, we use Greek chars
- For
query, we use Latin char
*q*.
- For
other cases, like in LSI, we use Latin chars
*a, b, c*…

** **

** **

The classical Boolean model is discussed in [2],
§2.5.2. Here we want to use a vector
space to describe this model.

Assume that the term-document and term-query weights are all
binary, i.e.:

(1.2.1)

Now we introduce new vector space to represent documents and
query in this space. Each vector is a director product of t 2-dimentional
vector. This space has M = points.
We call this vector space a **Boolean
Term Vector Space**, or *BT-space*
for short. Suppose we have 5 independent
terms, and a sample vector has three terms: first, second and the fifth. Then
in the BT-space, this vector can be represented as:

(1.2.2a)

Here we have used an explicit representation of each 2D
factor space. They are chosen as eigenvectors of an *occupation operator*:

(1.2.2b)

The 2-dimentional vector looks similar to the spin-1/2
vector representation in quantum mechanics (QM). Only difference is, in QM, the
base vectors usually are common eigenvectors of operators and ,
corresponding to spin-up and spin-down states (in natural units, ):

(1.2.3a)

Using these two states as basis, we have their matrix
expressions:

(1.2.3b)

Comparing (1.2.2b) and (1.2.3b), we see the following
relations:

(1.2.3c)

In BT-space, the basis in equation (1.1) now can be
represented as a set of t elements of the form:

(1.2.4a)

And we have t occupation operators:

(1.2.4b)

This means that acts on the *l*-th 2-dimentional vector of the product; if it is |1>, returns
an eigenvalue 1, otherwise returns 0. Therefore we can think of the t base
vectors as the common eigenvectors of the t occupation operators. We call it
occupation operator, because its eigenvalues are similar to the occupation
number of a system of fermions (see §1.3).

A document now can be represented as:

(1.2.5)

The documents in *GF-Example*
now can be represented as:

(1.2.6a)

Now let us assume that our binary query is “ ”, i.e.:

(1.2.6b)

Then we can see only is relevant, because it has the sixth and
ninth term but not has the fifth term as the query requires. In other words,
the relevant documents of this query are in the following set (or *a concept*):

(1.2.6c)

Using the vector space, we can also think that any relevant
document is a common eigenvector of occupation
operators and such that:

(1.2.6d)

Our BT-space is very close to what is used by the *Generalized Vector Space*, introduced by
Wang, Ziarko, and Wong [6]. The main difference is that they consider the space
is of dimensional, and introduce base vector to represent a term vector and
term-term correlations. In our case, we rather think it is a product of *t* factor spaces, which is spanned by the
eigenvectors of the occupation operator g. This factor space is 2-dimesional
for Boolean model, but can be extended easily, as described in next sections
(see §1.3 and §2.1).

We see that the BT-space is a bounded, **finite and t-dimensional discrete space over the field of {0, 1}**,
restricted to the vertices of the unit cubes in our WT-space:

(1.2.7)

Again note that, the vectors defined by Eq. (1.2.7) do not
form a vector space, but a retrieval-closed space of vectors. It is also a Fock
space, as we will see in next section.

Following the discussion in [19] (pages 56-60), we see that
our occupation operator is a projector, and can be written as:

(1.2.8)

The vectors like (1.2.6a) remind us of the **Occupation Number Representation** (ONR)
([1], page 566) in a multi-particle system:

(1.3.1)

Here is the number of particles occupying the state (orbital). This vector space is called *Fock
space*. For fermions (e.g., electrons, protons…), can only be 0 or 1.

Our BT-space is also a Fock space, and its representation
can be called as **Boolean Term Occupation
Number Representation** (BT-ONR), in which the number can have only value 0
or 1 for a given term. The occupation
operator now can be interpreted as Term Occupation
Number Operator. Its eigenvalue is 0 or 1 in BT-space. Because any term, word
group, sentence, paragraph, document or query is represented by one of the M = vectors of the form (1.3.1), we can also
think of TB space as a finite, discrete set of points (or elements). Each point is a *concept*.
The interesting example is the point representing the *vocabulary*, or all
distinct terms,** **of the collection
([2], page 174):

(1.3.2)

The base vector (1.2.3) now can also be written as:

(1.3.3)

They form a subset of the t terms, each represent one term.
Here we introduce the t-plet notation of the t base vectors (points) in the
Fock space:

(1.3.4a)

Note that, in this notation, the addition of two t-plets is
actually their union. For example, in our 5-term space, a document vector can be written as:

(1.3.4b)

If we allow the term number to be any non-negative integer
(or natural number), like the term-frequency in a document, then we have a new
representation of the classic vector space: a **Term Frequency Fock Space, **or TF-space for short, which is similar
to ONR for Bosons in Quantum mechanics (e.g., photons, mesons…).

The number vector (1.3.4a) can now be generalized to:

(1.3.5)

The occupation operator now has such a property:

(1.3.6)

In this representation, the three documents and the query in
our *GF-Example* now can be written as:

(1.3.7a)

** **

Using term vectors in Eq. (1.3.3) as our base, we can
represent the above document vectors and the query vector in columns as
follows:

(1.3.7b)

Note that, in this notation, the addition of two t-plets is
actually the addition of number vectors. For example, in a 5-dimetional term
space, a document vector can be written as:

(1.3.8)

** **

Now we see that the TF-space is a retrieval-closed space of
bounded **t-dimensional discrete vectors**,
restricted on the vertices of the unit cubes in our WT-space:

(1.3.9)

In TF-space, we do not use vector addition of documents, but
we do need to define inner product, which can be derived from their
correspondence in WT space:

(1.3.10)

We will see that the TF-space and their representations will
greatly simplify the description of the so-called **Latent Semantic Indexing Model** (see §3). More about Operators in
Fock space is given in Appendix B.

** **

Assume that the term-document and term-query weights are all
binary, as in eq. (1.2.1), and assume that is the set of relevant documents for a query and is the irrelevant document for the query.
Then we may modify the identity matrix in eq. (1.1.1) to including a
probability factor with respect to *q*, *R*
and :

(1.4.1)

Here and may be computed using formulas like ([2],
page 32):

(1.4.2)

where is the probability when term is present in a document randomly selected
from the set and is the probability when is resent in a document randomly selected
from the set .
Usually, in the beginning, the two probabilities are assigned as constants for
all terms, for example ([2], page 33):

(1.4.3)

Then, based on the retrieved documents, one can improve the
initial ranking.

The Similarity Coefficient now can be calculated as:

(1.4.4)

The reason that here we can use ,
instead of ,
is that we are limited in the subset of terms in query ,
and all other terms ,
not contained in ,
are orthogonal to ,
i.e., Because this space is restricted to subspace
spanned by ,
we may call this space as **Term-Query
Probability Fock Space**, or TQP-space for short.

In many cases, we want to normalize and ,
as in Eq. (1.1.8). Now suppose that the query is related to *m* terms. Then the TQP –space is an *m*-dimensional subspace of the WT-space
as described by Eq, (1.1.9):

(1.4.5)

** **

** **

**2. The Alternative IR Models**

In classical Boolean Model, a document either has a term or
does not have the term. This fact is represented in the values 0 or 1 in the
vectors of (1.2.6a). A document is either relevant or irrelevant to a query.
This fact is reflected in the set expression (1.2.6c). The basic assumption can
be seen in Eq. (1.2.2b), that is, all term states are eigenstates of occupation
operator.

An interesting case in QM is that if a spin state is a
linear combination of the up and down status. Then this state is not an
eigenvector of anymore. We can only predict its “average”
value (assume the state vector is normalized):

(2.1.1)

Because can have eigenvalue only -½ or ½, its
average value in the range of [-½, ½].
Based on Eq. (1.2.3c), we can think that if there is dependence between
terms, we may assign a fuzzy number to occupation operator for i-th term in the
-th document as (assume the document vector
is normalized):

(2.1.2)

Because can have eigenvalues only 0 or 1, its average
value in the range of [0, 1].

This is very mush similar to *fuzzy set theory*, (see [2], page 35 and its Ref. [846]), where we
can have *a degree of membership* for each term in a document. This can be
done by using a map in our 2D space (remember, each 2D space is for one term):

(2.1.3)

For a document vector, we have the following mapped *fuzzy
document vector*:

(2.1.4)

The *membership of i-th term in j-th document*,
,
can be calculated, e.g., using following formula (see [3], page 86 and our
notation in Eq. (1.1.3):

(2.1.5)

In our *GF-Example*
in §1.1.1, we have:

.

Hence, the term gold has a membership in first document as:

Other memberships are listed in table 2.1 9(see [3], page
86).

Table 2.1: Term Membership in Documents ( )

(2.1.5)

Now let us calculate the membership of documents relative to
the following query ([3], page 87):

A more explicit way to consider term-term correlation is to
introduce a correlation matrix (see [2], page 36 and its Ref. [616]). It can be
represented by a matrix representation of a correlation operator:

where is defined in Eq. (1.1.3b), and is the documents which contains both terms.
Then the membership of *i-th* term in -th document can be computed as:

In this approach, the fuzzy OR operation defined in (2.1.5)
is replaced by an algebraic sum, implemented as a complements of a negative
algebraic products, and the fuzzy AND operation in (2.1.5) is computed using
algebraic product. For example:

To use above set operation rules, one should first decompose
the query concepts in a *distinctive
normal form*, as a union of disjunctive concept vectors (see [2], pages
36-37).

In any case, it seems that, using FBT-space can greatly
simplify the expressions and calculations for fuzzy set examples. Based on our
discussion, FBT-space is **a
retrieval-closed space of bounded t-dimensional continuous vectors, **as
described by Eq. (1.1.9).

Another way to improve Boolean search is to rank the
relevance document based on its closeness to a query, using Extended Boolean
Model, introduced by Salton, Fox and Wu [7, 8] (see [2], page 38, [3], page 67
and [5]).

First, we assume that all terms in a document or a query are
assigned with weights as in classical vector model, described in §1.1. All
weights are normalized to be in the range of [0, 1].

Next, the least or most desirable point is decided based on
the operation of the Boolean query (union or join). If the query is a union of *m* terms, we will have the least
desirable point; if the query is a joint of m terms, then we will have the most
desirable points. To describe such points, we restrict ourselves to a subspace,
spanned by the terms contained in the query. We call this space the *q*-space (like in the classical
probabilistic model). To simplify our formula, we assume that **the first ***m* terms are contained in the query. Now we can express the
Identity operator in the *m*-dimensional
*q*-space by:

(2.2.1)

In this subspace, the query and a document is expressed in
the WT-space (Weighted Term Vector Space**)**
as:

(2.2.2)

If the query is a union of *m* terms,

(2.2.3a)

then the *least desirable
point* is the origin of the weighted space ([2], page 39):

(2.2.3b)

The same vector, if represented in our TF-space (or
BT-space), will be:

(2.2.3c)

If the query is a joint of *m* terms:

(2.2.4a)

We can see that the *most
desirable point* is the vector of the highest weights ([2], page 39):

(2.2.4a)

The same vector, if represented in our TF-space (or
BT-space), will be:

(2.2.4c)

Now we need define a measure or a distance in the space, in
order to calculate the distance of the document from such point. According to
[7.8], we can define the normalized distance between two vectors in the
m-dimensional *q*-space with a
parameter *p* by:

(2.2.4)

This is a normalized form of the *Minkowski distance of order p*
(or *p*-norm
distance) in a Euclidean space, as an example of metric spaces [9].

With the above descriptions, we can simplify the Similarity
Coefficients (*SC*) formulas as
follows.

(2.2.5a)

(2.2.5b)

If *p* = 1, the
distance is the difference of the two vectors in *m*-dimensional vector space; if *p*
= 2, the distance is for an *m*-dimensional
Euclidian space; if p = ∞, Eq. (2.25) return the result of the original Boolean
query.

We can also include the weights of query in the calculation
of distance ([3], page 68). Then Eq. (2.2.4) becomes:

(2.2.6)

We call the space the Extended Boolean Term Fock Space
(EBT-space). Based on our discussion, the EBT-space is **a bounded t-dimensional continuous ***metric*
space: it is a retrieval-closed space of vectors as described by Eq.
(1.1.9) and it is also a metric space with a *p*-norm (2.2.4) or (2.2.6) as its distance function. No inner
product is need here.

** **

** **

Because the existence of term-term correlations, we cannot
say terms are independent. This means that the term vectors in Eq. (1.1.b) are
not orthogonal to each other anymore:

(2.3.1a)

Note that the completeness of terms, as described in Eq.
(1.1.1c) still holds as long as the expansion of documents and query are unique
with respect to defined inner products. In this case, we still have:

(2.3.1b)

There are many ways to compute term-term relationships. The *Generalized
Vector Model* was proposed by Wong, Ziarko and Wong in 1985 [6]. They
considered the space spanned by the vectors similar those described in Eq.
(1.2.6), introduced orthonormal vectors as the bases of the
vector space, expressed terms as linear combinations of these base vectors, and
calculated term-term correlation based on such term expressions. More details
can also be seen in [2: pages 41-44].

Here we discuss a query expansion model based on a global
similarity thesaurus which is constructed automatically ([10], and [2], page
131-133]). Because the conditions in Eq. (2.3.1), we cannot use the base
vectors in (1.1.1b) to represent our vector space. Instead, we expand term
vectors with respect to documents:

(2.3.2)

Assume any term is at least in one document and no two
documents are identical in term frequency, so term vector can be uniquely
expanded with respect to document vectors. This is a very strong requirement:
the Term-Document matrix defined in Eq. (3.1.1) need have a rank of* t*.
In this way, we can think that the document vectors form a basis of
N-dimensional Fock space, with the inner product of term-document defined in
(2.3.2):

(2.3.3)

The weights in Eq. (2.3.2), now the document occupation
number of terms, are calculated using the following formula ([2], page 132),

(2.3.4)

where is defined in Eq. (1.1.3a), is the maximum value of all for a given ,
and is the inverse term frequency for document ,
defined by:

(2.3.5)

Here is the number of distinct terms in document .

** **

Now we can express the term-term correlation factor as
(identical to Eq. (5.12) in [2], page 132):

(2.3.6)

Although terms are not an orthonormal system, we still can
define the query vector as a linear combination of terms contained in the query
([2], page 133):

(2.3.7)

Here, the weights are calculated similarly to Eq. (2.3.5),
but they are not equal to the inner product of term-query. Instead, the inner product of a query with a
term is computed using (2.3.7) and (2.3.6) as:

(2.3.8)

Now we can derive our expression for the Similarity
Coefficient as:

(2.3.9)

This equation is identical to the one on page 134, Ref [2].

We call the vector space spanned by document vectors defined
in (2.3.3) as Document Vector Space (DV-space), which is an N-dimensional
vector space with base vector (2.3.3), and its inners product of document and
query with term as (2.3.2) and (2.3.8). Based on above rules, we can then
calculate the term-term correlations as in (2.3.6).

We see in this model, document and query are treated
differently, because they have different inner product rules with terms. In
next section, we will see a more natural way of introduce term-term interaction
in which document and query would be treated in the same way, unified by a
metric tensor derived using SVD.

** **

**3. Latent Semantic Indexing Model and SVD**

** **

In WT-space of classical vector models, we have the basic
assumption that the term vectors form a complete orthonormal bases as in Eq.
(1.1.1b). In reality, terms are not
independent vectors, because of the existence of the term-document relations.

This situation is very much like the situation in the
quantum system of the Hydrogen atom when neglecting the orbital-spin (*L-S*)
interaction ([1], pages 549-554). If we neglect the *L-S* interaction, the
Hamiltonian (*H*) of the system has
eigenvector which are also the common eigenvectors of the orbital angular
momentum operators (*L*^{2}, L_{z})
and the spin angular momentum (*S*^{2},
S_{z}). Any two of them are
orthogonal if anyone of the four eigenvalues (*L*^{2}, L_{z}, **S**^{2},
S_{z}) is different. But because of the *L-S* interaction, they
become dependent and we have no such common eigenvectors anymore. Instead, we have to introduce the total
angular momentum (*J = L + S*), to find the common eigenvectors of (*J*^{2}, **L**^{2}, **S**^{2},
J_{z}), which give us the fine structure of the Hydrogen atom ([1],
page 554-555).

Now we can apply the Singular Value Decomposition (SVD) of
matrix A to use the eigenvectors of both *L*
and *R*:

(3.2.1)

Here, *U* and *V* are orthogonal matrices, the columns
of the *t **x **N *matrix *U*
are the *N* new term eigenvectors *u*,
the columns of *N **x **N*
matrix *V *are the new document
eigenvectors *v*, and the *N **x **N* matrix S is a diagonal matrix with its
diagonal elements *S*_{i} = (*the
singular values*) as the square root of the eigenvalues of *M *or*
M*^{T}. We can arrange *S*_{i}
such that *S*_{1 }> *S*_{2 }> *S*_{3 }> …* S*_{N}.

One of the advantages of SVD is that, in most cases, we do
not need to use eigenvectors of zero eigenvalues (usually, there are many of
them), and we can also keep only *r*
non-zero singular values to reduce the ranks of all matrices to* r, *without losing much of the accuracy:

(3.2.2)

This is referred to *Reduced *SVD (RSVD). We can visually
explain the reduction by the following figure (a copy of Figure 8, tutorial 3,
Ref. [12], where *k* is used as our *r*).

**[12]-4: Figure 8. The
Reduced SVD or Rank ****k**** Approximation**

Using the Dirac notation, we can express (3.2.2) as:

(3.2.3)

Here we have used following notations:

(3.2.4a)

This means that:

(3.2.4b)

Also note that here the new document vectors are reduced to *r*-dimensional.

A very important equation derived from Eq. (3.2.1) is ([12],
tutorial 4):

(3.2.5a)

Using Eq. (3.1.1) and (3.2.4a), we can express (3.2.5a) in
reduced form as:

(3.2.5b)

This equation tells us how we can transform original
document vector to new vector: :

(3.2.6a)

Since the query vector is treated like a document vector, we
can compute the new query vector as:

(3.2.6b)

Eq. (3.2.6) give us a geometrical interpretation of the
transformation: the new document vector is represented in a new base of the new
term vectors ( ,
eigenvectors of *L*); its *i*^{th} component is its inner
product with corresponding new term vector ,
scaled by a factor of 1/S_{i}. So we can say the new document is the
result of a rotation (from *k* to *k’*) plus a scaling. Then, the inner product of the two new
vectors can be derived from Eq. (3.2.6) as:

(3.2.7)

Finally, the ranking of documents now can be calculated
using the Cosine low (1.1.5), or:

(3.2.10)

Now let us introduce a metric tensor to simplify our
calculations. We can rewrite Eq. (3.2.7) in a more concise form:

(3.2.11)

Here, we have defined our Riemannian metric tensor as:

(3.2.12)

In this way, we actually interpret the inner product of two
vectors on the tangent space of an *r*-dimensional
Riemannian Manifold [16][19]:

(3.2.13)

The norm of the vector can be calculated as the square root
of its self- inner product. Eq. (3.2.10) now can be rewritten as:

(3.2.14)

We see that, from Eqs. (3.2.11-13), to compute the ranking
of a document with respect to a query ,
all we need to do are:

- Calculate
the term-term interaction matrix
*L*
from term-document frequency matrix *A*,
using Eq. (3.1.2).
- Calculate
the eigenvalues ( ) of
*L*,
and then keep eigenvectors ( ) for top *r* non-zero eigenvalues (*S*_{1}>* S*_{2}>…>* S*_{r} >…>0).
- Calculate
the
*t** **x** t*
metric tensor *g* using Eq.
(3.2.12).
- Calculate
the ranking using (3.2.14).

Now let us go over it with our GF-example.

**Step 1**: we use online
tool [14] to calculate *L = AA*^{T},
the matrix is as follows:

3 2
1 1 1 2 3
3 2 2 2

2 2
0 1 0 1 2
2 1 2 2

1 0
1 0 1 1 1
1 1 0 0

1 1
0 1 0 0 1
1 0 2 1

1 0
1 0 1 1 1
1 1 0 0

2 1
1 0 1 2 2
2 2 0 1

3 2
1 1 1 2 3
3 2 2 2

3 2
1 1 1 2 3
3 2 2 2

2 1
1 0 1 2 2
2 2 0 1

2 2
0 2 0 0 2
2 0 4 2

2 2
0 1 0 1 2
2 1 2 2

**Step 2**: we use
online tool [15] to calculate its eigenvalues and eigenvectors. It has 3
non-zero eigenvalues, their square roots *S*_{i}
are:

{*S*_{1}, *S*_{2}, *S*_{3}}
= {4.0989, 2.3616, 1.2737} (3.3.1)

Their corresponding eigenvectors (*k’*_{1,} k’_{2,,} k’_{3}) are as follow:

[-0.4201, -0.2995, -0.1206, -0.1576, -0.1206, -0.2626,
-0.4201, -0.4201, -0.2626, -0.3151, -0.2995]^{T}

[-0.0748, 0.2001, -0.2749, 0.3046, -0.2749, -0.3794,
-0.0748, -0.0748, -0.3794, 0.6093, 0.2001]^{
T}

[-0.0460, 0.4078, -0.4538, -0.2006, -0.4538, 0.1547,
-0.0460, -0.0460, 0.1547, -0.4013, 0.4078]^{
T}

**Step 3**: we keep
all three eigenvectors, i.e., set *r* =
3 in Eq. (3.2.2); using a simple computer program [18] and Eq. (3.2.12), we get
the metric matrix g as follows:

0.0127 -0.0067 0.0195 0.0055 0.0195 0.0072 0.0127 0.0127 0.0072 0.011 -0.0067

-0.0067 0.1149 -0.1217 -0.0366 -0.1217 0.0299 -0.0067 -0.0067 0.0299 -0.0733 0.1149

0.0195 -0.1217 0.1412 0.0421 0.1412 -0.0226 0.0195 0.0195 -0.0226 0.0844 -0.1217

0.0055 -0.0366 0.0421 0.0428 0.0421 -0.0373 0.0055 0.0055 -0.0373 0.0857 -0.0366

0.0195 -0.1217 0.1412 0.0421 0.1412 -0.0226 0.0195 0.0195 -0.0226 0.0844 -0.1217

0.0072 0.0299 -0.0226 -0.0373 -0.0226 0.0446 0.0072 0.0072 0.0446 -0.0747 0.0299

0.0127 -0.0067 0.0195 0.0055 0.0195 0.0072 0.0127 0.0127 0.0072 0.011 -0.0067

0.0127 -0.0067 0.0195 0.0055 0.0195 0.0072 0.0127 0.0127 0.0072 0.011 -0.0067

0.0072 0.0299 -0.0226 -0.0373 -0.0226 0.0446 0.0072 0.0072 0.0446 -0.0747 0.0299

0.011 -0.0733 0.0844 0.0857 0.0844 -0.0747 0.011 0.011 -0.0747 0.1716 -0.0733

-0.0067 0.1149 -0.1217 -0.0366 -0.1217 0.0299 -0.0067 -0.0067 0.0299 -0.0733 0.1149

**Step 4**: Using the
simple computer programs [18] and Eq. (3.2.10 and12), we get the following
similarity coefficients:

*SC (d*_{1},
q) = (d1,
q)/|d1|/|q| = -0.2787

* SC (d*_{1}, q) = (d1, q)/|d1|/|q|
= 0.7690 (3.3.2)

*SC (d*_{1},
q) = (d1,
q)/|d1|/|q| = 0.5756

We see the ranking of documents are:

*d*_{2}
> d_{3 }> d_{1
}(3.3.3)

** **

To compare with the results in [3] and [12], we keep only
top 2 eigenvectors of L, i.e., set *r*
= 2 in Eq. (3.2.12) and metric matrix *g*
now becomes

0.0114 0.0047 0.0066 -1.0E-4 0.0066 0.0116 0.0114 0.0114 0.0116 -2.0E-4 0.0047

0.0047 0.0125 -0.0077 0.0137 -0.0077 -0.0089 0.0047 0.0047 -0.0089 0.0274 0.0125

0.0066 -0.0077 0.0144 -0.0138 0.0144 0.0205 0.0066 0.0066 0.0205 -0.0277 -0.0077

-1.0E-4 0.0137 -0.0138 0.018 -0.0138 -0.0182 -1.0E-4 -1.0E-4 -0.0182 0.0361 0.0137

0.0066 -0.0077 0.0144 -0.0138 0.0144 0.0205 0.0066 0.0066 0.0205 -0.0277 -0.0077

0.0116 -0.0089 0.0205 -0.0182 0.0205 0.0298 0.0116 0.0116 0.0298 -0.0365 -0.0089

0.0114 0.0047 0.0066 -1.0E-4 0.0066 0.0116 0.0114 0.0114 0.0116 -2.0E-4 0.0047

0.0114 0.0047 0.0066 -1.0E-4 0.0066 0.0116 0.0114 0.0114 0.0116 -2.0E-4 0.0047

0.0116 -0.0089 0.0205 -0.0182 0.0205 0.0298 0.0116 0.0116 0.0298 -0.0365 -0.0089

-2.0E-4 0.0274 -0.0277 0.0362 -0.0277 -0.0365 -2.0E-4 -2.0E-4 -0.0365 0.0724 0.0274

0.0047 0.0125 -0.0077 0.0137 -0.0077 -0.0089 0.0047 0.0047 -0.0089 0.0274 0.0125

Using this metric tensor in Eq. (3.2.13-14), we recalculate
the similarity coefficients:

*SC (d*_{1},
q) = (d1,
q)/|d1|/|q| = -0.0552

*SC (d*_{2},
q) = (d2,
q)/|d2|/|q| = 0.9912
(3.3.4)

*SC (d*_{3},
q) = (d3,
q)/|d3|/|q| = 0.4480

The results are almost identical to the results in [3] and
[12] ([3], page 73; [12], tutorial 4, Figure 6). Their results are:

*SC (d*_{1},
q) = (d1,
q)/|d1|/|q| = -0.0541

*SC (d*_{2},
q) = (d2,
q)/|d2|/|q| = 0.9910
(3.3.5)

*SC (d*_{1},
q) = (d3,
q)/|d3|/|q| = 0.4478

We see that, the order of the ranking of documents is the
same as in Eq. (3.3.3) where we use r = 3:

*d*_{2} > d_{3 }> d_{1 }_{ } (3.3.6)

From the numerical results, we see there is an important
characteristic of LSI/SVD: the relevance coefficient can be negative. This
means, the angle between the document and the query can be greater than π/2. If
we use the cosine squared law as in Eq. (1.1.11) of Ref [19], then the ranking
may cause problem.

In this section we would like to give a brief geometric
interpretation on metric tensor, SVD and TF-space of LSI.

We know that in the Cartesian coordinate system of a
Euclidian space, the metric has the simplest form:

(3.4.1)

But in other curvilinear coordinate systems (like polar or
spherical), metric tensors have different forms, and they can be transformed to
each other. But the space is still flat (the curvature is zero). In general relativity (see [17], chapter 3)
the metric tensor is the solution of Einstein field equation, representing a
curved space by the distribution of matter, modular to general coordinate
transformations.

Metric tensor for IR has mentioned in Ref [19] (page 86).
Following the notation there, we can rewrite Eq. (3.2.12) as:

(3.4.2)

Here the *r*-dimensional
base vectors have the following components:

(3.4.3)

Now Eq. (3.2.13) can be rewrite as:

(3.4.4)

The following are 2D graphic representations of the related
vectors.

In short, we have provided a concrete example of metric
tensor from LSI. We see that, because of the term-term interaction, the old
term coordinate system (in which we have the representation of term-document
interaction array) is no longer a Cartesian coordinated system. Actually, SVD
finds the new orthogonal coordinate systems (using new term vector k’), and the
transformation between the two systems. The existence of the global Cartesian
coordinate system shows that *the Term Frequency space of LSI is a flat
space*. This is easy to see, because the metric tensor defined in Eq.
(3.2.12) or (3.4.2) is a global constant tensor.

Note that in the Extended Boolean Model we also have used a
metric, a distance function as in Eq. (2.2.4). But that is not a Riemannian
metric, because it is not a rank-2 tensor and we cannot use it to define inner
product of vectors in that space. On the other hand, Riemannian metric can also
be used as a distance function, although we use it here only to define the
inner product of vectors on the tangent space. Because Term Frequency space is
flat, therefore, its tangent space at any point is itself.

We did not introduce metric tensor for weighted term vector
space in §1.1, because there we have already assumed no term-term interaction,
hence the old term vectors form an orthonormal basis as in Eq. (1.1.1b). In addition, since the weight format is not
always proportional to term frequency, as in Eq. (1.1.4d), it is not naturally
to make such inner product in a metric tensor form.

** **

** **

** **

Fock space describes the states of “many-particle” systems
(our documents), it is the space used in *second quantization* or quantum field
theory. In Fock space, the basic representation is the ONR, and the basic
operators are creation and annihilation operators (([1], page 576). They add or
remove a term in a given document.

For TF-space (documents of terms as ‘Bosons”), they have the
following properties:

(B.1)

Here we have used the definition of *communication relation*:

(B.2)

For BT-space (documents of terms as ‘Fermions”), their
properties are:

(B.3)

Here we have used the definition of *anti-communication relation*:

(B.4)

The occupation number can be expressed as follows ([1], page
575):

(B.5)

It has eigenvalues for Bosons and Fermions because they have
different commutation rules (B.2) and (B.4). We are not going to use operators
other than the occupation number operator. But it is worthwhile to know that,
in Fock space, all operators can be expressed in terms of creation or
annihilation operators. For example, if *ĥ*
is a single term operator, then an operator of the system (a document) can be
written as:

(B.6)

Note that the term states can have a continuous spectrum; in
this case, the operator can be written is:

(B.7)

These features might be useful to deal with document
updating or down-dating, and also with relations to non-text collections.

The Hilbert space and operators introduced in Ref. [19] are
mainly based on “single–particle” states or *first quantization*. We
see that there is a homomorphism between the document vectors in classical Vector
Space Model and in TF-space, so we can use either way to investigate them. But
as for classical Boolean Model, the BT Fock space maybe is the most natural one
to use.

** **

** **

** **

** **

**References**

[1].
Herbert Kroemer. *Quantum Mechanics**, for Engineering, Material Science and
Applied Physics*. Prentice Hall, 1994.

[2].
Ricardo Baeza-Yates and Berthier Riberto-Neto. *Modern
Information Retrieval*. Addison Wesley, New York, 1999.

[3].
David A. Grossman and
Ophir Frieder. *Information Retrieval**,
Algorithm and Heuristics*, 2^{nd} Edition. Springer, 2004.

[4].
G. Salton and M.E.
Lesk, *Computer evaluation of indexing and
text processing*. Journal of the ACM, 15 (1): 8-16, Jan., 1968.

[5].
G. Dalton. *The SMART Retrieval System – Experiments in
Automatic Document Processing*. Prentice Hall Inc., Englewood Cliffs, NJ,
1971.

[6].
S. K. M. Wong, W.
Ziarko, and P. C. Wong, *Generalized
vector space model in information retrieval*. In *Proc. 8*^{th} ACM SIGIR Conference on Research and Development
in Information Retrieval, pages 18-25, New York, USA, 1985.

[7].
Gerard Salton, Edward A. Fox, and Harry Wu, *Extended Boolean information retrieval*.
Communications of the ACM, 26(11):1022-1036, November 1983.

[8].
Gerard Salton. *Automatic Text Processing*.
Addison-Wesley, 1989.

[10].
Yonggang Qiu and H. P. Frei. *Concept based query expansion*. In *Proc. 16*^{th} ACM SIGIR Conference on Research and Development
in Information Retrieval, pages 160-169, Pittsburg, USA, 1993

[11].
G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauwer, R.
A. Harshman, L. A. Streeter, and K. E. Lochbaum. *Information retrieval using a singular value decomposition model of
latent semantic structure*. In *Proc.
of the 11*^{th} annual International ACM SIGIR conference on Research
and Development in Information Retrieval, pages 465-480, 1988.

[13].
Steven J. Leon. *Linear Algebra with Applications*. 4^{th}
edition, Macmillan College Publishing Company, Inc, 1994.

[17].
S. Weinberg. *Gravitation and Cosmology*. John
Wiley, 1972.

[19].
Keith Van Rijsbergen. *The Geometry of Information Retrieval*.
Cambridge, 2004.

[20].
Lorenzo Sadun. *Applied Linear Algebra: The Decoupling
Principle.* Prentice Hall, 2000.