Dirac Notation, Fock Space and Riemann Metric Tensor in Information Retrieval Models

 

Sherman Wang, PhD in Theoretical Physics

Email: xmwang@shermanlab.com

First Draft: 10/22, 2006

Second Draft: 12/03/2006

 

Introduction

 

Using Dirac Notation, various modern Information Retrieval (IR) models (vector space, Boolean, probabilistic and some their extensions) are investigated in Occupation Number Representations (ONR) of Fock spaces. A Riemannian metric tensor is derived based on Singular Value Decomposition (SVD) for Latent Semantic Indexing (LSI) Model.

 

This article is not about how to deduce models from first principals, but is about how to use Dirac notation and Fock space to investigate exiting models in a concrete way of representations. After finishing first draft, we notice that Van Rijsbergen has discussed ([19]) IR in an abstract way, using observables (Hermitian operators) and Hilbert space with Dirac notation. Whenever is possible, we will make use of and make comparison to Ref. [19].

 

Table of Contents

 

  1. The Classical IR models and Fock Spaces

1.1.  The Vector Space Model and Weighted Term Vector Space             

1.2.  The Boolean Model and Boolean Term Vector Space             

1.3.  Fock Spaces and Occupation Number Representations                       

1.4.  Probabilistic Model and Term-Query Probability                    

 

  1. The Alternative IR Models

2.1.  The Fuzzy Set Retrieval and Fuzzy Boolean Term Fock Space            

2.2.  Extended Boolean Model                                         

2.3.  Term-Term Correlation and the Document Fock Space              

 

  1. LSI, SVD and Riemann Metric Tensor                                          

3.1.  Term-Document  Term-Term and Document-Document Arrays    

3.2.  SVD, Vector Inner Product and Metric Tensor                      

3.3.  Term-Frequency Fock Space and Metric Tensor for GF-example  

3.4.  Geometric Meaning of Metric Tensor in Term Frequency Space  

      

Summary                                          

Appendix A: The Grossman-Frieder example                                                  

Appendix B: More on Operators in Fock Space                                              

References                               

 

 

1. The Classical IR Models and Fock Spaces

 

In this section, we first give detailed applications of Dirac notation for the three classical models: vector space model, Boolean model and probabilistic model in their classical ways. Then we introduce Fock space as a unified way to represent the above models.

 

 

1.1. The Vector Space Model and Weighted Term Vector Space

 

Suppose we have t-independent terms in our collection, we can think of them as othonormal vectors in a t-dimensional real vector space:

 

                                               (1.1.1a)

 

In Dirac notation, we can define a set of the orthonormal vectors  ([1], pages 192-195):

 

                                                                 (1.1.1b)

 

Here,  is called a ket;  is its Hermit conjugate (or, transpose for real vector), and is called a bra;  is a bracket, the inner product of two vectors. More about Dirac notation can be found in Ref [19] and [20].

 

Because any vector in the term-document space (i.e., terms, documents and queries) can be uniquely expanded with respect to the terms, we say that the terms construct a basis of the space. Moreover, if the expansion coefficients are uniquely determined by the inner product of document (and query) with term, the completeness can be written as:

 

                                                                                                      (1.1.1c)

 

Now we can represent a query vector q and document  (μ=1, 2…N) as:

    (1.1.2a)

 

Here,  and   are called term weights of a query or a document with respect to the ith term. They actually define the inner product of a vector (or a query) with a term ([2], page 29; [3], page 15):

 

                    (1.1.2b)

Note that we have used the fact that, for two real vectors,

 

 

This is the classical vector space model (VSM), proposed by Salton et al ([4, 5]). We call the vector space spanned by these t base-vectors of Eq. (1.1.1b) a Weighted Term Vector Space, or WT-space for short.  The weights can be calculated based on different schemes, usually involving ([2], page 29; [3], page 15) following parameters:

 

: The frequency of i-th term in document                                                        (1.1.3a)

: The number of documents containing i-th term                                                    (1.1.3b)

 

: The length (or word count) of the  -th document                                            (1.1.3c)

 

: Number of all documents in the collection                                                (1.1.3d)

 

                      (1.1.3e)

           

 

: The frequency of i-th term in the query ([2], page 29)                          (1.1.3f)

 

In reference [3], the following formula is used to calculate idf,  and :

 

                                                                                             (1.1.4a)

 

                                                                                           (1.1.4b)

 

                                                                                (1.1.4c)

 

Many variations on how to compute weights have been done. One good performer is identified as ([3], page 17):

                                         

                                                                      (1.1.4d)

 

After we have computed the weights, we now can calculate the relevance or the Similarity Coefficient (SC) of document   with respect to query q. In classical vector model, SC is the cosine of the angle between the two vectors, ([2], page 27), which can be written as:

 

                                         (1.1.5)

 

Here, using (1.1.2a, b) and (1.1.1c), we have:

 

                                  (1.1.6)

 

 

        (1.1.7a)

 

        (1.1.7b)

In many cases, we want to normalize  and , so that:

          (1.1.8)    

 

We see that the normalized WT-space is a bounded t-dimensional continuous space over the field of [0, 1], restricted in the unit cubes:

 

                                                                         (1.1.9)

 

Note that, the vectors restricted by Eq. (1.1.9) do not form a vector space, because their inverse elements are not included, and they do not close to vector addition. But they do form a subspace, closed to retrieval-related operations. We call such space a retrieval-closed space. Actually, this space is better to be classified as a Fock space (see Section 1.3).

 

Up to now, the only properties (or assumptions) we know about the base vectors  are in Eq. (1.1.1). These assumptions may be relaxed or violated in other models. We also have to define the rules to calculate the inner product of document and query with term vector (the weights), for example, by Eq. (1.1.4).

 

If we following the discussion in Ref [19], we can rewrite Eq. (1.1.6) using the trace operation:

                                             (1.1.10)

We accept is as Relevance Coefficient (the cosine law), while in Ref [19], the related probability is given by (page 87) the cosine-squared law:

 

                                                             (1.1.11)

 

In reference [3], a simple example is used throughout that book. To make it convenient for readers, we will reuse this example and restate the example ([3], page 15). The example, referred to GF-Example from now on, has a collection of three Documents and one Query as described in Appendix A.

 

To avoid unnecessary confusion, we use the following conventions for labeling indexes of vectors or tensors:

 

Index Convention:

  1. For terms, we use Latin chars i, j, k, l, m
  2. For documents, we use Greek chars
  3. For query, we use Latin char q.
  4. For other cases, like in LSI, we use Latin chars a, b, c

 

 

1.2. The Classical Boolean Model and Boolean Term Vector Space

 

The classical Boolean model is discussed in [2], §2.5.2.  Here we want to use a vector space to describe this model.

 

Assume that the term-document and term-query weights are all binary, i.e.:

 

                                                   (1.2.1)

 

Now we introduce new vector space to represent documents and query in this space. Each vector is a director product of t 2-dimentional vector. This space has M =  points.  We call this vector space a Boolean Term Vector Space, or BT-space for short.  Suppose we have 5 independent terms, and a sample vector has three terms: first, second and the fifth. Then in the BT-space, this vector can be represented as:

 

                                             (1.2.2a)

 

Here we have used an explicit representation of each 2D factor space. They are chosen as eigenvectors of an occupation operator:

 

                      (1.2.2b)

 

 

The 2-dimentional vector looks similar to the spin-1/2 vector representation in quantum mechanics (QM). Only difference is, in QM, the base vectors usually are common eigenvectors of operators  and , corresponding to spin-up and spin-down states (in natural units,  ):

 

                                                               (1.2.3a)

 

Using these two states as basis, we have their matrix expressions:

 

                               (1.2.3b)

 

Comparing (1.2.2b) and (1.2.3b), we see the following relations:

 

                                                             (1.2.3c)

 

In BT-space, the basis in equation (1.1) now can be represented as a set of t elements of the form:

 

   (1.2.4a)

 

 

And we have t occupation operators:

 

                                                                            (1.2.4b)

 

This means that  acts on the l-th 2-dimentional vector of the product; if it is |1>, returns an eigenvalue 1, otherwise returns 0. Therefore we can think of the t base vectors as the common eigenvectors of the t occupation operators. We call it occupation operator, because its eigenvalues are similar to the occupation number of a system of fermions (see §1.3).

 

A document now can be represented as:

 

                                       (1.2.5)

 

 

The documents in GF-Example now can be represented as:

 

                                                      (1.2.6a)

 

Now let us assume that our binary query is “  ”, i.e.:

 

                                                                                         (1.2.6b)

 

Then we can see only  is relevant, because it has the sixth and ninth term but not has the fifth term as the query requires. In other words, the relevant documents of this query are in the following set (or a concept):

 

                                         (1.2.6c)

 

Using the vector space, we can also think that any relevant document  is a common eigenvector of occupation operators  and  such that:

 

                                               (1.2.6d)

 

Our BT-space is very close to what is used by the Generalized Vector Space, introduced by Wang, Ziarko, and Wong [6]. The main difference is that they consider the space is of  dimensional, and introduce  base vector to represent a term vector and term-term correlations. In our case, we rather think it is a product of t factor spaces, which is spanned by the eigenvectors of the occupation operator g. This factor space is 2-dimesional for Boolean model, but can be extended easily, as described in next sections (see §1.3 and §2.1).

 

We see that the BT-space is a bounded, finite and t-dimensional discrete space over the field of {0, 1}, restricted to the vertices of the unit cubes in our WT-space:

 

                                                                        (1.2.7)

 

Again note that, the vectors defined by Eq. (1.2.7) do not form a vector space, but a retrieval-closed space of vectors. It is also a Fock space, as we will see in next section.

 

Following the discussion in [19] (pages 56-60), we see that our occupation operator is a projector, and can be written as:

 

                                              (1.2.8)

 

 

1.3. Fock Spaces and Occupation Number Representations

 

The vectors like (1.2.6a) remind us of the Occupation Number Representation (ONR) ([1], page 566) in a multi-particle system:

 

                                                                                         (1.3.1)

 

Here  is the number of particles occupying the  state (orbital). This vector space is called Fock space. For fermions (e.g., electrons, protons…),  can only be 0 or 1.

Our BT-space is also a Fock space, and its representation can be called as Boolean Term Occupation Number Representation (BT-ONR), in which the number can have only value 0 or 1 for a given term.  The occupation operator  now can be interpreted as Term Occupation Number Operator. Its eigenvalue   is 0 or 1 in BT-space. Because any term, word group, sentence, paragraph, document or query is represented by one of the M =  vectors of the form (1.3.1), we can also think of TB space as a finite, discrete set of  points (or elements). Each point is a concept. The interesting example is the point representing the vocabulary, or all distinct terms, of the collection ([2], page 174):

 

                                                                                                  (1.3.2)

 

The base vector (1.2.3) now can also be written as:

 

                                     (1.3.3)

 

They form a subset of the t terms, each represent one term. Here we introduce the t-plet notation of the t base vectors (points) in the Fock space:

 

                                                                                                 (1.3.4a)

 

Note that, in this notation, the addition of two t-plets is actually their union. For example, in our 5-term space, a document vector   can be written as:

                                                                           (1.3.4b)

 

If we allow the term number to be any non-negative integer (or natural number), like the term-frequency in a document, then we have a new representation of the classic vector space: a Term Frequency Fock Space, or TF-space for short, which is similar to ONR for Bosons in Quantum mechanics (e.g., photons, mesons…). 

 

The number vector (1.3.4a) can now be generalized to:

 

                                    (1.3.5)

 

The occupation operator now has such a property:

 

                                        (1.3.6)

 

In this representation, the three documents and the query in our GF-Example now can be written as:

                              (1.3.7a)

 

Using term vectors in Eq. (1.3.3) as our base, we can represent the above document vectors and the query vector in columns as follows:

 

                             (1.3.7b)

 

Note that, in this notation, the addition of two t-plets is actually the addition of number vectors. For example, in a 5-dimetional term space, a document vector   can be written as:

 

                                                                   (1.3.8)

 

Now we see that the TF-space is a retrieval-closed space of bounded t-dimensional discrete vectors, restricted on the vertices of the unit cubes in our WT-space:

 

                                                                                    (1.3.9)

 

In TF-space, we do not use vector addition of documents, but we do need to define inner product, which can be derived from their correspondence in WT space:

 

                                   (1.3.10)

 

 

We will see that the TF-space and their representations will greatly simplify the description of the so-called Latent Semantic Indexing Model (see §3). More about Operators in Fock space is given in Appendix B.

 

1.4. The Classical Probabilistic Model and Term-Query Probability Fock Space

 

Assume that the term-document and term-query weights are all binary, as in eq. (1.2.1), and assume that  is the set of relevant documents for a query  and  is the irrelevant document for the query. Then we may modify the identity matrix in eq. (1.1.1) to including a probability factor with respect to q, R and :

  (1.4.1)

 

Here  and  may be computed using formulas like ([2], page 32):

 

                                                                      (1.4.2)

 

where  is the probability when term  is present in a document randomly selected from the set  and  is the probability when  is resent in a document randomly selected from the set . Usually, in the beginning, the two probabilities are assigned as constants for all terms, for example ([2], page 33):

 

                                                                  (1.4.3)

 

Then, based on the retrieved documents, one can improve the initial ranking.

 

The Similarity Coefficient now can be calculated as:

 

                 (1.4.4)

 

The reason that here we can use , instead of , is that we are limited in the subset of terms in query , and all other terms , not contained in , are orthogonal to , i.e.,   Because this space is restricted to subspace spanned by , we may call this space as Term-Query Probability Fock Space, or TQP-space for short.

 

In many cases, we want to normalize  and , as in Eq. (1.1.8). Now suppose that the query is related to m terms. Then the TQP –space is an m-dimensional subspace of the WT-space as described by Eq, (1.1.9):

 

                                    (1.4.5)

 

 

2. The Alternative IR Models

 

2.1. The Fuzzy Set Retrieval and Fuzzy Boolean Term Fock Space

 

In classical Boolean Model, a document either has a term or does not have the term. This fact is represented in the values 0 or 1 in the vectors of (1.2.6a). A document is either relevant or irrelevant to a query. This fact is reflected in the set expression (1.2.6c). The basic assumption can be seen in Eq. (1.2.2b), that is, all term states are eigenstates of occupation operator.

 

An interesting case in QM is that if a spin state is a linear combination of the up and down status. Then this state is not an eigenvector of  anymore. We can only predict its “average” value (assume the state vector is normalized):

 

                                                                                         (2.1.1)

 

 

Because  can have eigenvalue only -½ or ½, its average value in the range of [-½, ½].  Based on Eq. (1.2.3c), we can think that if there is dependence between terms, we may assign a fuzzy number to occupation operator for i-th term in the  -th document as (assume the document vector is normalized):

 

                                                                                                   (2.1.2)

 

Because  can have eigenvalues only 0 or 1, its average value in the range of [0, 1]. 

 

This is very mush similar to fuzzy set theory, (see [2], page 35 and its Ref. [846]), where we can have a degree of membership for each term in a document. This can be done by using a map in our 2D space (remember, each 2D space is for one term):

 

                                                                           (2.1.3)

 

For a document vector, we have the following mapped fuzzy document vector:                                   

                               (2.1.4)

 

The membership of i-th term in j-th document, , can be calculated, e.g., using following formula (see [3], page 86 and our notation in Eq. (1.1.3):

 

                                                                                    (2.1.5)

 

In our GF-Example in §1.1.1, we have:

 

.

 

Hence, the term gold has a membership in first document as:

                              

 

Other memberships are listed in table 2.1 9(see [3], page 86).

 

Table 2.1: Term Membership in Documents (  )

 

Term

1

2

3

4

5

6

7

8

9

10

11

Word

a

arrived

damaged

delivery

fire

gold

in

of

shipment

silver

truck

0.143

0

0.143

0

0.143

0.143

0.143

0.143

0.143

0

0

0.125

0.125

0

0.125

0

0

0.125

0.125

0

0.25

0.125

0.143

0.143

0

0

0

0.143

0.143

0.143

0.143

0

0.143

 

 

Using our notation, the fuzzy document vectors can be written as:

 

We call this as a Fuzzy Boolean Term Fock Space, or FBT-space, which can be thought as an extension of TF-ONR by mapping Frequency to a real number between 0 and 1. Again, it is a retrieval-closed space of vectors.  

 

To get the membership of i-th term in j-th document, we can define a new fuzzy occupation operator  and a Fuzzy Membership function :

 

                                                    (2.1.4)

 

Note that the M-function should satisfy the fuzzy set operation rules ([2], page 35 and [3], page 85):

 

                                                                                    (2.1.5)

 

Now let us calculate the membership of documents relative to the following query ([3], page 87):

 

                                        

 

A more explicit way to consider term-term correlation is to introduce a correlation matrix (see [2], page 36 and its Ref. [616]). It can be represented by a matrix representation of a correlation operator:

 

                                                                        (2.1.6a)

 

 

where  is defined in Eq. (1.1.3b), and  is the documents which contains both terms. Then the membership of i-th term in  -th document can be computed as:

 

                         (2.1.6b)

 

In this approach, the fuzzy OR operation defined in (2.1.5) is replaced by an algebraic sum, implemented as a complements of a negative algebraic products, and the fuzzy AND operation in (2.1.5) is computed using algebraic product. For example:

 

                                                   (2.1.7)

 

To use above set operation rules, one should first decompose the query concepts in a distinctive normal form, as a union of disjunctive concept vectors (see [2], pages 36-37).

 

In any case, it seems that, using FBT-space can greatly simplify the expressions and calculations for fuzzy set examples. Based on our discussion, FBT-space is a retrieval-closed space of bounded t-dimensional continuous vectors, as described by Eq. (1.1.9).

 

 

2.2. Extended Boolean Model

 

Another way to improve Boolean search is to rank the relevance document based on its closeness to a query, using Extended Boolean Model, introduced by Salton, Fox and Wu [7, 8] (see [2], page 38, [3], page 67 and [5]).

 

First, we assume that all terms in a document or a query are assigned with weights as in classical vector model, described in §1.1. All weights are normalized to be in the range of [0, 1].

 

Next, the least or most desirable point is decided based on the operation of the Boolean query (union or join). If the query is a union of m terms, we will have the least desirable point; if the query is a joint of m terms, then we will have the most desirable points. To describe such points, we restrict ourselves to a subspace, spanned by the terms contained in the query. We call this space the q-space (like in the classical probabilistic model). To simplify our formula, we assume that the first m terms are contained in the query. Now we can express the Identity operator in the m-dimensional q-space by:

                                                                                                     (2.2.1)

 

In this subspace, the query and a document is expressed in the WT-space (Weighted Term Vector Space) as:

 

   (2.2.2)

 

If the query is a union of m terms,

                                                                                          (2.2.3a)

 

then the least desirable point is the origin of the weighted space ([2], page 39):

 

                                                                       (2.2.3b)

The same vector, if represented in our TF-space (or BT-space), will be:

 

                                                                                   (2.2.3c)

 

If the query is a joint of m terms:

                                                                                         (2.2.4a)

 

We can see that the most desirable point is the vector of the highest weights ([2], page 39):

 

                                                                             (2.2.4a)

 

The same vector, if represented in our TF-space (or BT-space), will be:

 

                                                                                                  (2.2.4c)

 

 

Now we need define a measure or a distance in the space, in order to calculate the distance of the document from such point. According to [7.8], we can define the normalized distance between two vectors in the m-dimensional q-space with a parameter p by:

 

                                                    (2.2.4)

This is a normalized form of the Minkowski distance of order p (or p-norm distance) in a Euclidean space, as an example of metric spaces [9].

 

With the above descriptions, we can simplify the Similarity Coefficients (SC) formulas as follows.

 

                                                          (2.2.5a)

 

                                                    (2.2.5b)

 

If p = 1, the distance is the difference of the two vectors in m-dimensional vector space; if p = 2, the distance is for an m-dimensional Euclidian space; if p = ∞, Eq. (2.25) return the result of the original Boolean query.

 

We can also include the weights of query in the calculation of distance ([3], page 68). Then Eq. (2.2.4) becomes:

                                          (2.2.6)

 

We call the space the Extended Boolean Term Fock Space (EBT-space). Based on our discussion, the EBT-space is a bounded t-dimensional continuous metric space: it is a retrieval-closed space of vectors as described by Eq. (1.1.9) and it is also a metric space with a p-norm (2.2.4) or (2.2.6) as its distance function. No inner product is need here.

 

 

 

2.3. Term-Term Correlation and the Document Fock Space

 

Because the existence of term-term correlations, we cannot say terms are independent. This means that the term vectors in Eq. (1.1.b) are not orthogonal to each other anymore:

 

                                                                                (2.3.1a)

 

Note that the completeness of terms, as described in Eq. (1.1.1c) still holds as long as the expansion of documents and query are unique with respect to defined inner products. In this case, we still have:

 

                                                                                                    (2.3.1b)

 

There are many ways to compute term-term relationships. The Generalized Vector Model was proposed by Wong, Ziarko and Wong in 1985 [6]. They considered the space spanned by the  vectors similar those described in Eq. (1.2.6), introduced  orthonormal vectors as the bases of the vector space, expressed terms as linear combinations of these base vectors, and calculated term-term correlation based on such term expressions. More details can also be seen in [2: pages 41-44].

 

Here we discuss a query expansion model based on a global similarity thesaurus which is constructed automatically ([10], and [2], page 131-133]). Because the conditions in Eq. (2.3.1), we cannot use the base vectors in (1.1.1b) to represent our vector space. Instead, we expand term vectors with respect to documents:

 

                                                           (2.3.2)

 

Assume any term is at least in one document and no two documents are identical in term frequency, so term vector can be uniquely expanded with respect to document vectors. This is a very strong requirement: the Term-Document matrix defined in Eq. (3.1.1) need have a rank of t.  In this way, we can think that the document vectors form a basis of N-dimensional Fock space, with the inner product of term-document defined in (2.3.2):

 

                                                         (2.3.3)

 

The weights in Eq. (2.3.2), now the document occupation number of terms, are calculated using the following formula ([2], page 132),

 

 

                                          (2.3.4)

 

where  is defined in Eq. (1.1.3a),  is the maximum value of all  for a given , and  is the inverse term frequency for document , defined by:

 

                                                                                                    (2.3.5)

Here  is the number of distinct terms in document .

 

Now we can express the term-term correlation factor as (identical to Eq. (5.12) in [2], page 132):

 

                                        (2.3.6)

 

Although terms are not an orthonormal system, we still can define the query vector as a linear combination of terms contained in the query ([2], page 133):

 

                                              (2.3.7)

 

Here, the weights are calculated similarly to Eq. (2.3.5), but they are not equal to the inner product of term-query.  Instead, the inner product of a query with a term is computed using (2.3.7) and (2.3.6) as:

 

                                          (2.3.8)

 

 

Now we can derive our expression for the Similarity Coefficient as:

 

                                            (2.3.9)

 

This equation is identical to the one on page 134, Ref [2].

 

We call the vector space spanned by document vectors defined in (2.3.3) as Document Vector Space (DV-space), which is an N-dimensional vector space with base vector (2.3.3), and its inners product of document and query with term as (2.3.2) and (2.3.8). Based on above rules, we can then calculate the term-term correlations as in (2.3.6).

 

We see in this model, document and query are treated differently, because they have different inner product rules with terms. In next section, we will see a more natural way of introduce term-term interaction in which document and query would be treated in the same way, unified by a metric tensor derived using SVD.

 

 

3. Latent Semantic Indexing Model and SVD

 

 

In WT-space of classical vector models, we have the basic assumption that the term vectors form a complete orthonormal bases as in Eq. (1.1.1b).  In reality, terms are not independent vectors, because of the existence of the term-document relations.

 

This situation is very much like the situation in the quantum system of the Hydrogen atom when neglecting the orbital-spin (L-S) interaction ([1], pages 549-554). If we neglect the L-S interaction, the Hamiltonian (H) of the system has eigenvector which are also the common eigenvectors of the orbital angular momentum operators (L2, Lz) and the spin angular momentum (S2, Sz).  Any two of them are orthogonal if anyone of the four eigenvalues (L2, Lz, S2, Sz) is different. But because of the L-S interaction, they become dependent and we have no such common eigenvectors anymore.  Instead, we have to introduce the total angular momentum (J = L + S), to find the common eigenvectors of (J2, L2, S2, Jz), which give us the fine structure of the Hydrogen atom ([1], page 554-555).

 

3.1. Term-Document, Term-Term and Document-Document Arrays

 

In Latent Semantic Indexing (LSI) Model singular value decomposition (SVD), the key assumption is that the document-term dependence causes both term-term and documents-document dependences ([2], pages 44-45; [3], pages 70-73]; [11-12]). The document-term dependence is nothing else but the frequency of i-th term in the  -th document, , as defined in Eq. (1.1.3a). We also have their representations in Eq. (1.3.7) in our TF-space for the GF-examples. This relation can be viewed as the definition of inner product between a document and a term, and represented by a t x N matrix A (the term-document matrix) as follows:

 

                                                                    (3.1.1)

 

Here we have used the document vectors as the columns in the matrix. From now on, we assume that t > N. Apply to our GF-example, we have the following three document vectors, the term-doc matrix and the query vector.

 

Term-Document Matrix Example

[12]-4: Figure 2. Term-document matrix and query matrix example

 

The term-term interaction is defined by a t x t matrix L (called Left matrix):

 

                           (3.1.2)

 

Here we have used the transpose of document vectors as the rows in the matrix.  Note that the elements of matrix L are similar to (2.3.6); the difference is the definition of the inner product of term and document.

 

The document-document interaction is defined by N x N matrix R (called Right matrix):

 

                  (3.1.3)

 

Because both L and R are real symmetric matrices, they both have real, complete and orthogonal eigenvectors with the same set of real, non-negative eigenvalues ([1], pages 199-204; [13], pages 321-325). Here are their normalized forms:

 

                                                                    (3.1.4)

 

 

3.2. SVD, Vector Inner Product and Metric Tensor

 

Now we can apply the Singular Value Decomposition (SVD) of matrix A to use the eigenvectors of both L and R:

 

                                                                                                    (3.2.1)

Here, U and V are orthogonal matrices, the columns of the t x N matrix U are the N new term eigenvectors u, the columns of N x N matrix V are the new document eigenvectors v, and the N x N matrix S is a diagonal matrix with its diagonal elements Si =  (the singular values) as the square root of the eigenvalues of M or MT. We can arrange Si such that S1 > S2 > S3 > … SN

 

One of the advantages of SVD is that, in most cases, we do not need to use eigenvectors of zero eigenvalues (usually, there are many of them), and we can also keep only r non-zero singular values to reduce the ranks of all matrices to r, without losing much of the accuracy:

 

                                                                                        (3.2.2)

 

This is referred to Reduced SVD (RSVD). We can visually explain the reduction by the following figure (a copy of Figure 8, tutorial 3, Ref. [12], where k is used as our r).

 

 

Reduced SVD

 

[12]-4: Figure 8. The Reduced SVD or Rank k Approximation

 

Using the Dirac notation, we can express (3.2.2) as:

 

  (3.2.3)

Here we have used following notations:

 

                                        (3.2.4a)

 

This means that:

 

                                                                                                             (3.2.4b)

 

Also note that here the new document vectors are reduced to r-dimensional.

 

A very important equation derived from Eq. (3.2.1) is ([12], tutorial 4):

 

                                                                                                  (3.2.5a)

Using Eq. (3.1.1) and (3.2.4a), we can express (3.2.5a) in reduced form as:

 

                                        (3.2.5b)

 

This equation tells us how we can transform original document vector  to new vector: :

 

                                             (3.2.6a)

 

 

Since the query vector is treated like a document vector, we can compute the new query vector as:

                          (3.2.6b)

Eq. (3.2.6) give us a geometrical interpretation of the transformation: the new document vector is represented in a new base of the new term vectors ( , eigenvectors of L); its ith component is its inner product with corresponding new term vector , scaled by a factor of 1/Si. So we can say the new document is the result of a rotation (from k to k’) plus a scaling.  Then, the inner product of the two new vectors can be derived from Eq. (3.2.6) as:

 

                                                                    (3.2.7)

 

Finally, the ranking of documents now can be calculated using the Cosine low (1.1.5), or:

 

                                    (3.2.10)

 

Now let us introduce a metric tensor to simplify our calculations. We can rewrite Eq. (3.2.7) in a more concise form:

                                  (3.2.11)