Many tasks such as Information Retrieval, Clustering and Categorization represent documents by vectors, where each dimension index of a document can represent a given word and where a value can encode the word importance in the document. The field of the subject embodiments relate to a representation of documents by the relative importance of a word in a document and to a particular weighting model relative to term frequencies in an associated collection. More particularly, the embodiments relate to weighting the measure of relative importance using a concavity control parameter.
Many weighting models try to quantify the importance of a word in a document with probabilistic measures. For each term in the collection, a probability distribution is associated to the term. For any word in any document, a probability according to the collection model can then be computed. The computed probability is related to the informative content of the word in the document or document collection (corpus).
The main hypothesis of many weighting models is the following: The more the distribution of a word in a document deviates from its average distribution in the collection, the more likely is this word significant for the document considered. This can be easily captured in terms of Shannon information. Let Xω the random variable of frequencies of word ω and P a probability distribution with parameter λω, then the Shannon information measure is:Info P(Xw=x)=−log P(X=xw|λw)=InformativeContent  (1)
If a word behaves in the document as expected in the collection, then it has a high probability of P(X=xω|λω) occurrence in the document, according to the collection distribution, and the information it brings to the document, −log P(X=xω|λω), is small. On the contrary, if it has a low probability of occurrence in the document, according to the collection distribution, then the amount of information it conveys is greater
This is the idea underpinning the classical Divergence From Randomness Model and information-based models in Information Retrieval.
Hence, the cornerstone of many weighting models consists in using Shannon information −log P(Xω|λω) to measure the importance of a word in a document and weighting of words in documents.
Table 1 identifies many of the notations used in the remainder of this disclosure.
TABLE 1NotationsNotationDescriptionq, dOriginal query, documentRSV (q, d)Retrieval status value of d for q (ie Ranking Function)xwd# of occurrences of w in doc dXwDiscrete Random Variable for the xwdTwContinuous Random Variable for normalized occurrencesldLength of doc davglAverage document length in collectionN# of docs in collectionNw# of documents containing widf (w)−log(Nw/N)P(w|C)Probability of the word in the collection
A known notion to define the family of IR models is the following equation:RSV(q,d)=Σwεq−qw log P(Tw≧twd|λw)  (2)where Tw is a continuous random variable modeling normalized term frequencies and λw is a set of parameters of the probability distribution considered. This ranking function corresponds to the mean information a document brings to a query or, equivalently, to the average of the document information brought by each query term a.
Few words are needed to explain the choice of the probability P(Tw≧twd) in the information measure. Shannon information was originally defined on discrete probability and the information quantity from the observation of x was measured with −log P(X=x|Θ). As the normalized frequencies twd are continuous variables, Shannon information cannot be directly applied.
A known solution is to measure information on a probablility of the form P((twd−a—Tw—twd+b|λw). However, one has to choose values for a and b, and a=0 and b=+∞ have been chosen for theorical reasons. The mean frequency of most words in a document is close to 0. For any word large frequencies are typically less likely than smaller frequencies on average. The larger the term frequency is, the smaller P(Tw≧twd) is and the bigger −log P(Tw≧twd). Hence, the use of the survival function P(T>t) seems compatible with the notion of information content discussed above.
Overall, the general idea of the information-based family is the following:
1. Due to different document length, discrete term frequencies (x) are renormalized into continuous values (t(x))
2. For each term w, one can assume that those renormalized values follow a probability distribution P on the corpus. Formally, Tw:P(.|λw).
3. Queries and documents are compared through a measure of surprise, or a mean of information of the form
      RSV    ⁡          (              q        ,        d            )        =            Σ              w        ∈        q              -                  q        w            ⁢      log      ⁢                          ⁢              P        ⁡                  (                                                    T                w                            ≥                              t                ⁡                                  (                  x                  )                                                      |                          λ              w                                )                    
So information models are specified by two main components: a function which normalizes term frequencies across documents, and a probability distribution modeling the normalized term frequencies. Information is the key ingredient of such models since information measures the significance of a word in a document.
Such known information models measure the relative importance of a word in a document compared to its importance in other documents in the collection by either a fixed weighting function (natural log) or the proposition of ad-hoc classes (which do not focus on concavity control).
There is always a need for new and improved representation of documents which can yield better results than known representations in document retrieval, categorization and clustering tasks. The subject embodiments address this need.