One of the challenges, in the field of Information Retrieval, or IR, is the representation of a natural-language text in the form of a search string can be used for purposes of text matching and other text manipulations. See for example, the discussion of text representation in Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley, 1999.
In one general approach, a natural-language text is represented as a vector in word space, where each word (or non-generic word) represents a vector dimension, and the vector coefficients are related to some relevance factor that is assigned to the word. The relevance between the target document and a document in a searched library can then be readily determined, for example, from the overlap in vector terms in the target document and searched documents. Ideally, the vector coefficients are term weights that are related to the content of the text, the higher the coefficient, the more related to text content. Thus, for example, in using vectors for purposes of text matching, the term coefficients are useful in determining the degree of similarity between each document stored in a system and the search query represented by the vector.
One well-known coefficient for vector word strings is the inverse document frequency, or IDF. This coefficient is related to the inverse of the frequency of a word in a document of set of documents. The rationale behind this term is that a word that appears with greater frequency (such as common generic words) will be less pertinent to the content of the document. Although IDF has proven useful as a indicator of word pertinence, it is limited in two fundamental ways.
First, the value of the coefficient is highly dependent on the particular text of group of texts from which the IDF value is calculated. Take, for example, the IDF for the word “cardiac.” In general, this word would be expected to be highly pertinent to content of a natural-language text dealing with some aspect of the heart, e.g., cardiac treatment. The word should thus be given a high “pertinence” coefficient in a vector representation of the text. However, if the group of texts from which the IDF is being calculated are only medical texts, or in particular, texts dealing with the heart, the word “cardiac” is likely to have a high document frequency and thus a low IDF. The pertinence of the word to the content of the text is thus diluted or lost.
A second limitation of IDF coefficients relates to word groups, e.g., word pairs. Often a word group, such as “cardiac signal” or “cardiac arrest” is more descriptive of content than the individual words making up the word group. The difficulty with determining relevance coefficients for word groups, however, is that most word groups that a computer would identify by deconstructing a natural-language would be spurious or nonsense word groups, and as such would be expected to have a very low frequency and correspondingly high IDF. These spurious terms could either be highly weighted, which would badly misrepresent the text content, or upper limits would have to be imposed on all word-pair values, which would under-represent the pertinence of true word pairs.
It would therefore be desirable to provide a method, code, and apparatus for representing a natural-language text as a word-string vector whose word and word group coefficients provide a meaningful reflection of the pertinent of the vector terms in a particular field.