One of the challenges, in the field of Information Retrieval, or IR, is the representation of a natural-language text in the form of a search string can be used for purposes of text matching and other text manipulations. See for example, the discussion of text representation in Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley, 1999.
Typically in automated text-searching methods, a natural language target text is represented as a vector in word space, where each word (or non-generic word) represents a vector dimension, and the vector coefficients are related to some relevance factor that is assigned to the word. The relevance between the target document and a document in a searched library can then be readily determined, for example, from the “overlap” between target document and searched document vectors. Heretofore, this approach has been hampered by the challenges, in an automated system, of identifying meaningful search terms for the vector, and assigning term coefficients that are robust and reasonably related to the content of the text.
It would therefore be desirable to provide a method, code, and apparatus for representing a natural-language text as a word-string vector whose word and word group coefficients provide a meaningful reflection of the pertinent of the vector terms in a particular field.