1. Field of the Invention
This invention relates to vector-based meaning-sensitive information storage and retrieval systems, and more particularly to an improved system and method for generating and retrieving context vectors that represent high-dimensional abstractions of information content.
2. Description of the Related Art
Conventional methods of record storage and retrieval generally involve storage of all records word for word and then searching for key words in the records using inverted indexes. The key word searches are performed by doing a complete search through all of the contents of the data base that contain a list of query words. Such systems have no knowledge, for example, that "car" and "automobile" represent nearly the same meaning, so the user must include this information by using a complex and difficult-to-formulate query. Some systems try to solve this problem by a built-in thesaurus, but such systems lack "meaning sensitivity" and miss many subtleties of meaning association, such as the fact that "car" is closer to "road" than to "hippopotamus".
There is currently much research and development in the field of neural networks. A neural network consists of a collection of cells and connections among cells, where every connection has an associated positive or negative number, called a weight or component value. Each cell employs a common rule to compute a unique output, which is then passed along connections to other cells. The particular connections and component values determine the behavior of the network when some specified "input" cells are initialized to a set of values. The component values play roughly the same role in determining neural network behavior as a program does in determining the behavior of a computer.
Prior art for document retrieval includes systems using variable length lists of terms as a representation, but without meaning sensitivity between terms. In such systems, pairs of terms are either synonyms or not synonyms.
So-called "vector space methods" can capture meaning sensitivity, but they require that the closeness of every pair of terms be known. For a typical full-scale system with over 100,000 terms, this would require about 5 billion relationships--an impractical amount of information to obtain and store.
Methods have also been proposed for searching with fixed-length vectors. However, such methods require work on the order of at least the square of the sum of the number of documents and the number of terms. This is impractical for a large corpus of documents or terms.
A document retrieval model based on neural networks and capturing some meaning sensitivity has been proposed. However, a search in such models requires multiplications for twice the product of the number of document and the number of keywords for each of a plurality of cycles.
Koll in "WEIRD: An Approach to Concept-Based Information Retrieval," SIGIR Forum, vol. 13, no. 4, Spring 1979, pp. 32-50, discloses a retrieval method using vector representations in Euclidean space. The kernel or core used by Koll are non-overlapping documents. This results in rather small dimensional vectors on the order of seven values. Vectors are generated from the core documents based on whether or not a term appears in a document. As an alternative, Koll suggests starting with a kernel of terms which never co-occur.