The present invention is directed to a method for storing documents that permits meaning sensitive and high speed subject area searching and retrieval. The same method may be used for word sense disambiguation, (e.g., "star" in the sky vs. movie "star").
The most common method of document storage and retrieval involves storing all documents word for word and then searching for key words in the documents using inverted indexes (1). The key word searches are performed by doing a complete search through all of the contents of the data base that contain a list of query words. Such systems have no knowledge that "car" and "automobile" should be counted as the same term, so the user must include this information by a complex and difficult-to-formulate query. Some systems try to solve this problem by a built-in thesaurus, but such systems lack "meaning sensitivity" and miss many obvious facts, for example, that "car" is closer to "road" than to "hippopotamus." It is an object of the present invention to provide a more meaning sensitive method of storage and retrieval that allows simplified queries and reduces the computing capacity required for any particular data base.
There is currently much research and development in the fields of neural networks (2, 3, 4). A neural network consists of a collection of cells and connections between cells, where every connection has an associated positive or negative number called a weight or component value. Cells employ a common rule to compute a unique output, which is then passed along connections to other cells. The particular connections and component values determine the behavior of the network when some specified "input" cells are initialized to a set of values. The component values play roughly the same role in determining neural network behavior as a program does in determining the behavior of a computer.
Waltz and Pollack (5) presented a neural network based model for word sense disambiguation using high level features which are associated with "micro-features". The system was implemented by running several iterations of spreading activation which would be computationally inefficient for medium-or large-scale systems.
Cottrell (6) constructed a similar system as Waltz and Pollack, with the same practical limitations. Belew (7) has also constructed a document retrieval system based upon a "spreading activation" model, but again this system was impractical for medium or large-scale corpora. McClelland and Kawamoto (2) disclosed a sentence parsing method, including word sense disambiguation, using a model with a small number of orothgonal microfeatures.
An important related problem is the following. Given a collection of high-dimensional vectors (e.g. all vectors might have 200 components), find the closest vector to a newly presented vector. Of course all vectors can simply be searched one-by-one, but this takes much time for a large collection. An object of the current invention is to provide a process which makes such searches using much less work.
Although this problem is easily solved for very low dimensional (e.g., 2-4 dimensions) vector by K-D trees (8), K-D trees are useless for high dimensional nearest neighbor problems because they take more time than searching vectors one-by-one.
Prior art for document retrieval is well-summarized by reference (1). Salton's SMART system uses variable length lists of terms as a representation, but there is no meaning sensitivity between terms. Any pair of terms are either synonyms or are not synonyms; the closeness of "car" and "driver" is the same as that of "car" and "hippopotamus".
So called "vector space methods" (1) can capture meaning sensitivity, but they require that the closeness of every pair of terms be known. For a typical full-scale system with over 100,000 terms, this would require about 5,000,000,000 relationships, an impractical amount of information to obtain and store. By contrast the present invention requires only one vector per word, or 100,000 vectors for such a typical full-scale system. This is easily stored, and computation of these vectors can be partly automated.
More recently Deerwester et al. (9) have also proposed a method for searching which uses fixed length vectors. However, their method also requires work on the order of at least the square of the sum of the number of documents and the number of terms. This is impractical for large corpora of documents or terms.
Bein and Smolensky (10) have previously proposed a document retrieval model based upon neural networks that captures some meaning sensitivity. However, a search in their model requires multiplications for twice the product of the number of documents and the number of keywords for each of a plurality of cycles (they report 60). For large corpora, the present invention is expected to make searches up to 10,000 times faster.