The present invention is directed to a method for storing records that permits meaning sensitive and high speed subject area searching and retrieval. The same method may be used for word sense disambiguation, (e.g., "star" in the sky vs. movie "star"). The invention is further directed to methods for generating context vectors to be associated with word stems for use in the record storage and retrieval method.
The most common method of record storage and retrieval involves storing all records word for word and then searching for key words in the records using inverted indexes (Salton, G., Automatic Text Processing: The transformation analysis and retrieval of information by computer, Addison-Wesley, 1989.) The key word searches are performed by doing a complete search through all of the contents of the data base that contain a list of query words. Such systems have no knowledge that "car" and "automobile" should be counted as the same term, so the user must include this information by a complex and difficult-to-formulate query. Some systems try to solve this problem by a built-in thesaurus, but such systems lack "meaning sensitivity" and miss many obvious facts, for example, that "car" is closer to "road" than to "hippopotamus." It is an object of the present invention to provide a more meaning sensitive method of storage and retrieval that allows simplified queries and reduces the computing capacity required for any particular data base.
There is currently much research and development in the fields of neural networks (Rumelhart, D. E. & McClelland, J. L., (eds.) Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1 and Vol. 2 MIT Press, 1986; Anderson, J. A. and Rosenfeld, E. (eds.), Neurocomputing, A Reader, MIT Press, 1988; Hecht-Nielson, Neurocomputing, Addison-Wesley, 1990). A neural network consists of a collection of cells and connections between cells, where every connection has an associated positive or negative number called a weight or component value. Cells employ a common rule to compute a unique output, which is then passed along connections to other cells. The particular connections and component values determine the behavior of the network when some specified "input" cells are initialized to a set of values. The component values play roughly the same role in determining neural network behavior as a program does in determining the behavior of a computer.
Waltz and Pollack, in their article entitled "Massively Parallel Parsing: A Strongly Interactive Model of Natural Language Interpretation" in Cognitive Science, Vol. 9, pages 51-74 (1985), presented a neural network based model for word sense disambiguation using high level features which are associated with "micro-features". The system was implemented by running several iterations of spreading activation which would be computationally inefficient for medium-or large-scale systems.
Cottrell, in the article entitled "Connectionist Parsing" from the Seventh Annual Conference of the Cognitive Science Society", Irvine, Calif. constructed a similar system as Waltz and Pollack, with the same practical limitations. Belew, in the article entitled "Adaptive Information Retrieval" from the Twelfth International Conference on Research and Development in Information Retrieval, Boston, June, 1989, has also constructed a document retrieval system based upon a "spreading activation" model, but again this system was impractical for medium or large-scale corpora. McClelland and Kawamoto, in the Rumelhart et al. books cited above, disclosed a sentence parsing method, including word sense disambiguation, using a model with a small number of orthogonal microfeatures.
An important related problem is the following. Given a collection of high-dimensional vectors (e.g. all vectors might have 200 components), find the closest vector to a newly presented vector. Of course all vectors can simply be searched one-by-one, but this takes much time for a large collection. An object of the current invention is to provide a process which makes such searches using much less work.
Although this problem is easily solved for very low dimensional (e.g., 2-4 dimensions) vector by K-D trees as described in Samet, H. The Design and Analysis of Spatial Data Structures, Addison-Wesley Publishing Company, 1990, K-D trees are useless for high dimensional nearest neighbor problems because they take more time than searching vectors one-by-one.
Prior art for document retrieval is well-summarized by the Salton reference cited above. Salton's SMART system us variable length lists of terms as a representation, but there is no meaning sensitivity between terms. Any pair of terms are either synonyms or are not synonyms; the closeness of "car" and "driver" is the same as that of "car" and "hippopotamus".
So called "vector space methods" can capture meaning sensitivity, but they require that the closeness of every pair of terms be known. For a typical full-scale system with over 100,000 terms, this would require about 5,000,000,000 relationships, an impractical amount of information to obtain and store. By contrast the present invention requires only one vector per word, or 100,000 vectors for such a typical full-scale system. This is easily stored, and computation of these vectors can be partly automated.
More recently Deerwester et al., in the article entitled "Indexing by Latent Semantic Analysis" in the Journal of the American Society for Information Science, Vol. 41(b), pages 391-407, 1990, have also proposed a method for searching which uses fixed length vectors. However, their method also requires work on the order of at least the square of the sum of the number of documents and the number of terms.
Bein and Smolensky, in the article "Application of the Interactive Activation Model to Document Retrieval" in the Proceedings for Neuro-Nimes, 1988: Neuro networks and their applications, November, 1988, have previously proposed a document retrieval model based upon neural networks that captures some meaning sensitivity. However, a search in their model requires multiplications for twice the product of the number of documents and the number of keywords for each of a plurality of cycles (they report 60). For large corpora, the present invention is expected to make searches up to 10,000 times faster.
Koll in "WEIRD: An Approach to Concept-Based Information Retrieval," SIGIR Forum, vol. 13, no. 4, Spring 1979, p. 32-50, discloses a retrieval method using vector representations in Euclidean space. The kernel or core used by Koll are non-overlapping documents. This results in rather small dimensional vectors on the order of seven values. Vectors are generated from the core documents based on whether or not a term appears in a document. As an alternative, Koll suggests starting with a kernel of terms which never co-occur.