One of the major challenges in managing information is to locate natural-language texts describing a particular idea, invention or vent. For example, one might wish to locate texts that concern a set of events relating to a legal proposition, or a set of facts relating to a business situation, or a description of a particular invention or idea or concept.
There are a number of systems available commercially for accessing digitally process texts. Typically, in finding a desired text, one first classifies the text into some field or class that the text is likely to be found in. For example, in the legal field, one might confine the text search to an appellate cases relating to a specific area of the law or in a specific jurisdiction. In a technical or patent search, one might confine the search to a particular area of technology of patent class or subclass. This initial classification serves the purpose of narrowing the search to the areas of interest or most likely text matches.
Once a class or area of search has been identified, a search for a matching text is typically carried out by Boolean word search methods. In this approach, the user provides key words, and/or groups of words, typically specified by a Boolean connection, and a search algorithm is used to identify digitally processed texts that contain that word or groups of words. This approach, although widely available, is nonetheless limited in two fundamental respects. First, the search can be fairly time consuming, since with each new Boolean search command, a search output must be evaluated, to refine and improve the search results. Often this means reading through portions of the texts retrieved, then deciding how the search command can be improved to sharpen the search results. Secondly, the approach is subject to the general problem of false maxima. That is, even though a retrieved text has many of the key words included in the search commands, it is impossible to know whether a text with a maximum word overlap with the search words, unless only a small number of search words are used.
At the other extreme, efforts in the field of natural-language processing are aimed at “reading” an input text for content, and trying to match the target text with a library of digitally processed in content, rather than on the basis of words alone. At present, this field is still at an embryonic stage, and impractically slow, since every text that is searched must be individually processed for content.
It would therefore be desirable to provide a text processing and matching system that is substantially automated, that is, does not require user input to classify the field of search and/or identify key words and words phrases useful for text searching.
It would be further desirable to provided such a system that overcomes the problem of false minima associated with Boolean word searching, and is capable of conducting complex text search in real time, e.g., in a matter of seconds or a few minutes.