Information retrieval from text documents has been a major area of research in computer science for the past three decades. Although a number of techniques have been developed which enable computers to understand text, none of these techniques has been wholly satisfactory.
The simplest of these techniques is `keyword matching`. Using this technique, the computer scans through an ASCII text file looking for each and every occurrence of the desired `keyword`. By using a variety of logical operations such as AND and OR, the computer can be instructed to retrieve only the combinations of keywords which the operator believes will be most relevant. Unfortunately this technique is highly literal and even a well-crafted search can miss relevant documents. For example, if a paragraph contains the sentence:
California is a leading producer of oranges, PA1 California is a leading consumer of oranges, PA1 Worked during this period of time for a large chip 1977--manufacturer de5igning chips and wrlting 1980 simulations for new designs. I participated in the design of several different memory chips.
and the user wishes to know which fruits are grown in California, searching the paragraph for "California" and "oranges" will detect the sentence and provide the correct answer. However, if the sentence read:
the same search will still detect this sentence, in this case erroneously.
obviously this technique is less than wholly satisfactory. Another technique uses `Natural Language Understanding`. The use of this technique combines syntactic knowledge of English with semantic information about the topic in the text being analyzed. See, for example, James Allen, Natural Language Understanding, The Benjamin/Cummings Publishing Company, Inc., Menlo Park, Calif. 1987. These techniques are limited due to their dependence on the grammatical correctness of the text and their failure to make use of the text's physical layout.
Many documents and forms are comprised of numerous unrelated text phrases, which can be ungrammatical but which are nonetheless easily understood by humans. Examples of such documents include resumes, purchase order forms, bank statements, insurance form, and so on. Human understanding of these documents entails realizing how the spatial relationship between blocks of text contributes to the meaning of the document.
Resumes provide a classic example of documents which are very easily understood by humans through the use of spatial and textual analysis. To find the name of the person, for example, the human uses the fact that it normally appears near the top of the resume, often centered or set apart from the rest of the text. Current natural language understanding systems cannot use such spatial information in detecting names.
Another example of how such information is easily used by humans but not so easily used by computers is contained in the following excerpt from a resume:
______________________________________ 1977-1980 Worked during this period of time for a large chip manufacturer designing chips and writing software simulations for new designs. I participated in the design of several different memory chips. ______________________________________
In a computer this text fragment might actually be stored as:
Typographical errors during data entry result in text which natural language systems could not understand. Even if the typographical errors were corrected, it is evident that correct understanding of this stored text would be difficult as the range of years of employment has become separated and embedded in the job description. Another problem which would result in incorrect analysis by natural language processing systems is the fact that the first sentence is grammatically not a complete sentence.
It is an object of this invention to provide a method and apparatus for use with a computer which analyzes text documents using both sophisticated text pattern matching techniques which are insensitive to typographical errors and to the ungrammatical nature of text fragments, and spatial analysis techniques for analyzing the spatial structure of the text document, the method providing greatly improved computer textual analysis.