This invention relates generally to a system and method for processing a document and in particular to a system and method for identifying a plurality of phrases within the document which indicate the context of the document.
Various factors have contributed to the extensive storage and retrieval of textual data information using computer databases. A dramatic increase in the storage capacity of hard drives coupled with a decrease in the cost of computer hard drives, and increases in the transmission speed of computer communications have been factors. In addition, the increased processing speed of computers and the expansion of computer communications networks, such as a bulletin board or the Internet, have been factors. People therefore have access to the large amounts of textual data stored in these databases. However, although the technology facilitates the storage of and the access to the large amounts of textual data, there are new problems that have been created by the large amount of textual data that is now available.
In particular, a person trying to access textual data in a computer database having a large amount of data needs a system for analyzing the data in order to retrieve the desired information quickly and efficiently without retrieving extraneous information. In addition, the user of the system needs an efficient system for condensing each large document into a plurality of phrases (one or more words) which characterize the document so that the user of the system can understand the document without actually viewing the entire document. A system for condensing each document into a plurality of key phrases is known as a parsing system or a parser.
In one typical parser, the parser attempts to identify phrases which are repeated often within the document and identifies those phrases as being key phrases which characterize the document. The problem with such a system is that it is very slow since it must count the repetitions of each phrase in the document. It also requires a large amount of memory. As the amount of data to be parsed increases, the slow speed of this parser becomes unacceptable. Another typical parser performs a three step process to identify the key phrases. First, each word in the document is assigned a tag based on the part of speech of the word (i.e., noun, adjective, adverb, verb, etc.) and certain parts of speech, such as an article or an adjective, may be removed from the list of phrases which characterizes the document. Next, one or more sequences of words (templates) may be used to identify and remove phrases which do not add any understanding to the document. Finally, any phrase which is an appropriate part of speech and does not fall within one of the templates is accepted as a key phrase which characterizes the document. This conventional parser, however, is also slow which is unacceptable as the amount of data to be parsed increases.
In all of these conventional parser systems, the parser attempts to break the document down into smaller pieces based on the characteristics (frequency of repetition or part of speech) of the particular words in the document. The problem is that language generally is not that easily classified and therefore the conventional parser does not accurately parse the document or requires a large amount of time to parse the document. In addition, the conventional parser systems are very slow because they all attempt to use complex characteristics of the language as a method for parsing the key phrases out of the document. These problems with conventional parsers becomes more severe as the number of documents which must be parsed increases. Today, the number of documents which must be parsed is steadily increasing at a tremendous rate due to, among other things, the Internet and the World Wide Web. Therefore, these conventional parsers are not acceptable. Thus, it is desirable to provide a parsing system and method which solves the above problems and limitations with conventional parsing systems and it is to this end that the present invention is directed.