This invention relates generally to a system and method for processing a document and in particular to a system and method for identifying a plurality of phrases within the document which indicate the context of the document.
Various factors have contributed to the extensive storage and retrieval of textual data information using computer databases. A dramatic increase in the storage capacity of hard drives coupled with a decrease in the cost of computer hard drives, and increases in the transmission speed of computer communications have been factors. In addition, the increased processing speed of computers and the expansion of computer communications networks, such as a bulletin board or the Internet, have been factors. People therefore have access to the large amounts of textual data stored in these databases. However, although the technology facilitates the storage of and the access to the large amounts of textual data, there are new problems that have been created by the large amount of textual data that is now available.
In particular, a person trying to access textual data in a computer database having a large amount of data needs a system for analyzing the data in order to retrieve the desired information quickly and efficiently without retrieving extraneous information. In addition, the user of the system needs an efficient system for condensing each large document into a plurality of phrases (one or more words) which characterize the document so that the user of the system can understand the document without actually viewing the entire document. A system for condensing each document into a plurality of key phrases is known as a parsing system or a parser.
In one typical parser, the parser attempts to identify phrases which are repeated often within the document and identifies those phrases as being key phrases which characterize the document. The problem with such a system is that it is very slow since it must count the repetitions of each phrase in the document. It also requires a large amount of memory. As the amount of data to be parsed increases, the slow speed of this parser becomes unacceptable. Another typical parser performs a three step process to identify the key phrases. First, each word in the document is assigned a tag based on the part of speech of the word (i.e., noun, adjective, adverb, verb, etc.) and certain parts of speech, such as an article or an adjective, may be removed from the list of phrases which characterizes the document. Next, one or more sequences of words (templates) may be used to identify and remove phrases which do not add any understanding to the document. Finally, any phrase which is an appropriate part of speech and does not fall within one of the templates is accepted as a key phrase which characterizes the document. This conventional parser, however, is also slow which is unacceptable as the amount of data to be parsed increases.
In all of these conventional parser systems, the parser attempts to break the document down into smaller pieces based on the characteristics (frequency of repetition or part of speech) of the particular words in the document. The problem is that language generally is not that easily classified and therefore the conventional parser does not accurately parse the document or requires a large amount of time to parse the document. In addition, the conventional parser systems are very slow because they all attempt to use complex characteristics of the language as a method for parsing the key phrases out of the document. These problems with conventional parsers becomes more severe as the number of documents which must be parsed increases. Today, the number of documents which must be parsed is steadily increasing at a tremendous rate due to, among other things, the Internet and the World Wide Web. Therefore, these conventional parsers are not acceptable. Thus, it is desirable to provide a parsing system and method which solves the above problems and limitations with conventional parsing systems and it is to this end that the present invention is directed.
A parser system and method in accordance with the invention is provided in which the break characters within a sentence or a paragraph are used to parse the document into a plurality of key phrases. The parser system in accordance with the invention is very fast and does not sacrifice much accuracy for the speed. The break characters within the document may include punctuation marks, certain stop words and certain types of words such as verbs and articles. The parser system may include a buffer which receives one or more words before it receives a break character. When the buffer receives a break character, the parser may determine whether the phrase before the break character is saved based on the type of break character. In particular, if the break character is a punctuation mark, the parser may keep the one or more words before the break character as a key phrase. If the break character is another type of character, the phrase before the break character may or may not be saved. Once the fate of the phrase has been determined, the buffer is flushed and the next sequence of one or more words is read into the buffer so that it may also be parsed. In this manner, a plurality of phrases in the document may be rapidly extracted from the document based on the break characters within the sentences and paragraphs of the document.
Thus, in accordance with the invention, a system for parsing a piece of text into one or more phrases which characterize the document is provided. The system comprises a buffer for reading one or more words from the piece of text into the buffer and a parser for identifying a phrase contained in the buffer, the phrase being a sequence of two or more words in between break characters. The parser further comprises means for determining the type of break character that follows the identified phrase and means for saving a key phrase from the buffer based on the determined type of break character. The key phrases are stored in a database.