This invention relates generally to a system and method for processing a document and in particular to a system and method for generating a list of classifications (a taxonomy) from a plurality of phrases which are extracted from a set of documents.
Various factors have contributed to the storage of vast amounts of textual data information in computer systems and computer databases. The dramatic increase in the storage capacity of computer storage devices, such as hard disks, tape drives and the like and a decrease in the cost of these higher capacity computer hard drives are factors. Other factors include an increase in the transmission speed of computer communications, an increase in the processing speed of personal computers and an expansion of various computer communication networks, such as a bulletin board or the Internet. People therefore currently have access to the large amounts of textual data stored in these databases. However, although the current technology facilities the storage of and the access to this textual data, there are new problems that have been created by the vast amount of textual data that is now available.
In particular, a person trying to access the textual data in these computer databases needs a system for analyzing and processing the data in order to retrieve the desired information quickly and efficiently without retrieving extraneous information. In addition, the person trying to access the information needs an efficient system for condensing each large document into a plurality of phrases (one or more words) which characterize the document so that the person can browse the phrases and understand the document without actually viewing the entire document. It is also desirable to be able to automatically generate a classification system based on the extracted phrases stored in the computer database so that a person may use the classification system to browse through the database and focus in on the most relevant documents or pieces of textual data in the computer database. The classification system may be known as a taxonomy which consists of one or more subject matter headings with sub-headings reflecting the phrases extracted from the documents in the database.
To generate a typical subject matter classification system, a person must manually generate a classification system into which the one or more pieces of textual data in the database may then be manually or automatically classified. Thus, the typical classification system is manually generated and does not use the extracted phrases from the documents in the database to create the classification system. Therefore, the typical classifications are generally very broad reflecting the inability to more accurately classify the documents since the exact contents of the documents are not known. Thus, the typical classification may permit the user to select only a broad category which probably then contains too many documents to easily review. In addition, for each different database, a new classification system must be manually generated which is slow and time consuming.
Thus, it is desirable to provide a taxonomy generation system and method which solves the above problems and limitations with conventional classification systems and it is to this end that the present invention is directed.