This invention relates generally to a computer system for document processing, and more specifically for generating a summary of a document.
The Internet is a collection of interconnected computer systems through which users can access a vast store of information. The information accessible through the Internet is stored in electronic files (i.e., documents) under control of the interconnected computer systems. It has been estimated that over 50 million documents are currently accessible through the Internet and that the number of documents is growing at the rate of 75% per year. Although a wealth of information is stored in these documents, it has been very difficult for users to locate documents relating to a subject of interest. The difficulty arises because documents are stored in many different computer systems, and the Internet provides no central mechanism for registering documents. Thus, a user may not even know of the existence of certain documents, let alone the subject matter of the documents. Each document that is accessible through the Internet is assigned a unique identifier, which is referred to as a uniform resource locator (xe2x80x9cURLxe2x80x9d). Once a user knows the identifier of a document, the user can access the document. However, even if a user knows the identifiers of all documents accessible through the Internet, the user may not know the subject matter of the document. Thus, the user may have no practical way to locate a document relating to a subject of interest.
Several search engines have been developed to assist users to locate documents relating to a subject of interest. Search engines attempt to locate and index as many of the documents provided by as many computer systems of the Internet as possible. The search engines index the documents by mapping terms that represent the subject matter of each document to the identifier of the document. For example, if a search engine determines that the terms xe2x80x9cunitedxe2x80x9d and xe2x80x9cstatesxe2x80x9d represent the subject matter of a document, then the search engine would map each of those terms to the URL for the document. When using a search engine to locate documents relating to a subject of interest, the user enters search terms that describe the subject of interest. The search engine then searches the index to identify those documents that are most relevant to the search terms. For example, if a user enters the search terms xe2x80x9cunitedxe2x80x9d and xe2x80x9cstates,xe2x80x9d then the search engine searches the index to identify the documents that are most relevant to those search terms. In addition, the search engine may present the search results, that is the list of relevant documents, to the user in order based on the relevance to the search term. The user can then select and display the most relevant documents.
The accuracy of the search results depends upon the accuracy of the indexing used by a search engine. Unfortunately, there is no easy way for a search engine to determine accurately the subject matter of documents. The difficulty in determining the subject matter of a document is compounded by the wide variety of formats (e.g., as a word processing documents or as a hyper-text document) and the complexity of the formats of the documents accessible through the Internet. To make it easier for a search engine to determine the subject matter of a document, some document formats have a xe2x80x9ckeywordxe2x80x9d section that provides words that are representative of the subject matter of the document. Unfortunately, creators of documents often fill the xe2x80x9ckeywordxe2x80x9d section with words that do not accurately represent the subject matter of the document using what is referred to as xe2x80x9cfalse promotingxe2x80x9d or xe2x80x9cspamming.xe2x80x9d For example, a creator of a classified advertising web page for automobiles that may fill the xe2x80x9ckeywordxe2x80x9d section with repetitions of the word xe2x80x9ccar.xe2x80x9d The creator does this so that a search engine will identify that web page as very relevant whenever a user searches for the term xe2x80x9ccar.xe2x80x9d However, a xe2x80x9ckeywordxe2x80x9d section that more accurately represents the subject matter of the web page may include the words xe2x80x9cautomobile,xe2x80x9d xe2x80x9ccar,xe2x80x9d xe2x80x9cclassified,xe2x80x9d xe2x80x9cfor,xe2x80x9d and xe2x80x9csale.xe2x80x9d
Because the document formats have no reliable way to identify the subject matter of a document, search engines use various algorithms to determine the actual subject matter of documents. Such algorithms may generate a numerical value for each term in a document that rates importance of the term within the document. For example, if the term xe2x80x9ccarxe2x80x9d occurs in a document more times than any other term, then the algorithm may give a high numerical value to the term xe2x80x9ccarxe2x80x9d for that document. Typical algorithms used to rate the importance of a term within a document often factor in not only the frequency of the occurrence of term within the document, but also the number of documents that contain that term. For example, if a term occurs two times in a certain document and also occurs in many other documents, then the importance of that term to the document may be relatively low. However, if the term occurs two times in that document, but occurs in no other documents, then the importance of that term within the document may be relatively high even though the term occurs only two times in the document. In general, these algorithms attempt to provide a high xe2x80x9cinformation scorexe2x80x9d to the terms that best represent the subject matter of a document with respect to both the document itself and to the collection of documents.
To calculate the importance or xe2x80x9cinformation score,xe2x80x9d typical algorithms take into consideration what is referred to as the term frequency within a document and the document frequency. The term frequency of a term is the number of times that the term occurs in the document. The term frequency for term i within document j is represented as TFij. The document frequency of a term is the number of documents in which the term occurs. The document frequency for term i is represented as ni. One such algorithm uses the Salton Buckley formula for calculating the importance of terms. The formula is given by the following equation:                               W          ij                =                              log            2                    ⁢                      TF            ij                    *                      log            2                    ⁢                      N                          n              i                                                          (        1        )            
where Wij is the numerical value (i.e., weight) of the importance of the term i to the document j, where TFij is the term frequency, where ni is the document frequency, and where N is the total number of documents in a collection of documents. The quotient N/ni is referred to as the inverse document frequency, which is the inverse of the ratio of the number of documents that contain the term to the total number of documents. As the term frequency increases, the weight calculated by this formula increases logarithmically. That is, as the term occurs more frequently in a document, the weight of that term within the document increases. Also, as the document frequency increases, the weight decreases logarithmically. That is, as a term occurs in more documents, the weight of the term decreases. It is, of course, desirable to use a formula that results in weights that most accurately reflect the importance or information score of terms.
Search engines typically identify and index only single terms. Search engines, however, do not typically index phrases, which are sequences of two or more terms. For example, a search engine may index the terms that comprise the phrase xe2x80x9cUnited States,xe2x80x9d separately. Thus, when a user wants to locate documents related to the xe2x80x9cUnited States,xe2x80x9d the search engine may locate many documents that contain the terms xe2x80x9cunitedxe2x80x9d and states,xe2x80x9d but that do not contain the phrase xe2x80x9cUnited States.xe2x80x9d As a result, the search engine may locate many documents that are of no interest to the user. For example, the search engine may locate documents that contain the sentence: xe2x80x9cHe then states that united we stand.xe2x80x9d Moreover, even if a search engine could index phrases, the search engines would calculate the importance of a phrase in a manner that is analogous the calculation of the importance of a term. That is, the search engines would treat the phrase as a single term and would use the formula as shown in equation (1) to calculate the weight of the phrase. However, such treatment of phrases is impractical for large collections of documents with a large number of unique terms. In particular, since the number of possible phrases increases exponentially with the length of the phrase, the number of frequencies that would need to be determined and stored for each document would also increase exponentially. Thus, it would be desirable to have a technique to calculate the importance of a phrase in a way that avoids this increase in storing.
As described above, the search engine presents the search results, that is the list of relevant documents, to the user. The relevance of a document is determined by the particular algorithm used by the search engine. Once a user is presented with documents, the user typically still needs to ensure that each document actually relates to the subject of interest. To determine whether a document is actually related to the subject of interest, a user may open and read through the document. However, the opening and reading of the various documents identified by the search engine can be very time consuming but very necessary. As described above, the xe2x80x9ckeywordxe2x80x9d section of a document may not accurately represent the subject matter of the document. Similarly, the creator of a document may fill the abstract or first couple sentences of a document with words that do not accurately represent the subject matter of the document. Thus, it would be desirable to have a technique by which a user could determine the relevance of a document without opening and reading through the document.
An embodiment of the present invention provides a method and system for generating a summary of a document. The summary generating system generates the summary from the sentences that form the document. The summary generating system calculates a weight for each of the sentences in the document. The weight indicates the importance of the sentence to the document. The summary generating system then selects sentences based on their calculated weights. The summary generating system creates a summary of the selected sentences such that selected sentences are ordered in the created summary in the same relative order as in the document. In one embodiment, the summary generating system identifies sets of sentences whose total length of the sentences in the set is less than a maximum length. The summary generating system then selects an identified set of sentences whose total of the calculated weights of the sentences is greatest as the generated summary. The length of a sentence may be measured in characters or words. In an alternate embodiment, the summary generating system selects the sentences with the highest calculated weights whose total length of the selected sentences is less than a maximum length as the summary.