1. Field of the Invention
The present invention relates to a method of summarizing markup-type documents automatically.
2. Discussion of the Related Art
Generally, supply of computers and development of network technology such as Internet enable users to make use of or gain access to numerous information (documents) on-line.
On-line documents were simple forms composed by texts initially but become complicated forms due to the generation of means for expressing the construction of the documents variously such as a markup language and the like.
In this case, the term “markup” means a work for describing a document constructed with texts or a logical structure of a word processing document. A markup language is used for such a work. The markup language is a series of characters and symbols inserted in a specific location of the document to describe the logical structure of the document. In addition, the document having the markup language inserted therein is called a markup document.
Since an amount of electronic documents of variously complicated forms such as the markup documents increases explosively to raise problems of the excessiveness of the accessible documents. Besides, searching the demanded document becomes a time-consuming job relatively. Hence, the advent of a document search system becomes inevitable.
Document search means that a user is provides with a sequence according to a specific condition by retrieving documents including a keyword (subject word) inputted by the user.
FIG. 1 illustrates a schematic diagram of a structure of a document search system for carrying out the above role.
Referring to FIG. 1, a document search system includes a plurality of user devices (e.g. PC, digital TV, etc.) enabling bi-directional communications, a server 1 having a search engine, and various servers (server 2 and server 3) providing documents requested by the search engine. Specifically, the used devices and servers are linked to networks providing the bi-directional communications such as Internet.
A user gain access to the server 1 including the search engine using his user device and then inputs a keyword to search.
The server 1 including the search engine retrieves documents corresponding to the user-inputting keyword to provide the user device with. In this case, the server 1 receives the documents corresponding to the keyword from its database or other servers (server2, server 3) existing on-line to provide the user device with.
The user then checks the documents according to the search result through his user device.
However, the amount of the search result corresponding to the keyword is enormous lately as well as it is unable to grasp whether the search result is correct or not. Substantially, the user has to find the requested document by checking all the documents corresponding to the search result one by one.
In order to overcome such a disadvantage of the document search system, a document summarizing system has been developed.
Document summarization means that contents of the enormous documents are reduced to a predetermined size. Specifically, unimportant or trivial parts of a plurality of the documents according to the document search result are skipped and core contents are extracted consistently. Namely, document summarization has a concept of document contents compression.
Generally, a document summarizing system is divided into a process of summarizing documents and a process of constructing keyword information of documents.
The document summarizing process starts from a parsing step of reading contents of the searched documents to classify into interpretation units for document summarization. In this case, the searched documents are regarded as a set of sections, each sentence is grasped as a set of words, and each of the words plays a role of a keyword as well as a least element of document summarization.
The process of constructing the keyword information of the documents is carried out in a manner that frequency information is collected by taking the word of the least element of the searched documents as a reference to construct the keyword information. After the keyword information has been constructed, a weight of each of the sentences is calculated to select the subject sentence.
The calculation of the weights of the sentences is carried out by two steps. Firstly, a point is given to each of the sentences centering around a frequency of entering the keyword. Secondly, the weight of each of the sentences is calculated according to the given point.
Once the weight of each of the sentences is calculated, a summary document amounting to a designated quantity is generated by extracting the sentences sequentially in order of high weights.
When the above-explained document summarizing system according to the related art is used, the contents of the summary document may lose its consistency. This is because the document summarizing system according to the related art provides the summary document by combining only the sentences containing the keywords with each other in part. Namely, it occurs occasionally that there is no content correlation between one and another sentences in the summarized document.
Hence, it happens frequently that partial combination of the sentences fails to make the user understand the entire contents of the documents prior to the summarization. Moreover, even if the sentence constructing the summary document includes the keyword, the overall contents of the summary document may not include the contents requested by the user.
Hence, the summary document generated from the document summarizing system according to the related art is a summary of the sentences included in the various searched documents, thereby being poor in the information contained in the summary document. Moreover, the entire contents of the various documents are confronted with the keyword to search, whereby it takes considerably much time to generate the summary document.