The present invention relates to computer based apparatus for and methods of semantically analyzing, selecting, and summarizing candidate documents containing specific content or subject matter.
Computer based document search processors are known to perform key word searches for publications on the Internet and World Wide Web. Today, information owners and service providers are adapting their data bases to individual tastes and requirements. For example, Boston based Agents, Inc. offers over the Web personalized newsletters for music fans such that classical music lovers are blocked from receiving Rap music ads and vice-versa. KD, Inc. of Hong Kong has developed a system that takes into consideration words similar by sense while searching the Web. Today the user can download 10,000 papers from the Web by typing the word "Screen". The search system designed by KD, Inc. asks the user whether he/she is seeking papers related to Computer Screen, TV Screen or Window Screen. In this case, the number of unrelated papers will be drastically reduced.
Software based search processors are able to remember requests of single user and to conduct personalized non-stop searches on the Web. So, when a user wakes up in the morning he/she finds references and abstracts of several new Web papers, related to his/her area of interest. In 1997, practically all fundamental technical publications, journals, magazines, as well as patents of all industrial countries became available on the Web, i.e. available in electronic format.
Although key word searching the Web affords the user great value, it also has created and will continue to create substantial problems adversely affecting this value. Specifically, because of the enormous amount of information available on the Web, key word search processors produce too much downloaded information, the vast majority of which is irrelevant or immaterial to the information the user wants. Many users simply give up in frustration when presented with several hundred articles in response to what the user considered a request for only those few articles related to a specific request.
This problem is also experienced in the technical fields of science and engineering, particularly since there is a growing number of libraries, government patent offices, universities, government research centers, and other adding vast amounts of technical and scientific information for Web access. Engineers, scientists, and doctors are overwhelmed with too many articles, papers, patents and general information on the topic of interest to them. In addition, the user presently has only two choices when examining a download article to determine its relevance to the users project. He/she can either read the authors abstract and/or scan various sections of the full article to determine whether or not to save or print-out that specific document. Since the author's abstract is not comprehensive, it often omits the reference to the specific subject matter of interest to the user or treats this subject matter in an incomprehensive manner. Thus, scanning the abstract and scanning the full article may have little value and require an inordinate amount of user time.
Various attempts purport to increase the recall and precision of the selection such as U.S. Pat. Nos. 5,774,833 and 5,794,050 incorporated here by reference, however, these methods simply rely on key word or phrase searching with various techniques of selection based on variations of the key words, or purported understanding of textual phrases. These prior methods may improve recall but may still requires too much physical and mental effort and time to determine why the document was selected and what is the pertinent part. This results from the entire document or abstract being presented without summary or concept generation.