The present invention relates to computer software programs and more particularly to a computer software search program to search distributed text databases.
At the present time there is a need for a more accurate computer software search program to search distributed text databases in response to a user""s query and to respond by retrieving the documents, or sections of documents, most pertinent to the query. A database is a body of information made up of records. A user of the Internet and World Wide Web (WWW) may be interested in obtaining documents relating to a relatively narrow field, for example, the present physical location of guitars that had been owned by Bob Dylan, or documents describing cures for microscopic colitis.
Despite the fact that enormous popularity and sophisticated technology turned the Internet into not only the major source of information, but also media for a wide range of day-to-day activities, WWW, and other distributed sets of databases, still remains not a too friendly place for a typical Internet user. This is due partly to overwhelming amounts of information accessible through the Internet and partly to the fact that intrinsic xe2x80x9cnatural lawsxe2x80x9d of the cyber World significantly differ from those with which Internet users gained familiarity in the real World.
Effective information acquisition has two crucial aspects: retrieval and presentation. A successful retrieval should utilize all available sources of information and select those which are the most suitable to the type of information required or the most appropriate to the query that initiated the process. On the other hand, the presentation should reorganize the acquired information by eliminating the irrelevant or technically inaccessible information and by sorting, ordering and grouping the relevant information in a manner that enables the real World user to take decisions and to react efficiently.
The success of both of these aspects depends critically on the ability to understand the needs and intentions behind the process of information acquisition, as well as on the ability to evaluate the acquired information. The existing instruments for information acquisition from the Internet (search engines, portals, etc.) have failed to develop any of these abilities and depend heavily on the experience and skillfulness of an Internet user.
Using the presently available search engines, such as LYCOS(trademark), EXCITE(trademark), INFO SEEK(trademark), WEB CRAWLER(trademark), ALTA VISTA(trademark), NORTHERN LIGHT(trademark), YAHOO(trademark), HOT BOT(trademark), or a meta-search engine such as META CRAWLER(trademark), DOGPILE(trademark), INFERENCE FIND(trademark), MAMMA(trademark) and SAVVY SEARCH(trademark), it often is difficult, time-consuming and frustrating for the user to obtain the exact information regarding what the user enters as an enquiry to the search engine. It is not uncommon for the user to be told, by the search engine, that an enquiry resulted in over 20,000 documents or that there are no documentsxe2x80x94when in fact there are many. An important function is the xe2x80x9crankingxe2x80x9d of the documents found in a search, with generally the 10 highest ranking documents being presented first, followed by the next 10, etc., etc. The user""s enquiry, sometimes called xe2x80x9csearch strategy statementxe2x80x9d, generally uses specific terms, i.e., keywords.
That process, however, often gives inaccurate results in that it misses relevant documents, provides irrelevant documents, and often provides too many documents. Consequently, there is an urgent need for a kind of xe2x80x9cvirtual representativexe2x80x9d of a real World user in the cyber World that is able to accurately acquire information on behalf of the user.
Such a virtual representative can serve as a personal assistantxe2x80x94born and bred in the cyber World. This assistant can independently perform numerous activities on behalf of a real World user and not only relieve him or her from the Web routine, but also increase significantly the productivity of his or her activities using sets of distributed databases, such as the WWW.
The present invention provides a robot (independently operating agent that combines machine understanding and automation of routines). The robot is capable of (1) collecting information from a variety of Web, or other distributed data based sources in parallel; (2) semantically analyzing the retrieved information in order to evaluate its suitability to user""s intentions and expectations; (3) reorganizing the retrieved information in a useful manner; and (4) extracting information concerning an Internet user in order to formulate his, or her, foci of interest. This robot operates xe2x80x9con topxe2x80x9d of various Web based and other sources of information and instruments for information acquisition and does not require any particular database of its own or any reprocessing of the Web content.
The robot is activated by an explicit user""s action (e.g. posting a query), or automatically when searching for Web content that match foci of interest of a user. In both cases the robot can collect information either from a set of information sources predefined by the user (e.g., a particular Web site, search engines supplying a particular type of content, such as news or press releases etc.), or by automatically selecting the most appropriate sources of information. This activity of the robot is terminated either by an explicit action on behalf of a user, or when exhausting all the relevant sources, or when a satisfying amount of relevant information has been retrieved.
The retrieved information is semantically analyzed. The obsolete or inaccessible information is completely ignored. From each retrieved document the semantic core information is extracted in order to create its xe2x80x9cshorthandxe2x80x9d signature. These signatures are compared in order to detect the semantic common denominators and to group the retrieved documents by common topics. The resulting subgroups are sorted, by their relevancy to the initial query, and ranked by their suitability to the user""s foci of interest. If necessary, the documents are further grouped by the domains in their Uniform Resource Identifiers (URI). Accordingly, the results are not presented in their raw form (like in regular search engines), but rather as topics extracted from retrieved documents which are sorted xe2x80x9con-the-flyxe2x80x9d by their semantic relevancy to the query and ranked by their suitability to the user""s interests.
The robot can learn about user""s interests in a variety of ways. It can extract the most dominant topics from any textual information electronically supplied by the user (such as the so-called xe2x80x9cbookmarksxe2x80x9d or xe2x80x9cfavoritesxe2x80x9d from a Web browser or any other set of documents that are representative of his or her interests). In addition, the robot follows user""s reactions to supplied information: selected topics, preferred information sources, typical domains, preferred type of documents etc. The collected information is incorporated into the user""s foci of interest in order to keep them updated. They are further enhanced with the information from frequently repeated queries. The user is also allowed to manually modify and enrich his or her foci of interest. The robot is equipped with a mechanism simulating natural amnesia which disregards and eventually removes obsolete constituents.
If the robot is based on a Web server, rather than on the client computer, the robot can operate independently and without any direct supervision on behalf of a user. A user can be informed when the required information is available and he or she can access the information next time when the user comes on line (connects to the server).