The present invention relates to searching for information and, in particular, searching for information on a computer network.
Computer networks, such as the World Wide Web (the xe2x80x9cWebxe2x80x9d, a.k.a. the xe2x80x9cInternetxe2x80x9d) have resulted in large amounts of information distributed across an enormous number of processing devices or computers. For example, an electronic representation of a document may be stored at a xe2x80x9cwebsitexe2x80x9d of a computer connected to the Web. The document may include multiple pages in which a page is added frequently or a page is altered frequently.
Often, a search engine is used in retrieving a document on the Web. A search engine is typically a remotely accessible software program which indexes Internet addresses (universal resource locators (xe2x80x9cURLsxe2x80x9d), usenet, file transfer protocols (xe2x80x9cFTPsxe2x80x9d), image locations, etc). A search engine typically returns a list of xe2x80x9chyperlinksxe2x80x9d or Internet addresses of information from an index in response to a query. A user query may include a keyword, a list of keywords or a structured query expression, such as boolean query.
A typical search engine contains a special program often called a xe2x80x9ccrawlerxe2x80x9d or sometimes called a xe2x80x9cspiderxe2x80x9d or xe2x80x9cbotxe2x80x9d. A search engine xe2x80x9ccrawlsxe2x80x9d the Web by performing a search of the connected computers that store the information and makes a copy of the information. Sometime later, the search engine will process a copy of the information and modify the search engines existing index to reflect the new information available on the Web. The search engine may catagorize the information in order to quickly provide a user with relevant information in response to a query.
However, because of the vast amount of distributed information currently being added daily to the Web, maintaining an up-to-date index of information in a search engine is extremely difficult. A user may not obtain the most recent information from a search engine even though the information is at a website which has been recently published or a previously published website which has an altered page. The most recent information will likely be the most valuable, but is often not indexed in the search engine. Also, search engines do not typically use a user""s personal search information in updating the search engine index.
Therefore, it is desirable to provide an information system, computer readable medium and method for searching for current relevant information on a processing device network, such as the Web. Relevant information which has been recently published or altered on the Web should be provided by the search engine. User""s personal search information should also be used in order to provide relevant current information.
Generally, an embodiment of the present invention is directed toward selectively searching the Web for relevant current information based on user personal search information (or filtering profiles). By selectively searching the Web, relevant information that has been added recently will more likely be discovered. A user provides personal search information such as a query and how often a search is performed to a filtering program. The filtering program invokes a Web crawler to search selected or ranked servers on the Web based on a user selected search strategy or ranking selection. The filtering program directs the Web crawler to search a predetermined number of ranked servers based on: (1) the likelihood that the server has relevant content in comparison to the user query (xe2x80x9ccontent ranking selectionxe2x80x9d); (2) the likelihood that the server has content which is altered often (xe2x80x9cfrequency ranking selectionxe2x80x9d); or (3) a combination of (1) and (2) (xe2x80x9cboth content and frequency rankingxe2x80x9d). The recently altered relevant information, or hyperlinks to such information, is then provided to the user.
An information system for providing recently altered information on a computer network, such as the World Wide Web, is provided. The information system comprises a user processing device, a first content processing device, and a search engine processing device coupled to the Web. The user processing device includes a processor readable memory storing a user interface program for obtaining user information. The first content processing device has a first type of content information which is altered at a first frequency. The search engine software program includes a Web crawler software program for obtaining content information responsive to (1) a comparison of the first type of information with the user information, and (2) the first frequency.
According to an embodiment of the present invention, the user information is a query including a keyword, a search interval including a time value, and a percentage searched including a percentage value.
According to another embodiment of the present invention, the first frequency is the number of page alterations per day, the number of page alterations per week, the number of page alterations per month, or the number of page alterations per year.
According to another embodiment of the present invention, the first frequency is an average of (1) the number of page alterations in the preceding day, (2) the number of page alterations in the preceding week, (3) the number of page alterations in the preceding month, and (4) the number of page alterations in the preceding year.
According to another embodiment of the present invention, the search engine software program obtains a content vector of the content information and a comparison is made between the content vector and the user information to obtain a content score.
According to another aspect of the present invention, the information system further comprises a second content processing device coupled to the Web. The second content processing device has a second type of content information which is altered at a second frequency. The search engine ranks the first and second processing devices based on a: (1) comparison of the user information with the first type of content; (2) comparison of the user information with the second type of content; (3) the first frequency; and (4) the second frequency.
According to another aspect of the present invention, an article of manufacture, including a computer readable memory for searching for recently altered documents is provided. The computer readable memory comprises a first software program for obtaining user information. A second software program provides a first content value of a first document at a first processing device address, responsive to a comparison of the user information with the content of the first document. A second software program obtains a first frequency of alterations to the content of the first document.
According to another aspect of the present invention, the article of manufacture further comprises a fourth software program for ranking the first processing device address on a list based on a comparison of the first content value and a second content value of a second document having a second processing device address.
According to still another aspect of the present invention, the first document is stored on a first computer connected to a network and the second document is stored on a second computer connected to the network.
According to still another aspect of the present invention, a method for obtaining information from the World Wide Web is provided. The method comprises the steps of selecting a user and obtaining a query from the user. A content score is then calculated for a document having an address on the World Wide Web. A frequency score for the document is then calculated. The associated address is then stored in a list of addresses based on the content score and frequency score. A subset of the list is selected and the document having the first address on the list is crawled.
According to another aspect of the present invention, the method further comprises the step of notifying the user that the content of the document has changed.