The invention relates to communications in general. More particularly, the invention relates to a method and apparatus to retrieve information from a network such as the Internet.
The existing amount of information available over the Internet and World Wide Web (WWW) is staggering. There are literally millions of xe2x80x9cweb pagesxe2x80x9d full of information on almost any topic of interest. Moreover, this amount of information is increasing at a geometric rate. This sheer volume of information has made the search for specific types of information a significant challenge. The complexity of this challenge may be better understood with some background information regarding the Internet and WWW in general.
The Internet comprises a network of computers interconnected by some form of communication medium. The type of computer could range from handheld computers and pocket PCs to high-end mainframe and supercomputers. The communication mediums may vary between twisted pair, co-axial cable, optical fibers and radio-frequencies. Each computer is equipped with software and hardware that enables each computer to communicate using the same procedures or language. These procedures and language are often referred to as protocols, which are often layered over one another to form something called a xe2x80x9cprotocol stack.xe2x80x9d One such protocol is referred to as the Hypertext Transfer Protocol (HTTP) and it permits the transfer of Hypertext Markup Language (HTML) documents between computers. The HTML documents are often referred to as xe2x80x9cweb pagesxe2x80x9d and are files containing information in the form of text, video, images, links to other web pages, and so forth. Each web page is stored in a computer (sometimes referred to as an xe2x80x9cInternet Serverxe2x80x9d) and has a unique address referred to as a Universal Resource Locator (URL). The URL is used by a program referred to as a xe2x80x9cweb browserxe2x80x9d located on one computer to find a web page stored somewhere on another computer connected to the network. This creates a xe2x80x9cwebxe2x80x9d of computers each storing a number of web pages that can be accessed and transferred using a standard protocol, and hence this web of computers is referred to as the WWW.
A complete field of technology has arisen that focuses upon making it easier for a user to find information available over the Internet. There are a large number of xe2x80x9csearch enginesxe2x80x9d that permit the user to enter key words or phrases. The search engine then searches the Internet to find web pages that contain the key terms. The results are then presented to the user in some sort of ranked fashion. Given the sheer volume of information available over the Internet and WWW, however, search time can be extremely long. This is particularly problematic in an age when users are demanding faster performance in information retrieval tools. Moreover, the search results may often have little relevance to the user""s initial request.
In order to accelerate the search process, some search engines build internal databases using a search program referred to as a xe2x80x9cweb crawler.xe2x80x9d The idea is that by building an internal database, much of the search work can be done prior to a user""s request for information thereby decreasing search times. A web crawler performs as its name suggests. The program periodically xe2x80x9ccrawlsxe2x80x9d or searches the Internet and attempts to catalog or index the information available in certain web pages. The index is stored in a database that is accessible to the search engine. In this manner, when a user enters a search term, the internal database is searched first in a relatively fast and efficient manner.
A problem with conventional web crawlers, however, is that they are designed to collect a limited set of information about the web page. Each web page typically has a list of terms provided by the web page designer that attempts to identify the content found within the web page. The web crawler retrieves this list of terms and stores the terms in a database. This list of terms, however, is typically limited to what the web designer deems significant. Consequently, it may not be accurate or comprehensive. Moreover, in many instances, this list may contain terms that are misleading. For example, a web page having information about a particular brand of car may include in its list of terms the name of several competitors. When the user inputs the competitor""s name in a search engine, the unintended web page would be retrieved as part of the search results.
Another problem with conventional web crawlers is that they are designed to locate general information. They simply search for web pages in a random manner and index those web pages within the initial search parameters. These conventional web crawlers, however, are not optimized to locate a specific set or domain of information. Accordingly, the conventional web crawler is not efficient or effective when attempting to catalog or index specialized information.
In view of the foregoing, it can be appreciated that a substantial need exists for a web crawler that solves the above-discussed problems.
One embodiment of the invention comprises a method and apparatus to index network information. A network is searched for files of information relevant to people and resources in a particular field using a search list of weighted links to the files. The information is parsed into content and additional links to additional files. The content is weighted and copied to memory (such as a database). A determination is made as to whether the additional links are relevant to the people and resources in the given particular field. Those additional links that are relevant are weighted using a predetermined weighting algorithm. The relevant additional weighted links are copied to the search list. This process continues until an ending condition occurs.
With these and other advantages and features of the invention that will become hereinafter apparent, the nature of the invention may be more clearly understood by reference to the following detailed description of the invention, the appended claims and to the several drawings attached herein.