1. Field of the Invention
The present invention relates generally to a system and method of classifying and retrieving the attributes of Web pages. More particularly, the present invention relates to pre-fetching a HyperText Transfer Protocol ("HTTP") header of a Web page and scanning for attributes embedded in the HTTP header.
2. Description of Related Art
Today, a user searching for Web pages on a particular topic inputs keywords into a search engine (such as ALTA VISTA.TM. or YAHOO.TM.) which searches for possible Web pages that contain the keywords. The search engine then "crawls" through every link and Web page that it can find and retrieves data that matches the keywords. The search engine then classifies and organizes the Web pages according to the raw number of times that the keywords appear in each Web page. That is, the search engine creates a massive database that keeps track of the number of times that each keyword occurred.
Over the last few years, it has been discovered that the current method of searching for Web pages has certain disadvantages. One disadvantage is narrowing down the retrieved data that is meaningful to a user. For example, this may occur when a user searching for "scholarships" (e.g., educational) finds a Web page containing a personal resume containing the word "scholarship". In addition, an increasing number of Web pages are rigged with hidden text. Thus, a Web page on the surface may appear meaningful to the user, but in reality is not pertinent. This defeats the whole organization of the Internet, especially when an increasing number of people are selling the content of Web pages for money.
Another disadvantage associated with the current method of searching for keywords in Web pages is the amount of time that it takes to search every Web page. A Web page may also take a long time to download since it typically contains, among other things, textual content, embedded graphics, and tables. This is especially annoying when a Web page that is downloaded does not pertain to the user's topic of interest.
Therefore, there is a need for a better way to classify and index the contents of a Web page such that the classification more accurately reflects the contents of the Web page and to retrieve the classification much faster than any other method used today.