1. Field of the Invention
Embodiments of the invention generally relate to the field of data-processing. More specifically, the invention relates to data processing to determine searchable content of network resources.
2. Description of the Related Art
Computer networks were developed to allow multiple computers to communicate with each other. In general, a network can include a combination of hardware and software that cooperate to facilitate the desired communications. One example of a computer network is the Internet, a sophisticated worldwide network of computer system resources.
The growing size of networks, particularly the Internet, makes it difficult to locate relevant information in an expedient fashion. As a result, search engines were developed to locate information on the network based on a query input by a user. Search engines comprise a search tool referred to as a spider, a crawler, or a robot, which builds indexes containing traversed network resources (e.g., addresses, Uniform Resource Locators (URLs), websites, etc.) according to well-known protocols and algorithms.
A user-input query in the form of search words, phrases, keywords, network addresses, etc., prompts the search engine to sift through the plurality of network resources (typically on the order of millions) in the index to find matches to the user query. Search engines typically reside on a server accessible via the internet to multiple users. Search queries are sent from the users to the search engine via a network connection. The search engine then parses the query and executes a search algorithm to identify any network resources containing information matching the query. Having identified results matching the user's query, the results are then returned and displayed to the user for review and selection.
One problem with conventional search engines is the amount of URLs that are returned to the user that are not relevant to the user query. To understand how this happens one must understand how search engines match a user query with a URL. One method of matching queries to URLs is to associate a keyword or keywords with the indexed URLs. If a term in the user query matches a keyword associated with a URL, then the URL is returned to the user.
One method to determine a keyword to index with a particular URL or website may be to analyze the frequency of occurrence of a word on the website. If a word appears a number of times on a website, such that the frequency of the word's appearance surpasses a predefined threshold, then the word may be deemed a keyword for the URL. Another method of determining a keyword for a particular website is to examine links from other sites to the particular website. If a particular word is used within a link to the URL of the particular website, then that word may be deemed a keyword for the URL of the particular website.
The problem with these methodologies is that web site designers have found ways mislead search engines, and consequently place their websites high in the result list of a user query, even though the true content of their websites may have no relation to the user query. This practice is commonly referred to as “spamming.” One example or spamming is keyword spamming. For instance, a web designer may place a high number of words commonly chosen for user queries, but not representative of the content of the website, within the text of the website (or within HTML structures known as meta tags). This is done with the intention of the search engine crawler associating those common query words with the website as keywords. Because the search engine crawler has now associated the query words with the website, the website is more likely to be returned in response to a query using the keywords. Due to this and other spamming techniques, search engines return less accurate results in response to user queries.
Therefore, a need exists for a method and apparatus to determine searchable criteria of network resources based on a commonality of content.