1. Field of the Invention
The embodiments of the invention generally relate to web page searching and indexing and, more particularly, to a streamlined system and process to facilitate efficient web page searching.
2. Description of the Related Art
Generally, a large number of searches on web search engines refer to entities such as people, organizations, and places. However, there may be many different people, organizations, and places that are referred to on web pages, which have the same name, but are, in fact, distinct entities. For example, given the search query “Charles Smith”, a typical search engine may retrieve references to 12-15 different people named “Charles Smith” in the first 20 results alone. This is a problem (hereinafter “identified problem”) with not just the names of people.
For example, the search query “Asha” retrieves pages relating to an educational charity called Asha, the American Speech and Hearing Association, the singer Asha Bhosle, the American Saddle Horse Association, the American Social Health Association, etc. In any one search, a user is likely searching for information about only one of these entities. To filter the pages referring to only about one particular use of Asha (hereinafter, the term Asha will be used in the generic sense; i.e., to refer to the entity denoted by the query, and the term ‘Asha’ will refer to the word or phrase), a user may be forced to augment the query with additional terms that are likely to occur on pages referring to that particular use of Asha, which the user is searching for. For example, if a user wants to search for information regarding the singer “Asha Bhosle” but did not know the singer's last name, the user may augment the query with additional terms such as “singer” or “music” or “musician”, etc.
Sometimes, one particular entity, which is not the one the user is looking for, dominates the search results. For example, the search “Michael Jordan” mostly retrieves pages about the famous basketball player. This is a problem if the user happens to be searching for information about someone else named Michael Jordan, for example, an individual named “Michael Jordon” who may be a high school teacher in Akron, Ohio. Again, the user is generally forced to contort the query in an attempt to eliminate the unwanted pages. This process not only places an additional burden on the user, but also often results in valid pages being left out of the results.
Taking the example of the search query for Asha, one of the problems of disambiguating different denotations of ‘Asha’ can be seen as a special case of the conventional word sense disambiguation (WSD) problem which has been previously studied. However, there are some major differences between WSD and the problem identified above, which makes the traditional approaches to WSD inappropriate for the identified problem. WSD has generally dealt with the problem of identifying the word sense of a particular use of a word such as “bank”, which might refer to either a financial bank or a river bank. Typically, the problem is that of distinguishing between the two to four possible alternative meanings of a particular word, all of which are a priori known. This is done by using linguistic properties of the word, domain knowledge, or by looking for commonly co-occurring words. Further, from a linguistic and common sense domain knowledge perspective, all of the denotations are equally plausible.
Some conventional approaches look at the problem of the semantically same record (i.e., set of n-tuples), with erroneous syntactic variations (such as an address being written differently) in some of the fields, appearing as different records in the same database (such as the census database). The goal is to correctly link these duplicate records. In this approach, it is determined which field values are actually the same. However, the identified problem is different from this record linkage problem in two important ways. First, in the present context, even if everything two pages say about the Asha on the two pages is the same, it might not follow that the Ashas denoted by the two pages are, in fact, the same. For example, two pages might simply say that the person is called Asha and is a resident of the United States, from which one cannot conclude that they are the same. Second, different pages are likely to have very different kinds of information. One might identify the person based on his/her organizational affiliation and another based on the books he/she has written and, as such, it would be advantageous to still be able to co-identify them, if indeed they are the same.
Conventionally, a number of popular search engines provide a feature for retrieving similar or related pages. These features are aimed at retrieving pages that are overall similar to the page under consideration. Consequently, most of the pages they retrieve might not even refer to the original search query. For example, according to one of the most popular search engines, one of the top search results for the query “Barbara Johnson” is the web page for the Barbara Johnson who previously ran for governor of Massachusetts. Over half of the retrieved pages that are similar to this do not even contain the term “Barbara Johnson”. This is to be expected since the similarity is defined just as a function of the page, and not of the user's original query.
This problem is closely related to the much studied Information Retrieval problem of relevance feedback, which typically involves finding documents similar to a given document. As it relates to the identified problem, a precise definition is given of the sense in which two documents are to be considered similar; i.e., they refer to the same Asha. With this definition, one can measure the performance of different methodologies.
However, the conventional approaches have not generally worked well for web page searching and retrieval. Therefore, while the conventional approaches were sufficient for the purpose they were intended for, there remains a need for a novel entity disambiguation technique capable of being used in web page searching and retrieval.