The present invention relates to the field of data processing, and particularly to a software system and associated method adapted for use within a search engine system, to rank search results based on document quality. This invention pertains, in particular, to a computer software product and algorithm for retrieving and ranking XML documents and their associated document schemas based on the link relationships among them.
The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. Users navigate these pages by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web pages have a redundancy of information or share a strong likeness in either function or title. The vastness of the WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through its index of web pages to locate the pages that match the user""s search terms. The search engine then returns the search results in the form of HTML pages. Each set of search results includes a list of individual entries that have been identified by the search engine as satisfying the user""s search expression. Each entry or xe2x80x9chitxe2x80x9d includes a hyperlink that points to a Uniform Resource Locator (URL) location or web page.
A search of web pages using keywords, in most cases, returns an over-abundance of search-results. For example, a search for xe2x80x9cHarvardxe2x80x9d might result in an excessive number of web pages. Search engines face the challenge of ranking these results according to the most definitive pages for the search query. Text-based ranking alone will often miss some pages that are relevant to the search. Of the pages that contain xe2x80x9cHarvard,xe2x80x9d for example, the web site www.harvard.edu may not be the one that uses the term xe2x80x9cHarvardxe2x80x9d most often, most prominently, or in any other way that would favor it under a purely text-based ranking function even when it is the most definitive result for a topic-based search query.
One approach to addressing this ranking problem is to exploit the information embedded in the hyperlink structure of WWW pages. Hyperlinks encode a considerable amount of human judgment used by various techniques to determine the authority or quality of a page in a specific context. Exemplary techniques that use algorithms to exploit the hyperlink structure within HTML pages for this purpose are the HITS and CLEVER methods. These algorithms have been implemented in search environments in order to determine the relevance of HTML pages to user-defined search criteria.
The HITS method introduces the notions of xe2x80x9cauthoritativexe2x80x9d and xe2x80x9chubxe2x80x9d resources. An authoritative resource (or authority page) is one that contains definitive information about a topic. In other words, in the context of search results, it is a high-quality page. A hub resource (or hub page) is one that contains a large number of hyperlinks that point to authoritative pages. The HITS algorithm is applied to a set of pages returned by a text-based search (a seed set). The goal is to determine the most authoritative pages and best hub pages in the set. To accomplish this goal, the HITS algorithm makes use of the structure of the in-links (the links into a web page) and the out-links (the links out of a web page) of each of the pages within the set. To begin, it counts the number of each page""s out-links. In the first iteration, the initial xe2x80x98hubxe2x80x99 score of a page is the number of pages linking out of that page, and the initial xe2x80x98authorityxe2x80x99 score of this page is the number of pages pointing to it. The xe2x80x98hubxe2x80x99 score for the next iteration is the sum of the xe2x80x98authorityxe2x80x99 scores of the out-linked pages and the xe2x80x98authorityxe2x80x99 score is the sum of the xe2x80x98hubxe2x80x99 scores of the in-linked pages. The iterations are continued until satisfactory convergence for the xe2x80x98authorityxe2x80x99 and xe2x80x98hubxe2x80x99 scores is achieved. The pages with the highest xe2x80x98hubxe2x80x99 and xe2x80x98authorityxe2x80x99 scores are identified as the results of the search.
HITS is the definitive algorithm used to find authoritative resources in a hyperlinked environment. The CLEVER method extends the HITS method by taking advantage of the text surrounding hyperlinks. It uses the annotations provided by this text to weight each link and further classify the search results.
A significant portion of the WWW documents today are authored in HTML, which is a mark-up language that describes how to display page information through a web-browser and to link documents up to each other. HTML is an instance of SGML (Standardized Markup Language) and is defined by a single document schema or Document Type Definition (DTD). The document schema puts forth a set of grammatical rules that define the allowed syntactical structure of an HTML document. The schema, or structure of HTML pages, is consistent from page to page. Both the HITS and CLEVER algorithms apply to HTML pages and do not necessarily address documents containing a number of different schemas.
Currently, however, Extensible Markup Language (XML) is gaining popularity. XML, which is a subset of SGML, provides a framework for WWW authors to define schemas for customized mark-up languages to suit their specific needs. For example, a shoe manufacturer might create a xe2x80x9cshoexe2x80x9d schema to define an XML language to be used to describe shoes. The schema might define mark-up tags that include xe2x80x9ccolorxe2x80x9d, xe2x80x9csizexe2x80x9d, xe2x80x9cpricexe2x80x9d, xe2x80x9cmaterialxe2x80x9d, etc. Hence, XML documents written in this shoe language will embed semantic, as well as structural, information in the document. For example, a shoe XML document uses the mark-up tag xe2x80x9ccolorxe2x80x9d to indicate that the shoe is xe2x80x9cbluexe2x80x9d.
One advantage of XML is that it allows the efficient interchange of data from one business to another (or within the business itself. A business may send XML data that conforms to a predefined schema to another business. If the second business is aware of the first business""s schema, it may use a computer program to efficiently process the data. To enable this efficient data interchange and processing, XML requires that standard and high-quality schemas be developed and conformed to, by XML documents.
As noted, the XML framework allows for the definition of document schemas, which give the grammars of particular sets of XML documents (e.g. shoe schema for shoe-type XML documents, resume schema for resume-type XML documents, etc.). The XML framework also puts forth a set of structural rules that all XML documents must follow (e.g. open and close tags, etc.). Moreover, it is possible for an XML document to have no associated schema. If a document has an associated schema, the schema must be specified within the document itself or linked to by the document.
Information about the quality of an XML document may be inferred by its conformance with the rules put forth by this XML framework. An XML document is said to be xe2x80x9cvalidxe2x80x9d if it has an associated schema and conforms to the rules of the schema. An XML document is said to be xe2x80x9cwell-formedxe2x80x9d if it follows the general structural rules for all XML documents. Ultimately, a high quality document has a higher probability of being both xe2x80x9cvalidxe2x80x9d and xe2x80x9cwell-formedxe2x80x9d than a low-quality document.
In addition, like HTML documents, XML documents form a hyperlinked environment in which each XML document that has an associated schema provides a link to the schema (if the schema is not defined within the document itself. Moreover, each XML document, using various mark-up structures, such as XLink or XPointer, may link up to other XML structures and XML documents. Unlike the HTML environment, however, the schemas of each hyperlinked document may vary from document to document. A document that satisfies one particular schema can point to a document that satisfies a different schema. Further, two documents with different schemas can point to a document with a third schema. The quality of each schema may vary significantly.
To take full advantage of XML for efficient data interchange requires the use of standard, well-defined document schemas and XML documents that properly conform to them. The number of XML schemas and documents on the WWW today is rapidly increasing. The increasing diversity of document schemas adds a new dimension to the analysis of hyperlinked documents on the WWW.
The HITS and CLEVER algorithms make use of hyperlinked structures to rank documents that share the same schema. Exemplary documents with hyperlinked structures are HTML documents. XML has given rise to a new hyperlink environment that includes documents with different schemas. In this environment, it will become increasingly important to identify high-quality schemas and documents that correctly use them. Hence, this new environment presents several previously unaddressed issues: ranking documents based on the quality of their associated schema, determining the quality of the schemas themselves, and ranking documents based on their structural properties (e.g. validity, well-formedness, etc.). The WWW today calls for a system that finds and identifies authoritative XML-documents that take these factors into account. This need, which makes use of the new dimension added by XML, has heretofore remained unsatisfied.
The present system and method for identifying authoritative XML schemas and documents includes a ranking manager that satisfies this need. The present invention describes a retrieval system using a ranking manager and a method that extend the HITS algorithm by introducing the notions of document schemas and structural conformance to the algorithm. It rates the authority of XML documents and the authority of their associated document schemas based on an enhancement of the iterative algorithm originated by HITS.
Similar to the HITS and CLEVER algorithms, the present invention provides an algorithm which is applied to an initial set of documents. For example, a search for a topic (e.g. shoe) on the web might produce a large number of responses, or xe2x80x9chitsxe2x80x9d, of XML documents. In one embodiment, the initial seed set includes these search results. In addition, the seed set includes all the XML documents linking into and out of these hits. In addition, the seed set also includes all schemas used by these XML documents in the expanded set. One goal of such an embodiment is to order these documents and their schemas by their authority (i.e., quality or reliability of search results) and hub scores.
As used herein, a document that is a good authority is a definitive representative of the search topic. A document that is a good hub links to a large number of documents that are good authorities. In general, good authorities are linked to from a large number of good hubs. Good authorities have a high authority score. Conversely, good hubs contain a large number of links to good authority documents. A good hub document has a high hub score. In addition, the present invention introduces the notion of an authoritative schema. A schema is said to be authoritative if it is used by a large number of authoritative pages and by a large number of good hub pages. An authoritative schema has a high authority score.
The ranking manager of the present invention uses a ranking software program, algorithm, or module, that is based on a base iterative algorithm, such as the HITS and/or CLEVER algorithms. The base algorithm maintains a hub score, h(d), and an authority score, a(d), for each document, d. It initializes these scores to some constant and recomputes them through a sequence of iterations. In a first step, the ranking algorithm recomputes the hub score, h(d), of each document by replacing it with the sum of authority scores of the documents to which it points. Next, the ranking algorithm replaces the authority score, a(d), of each document with the sum of the hub scores of the documents that point to it. In addition, the base algorithm, reiterates these steps until the difference among the hub scores, and/or the difference among the authority scores, in each iteration converge to a predetermined value such as 0.
The above described base algorithm has been used to find authoritative resources in hyperlinked environments, specifically HTML documents. XML introduces new dimensions to this analysis of authoritative resources. The present invention enhances the notation of authoritative resources based on the following observations. First, the quality of schemas provides a new source of information about the quality of the hyperlinked documents that use them. For example, the conformance of a document to a high-quality industry-standard schema confers on it a degree of authority. Second, the correctness of document structure provides information about its reliability. A hyperlinked document""s structure, such as its validity and/or well-formedness adds weight to its authority. Third, the characteristics of a set of XML documents that share a common schema provide information about the quality of the shared schema. For example, if all documents that use a schema X are malformed and not valid, schema X will lose credibility as a good schema. Moreover, if only a handful of documents about shoes use schema X, while the rest of the documents about shoes use schema Y, it may be inferred that schema Y is a better schema than schema X. In this case, schema Y has more authority than schema X.
Hence, the iterative process of the present invention enhances the base algorithm to account for the attributes of document structure, which provide additional information about the document quality, and to account for the authority scores of the schemas that the documents use. To this end, the ranking manager adds weights to the hub and authority scores of each XML document based on document validity and well-formedness, and further introduces the notion of authoritative document schemas. Consequently, the ranking manager introduces a third step to the iterative process that computes and incorporates the authority score of these document schemas.
In order to implement this third step, the ranking manager, in addition to maintaining the hub and authority scores for each XML document, also maintains, for each schema, s, an authority score, a(s). The ranking manager begins by initializing the hub score, h(d), and authority score, a(d), of each XML document to a weighted value based on the document""s validity and well-formedness. It initializes the authority score, a(s), of each schema, s, to 0 or some constant.
Each iteration includes the following three steps: (1) The ranking manager recomputes the authority score, a(s), of each schema used by the XML documents in the pool by adjusting it with a normalized sum of the hub scores of the documents that use the schema and the authority scores of the documents that use the schema; (2) it recomputes the hub score, h(d), for each XML document, d, in the pool by adjusting it with a normalized sum of the authority scores of the documents that it points to and the authority score of the schema that it uses; and (3) it recomputes the authority score, a(d), of each XML document in the pool by adjusting it with a normalized sum of the hub scores of the XML documents that point to it and the authority score of the schema that it uses. The ranking manager reiterates over these three steps until the difference between the scores computed from one iteration to the next reaches a predetermined threshold value or converge to 0.
Ultimately, the ranking manager ranks the documents and their schemas according to these computed values. Thus, the authoritative XML document and schema identifying system (or ranking manager) of the present invention provides several features and advantages, among which are the following:
The ability to identify documents with high authority and documents that are good hubs (and therefore of high quality in the search results) using the in-links and out-links of these documents, the structural attributes of these documents, and the authority of the schemas of these documents,
The ability to identify schemas of high authority using the authority and hub scores of the XML documents that use them.
The incorporation of document validity and well-formedness into the ranking of documents.
The incorporation of schema quality into the ranking of documents.
The promotion of the use of highly rated standard schemas to write higher quality XML documents and promote schema standardization.
The promotion of the use of documents that use highly rated schema and discouragement of the use of documents with poorly written schema.
The promotion of the use of documents that are well structured and valid and discouragement of the use of documents that are poorly structured, invalid, or without a definite schema.