1. Field of the Invention
This invention relates generally to a method of partitioning and reorganizing physical domains into logical domains, and more particularly to a method of utilizing logical domains for the construction of multi-granular and topic-focused site maps of part of a physical domain or a search space.
2. Description of the Related Art
Use of the internet, and in particular the World Wide Web (WWW or Web), has recently been increasing at a rapid rate. The explosive popularity of the WWW has been accompanied by a tremendous growth in the size of the Web and the scope of its content. Due to the ever-increasing size and complexity of the search space, many queries of the vast WWW, for example, yield such a large number of matched Web pages that the results returned by the search engine are not useful. Even when specific information related to a definite topic is sought, Web users often encounter difficulties in foraging for relevant pages. Many of these difficulties are rooted in the structure of the search space and can be attributed to the deficiencies inherent in the way conventional search techniques and result organization schemes operate.
Generally, the Web includes many Hyper-Text Markup Language (or HTML) documents, or pages, and each page is assigned a unique Universal Resource Locator (URL) for identification and location purposes. The URLs are organized into physical domains; each physical domain is defined as a set of pages associated with a single host, and each page located within a particular physical domain contains the host name in its URL. For example, the URLs www.ccrl.com and www.ccrl.com/dl99ws/ identify individual pages which are both hosted by a Web server (or series of Web servers) having a unique host name (i.e.: www.ccrl.com). The pages identified by these two URLs are, therefore, each in the same physical domain. Since a particular URL represents a unique identifier for every HTML page, the URL is the preferred means utilized by conventional search engines and query processing methods for organizing Web query results in a physical domain.
With many conventional Web search engines, for example, query results or information reported in response to a request are grouped exclusively by physical domain and are presented in the form of a set of clusters of URLs within a particular physical domain. This organizational strategy is advantageous to the extent that the clusters can potentially provide a user with a visualization of the topology of the search space, i.e., how the pages are linked together. A user may thus first locate the most relevant site and browse through matched pages within that site. Organizing query results exclusively by physical domain has two significant limitations, however, especially when a physical domain contains large Web sites.
First, large Web sites tend to contain many matching pages arranged in only a few large, flat-structured, and unorganized clusters. This phenomenon is attributable to the fact that many pages, by virtue of their presence in the same physical domain, have related or similar URLs. For example, many large Web sites, such as Geocities, AOL, and NEC BIGLOBE, are either Internet Service Provider (ISP) sites or Web site hosting providers; consequently, these sites represent enormous physical domains. Typically, if one page contains pertinent information, many pages with similar URLs will also be returned as a match in response to a request for information. Many of the matching pages are only relevant to the extent that their URLs contain a given string of characters, namely, the DNS name; the actual contents of the pages may be totally irrelevant. The inclusion of irrelevant material in the search results shifts the burden of distinguishing between relevant pages and irrelevant pages from the search engine to the user.
Second, even assuming that all of the information returned by the search engine were relevant, grouping results by physical domain does not provide a well organized and convenient way for users to locate the most relevant pages in Web sites. For example, given a query containing the keyword xe2x80x9cXML,xe2x80x9d many portal sites specializing in XML material, such as www.xml.org and www.w3c.org, tend to offer a large number of matches which, when displayed in the form of a query result, are not categorized or otherwise summarized by a typical search engine. A method of presenting a user with an hierarchical display representing how the hundreds of pages are related in addition to their URL similarities will usually be of greater utility to a user than merely displaying, in list form, hundreds of pages within a given physical domain without any indication as to the way the pages are related.
In addition, while Web site maps can play an essential role in assisting users in navigating a Web site, many site maps can also prove to be inefficient or wholly ineffective with respect to the goal of assisting navigation. Ideally, such maps should provide users with a view of both the contents (i.e.: pages) as well as the link structure (i.e.: topology) of the Web sites they represent. Since, as noted above, the state of the art involves organizing query results and requests for information mainly according to physical domain, the typical site map necessarily reflects only the content and structure of a physical domain. To the extent that the organization of a physical domain is deficient or renders navigation tedious, the site map representing that physical domain can offer little or no assistance to a user interested in finding information regarding a particular topic of interest. A method of organizing results responsive to keyword queries or other requests for information into convenient and usable form should be, therefore, adapted for the construction of site maps having greater utility to users.
Many Web-masters presume that users of their sites have different hardware capabilities, network bandwidths, and preferences for interacting with the site. To support a more user-friendly and pleasant Web surfing experience, many Web sites support several variations with respect to the way information is presented to a user, such as, for example, text modes, graphics modes which may or may not support frames, Java(trademark) scripts, and so forth. Although users with different hardware or bandwidth capabilities are supported, the fact that different users may be browsing for different topics of interest is usually overlooked. For instance, most of the site maps are static, i.e.: predetermined and unalterable. The static nature of the typical site map is most evident at big portal sites that present vast amounts of information covering many diverse subjects. Such a static approach to site mapping is deficient to the extent that it assumes that a single map is suitable for all users who may visit the site.
For example, different users generally have different preferences with respect to visiting a Web site, and different users may visit the same site for different purposes. On any given Web site, for instance, one user may be hunting for particular information, while another user may simply be surfing the Web for enjoyment without any well-defined target in mind. Obviously, the relative expectations of these two users with respect to site map complexity are necessarily different. The former may want to see a detailed map which aids in speedy navigation to a specific directory containing the specific information sought, while the latter may prefer a more abstract map which merely offers a general overview of the contents of the Web site. It is desirable, therefore, to construct site maps which support multiple levels of granularity. A multi-granular site map enables a user selectively to examine different portions of the site in varying degrees of detail, from the very general to the very specific.
As another example, different users generally have different topics of interests in mind when conducting a search. In the case of an on-line xe2x80x9csuper-storexe2x80x9d having many different types of items for sale, one user searching for xe2x80x9chardware toolsxe2x80x9d and another user interested in xe2x80x9cbeauty products,xe2x80x9d for example, may issue keyword queries related to their respective topics on the same site. Each respective user would prefer to be presented with a site map which is focused primarily on his or her respective topic of interest, with little emphasis on the rest of the site. The site map, therefore, should be flexible so as to adjust for various users desiring information on different topics of interest. For each user, the area of the site map related to preferred interests should be emphasized in detail, while the rest of the map may display only cursory or general information; that is, just enough information to illustrate the Web site""s topology.
Based upon the foregoing observations, some requirements for a convenient and user-friendly site map can be summarized as follows. A site map should be: capable of summarizing, in general form, the contents of the site searched; capable of preserving and displaying the topology of the site, thereby supporting navigation from page to page throughout the site; flexible and adjustable, or multi-granular, such that both the overview of the contents as well as the detailed particulars of sections of the total contents may be selectively presented; and content-sensitive (or xe2x80x9ctopic-focusedxe2x80x9d) so as to support multiple users having different interests.
There has been a continuing and growing need for a method of partitioning and reorganizing search spaces, such as large XML databases or physical domains on the Web, according to a system of logical domains, wherein a logical domain is defined as a group of related pages which collectively represent a particular theme, function, concept, or topic of interest. Such a method preferably enables content-sensitive site mapping of the search space wherein multiple levels of granularity are supported.
Directed to partitioning and reorganizing physical domains, the method of the present invention addresses the above-mentioned considerations and overcomes these and other shortcomings of conventional searching and reporting techniques through the identification of logical domains responsive to a request for information. Additionally, the method of the present invention satisfies the requirements for a convenient reporting and displaying technique by implementing logical domains in the construction of multi-granular and topic-focused site maps representing certain areas of a search space or a physical domain. In particular, the site maps enabled by the method of the present invention are dynamically related to any or all of the following: the focus of a keyword query; the key topic of interest in another form of request for information; or organization specification.
The operative environment for the method of the present invention is a search space comprising a set of documents (hypertext or hypermedia documents, for example, such as HTML or XML documents), or pages. Each page in the search space is located in a particular physical domain as a means of organization. The present invention introduces the concept of a logical domain into this environment. Whereas physical domains on the Web, for instance, are defined based upon DNS names as represented by character strings in URLs, logical domains are defined based upon the whole spectrum of Web page metadata (including URLs, titles, and anchors) as well as actual page contents, link structure, and citation relationships. A logical domain is a set of pages which are related by semantic and syntactic structure, and which, collectively, represent a logical unit of information which pertains to a particular theme, function, concept, or topic of interest. For example, such Web sites as an individual user""s xe2x80x9chome page,xe2x80x9d a research group""s xe2x80x9cprojectxe2x80x9d page, and xe2x80x9ca tutorial on XMLxe2x80x9d can all be viewed as logical domains, since each represents a group of pages which collectively relate to a particular theme or function.
Responsive to a request for information, such as an issued query, for example, the preferred embodiments of the present invention begin with identification of logical domain entry page candidates, which are scored according to various attributes such as page metadata, subject matter relevance of page content, and citation information associated with each of the candidates. An entry page, which defines the top of a directory tree for a logical domain, is selected from the highest scoring candidates. Thereafter, the pages within the boundaries of the logical domain are determined by assigning pages according to page metadata, accessibility from the entry page as measured by path information and link structure, or some combination of these factors. This procedure is repeated until a desired number of logical domains are defined. Preferably, a recursive procedure assures appropriately sized and sufficiently relevant logical domains.
According to one preferred embodiment of the present invention described herein, for instance, the size of the logical domains may be selectively adjustable so as to provide adequate information without providing an overwhelming number of pages. The entry pages and boundaries of logical domains are dynamically adjusted in a recursive procedure, beginning with the pages located at the bottom of the physical domain and working up to the top. Each successive iteration is designed to eliminate logical domains which are so small as to be unlikely to provide adequate coverage of a particular topic of interest. The size of logical domains can be influenced through adjustment of various parameters in a preferred embodiment of the algorithm of the present invention.
In addition, according to another preferred embodiment described herein, a method is presented for constructing multi-granular and topic-focused site maps. The site maps are preferably constructed utilizing such information as directory structures derived from URLs, page contents, and link structure. In these site maps, Web site topology is preserved and displayed, and document importance, as measured by semantic relevance and external citation, is used for selecting pages which are to be displayed as well as prioritizing the presentation of pages and directories.
Briefly, the technique of site map construction includes the following steps: identifying logical domains within a physical domain or Web site; determining page importance based on citation analysis and adjusting page importance based upon page contents; adjusting the contents and entry pages of all the logical domains based on links, directory paths, and page importance; and selecting the entry pages of those logical domains having higher importance scores for presentation in the site map.