The present invention relates generally to methods and systems for use in providing services in a distributed information system, and more particularly to methods for generating hierarchical forward knowledge in a distributed directory service.
A directory service component of a distributed computing environment is intended to make it easier to find information. It provides the information necessary to access a person or resource. For example, a directory service permits administrators, users and applications to find information about people, printers, files and other shared resources in a single place even if the resources reside on many different physical servers. The directory may be viewed as a database of information about objects, including people, organizations and processes.
In massively distributed information systems, it can be extremely expensive and in some cases impossible to centrally index all the information from the large number of information sources available. One of two strategies is typically employed to conduct a search that covers the contents of large numbers of information sources. These strategies are xe2x80x98query floodxe2x80x99 and xe2x80x98query routing.xe2x80x99
Query flood requires the client to talk to each source in some subset of the system, whether or not that source has any relevant data. The subset of the system searched is selected by picking a location in the global hierarchical namespace and searching the sources whose names are xe2x80x98childrenxe2x80x99 of the location picked. The query flood strategy can be prohibitively expensive.
Query routing, on the other hand, uses hierarchies of indexes to allow queries to be directed to only those sources having information that may fulfill the query. For most types of information, this means that the client has to contact far fewer sources than before.
The present invention provides improvements to query routing methods and systems, such as the Common Indexing Protocol (CIP) used in the WHOIS++ directory service. For background on this work, see Deutsch, et al., xe2x80x9cArchitecture of the WHOIS++ Service,xe2x80x9d August 1995; Faltstrom, et al., xe2x80x9cHow to Interact With a WHOIS++ Meshxe2x80x9d, February 1996; Weider, et al., xe2x80x9cArchitecture of the WHOIS++ Service,xe2x80x9d February 1996; and Weider, et al., xe2x80x9cHierarchical Extensions to the Common Indexing Protocol,xe2x80x9d November 1996. Although the background of the present invention is explained below in connection with the WHOIS++ directory service, it should be noted that the invention is by no means limited thereto.
The WHOIS++ Directory Service
With the vast amount of directory information potentially available, e.g., on the Internet, it is not feasible to provide a centralized directory to serve all this information. Instead s of being centralized, the WHOIS++ directory service is xe2x80x98distributedxe2x80x99 in that it is based on a hierarchy of directory information collection agents. In this architecture, a directory query is delivered to an agent in the tree, and then handed up or down, as appropriate, so that the query is delivered to the agent that holds the information that fills the query.
One of the primary assumptions made by recent implementations of distributed directory services is that every entry in the directory resides in some location in a hierarchical name space. While this arrangement is ideal for reading the entry once its location is known, it is not as good when searching for the location of those entries that meet some set of criteria. If the only criteria known about a desired entry are items that do not appear in the namespace, a global query is required. Whenever a global query (at the root of the namespace) or a query at the top of a subtree in the namespace is issued, that query is replicated to all subtrees of the starting point. The replication of the query to all subtrees is not necessarily a problem, since the queries may be inexpensive. However, every server to whom the query has been replicated must process that query, even if it has no entries matching the specified criteria. This part of global query processing is quite expensive.
The WHOIS++ directory service addresses these problems by a combination of two techniques: directory meshes and forward knowledge. Although every entry in WHOIS++ does have a unique identifier (i.e., it resides in a specific location in the namespace), the navigational algorithms to reach a specific entry do not necessarily depend on the identifier assigned to the entry. The WHOIS++ service gets around the namespace and hierarchy problems by creating a directory mesh on top of the entries. Each layer of the mesh has a set of xe2x80x98forward knowledgexe2x80x99 that indicates the contents of the various servers at the next lower layer of the mesh. Thus, when a server in a particular layer of the mesh receives a query, it can prune the search tree and hand the query off to only those lower level servers that have indicated they might be able to answer it. Searching thus becomes feasible at all levels of the mesh.
Most directory services use a basic xe2x80x98template-basedxe2x80x99 information model, in which each entry consists of a set of attribute-value pairs. To participate in the index service, the underlying database should be able to generate a xe2x80x98centroidxe2x80x99 or some other type of forward knowledge for the data it serves. The so-called centroid of a server may include a list of the templates and attributes used by that server, and a word list for each attribute. The word list for a given attribute could contain, e.g., one occurrence of every word appearing at least once in that attribute in some record in that server""s data.
An index server collects and collates the centroids of either a number of base data servers or a number of other index servers. It is able to generate a centroid for the information it contains. In addition, when it receives a query, the index server searches its collections of centroids, determines which servers hold records that may fill that query, and then notifies the user""s client of the identities of the servers to which the query should be submitted next. An index server can also contain primary data and thus act as both an index server and a base level server. In this case, the index server""s response to a query may be a mix of records and referral pointers.
Shortcomings of the Prior Art
Both the query flood and query routing search strategies, in their present form, have shortcomings in both time and resource efficiencies. For example, the protocols used to create and query the hierarchies of indexes implicitly assume that the query issued to each index (and to the data sources) is the same in each case. The present disclosure addresses this problem by disclosing ways to permit index servers to xe2x80x98rewritexe2x80x99 the query to be sent to the next source down in the hierarchy. In contrast to the present invention, previous referral-based systems simply told the client which server to contact next when resolving a query, but not how to rewrite the query.
Moreover, the present invention is directed to methods and systems for generating the forward knowledge, or indexes, employed in a distributed directory service. As indicated above, the concept of using forward knowledge to efficiently route queries in a distributed directory service was disclosed in the work documenting the WHOIS++ directory service. However, particular methods and systems for generating such forward knowledge were not developed in the prior work.
Accordingly, the present invention is directed to an automated method and system for generating forward knowledge in the form of an index for use in a directory service. The inventive system operates on a set of records conforming to a particular schema. Specifically the input data is arranged hierarchically, such as from left-to-right. In addition, the data may be separated by some type of delimiter such as a period (xe2x80x9c.xe2x80x9d), a space (xe2x80x9c xe2x80x9d) or at sign (xe2x80x9c@xe2x80x9d). The inventive system parses the records, compares the parsed records and removes duplicates. As a result, only an index of the records remains. The index is then provided to index servers higher up in the hierarchy for use in query routing.
Other features of the invention are described below.