1. Field of the Invention
The invention pertains generally to the field of automated searching for electronic information content in computer networks, and more particularly, to efficient indexing and searching of information in a peer-to-peer computer network topology.
2. Background of the Invention
As computer networks have become nearly ubiquitous in all computer environments, the amount of information available for access and use by computer users has correspondingly exploded. Yet with billions of pages of information available to users of networks such as the Internet, without the ability to efficiently locate information, the available information is all but useless. Thus, automated search resources such as the Google and Yahoo internet search engines were developed to assist users in locating relevant information from a vast number of possible storage locations in a reasonable amount of time. Conventional search engines usually reduce search time by pre-indexing documents that are accessible to the network and applying user search criteria against the index to obtain search hits.
Typical document indexing systems have term occurrence data arranged in an inverted content index partitioned by document. The data is distributed over multiple computer systems that are dedicated to index storage with each computer system handling a subset of the total set of documents that are indexed. This allows for a word search query to be presented to a number of computer systems at once with each computer system processing the query with respect to the documents that are handled by the computer system.
An inverted word location index partitioned by document is generally more efficient than an index partitioned by word. This is because partitioning by word becomes expensive when it is necessary to rank hits over multiple words. Large amounts of information are exchanged between computer systems for words with many occurrences. Therefore, typical document index systems are partitioned by document and queries on the indexed documents are processed against the contents of the indexes until a sufficient results set is obtained. While the number of documents indexed in search engines is growing, in many cases the results for most queries come from a small portion of the entire set of documents. Therefore it may be inefficient to search indexes that contain documents that are less likely to return results in response to a query.
Peer-to-peer network topology (P2P) is well known in the computer world, and may be implemented in a hard-wired configuration, or, as more popularly implemented, in a virtual manner by overlaying a peer network configuration over a physical or native network topology. In a peer to peer network, each computer (also called a “peer,” or “node”) in the network has the same or similar responsibilities as each of the others, i.e. it is a “peer” rather than a merely client or server, and is physically or virtually connected to all other nodes in the network (see FIG. 1A). In P2P networks, all clients provide resources, which may include bandwidth, storage space, and computing power. Such networks are dynamically scalable; as nodes arrive and demand on the system increases, the total capacity of the system also increases. Many variations of P2P networks have been created, and popular examples include Napster, Kazaa, and Gnutella. Such P2P networks were often first used to disseminate large amounts of multimedia data such as movies or music over the Internet.
The distributed computing power and storage aspects of P2P networks provides great advantages in marshalling the resources of multiple computers for storage and processing. However, in such networks, the computing elements (nodes) are not always in a close geographic proximity and they are not always connected by high bandwidth connections. Further, the storage capacity of node in the network varies dramatically, and in some instances may be severely limited.
P2P web search engines, through the P2P network interface, utilize the resources of each of the network nodes, and may make efficient use of nodes at times such as when computer in the network are idle. In one search configuration, each computer/node in the P2P network contains a part of a search index rather than a centralized index which is more often the case in centralized search engine implementations. As the computers in a P2P network implementation are often a conglomeration of different users' computers, the computers may vary greatly in performance, bandwidth, and available memory to conduct searching and/or hosting an index.
Peer-to-peer search engines are typically implemented with a structured or unstructured network approach. In unstructured peer-to-peer networks, any peer can store any content. There is no specific responsibility between peers and content assigned; therefore at search time all peers need to be queried for content. If the search is limited to a certain number of peers in an unstructured peer search approach, a high probability exists that the results will be incomplete.
In structured peer-to-peer networks, each computer is responsible only for a specific fraction of the content. Therefore at search time it is possible to limit search activities only to those peers that store content related to the query. One example of a structured peer-to-peer network based on distributed hash tables is shown in FIG. 1B.
Search engines should strive to efficiently handle to multiword search queries, as few searches conducted by users include only single keywords. Most peer-to-peer search engines which are capable of multi keyword searches (e.g. Boolean queries) operate by intersecting posting lists of the single keywords as shown in FIG. 2. One may appreciate that posting lists include all addresses or pages which contain a specific keyword, and as such, may become extremely large in size. Therefore, the intersection analysis may require extreme memory, processing, and bandwidth resources to accomplish in a timely fashion. In FIG. 3, a slight improvement is shown where one large posting list is transferred for the keyword “Acid” from Peer 2 to a network node with the longest posting list (the list for “Flower” stored at Peer 1) where the intersection analysis occurs. The results are then transferred from Peer 1 to Peer 3 where a user may review them.
Both methods shown in FIG. 2 and FIG. 3 are inefficient for distributed search, as transferring huge posting lists requires excessive time and bandwidth. To guarantee complete results for two search terms with each 1 billion results at two separate peers, the transfer of at several gigabytes would be required. Even when compressed by compression utilities by factor of 10, the data transfer for a single search is still infeasible. Therefore, the existing approaches are limited to either slow search engines or incomplete results, even when inverse indexes have been utilized to obtain some level of efficiency. Therefore, a need exists for a space efficient distributed index searching system that supports timely and complete search results in a P2P implementation.