1. Field Technology
The present invention relates generally to computer networking, and particularly to peer-to-peer networks.
2. Description of Related Art
The increasing need to share computer resources and information, the decreasing cost of powerful workstations, the widespread use of networks, and the maturity of software technologies have increased the demand for more efficient information retrieval mechanisms.
“Peer-to-Peer” (P2P) network systems are real-time communications networks where any computing device currently connected—also sometimes referred to as an “edge node” or “fringe node”—can take the role of both a client and a server, where “Client-Server” is a model of interaction in a distributed computer network system in which a program at one site sends a request to another site and then waits for a response. The requesting program is called the “client,” and the program which responds to the request is called the “server.” In the context of the Internet, also referred to as the World Wide Web (“www” or just “web”), the client is a “browser,” a program which runs on a computer of an end-user. A program and network computer which responds to a browser request by serving web pages and the like, is referred to as a “server.”
Generally, peer-to-peer systems are connected personal computing devices—e.g., personal computer (“PC”), personal digital assistant (“PDA”), and the like—where the operating platforms may be heterogeneous. Each node connects to the network of peers by establishing a relationship with at least one peer currently on the network in a known manner referred to as the exchange of “ping” and “pong” messages. Peers arrive and disappear dynamically, shaping the peer-to-peer network's real-time structure; this contrasts to the Internet where web sites are statically allocated. Peer-to-peer is a way of decentralizing not just features, but costs and administration as well, eliminating the need for a single, centralized component, such as a known manner index server. Peer-to-peer permits ad-hoc collaboration and information sharing in what are large-scale, dynamic, distributed environments. Peer-to-peer systems are becoming increasingly popular because they offer the significant advantages of simplicity, ease of use, scalability, and robustness.
Peer-to-peer computer applications are a class of applications that takes advantage of resources available on this fringe of the standard Internet; for example, decentralized resources of storage, central processing unit (CPU) cycles, content, human presence, and the like. However, accessing such decentralized resources means operating in an environment of unstable connectivity and unpredictable locations since the nodes operate outside the DNS, having significant or total autonomy from known manner dedicated central servers. At the same time, an advantage of such systems is that communications can be established while tolerating and working with the variable connectivity of hundreds of millions of such nodes. Peer-to-peer system designers must try to solve such connectivity problems. A true peer-to-peer system must (1) treat variable connectivity and temporary network addresses as the norm, and (2) give the fringe nodes involved in the network at least significant autonomy.
One specific problem is that existing search mechanisms in peer-to-peer networks are inefficient due to the decentralized nature just described. That is, the topology of the peer-to-peer network is dynamically evolving in real time and arbitrary at any point in time with various connectivity degrees between the linked peers, making search and retrieval of the desired information a difficult problem. Moreover, the only thing assumptively known about a peer's knowledge base is what the peer wants to, or has time to, make available. This is all somewhat contrary to the objective of helping a querying peer efficiently find the most relevant answer.
One known peer-to-peer network communication protocol, known as “Gnutella™,” is a file sharing technology, offering an alternative to web search engines used in the Internet, with a fully distributed mini-search engine and a file serving system for media and archive files, that operates on an open-source policy of file sharing. FIG. 1 (Prior Art) illustrates a simple peer-to-peer structure and searching in a Gnutella peer-to-peer network model. In essence, each node (each circle symbol) represents a computing device; an accurate model may have tens of thousands of such nodes at any given point in time, with nodes appearing and disappearing with various links substantially randomly, where dotted-lines represent currently active network links between nodes. Individual host nodes 101, 102, 103, and the like, store resources, e.g., a database of documents or other content. Moreover, each peer uses its own local directory structure to store its copy of each of the resources. Any peer can propagate a search request, or “query,” illustrated in FIG. 1 by arrows parallel to current links, as broadcast by a first “Querying Peer” 101 to all of its “Neighbor Peer(s)” 102. Note that a neighbor peer becomes the querying peer when it passes a search request on to its neighbors which is not in direct communication with the first Querying Peer 101, e.g., a neighbor forwarding the query to node 103. In other words, each peer not only searches its own directory for the resource-of-interest of the query, but broadcasts the query to each of its neighbor peers. While individual hosts are generally unreliable with respect to availability at any given moment, the resources themselves, i.e., the content being sought, tend to be highly available because resources are generally replicated and widely distributed in proportion to demand in peer-to-peer networks. Generally, however, resources are identified only by file name and file names are subject to the individual preferences of each host node for its local directory structure. Thus, one specific problem is how to search intelligently and efficiently for relevant resources in a peer-to-peer network.
Again, it is common to store content data files at each peer's local directory structure simply by the given file name. For example, web sites such as Napster™/SM simply store data by a file name associated with the artist or specific song title to facilitate searching. Simple descriptor queries thus get a very large number of unsorted returns. In fact, even a web site search engine in a non-peer-to-peer system, such as the commercial Google, Alta Vista, and the like engines, provides a list of all return links potentially relevant to a query—namely, each and every file found which has a match, or “hit,” to the query—which the user must then study for relevance to the actual interest intended, then visit serially those which actually may be authoritative. That is, all of these web search engines rely upon human intelligence to build and keep the information they contain—in the form of links to web pages—relevant and current.
Another method of data storage at a given node is by random names in order to hide actual file identity. This raises the problem of need for some form of mapping between the random names and the actual files.
Another method for data retrieval is collaborative filtering where patterns of searches by like-minded searchers are analyzed and leveraged to produce allegedly more relevant results to a specific query. Such analysis inherently requires the documents to be public and known to the searchers in advance for providing an answer message to the query.
As another method for limiting query distribution, the query message itself (see e.g., FIG. 3 (Prior Art, message header 300) can include a decrementing, time-to-live (“TTL”) field whereby the number of node propagations is limited. For example, if the TTL is set to seven, each neighbor node passing on the message thereby identifies itself as the first, second, third, et seq., node receiving the message, decrementing the TTL. If the current neighbor node is the seventh node in a peer-to-peer network link chain, it will not forward the message because TTL has reached zero.
In general, existing solutions focus on locating every specific instance of each of the resources that is a potential match to the query. Thus, a replicated resource is likely to appear multiple times in multiple responses to one specific query.