Peer-to-peer (P2P) file sharing is a major peer-to-peer application, with millions of users sharing millions of files and consuming a large proportion of Internet bandwidth. In such a large-scale system, it is important to supply accurate, but yet, efficient search capabilities, lest the user be overwhelmed with search results. However, the search capabilities of these systems are generally weak, particularly in ranking query results.
In a pure peer-to-peer system, true clients and servers do not really exist because each node, i.e., computer, functions simultaneously as both a server and a client. However, as an aide to understanding the present invention, and not by way of limitation, the following terminology as may be used herein is explained. A client is a machine running a software routine seeking and receiving information. A server is a machine in the P2P file sharing system acting as a data repository and provider. A content file is a data object that is a unique set of data, e.g., song, picture, or any other thing in digital format. A replica is a copy of a content file. A node is one or more machines acting as one location in the network. A node will simply be referred to as a computer or “peer” herein, and is meant to encompass all automated data handling apparatuses.
Standard file sharing models include the common P2P file sharing systems Gnutella and Kazaa. These systems make very few assumptions about the behavior of users and about the data they share. Peers of a P2P file sharing system collectively share a set of content files by maintaining local replicas of them. Each replica of a content file (e.g., a music file) is identified by a descriptor. A descriptor is a metadata set, which includes user-readable terms (i.e., a “bag of words”) and is typically implemented as a filename. Depending on the implementation, a term may be a single word or a phrase. P2P searching consists of identifying content files through a search of the descriptors of the individual content files.
A peer acts as a client by initiating a particular query for a content file. A query is also a metadata set, composed of terms that a user thinks best describe the desired content file. A query is generally routed to all reachable peers, which act as servers.
P2P file sharing systems generally have simple keyword-based data retrieval functions. In general, queries are conjunctive, so servers return references to file replicas whose descriptors contain all of the unique query terms. This containment condition is often referred to as the matching criterion. Each reference, which is generally referred to herein as a “result” or a “search result,” contains the replica's descriptor and the identity of the server that returned it. The descriptor within the result helps the user and client distinguish the relevance of the content file to the query, and the server identity is required to initiate the content file's download.
Once the user selects a search result, a local replica of the corresponding content file is made by downloading it from the corresponding server. In addition, the user has the option of manipulating the local replica's descriptor in his own computer. He may manipulate it for personal identification or to better share it in the P2P file sharing system.
Traditional Information Retrieval (IR) techniques used to improve searching and result ranking are generally inapplicable in the P2P environment. Such techniques generally assume fixed architectures where dedicated servers manage statistics on the shared data and use them to generate a ranked list of results to return to the client. Such servers, however, do not exist in pure P2P environments; and even if they did, the set of shared data are constantly in flux due to the high churn rates (e.g., rate of joining and leaving the network) of participating peers. Reliable statistics are therefore hard to maintain.
Furthermore, servers in a P2P system independently maintain data and respond to queries. Each replica is annotated independently with metadata and may be particular to the user's tastes. For example, one user might annotate a particular Madonna song as “pop music,” whereas another may annotate it as “80's music.” Searches for this content file are complicated due to variations in the way it is identified.
Servers are also free to return whatever results they please in response to an incoming query, even being able to override the matching criterion. For example, a malicious server may send irrelevant marketing material or viruses in its responses. The client must aggregate the results from the disparate sources and try to rank them correctly to identify such spurious results.
In effect, P2P query processing is distinct from that of traditional search engines in that P2P query processing is a two-step process. The independent servers first generate responses to a query and then the clients must make sense of the responsive results. In contrast, in traditional IR systems, all data are centralized at a single site allowing a comprehensive search. This allows the creation of an integrated result set based on the global data set. Centralized servers can also perform optimizations, such as ranking results based on previous user selections.
Much of the known P2P improvement work proposes a focus on the network architecture of P2P file sharing systems to improve searching by identifying highly reliable peers, and giving them specialized roles in statistics maintenance, indexing, and routing. The performance of such systems can be impressive; however, the application domain is different than the one presently considered. The present invention makes no assumptions about the relative capabilities of the peers, and so is likewise applicable to ad hoc environments, where functionality is fully distributed among all participants.