The present invention is directed towards storage and retrieval of information in structured peer to peer (P2P) overlay networks. More particularly, the present invention provides a method for placing semantically similar information on peers which are proximally located in the P2P network and an efficient searching scheme for retrieving desired information from the network.
A peer-to-peer (P2P) network is a distributed network of computers in which there are no dedicated server or client computers. Every computer or node in a P2P network acts either as a server or as a client. Various different architectural configurations are available for creating P2P networks such as, centralized, decentralized unstructured, and decentralized structured, etc. Examples of earlier generation centralized and decentralized unstructured P2P networks comprise Napster and Gnutella respectively. Centralized P2P networks suffer from drawbacks such as a central repository maintaining indexes of all documents stored on the network resulting in a single point of failure. Decentralized unstructured P2P networks usually employ broadcasting of queries in the network, thereby limiting scalability of the network.
Decentralized structured P2P networks have been developed to address the limitations associated with the earlier generation networks. Examples of structured P2P networks comprise Chord developed at the Massachusetts Institute of Technology, Content Addressable Networks (CAN) [Ratnasamy et al, 2001], Tapestry [Zhao et al, 2001] and Pastry [Rowstron and Druschel, 2001]. Structured P2P networks are largely scalable in comparison to unstructured P2P networks.
Structured P2P networks (such as Chord, CAN etc.) employ an overlay architecture scheme providing a level of indirection over traditional networking addresses such as Internet Protocol (IP) addresses and are usually used for building distributed hash tables (DHT). In a typical Chord based P2P network, an identifier of each peer computer (e.g, the IP address of the computer) is hashed to generate a unique peer identifier. The hashed peers are arranged in the form of a uni-dimensional ring often referred to as a Chord ring. Resources such as files stored on the peers are also hashed to generate resource identifiers. Each resource is then placed in the Chord ring at a peer whose unique identifier is closest to its hash identifier. Each peer in the Chord ring maintains partial routing information and relies on successive forwarding by other peers to efficiently route user queries. Although such structured P2P network architectural schemes have proven to be highly scalable, there is need for P2P networks that support efficient multi-keyword or semantic based searches.
Recently developed P2P information retrieval systems such as pSearch [Tang et al, 2002] and GridVine [Aberer et al, 2004] address some of the limitations of the structured P2P networks (such as Chord, CAN etc.).
In the pSearch information retrieval system, documents stored in the network as well as queries are represented as vectors by using vector space model (VSM) or latent semantic indexing (LSI) schemes. In accordance with one approach, the similarity between a document and a query is assessed by using cosine of the angle between the respective vectors. In pSearch systems employing VSM scheme, m-most heavy weight document terms are identified, hashed and routed using a CAN overlay architecture. To process a semantic search query having t-keywords, the query is routed t-times using individual keywords and semantically similar documents are retrieved from selected zones in the CAN. In pSearch systems employing LSI scheme, the document vectors are dimensionally reduced by using singular value decomposition and the resulting vector is used as a DHT key for routing in a CAN. Such an approach enables placing semantically similar documents at proximally located zones in the CAN and a query is resolved by routing its vector as a DHT key in the CAN. The query upon reaching the relevant zone, floods the request to proximally located zones (with a maximum pre-computed radius) in order to retrieve semantically similar documents. Usage of VSM based pSearch systems requires a thorough knowledge of vector dimensions of peer resources and queries whereas LSI based pSearch systems inhibit dynamic schema evolution in P2P networks, thereby limiting the applicability of pSearch systems.
GridVine builds upon the P-Grid [Aberer, 2001] structured overlay network for supporting semantic based searches in a P2P network. In a GridVine network statements, describing information stored in the P2P network, are represented by using resource description framework (RDF) and the RDF triples, each consisting of a subject, a predicate and an object, and are published in the P-Grid overlay network. Each query is expressed by using resource description framework query language (RDQL) and is resolved by routing query variables to the GridVine overlay network. GridVine also supports dynamic schema evolution in the P2P network. As the frequency of RDF triples is generally non-uniform in nature, load-imbalances are created in the GridVine architecture leading to scalability issues.
Hence, there is need for a method of information storage and retrieval in structured P2P overlay networks which is scalable and efficient in terms of time taken to retrieve information corresponding to a search query comprising a set of keywords.