The technology described herein relates to the area of data management and, more specifically, to distributed data storage including distributed databases.
A distributed data storage solution includes a number of physically distinct computers with associated physical storage (e.g., one or more hard drives, optical discs, etc.); each computer managing a data set that is a subset of a larger data set. A distributed data storage solution is used when the storage capacity of a single computer is not enough to hold the larger data set. Each subset varies in size up to the storage capacity provided by the computer on which the subset is deployed.
After the larger data set is split into subsets and deployed to a number of computers, retrieving an individual data item requires finding out which subset holds such data item. A retrieval request is addressed to the computer which the corresponding subset, has been deployed to (stored on). Two approaches can be used to quickly determine the involved subset: 1) a linear scan of every computer by broadcasting a retrieval request to all computers storing the larger data set. This approach is inefficient as only one computer holds the subset that contains the individual data item (unless replicated), and 2) a data location algorithm that determines the specific computer (or computers, since data might be replicated for redundancy) where the subset holding the data item sought is held, and addressing the request to that computer only.
From the approaches above, the second one is usually preferred based on processing resources required. In the second solution, despite the added latency due to the execution of the data location algorithm, processing resources used by a retrieval request remain approximately constant (or rather, grow as O(log n), where n is defined as the number of individual data items in the larger data set and O is defined a function f(n) which is non-negative for all integers n≧0 and f(n)=O(g(n)), if there exists an integer n0 and a constant c>0 such that for all integers n≧n0, f(n)≦cg(n). Additional description can be found in “Data Structures and Algorithms with Object-Oriented Design Patterns in Python”, Bruno R. Preiss et al, 2003. In the first approach, the resources grow as O(N), where N is defined as the number of subsets. When all the computers in the solution share the same data storage characteristics N=n/S, where S is defined as the storage capacity provided by one single computer in the solution. If n is very large, as is usual in distributed data storage solutions, there is considerable inefficiency with respect to use of processing resources associated to the first approach.
Solutions applying the second approach are further characterized by type of data location algorithm used. Broadly speaking, a possible taxonomy of these algorithms is as follows: 1) state-less algorithms which do not use information stored previously during insertion or re-location of an individual data item; these algorithms use a mathematical function (typically some kind of unidirectional hash function) during data item insertion, re-location, and/or retrieval to find out the subset that contains the data item; 2) state-full algorithms which store information about the subset that contains every individual data item during data item insertion or re-location. Then, during data item retrieval, the algorithm reads the stored information to locate the subset, and 3) mixed algorithms start as state-less but allow applying the state-full logic for some data items. When used properly, a mixed algorithm conveys the advantages of the state-less and state-full algorithms in a single solution.
State-full algorithms provide the best flexibility in terms of data item re-location, allowing features like, for example, moving data items physically closer (in network hops) to points where those items are used more often, or to the computers experiencing less workload. However, these algorithms pose a processing bottleneck (each request implies querying the information about which subset contains the sought data item(s)) and a scalability issue (some information has to be stored for each and every individual data item in the larger data set, which takes up storage space from the distributed data storage).
For these reasons, highly distributed (e.g., hundreds of computers) data storage solutions typically use state-less algorithms. A state-less algorithm is fast and efficient (execution includes evaluating a hash function followed by reading an in-memory array) and consumes little space (the memory required to hold the array). However, re-location of individual data items is difficult, since the same hash function always delivers the same hash value for a given input value. Mixed algorithms provide some of the benefits of state-full algorithms, as long as the number of data times the state-full logic is applied to is small.
State-less algorithms are, however, not suitable for data sets characterized by multiple defining fields (keys). A defining field is a part of an individual data item that uniquely determines the rest of the item. For example, a phone number is a defining field (key) of a larger data set including phone lines throughout the world. Given the phone number, it is possible to determine additional data relative to a single phone line.
Data sets characterized by multiple defining fields (keys) will, in general, deliver different hash values for different input values. Thus, using a data set characterized by two or more defining fields (in the phone line example above, adding the network terminal point, NTP ID where the NTP ID is an identifier of the physical point of connection of a fixed phone line, e.g., an identifier for the phone line socket at a subscriber's home. Each fixed phone line is tied to one and only one NTP) as another defining field in the case of fixed phone lines, the subset obtained from the state-less algorithm, when using the phone number as a key value, is different from the subset obtained when using the NTP ID as a key, thus rendering impossible the task of determining a single subset which every data item belongs to.
To overcome this multiple key problem, distributed data storage solutions using state-less/mixed algorithms typically use two-step indexing algorithms. FIG. 1 illustrates a non-limiting example solution to the multiple keys problem. There is a main index comprising the values of one defining field (Primary Key 101), plus a number of secondary indexes containing the values of each additional defining field (Second Key 102, etc.) associated to a reference to the corresponding defining field in the main index. To find the subset a data item belongs to (i.e., locating a computer (C1, C2, . . . CN) storing a specific data item), using a key value stored in a secondary index, the secondary index is queried first to find the corresponding entry in the primary index, and then the hash function 103 is applied on the value stored in the primary index to determine a single subset 104.
However, the two-step indexing algorithm poses a problem in that the storage capacity used for holding the indexes grows linearly with the number of data items. In fact, when the number and size of defining fields (keys) is comparable to the size of a complete data item, a large amount of storage space is required just to hold the indices alone. How large can be estimated using the following formula: if s is the size of a complete data item and si is the size of the defining fields of a data item:n=N*S/(s+si)=N*S/s*(1+p)where p=si/s; n (the distributed system's capacity) decreases as inversely proportional to 1+p. Thus it can be seen that when si approaches s, the storage space required for storing indices is as large as that devoted to storing the data elements themselves.
Additionally, the index structure can become a processing bottleneck since it has to be checked for each and every query and requires extra storage space and more associated processing power (i.e., additional computers).
Moreover, there is no way to allocate a (subset of) data item(s) to one specific storage element in the distributed system. Targeted storage is beneficial in cases like when the distributed system spans large geographic areas (for example a country) and certain, often-accessed data item(s) are stored on a computer that is far—in network terms—from the point where the entity accessing it is located. Placing the often-accessed data item(s) in a computer close—in network terms—to the entity accessing them improves response time and decreases network utilization.
Existing systems allow the arbitrary allocation of data items to specific computers by means of a traditional associative array (one for every defining field, or key in the data item) that is searched prior to the two-step search process described above. This increases the time and resources used in every search, the storage space required for index structures (the aforementioned associative array).
What is needed is a data distribution and location method and system which overcomes the problems associated with expanding processing resources as well as multiple-key addressing, indexing, and reallocation of resources limitations.