Parallel processing techniques are known, in which a plurality of data processing units are provided and a separate processing unit is assigned, for example, to its own mutually exclusive set of local data items to process. This can greatly reduce the overall processing time as compared with serial processing. The `nodes` of a parallel processing system are the separate processing units, which each have their own processor and their own storage (or at least access to shared storage). Two models exist for processing data which is shared across a plurality of nodes of a parallel processing system. That is, where a dataset is to be processed in parallel, it is loaded into the storage of the plurality of parallel processing units of the system. In a first one of these models, known as the `master-slave` model, processing is under the control of a master node, which may have its own share of the data. There is generally no more than one master node. The other nodes are referred to as slaves. In the second model, there is generally no one node which is in control--all nodes are communicating with each other in an `any-to-any` model. With both of these models, if information is to be extracted from a dataset by selecting data items in a specific sequence and performing operations on the selected data, while ensuring adequate coverage of the data on each of the nodes, then a fast and efficient method is required for locating the required data items.
One possible method of locating specific data items within a dataset which is shared across multiple nodes involves polling of all the individual nodes. A first node (generally a controller node) sends a query to all nodes to determine which has, say, item number 15 of the set of data items. One of the nodes should reply with a confirmation that it has this required item. These inter-node communication steps are repeated for each required data item. However, such communication between the nodes entails both undesirable overheads and delays. Furthermore, associated with such inter-node communication is the necessity for status and error checking plus corrective operations to ensure that any communication failures cannot result in out-of-step processing. This entails a significant additional processing overhead. It is thus desirable to avoid any unnecessary communication between the nodes and so a method and a system are required which are not reliant on polling of individual nodes to determine the location of a required data item.
Although polling has these disadvantages, there is also a significant problem with locating and sampling of data items in a parallel system if polling is not used. Difficulties arise because the locations of data items within a dataset which is shared across a number of nodes are dependent on the number of nodes available (or the number selected from the available nodes) for performance of a particular operation and on the chosen type of data partitioning, both of which may be subject to change. The number of nodes across which the dataset is shared may vary, for example, because a number of nodes which were available when an operation was performed for a first time may be unavailable when the operation is subsequently re-run. The data may also be partitioned in different ways across the nodes according to a data analyst's selection. For example, data items may be striped across a number of nodes or each node may hold a contiguous block of data. The analyst may wish to change the partitioning of the dataset across the nodes when an operation is repeated (for example, because of temporal trends identified when the operation was first performed). Thus, each time a particular operation is repeated by the parallel processing system, data items may be located on different nodes than when the operation was previously performed. This makes locating of a particular data item and reproducible sampling of the dataset without polling of all nodes difficult.
A second alternative which may be considered is to provide a look-up mapping table on each node which identifies the items held there (for example, listing their global item numbers within the dataset as a whole and corresponding local item numbers). A master node or every node of the system can be provided with a full list of which nodes hold which items. This is unacceptable, since for any large size database where data mining is likely to be used the data item location tables will also be very large and will consume far too much of the available storage space. Also, generating the look-up tables entails significant overhead.
If efficient reproducible sampling is to be achieved, then there is a need for methods and systems which enable locating of particular selected data items despite any changes to the partitioning of the data set across a variable number of nodes. No method or system has previously been made available which provides efficient automatic determination by a single node of a parallel processing system of the location of items of a dataset which is shared across the system nodes, which does not involve polling of other nodes and which takes account of changes to the data partitioning.