Distributed storage systems implemented either as a distributed database or a distributed file system fail to scale well for data mining and business intelligence applications that may require fast and efficient retrieval and processing of large volumes of data. Distributed databases for large volumes of data, perhaps on the order of terabytes, may be traditionally implemented across several servers, each designed to host a portion of a database and typically storing a particular table data. In some implementations, such a system may also store a horizontally partitioned table of data on one or more servers. For instance, the technique known as horizontal partitioning may be used to store a subset of rows of data in a table resident on a storage server. Queries for retrieving data from the distributed storage system may then be processed by retrieving rows of data having many associated columns of datum for which only one or few columns may be needed to process the query. As a result, the storage and retrieval of data in these types of systems is inefficient, and consequently such systems do not scale well for handling terabytes of data.
Typical transaction processing systems using a distributed database likewise fail to scale well for data mining and business intelligence applications. Such systems may characteristically have slower processing speed during a failed transaction. During transaction processing a failed transaction may become abandoned and the database may be rolled back to a state prior to the failed transaction. Such database implementations prove inefficient for updating large data sets on the order of gigabytes or terabytes.
Distributed file systems are also inadequate for storage and retrieval of data for data mining and business intelligence applications. First of all, distributed file systems may only provide low-level storage primitives for reading and writing data to a file. In general, such systems fail to establish any semantic relationships between data and files stored in the file system. Unsurprisingly, semantic operations for data storage and retrieval such as redistributing data, replacing storage, and dynamically adding additional storage are not available for such distributed file systems.
What is needed is a way for providing data storage, query processing and retrieval for large volumes of data perhaps in the order of hundreds of terabytes for data warehousing, data mining and business intelligence applications. Any such system and method should allow the use of common storage components without requiring expensive fault-tolerant equipment.