While the rise of the Internet has solved some data management problems, at the same time it has created some new ones as well. For example, many Internet applications, such as e-commerce, e-mail, and social media applications, have created a so-called ‘big data’ problem. The ‘big data’ problem results from the vast volumes of data, much of which is generated at very high velocities and with widely varying formats and lengths. In general, the term ‘big data’ refers to datasets that have grown so large that they are beyond the ability of commonly-used database management tools to capture, manage and process within a tolerable period of time. Such datasets can range from a few dozen terabytes to many petabytes of data, all within a single data set. Thus, ‘big data’ comprises billions of potentially non-uniform data objects that are generated daily, must be accessible at an instant, and yet must be stored reliably and cheaply for potentially long periods of time.
A new class of distributed storage systems, called NoSQL or ‘big data’ databases, has recently emerged. Examples of such database management systems include HBase, Cassandra, MongoDB, Hibari®, etc. While such databases do not provide the richness of traditional SQL databases, they are very efficient in storing and retrieving large volumes of data in a relatively cheap and reliable manner. Such NoSQL-based systems are also readily scalable in that heterogeneous servers can be added at any time to networked server clusters, followed by the data being automatically rebalanced and distributed without disruption to service.
However, in order to achieve such high performance and scalability, these NoSQL-based systems must be optimized for specific data types. For example, Cassandra is optimized to handle very fast writes of many small data items, but conversely performs relatively poorly when many large data items are written to the database. No prior art solution is optimal for vastly different data types.
One potential solution would be to deploy different solutions for different data types; for example, store large data in a file system but keep small data objects in a NoSQL database. However, this approach is unsatisfactory since it multiplies the number of systems and software that must be maintained. Moreover, synchronizing usage across different databases is likely to be difficult, and enforcing a usage policy (say some bytes/second limit) for a user who happens to have both large and small data would require synchronizing two different systems in real time. It is also questionable if this approach would even function in a large scale ‘big data’ environment. This approach also does not readily scale to N systems since the management and synchronization overhead increases as N increases.
Accordingly, there is a need for an integrated hybrid data management system which is capable of efficiently handling varying types of ‘big data.’