Data interchange is a method for exchanging computer-readable data between two or more autonomous computer systems or servers. These computer systems may use different operating systems. JavaScript™ Object Notation (JSON) is a lightweight data interchange format that uses human-readable text to store and transmit data objects comprising attribute-value pairs. One common use of JSON is to read data from a web server, and to display the data in a web page. JSON may be used as an alternative to XML (Extendible Markup Language) for organizing data. Likewise, JSON may be used in conjunction with distributed document storage databases. JSON documents are relatively lightweight and are executed rapidly on web servers.
JSON includes “name: object” pairs and punctuation in the form of brackets, parenthesis, colons, and semicolons. Each object is defined with an operator such as “text:” or “image:” and then grouped with a value for that operator. The simple structure and absence of mathematical notation and algorithms makes JSON intuitive, easy to understand, and quickly mastered, even by those with limited formal programming experience. Moreover, JSON facilitates the development of web and mobile applications while not being affected by database schema changes. A schema is an organizational structure that represents a logical view of a database. The schema defines how data is organized, specifies relationships among the data, and formulates all constraints that are to be applied to the data.
JSON distributed document storage databases do not always provide adequate data analysis capabilities. As a result, external data analytic services, such as Spark™, have been developed to integrate data analysis capabilities with JSON distributed document storage databases. In order to leverage data analytic services, documents in a JSON document storage database must be read and transformed into a Resilient Distributed Dataset (RDD), and then an analytics job may be executed on the RDD. The RDD is an immutable, fault-tolerant, distributed collection of objects that can be operated on in parallel. The RDD can contain any type of object and is created by loading an external dataset or distributing a collection from a driver program. RDD data is resilient, in the sense that the data can be recomputed in case all or a portion of the data is lost. RDD data is distributed, such that the data can be read and processed from any of multiple nodes without having to drag the data to any particular node. RDDs are computed in memory and can be persisted in memory. RDDs can be recomputed each time an action is executed, or an RDD may be persisted in memory, in which case elements of the RDD are retained on a cluster for much faster access the next time that the elements are queried. RDDs are advantageous in terms of rearranging computations to optimize data processing.
As a practical matter, many data analytics jobs are required to be executed at regular time intervals, or on a continual basis. When a first round of a data analytics job is executed, a first set of documents from the JSON distributed document storage database is analyzed. Then, when a second round of the data analytics job is to be executed, a second set of documents from the JSON distributed document storage database needs to be analyzed. In general, the second set of documents is not identical to the first set of documents. Since the documents to be analyzed are changing dynamically, this poses challenges in terms of effectively and efficiently supporting data analytics on JSON distributed document storage databases. Thus, there exists a need to overcome at least one of the preceding deficiencies and limitations of the related art.