NoSQL databases, semi-structured data processing and storage systems, and other so-called schema-less software, accept nested key-value data structures, or “documents,” as input. These documents are a convenient way to represent data of a hierarchical nature, in which few or no restrictions are imposed on the contents or the structure (schema) of the data. In particular, the structure, contents and keys of different documents, even if logically grouped together, can be completely different one from each other.
A document structure can include a set of key-value pairs, in which the key is a name (e.g., a string) and the value can be a scalar value (e.g., numbers, text, Booleans, empty values, etc.) or a composite value. Composite values include nested values, such as embedded documents or collections of other scalar or composite values. For particular software processes, key-value pairs may be formatted as a single, indivisible, serialized unit of nested key-value structures. The data (e.g., the value) contained in the document is typically accessed via one or more keys (though a key is not necessarily required for data access).
Conventional document data access is performed via accessing (e.g., querying) the data via an (external) index that indexes the primary key or another field or fields of the document. However, conventional document data access is not always desired, optimal, or possible in certain data access scenarios, such as non-indexed queries, aggregate queries or data in nested fields or nested documents within the data. In these situations, the document data access methods implemented are computer resource inefficient (e.g., long CPU time and high memory use). More specifically, any non-indexed query to a collection of documents requires a full collection scan operation. Full collection scan operations require parsing all the data document by document, key by key, until a match with the query predicate is found on a per document basis. This operation includes the technical drawbacks of long processing times, the creation of frequent bottlenecks on either I/O or CPU, and exhibits a poor cache pattern usage.
In addition to the foregoing, present systems encounter difficulty when processing documents that have disparate schema or are schema-less due to the fact that the structure used must be defined on a per-document basis. If documents within a given set of documents share an underlying equal or similar enough structure, a significant processing overhead is incurred by unnecessarily redefining the schema, which leads to space, memory and processing bloat.
As such, there exists a need for systems and methods that extract, transform and store document data from a data source in which the documents have varying schema or are schema-less to improve computer resource management. Further, there exists a need for systems and methods that can perform document data processing on hierarchical, nested key-value data.
It is in regard to these issues and others that the present invention is provided.