The present disclosure relates to computing and data processing, and in particular, to computer implemented systems and methods for processing semi-structured documents.
Traditionally, data has been stored in a highly structured manner. For example, common databases such as SQL databases store data as records, where each record is a row of a table, and the columns of the table are fields representing particular data elements such as name, address, price, etc. . . . . Structured data storage typically includes a schema, which defines the structure of how the data is stored so that queries may be generated and executed to retrieve specific data from the database.
However, recently there has been an increase in the storage of unstructured or semi-structured data. Semi-structured data may be stored in a database as a document, where different documents may have different fields and different structures. For example, documents may have fields such as “first name” and “age,” and other information may be stored as subdocuments or associated arrays of information, for example. Different documents in the same database may have different fields, or the same fields or overlapping fields with other documents may be structured in different ways and not normalized.
A common situation is to have a set of semi-structured data instances that are grouped into a collection or set where the data items do not have identical structure. This often occurs when working with non-SQL databases such as MongoDB or Cassandra, for example. In this case, there is no database schema defining the structure of items as there is no set structure. An example would be items that represent a customer and their purchase history. The information can vary from one customer to the next. When working with the data, a user is hampered by not knowing what information is present due to the lack of a single schema. For example, if only some of the data items have information about the customer's children, a user may not know that that information is available to work with unless they look at a data item which happens to contain information about children. However, unstructured or semi-structured data stores may hold hundreds, thousands, hundreds of thousands, or more of such documents.
Accordingly, storing, accessing, and processing semi-structured data is a technical computing problem in need of a solution.