Computer data stored on a system may be structured, such as when it is stored in a relational database, or unstructured. Structured data may be analyzed and processed using database metadata or schemas, such as table definitions. Conventional knowledge suggests that a schema comes first and the data follows. For example, a database table cannot be populated without being defined. Once the table is defined, the data follows to populate the table. This is not necessarily true, however, with regards to big data. For example, unstructured data may arise from web logs, user transactions, camera feeds, sensor outputs, blogs, or many other sources. This data may need to be processed and analyzed without storing it in a traditional relational database management system, and is often received before a schema has even been defined.
To analyze structured data, a metastore may be used to store the schemas. These schemas may be retrieved as queries are processed. Typically, the schemas for structured data are created before the data is stored. For example, a CREATE TABLE statement may be executed to create a table definition. This table definition may then be stored in the structured data's metastore.
Similarly, unstructured data may be processed using schemas from its own metastore. The schemas contained in the unstructured data's metastore may be applied to the unstructured data as needed, and may be defined after the unstructured data has been created and stored. For example, a schema may be created which defines a structure for a large pre-existing log file. The schema may, for example, identify the location of a date, log level, and log statement in a single line of text.
There is a need, therefore, for an improved method, article of manufacture, and apparatus for accessing schemas stored in structured and unstructured data metastores.