Large data sets may exist in various levels of size and organization. With big data comprising data sets as large as ever, the volume of data collected incident to the increased popularity of online and electronic transactions continues to grow. For example, billions of records (also referred to as rows) and hundreds of thousands of columns worth of data may populate a single table. The large volume of data may be collected in a raw, unstructured, and undescriptive format in some instances.
Ingesting the big data sets may be a cost intensive process. In fact, processing inputs may comprise 50% or more of the time costs associated with using big data sets. The intake process may include numerous steps conducted with parallel processing and non-trivial user oversight. For example, a big data system may intake 100,000 records. The records may be distributed equally across 4 machines with 25,000 records processed by each machine.
In addition to parallel processing, big data systems may have a variety of intake approaches and/or algorithms requiring user management to input big data sets. Users may provide code to identify and place data in a usable form. Users may also oversee the intake data migration and processing to identify and handle errors. The manual nature of big data processing typically tends to increase the time spent on data intake.