Large data sets may exist in various sizes and organizational structures. With big data comprising data sets as large as ever, the volume of data collected incident to increased popularity of online and electronic transactions continues to grow. For example, billions of records (also referred to as rows) and hundreds of thousands of columns worth of data may populate a single table. The large volume of data may be collected in a raw, unstructured, and undescriptive format in some instances.
Data may be ingested into big data storage formats to convert raw binary files into a format useable by analysts. The distributed processing systems may be limited by incoming file formats and ingestion systems restrictive of processing parallelism. Typically, distributed processing systems split input files using a record delimiter. Conversion of binary files into delimited files followed by splitting and processing may result in files being processed multiple times. Additional processing typically means additional processing time. As a result, the ingestion systems may not scale up efficiently in big data environments that ingest binary data files.