Large-scale data processing may include extracting data of interest from raw data in one or more databases and processing it into a data product. These databases may store a vast number of datasets. As an example, many large-scale machine learning algorithms repeatedly process the same training data which may include 100s of billions of examples and 100s of billions of features. This training data is typically represented as text strings or identifications for interoperability and flexibility. However, when processing these vast amounts of training data, text strings and/or identifications may not be well suited for efficient input-output and in-memory usage.