Data science, in general terms, is the extraction of information from large volumes of unstructured data, called data sets. Because of the complexities and large amounts of data involved in data science operations, experts, such as trained data scientists, are typically needed to perform operations on these massive data sets. Further, trained data scientists commonly require sophisticated computing processes, hardware, and software to perform data science operations. Even with the proper tools and instruments, data scientists still face numerous challenges when working with large data sets and preforming data science operations.
To illustrate, to work with a large data set, a data scientist must first provision a dedicated storage space for the data set. Even with recent advancements in computer storage, finding dedicated storage space for large data sets can be difficult. Next, the data scientist must manually clean the data, which can involve editing the formatting and structure of thousands of lines of data to ensure proper readability of the data set. Then, upon cleaning the data set, the data scientist can run algorithms on the data. Before running a data science algorithm, however, the data scientist often needs to manually program (i.e., code) the algorithm, which requires the data scientist to be knowledgeable in computer programming.
The advent of general-purpose frameworks for large-scale data science computations has improved data science by standardizing and simplifying the above described process of handling large data sets. Nevertheless, data scientists using complicated processes are still needed in order to operate on data and implement algorithms. Further, while many data science techniques include elements that are becoming more standardized (e.g. data cleaning and/or normalization), these general-purpose frameworks remain too complex to enable many users, including data scientists to successfully use these frameworks.
As such, in the field of data science, there remains a need for an improved framework to perform data science operations. In particular, current data science techniques require large computing power and timeframes and are otherwise inefficient and inflexible. These and other problems exist with regard to current and traditional data science techniques.