Traditionally, the methods, processes, and algorithms used in data science to extract insight from data have largely been driven by human-input and intelligence. Administrators, such as data scientists, would look at a dataset and identify which columns of data might be meaningful and useful for analysis purposes. Often times, however, the dataset would include flaws (e.g., NULL values or ill-formatted values) that need to be cleansed or fixed prior to processing the dataset. Additionally, the columns of the dataset may need to be altered through normalization prior to processing, such as re-scaling a numeric column to go from 0.0 to 1.0 instead of from an arbitrary lowest to highest value.
In addition to cleaning and normalizing the columns in preparation for processing, in some instances, data scientists may want to perform operations on a dataset to derive additional columns of data. Deriving columns may be achieved by taking one or multiple input columns from a dataset and deriving new features from them. As an example, a data scientist may elect to use two timestamps (such as the date a project proposal was submitted and the date the project proposal was accepted) to derive a new “proposal review period” column that is the difference of the two timestamp columns. As another example, a data scientist may elect to use a date column to derive the month of the year as a new column for the dataset. This information is important in discovering patterns in the data.
Most machine learning algorithms are not designed to scale well for input data sets with a large number of features or columns, nor are they generally capable of deriving additional features from existing features without manual intervention. For example, techniques like linear regressions begin to fail with just a few dozen features. Furthermore, deriving additional features as described above has typically been driven by data scientists (e.g., humans) and can be time consuming, and thereby very labor intensive. The aggregate effect of these two factors is that far fewer features are used in data science than is optimal.