This application relates generally to analyzing data using machine learning algorithms to develop prediction models for generalization, and more particularly for applying iterative machine learning and other analytic algorithms directly on grouped data instances in databases.
Companies and other enterprises store large amounts of data, generally in large distributed data stores (databases), and the successful ones use the data to their advantage. The data are not simply facts such as sales and transactional data. Rather, the data may comprise all relevant information within the purview of a company which the company may acquire, explore, analyze and manipulate while searching for facts and insights that can lead to new business opportunities and leverage for its strategies. For instance, an airline company may have a great deal of data about ticket purchases and sometimes even about traveling customers, but this information in and of itself does not permit an understanding of customer behavior or answer questions such as their motivations behind ticket purchases, and does not afford the company the insight to make predictions that take advantage of this motivation. To accomplish this, the company may need to run various analytics and machine learning algorithms (processes) on its data to derive models which can provide insight into the data and afford generalization.
Database systems typically store data in data structures such as tables, and use query languages such as Structured Query Language (SQL) and the like for storing, manipulating, and accessing the data. Unfortunately, except for rather simplistic analytics such as max, min, average, sum, etc., SQL and other query languages cannot perform more complex analytics on data or run machine learning algorithms such as regression, classification, etc., which attempt to make predictions based upon generalizations from representations of data instances. Moreover, most machine learning algorithms require iteration on data, which SQL cannot do. This means that such analytics must be run by other programs and processes that may not operate within the database or interface well with SQL.
Moreover, since data is typically stored in a database by mixing together and storing a variety of data elements having different parameters and values, it may be necessary to redistribute the data to group common elements together for analysis. While data may be redistributed using a SQL GROUPBY operation, data redistribution is expensive and undesirable. It is time-consuming and it requires physically moving data around which has high overhead and the risk of data loss or corruption.
As a result, there are not available convenient, easy to use approaches for safely and efficiently running data analytics and machine learning algorithms on stored data within a database to derive models that characterize the data and afford insight into the factors underlying the data to permit generalization and predictions.
It is desirable to provide systems and methods that enable various analytic and machine learning processes to be applied directly to groups of data within a distributed database, without the necessity of redistribution of the data, in order to analyze the data and derive models that created the data and which can be used for generalizations and predictions. It is to these ends that the present invention is directed.