The evolution of computers with respect to memory storage expansion and processing capabilities has enabled massive amounts of data to be accumulated and analyzed by complex and intelligent algorithms. For instance, given an accumulation of data, algorithms can analyze such data and locate patterns therein. These patterns can then be extrapolated from the data, persisted as content of a data mining model or models, and applied within a desired context. With the evolution of computers from simple number-crunching machines to sophisticated devices, services can be provided that range from video/music presentment and customization to data trending and analysis.
Data mining involves searching through large amounts of data to uncover patterns and relationships contained therein. In the data mining world, there are at least two operations that are performed with data indicated by the client. These operations are training (finding patterns in client data) and prediction (applying such patterns to infer new/missing knowledge about client data). For example, data mining can be used to explore large detailed business transactions such as credit card transactions to determine the most influential factors common to non-profitable customers.
One way of accomplishing this is to employ a single monolithic application that loads the data, and retains the data in a memory for the prediction engine. That is, the prediction engine is trained using the in-memory data. A score can also be associated with the in-memory data. Thus, the application is essentially a black box that receives the data as input, and includes the logic to generate numbers. The output can be a set of rules that defines the resulting data, and/or a score that is associated with each entry of the in-memory data. This configuration is most suitable for client machines, since use on a server would impact performance.
Traditionally, developers of embedded and/or pipeline data mining applications were required to transfer their data to a relational data source, execute the training and/or prediction statements against the relational data source, and then delete the data from the relational data source. In addition to the extra complexity and impact on system performance that accompanies such an operation, the data source approach was also raising security issues in certain scenarios. If the connection to the data mining server is performed over an HTTP (HyperText Transport Protocol) connection (or some other connection from outside the server's domain), then finding a relational data source that is accessible both to the server and the client application can be a problem.
In one conventional data mining engine that processes relational data (e.g., SQL Server data mining), the data can only be fetched from relational data sources. Hence, a data mining statement involving external data is composed using an OPENROWSET function, which allows description of a relational statement and a data source against which it is executed. Following is an example of a conventional training statement:
INSERT INTO [Model] (‘A’, ‘B’)OPENROWSET(‘SQLOLEDB.1’,‘Provider = SQLOLEDB.1; Data Source=MyRBMSServer; InitialCatalog=MyCatalog;’‘SELECT a, b FROM MyTable’)
Users are required to store their data in a relational data source, and then point the data mining server to that relational data. This means that different kinds of applications are employed to arrive at an enhanced set of data. Moreover, it is extremely problematic to train a mining model to output a set of rules and/or scoring, unless the data is cached or staged first in the relational database. As indicated supra, this is time consuming and raises security issues. Additionally, this now involves a third entity—the relational data source, where both of the other players—the client that has the data and an analysis server need to have access. The client has to have the capability to write to the relational data source, and the server needs the capability to read from the relational data source. Thus, there is a substantial unmet need in the art for an improved data mining mechanism.