Currently, database administrators and data analysts are required to work closely together to transform and analyze data in relational databases. Together, they must go through a process involving several iterations of meetings, ad hoc structured query language (SQL) and extract-transform-load (ETL) scripting, exporting data to proprietary data mining tools, and using those tools to build predictive models. Once the models have been built, they must be imported back to a database environment for deployment. Often the analysts that create the predictive models do not have the knowledge and skills necessary to write the data transformation code themselves, which requires familiarity with ETL and SQL tools as well as the relational schema of the database which contains the raw data.
FIG. 1 illustrates a conventional related art method of transforming data residing in relational databases into forms suitable as input to predictive modeling tools. Referring to FIG. 1, at Step 1 a business expert and a data analyst work together to define a business problem to be solved. At Step 2 a team or other group of persons defines the data requirements. Step 2 is typically performed by a data analyst, a business expert, and an information technology (IT) person. At Steps 3 through 5 the team or group of persons carry out an iterative process where, for example, an analyst and an IT person work together to prepare relational data for input to a modeling tool. An analyst iteratively builds a model in Steps 6 and 7. The model is then deployed in step 8 with the help of an IT person.
These method steps demand a very high degree of interaction among IT people, data analysts, and business experts. In most cases, no single person possesses all of the necessary expertise to carry out the process on their own. A significant amount of interpersonal communication and coordination must therefore take place, which can introduce significant delays in the process. In addition, many of the method steps must be carried out manually and are quite time consuming, which further increases delays.
There are a great many drawbacks to this process. First, the data preparation step involves an analyst communicating his or her requirements to a database administrator, and then having the database administrator coding the data transformations and making the resulting transformed data suitable for mining. Information required for coding the data transformations must be drawn from several sources including metadata in relational and online analytical processing (OLAP) repositories, schema knowledge in the minds of the database administrators, and modeling knowledge in the minds of analysts. The time spent “preparing data” can be as high as 80-90% of the total time spent in conducting predictive model projects. Second, manual coding of data transformations is extremely complex and error prone. Third, no existing tools provide process automation for data transformations to support predictive modeling and model deployment.
Rosella data mining software available from Scion Analytics makes use of star schemas in predictive modeling. However, Rosella only appears to allow dimension tables to be joined to fact tables. Rosella does not perform aggregations.