1. Field of the Invention
The present invention relates to simple and complex data analytics and simple and complex machine learning or data mining, including methods, systems, and processes for enabling non-expert users to use and apply complex analytics to a particular task or query and to use and benefit from the results generated through the use of analytics. This invention and its various embodiments provide data mining and analytics solutions, including but not limited to predictive analytics, clustering, text mining, collaborative filtering, operations research methods, heuristics, and other similar analytic solutions. This invention and its various embodiments also provide data management solutions, such as data fusion, integration, transformation, de-normalization, historical preservation, preparation, derivation of statistical metrics, and other data management solutions. While the description of the invention often refers to use of predictive models to illustrate functionality of the invention, it will be appreciated that similar analytic techniques, such as the ones mentioned above, can be supported by this invention as well.
2. Description of Related Art
Analytics is a technology that has recently gained much attention in order to solve highly complex problems. Analytics basically encompasses the determination of large-scale data correlations. For example, in a business setting, a typical data analytics system processes many millions of records very quickly for thousands of different products and for thousands of different customers. While a human expert may be able to learn about a limited number of products and a limited number of different customers, no human expert would be able to understand patterns across the whole breadth of a multi-billion, multi-national enterprise. Further, human experts can sometimes have biases due to personal experiences, while analytics eliminates human biases in producing solutions and results.
The application of analytics to a business task or query involves collecting data about the problem; storing, managing, integrating, and preparing such data; and applying mathematical, quantitative and statistical analysis, and predictive modeling. As a result of applying analytics, organizations can better understand business needs and issues, discover causes and opportunities, predict risk levels and events, take steps to prevent risks and events, and perform other similar activities that are beneficial to the organization. It should be appreciated that analytics are used not only in the commercial for-profit sector but also in government, the non-profit sector, and by individuals to evaluate and solve many different types of problems or queries.
Examples of applications of analytics technology include the selection by a web browser of web pages and ad results in response to user queries, identification of tax returns with a high risk of fraud or inaccuracies, and selection of communications for further screening for terrorist threats. Further examples include categorization of retail customers to tailor offers and customer care to maximize customer loyalty and profit, prediction of equipment outages by computer manufacturers or data center operators, prediction of response and purchase propensities for marketing, and many other diverse applications.
Currently in the art, the use of analytics requires two general steps: data management and data mining. The data management phase requires the user to identify data that may be located in various disparate source systems within their organization. A typical complex case involves a large enterprise that may have a large number of source systems that are not integrated, but each of these source systems may separately provide different pieces of information that the customer would like to analyze. Typically, experts in data management spend a large amount of time and effort to build customized software to implement connections to these disparate systems, extract the information or data, and then integrate the data. Furthermore, such customized software development typically requires a large amount of interaction between the data management experts and the technical experts familiar with the various source systems. Once the data is integrated, the source or raw data typically requires further augmentation or manipulation to prepare the data for use by the data mining operations. Such work is typically very resource intensive in terms of time and financial cost.
At the data mining phase, typically a new set of experts who are skilled in mathematical analysis algorithms, such as neural networks, decision trees, nonlinear regressions, etc. perform data mining on the prepared data. Data mining experts typically create or use custom applications written in programming languages such as Java, C++, Python, or R, or data mining workbenches or use graphical user interface-based (GUI-based) tools, such as those provided by a development workbench tool like WEKA, SAS, and SPSS CLEMENTINE, to read, analyze, transform, and derive data from one or more data tables and to develop models. Data mining experts then set up experiments to evaluate the effectiveness of such models; however, this requires a data mining expert to configure the data mining tool manually. Further, the data mining expert must evaluate and compare models to select a satisfactory model to deploy. In some cases, a data mining expert may select various models that work together in concert, which the expert must also configure manually. Accordingly, the data mining phase is also very labor and cost-intensive.
In some cases, even after results become available in the data mining phase, a return to the data management phase may be required. For example, if a data mining expert decides to look at additional data that was not previously analyzed by the data mining phase, the data-mining experts must interface with data management experts who, in turn, must update the data integration and preparation process to prepare a new set of data for use by the data-mining experts.
Further, data management experts may use different applications from the applications and workbenches used by data mining experts. For example, a typical workbench for data management experts may include ETL (i.e., extract, transform, and load) tools and data warehousing tools, such as INFORMATICA and AB INITIO to build the data management capabilities, whereas data mining experts may use tools noted above such as WEKA, CLEMENTINE, SAS, STATISTICA, etc.
It would be an advantage in the art to eliminate these expensive expert interfaces and workbenches and to bypass such interfaces and workbenches and to provide a tool that seamlessly integrates data management and data mining. In other words, it would be advantageous to be able to take a user's data and generate inputs for data mining that can be used regardless of the format of the original data and that can be passed to any data mining algorithm or model without the expense of data management and data mining experts customizing the connection between the data and the data mining algorithms and models. That is, it would be beneficial to integrate the two complex steps of data management and data mining into a single product so that users would not need expensive data management or data mining experts, thereby saving time and costs.
It would also be an advantage in the art to provide a product that operates with standard platforms that users may use for data management and data mining and that can be easily installed and operated by non-data management and/or non-data mining experts. For such standard platform implementations, it would be an advantage to provide solutions in a fully automated fashion and to be able to develop a data mining model, execute that model and interpret the results from start to finish without requiring the intervention of an expert. If would be a further advantage to provide a product that easily adapts to custom installation and custom solutions with minimal user intervention.
For users with existing data management and data mining interfaces and workbenches, it would be beneficial to have a tool that automatically feeds improved information or data to such interfaces and workbenches. In other words, it would be beneficial for users, at their option, to reduce, continue, or expand their existing data management and data mining activities using a tool that interfaces with such existing activities and that flexibly supports users with various needs. For example, a user could perform data management and create new data inputs that could be used to build better data mining models. Similarly, a user could perform data mining and create new models that provide predictions that could be used as new inputs to build even better data mining models.
Since the current state of the art requires extensive analysis, design, implementation, testing, and deployment work that is performed by human experts, there is a high risk of introducing defects or “bugs” into the process of developing an analytic capability. It would be an advantage to automate much or all of the process by providing an analytic capability such that the risk of defects is greatly reduced, which is a significant value to potential users.
In addition to providing automated data analytic solutions for standard and/or custom platforms, it would be a further advantage to be able to improve analytic models over time. That is, data mining could be applied to the data-mining models to improve the analytic models without a user being required to be an expert in applying complex data management or data mining techniques (such as meta data management, data quality management, data warehousing, genetic algorithms, neural networks, etc.) to be able to optimize the analytic models.