1. Technical Field
The present invention relates to a method and apparatus for data analysis, and is particularly, but not exclusively, suited to selecting methods for analysing data.
2. Related Art
In many situations, information is derived from previously observed data. Thus decisions and recommendations, which are dependent on that information, are dependent on the ways in which data is analysed. For example, in forecasting the weather, predicting the behaviour of stock markets, identifying prospective customers, recognising objects and patterns in images, etc, previously observed data is analysed and used to form the basis of predictions and classifications.
Data analysis always has some objective, which is typically framed in the form of one or more questions to be answered. Examples of such questions include: Are there relevant structures in the data? Are there anomalous records? How can the data conveniently be summarised? Are these two groups different? Can the value of this attribute be predicted from measured values?
Recent advances in computer technology not only allow us to gather and store a continually increasing volume of data, but also enable us to apply an increasingly diverse range of analysis techniques in an attempt to understand the data. Such a diverse range of analysis techniques is a mixed blessing: in general, for a given set of data, several methods could be applied, each with subtle differences, preconditions or assumptions. Moreover, these methods often have rather complex interrelationships, which must be understood in order to exploit the methods in an intelligent manner.
Essentially, therefore, data analysis cannot be viewed as a collection of independent tools, and some a priori knowledge of the methods is required.
A further problem with data analysis is that questions relating to the data are usually not formulated precisely enough to enable identification of a single data analysis method, or a particular combination of data analysis methods. Very often new questions arise during the analysis process, as a result of the analysis process itself, and these typically require iterative application of other methods.
Typically whoever (or, if the data analysis is performed automatically, whatever) analyses the data is not an expert in analysis methods per se: he understands the application area, or domain, in which the data has been collected, but is not intimate with the workings of the analysis methods themselves. Geologists or physicians, for example, are not interested in the mathematical foundations of the analysis methods they apply to their data, but in the answers to questions such as, where to drill for oil, or which treatment is best for a certain disease. This is quite a common situation—there is no expectation of, for example, a driver to be capable of repairing his car or of a computer user to understand the function of a central processing unit (CPU). The point is that data analysis is a practical area and data analysis methods nowadays—with the help of the computer—are used as tools.
Known data analysis tools include statistical techniques, (e.g. SPSS: “SPSS 10.0 Guide to Data Analysis”, Marija J. Norusis, Prentice Hall, 2000, ISBN: 0130292044; Statistica: “Statistica Software”, Statsoft, International Thomson Publishers, 1997, ISBN: 0213097732). These statistical tools provide state of the art statistics, but usually only include a few artificial intelligence or soft computing techniques. Specialised data mining tools (e.g. IBM Intelligent Miner, Data Engine, Clementine) provide some machine learning (ML) techniques like top-down induction of decision trees (TDIDT) or neural networks (NN) but are often weak in statistics methods.
Both the statistical and data mining kinds of tools are method-oriented. They require the user to select an analysis method that then fits a model to the data. The tools do not support an exploratory approach and do not suggest appropriate analysis methods to the user. In addition these methods are unable to automatically select analysis strategies.