1. Field of the Invention
The present invention is related to data processing, and more specifically, to a method and apparatus for calculating correlations in mass data to generate a hypothesis, as well as a corresponding computer program product.
2. Description of the Related Art
A hypothesis can describe the impact of various factors on a transaction processing procedure. For example, the smelting technology for metallic zinc has a smelting scheme that is typically evaluated by a number of factors: such as yield, smelting recovery ratio, water consumption, electricity consumption, sulfuric acid consumption, etc. Other factors are also involved during smelting, such as smelting method, temperature, pressure, response time, impurity content in raw material, equipment service time, etc. In order to determine relatively important factors for improving the overall efficiency of zinc smelting and establish a hypothesis for studying the relation between these relatively important factors and the efficiency of zinc smelting, efforts should be taken to comprehensively collect the multitude of factors and analyze the relationships between respective factors, which is laborious and time-consuming.
The premise of establishing a hypothesis is that the initial research direction on which the hypothesis is based is correct. For example, zinc smelting might be affected by hundreds or thousands of factors, it would take a complex procedure to determine a correlation between each factor and zinc smelting yield. Existing solutions that sample data (for example, values of respective factors during a zinc smelting procedure are sample data) are manually analyzed by seasoned experts who manually establish a hypothesis based on their past experience and collected sample data, e.g., research on impacts of temperature on yield.
The prior art has the following drawbacks: relations between respective factors cannot be analyzed accurately, and especially in case of the number of factors to be analyzed being extremely large (e.g., thousands of), it would be impossible to analyze these factors one by one based on manual processing. Besides, due to the limitation of manual processing capability, the amount of sample data being selected is rather limited; since the accuracy of the analysis cannot be ensured, some important factors might be missed in a hypothesis, or some factors that are irrelevant or weakly correlated could be mistaken as important factors and introduced into the hypothesis. For example, “equipment service time” might have a significant impact on the efficiency of smelting. However, if a hypothesis dedicated to the relationship between “equipment service time” and “yield” is established, and since “equipment service time” actually has little relation to “yield,” the hypothesis comes to nothing. A cause behind such an error might be neglect of a certain important factor or intervention of other factors. Once an unrealistic hypothesis is established, huge losses of manpower, material resources, and time will result.
In another example, the factors involved in the research analysis field of clinical data are more complex. Take clinical data related to diabetes as an example, the factors can include: average daily dosage of insulin, last dosage of insulin, type of insulin, patient age, gender, nationality, education, or occupation. Each patient's clinical data is sample data. In order to ensure accuracy, it is usually necessary to collect hundreds of factors and analyze thousands of patients' clinical data. Imagine if data is stored using an ordinary, two-dimensional table which includes rows and columns where each column represents a factor and each row represents sample data of one patient. It would be impossible to correctly analyze the data table comprising hundreds of columns and thousands of rows, based on existing manual methods.