The present invention relates to a data analysis apparatus, a data analysis method, and computer products.
A yield analysis of semiconductor data will be explained as an example. Particularly, as in the process data analysis, in the case where reference data for deciding measures for improving the quality and productivity from the analysis result is to be obtained, the accuracy and the reliability of the analysis are important. An application related to this has been already filed by the present inventor (Application No., Japanese Patent Application No. 2000-41896 and Japanese Patent Application No. 2000-284578). In order to find a cause of yield decrease and take measures as soon as possible, there is performed a data analysis for finding a factor affecting the yield, and another factor affecting this factor, from the apparatus history, test results, design information and various measurement data.
In data analysis, the one to be analyzed, such as yield, is referred to as a target variable, and the apparatus history, test results, design information and various measurement data which become factors of the target variable are referred to as explanatory variables. At that time, various statistical methods are applied. As one of these methods, by applying data mining, a value, information or regularity which is difficult to discriminate can be extracted from various mass data.
It is important to analyze the collected data multilaterally based on scientific grounds, and extracts more significant differences, in order to analyze defective factors of the semiconductor devices. Therefore, values of the original data stored in a computer system and the mean value thereof have heretofore been used often. However, there may be a case where it is difficult to extract defective factors from the complicated original data group. In this case, if there is a characteristic data distribution related to various measurement results and yield of chips in a wafer face and wafers in a lot, the defective data may be analyzed based on this.
In the conventional computer system, however, for example, original data related to the yield and electrical characteristic value is stored, but the characteristic data distribution across a plurality of chips in the wafer face and a plurality of wafers in the lot is hardly stored. Therefore, engineers need to obtain the data distribution, by editing the original data and using various statistical analysis tools and table creation tools. It is also necessary for them to sum up the data and recognize the tendency of the data, by checking up the obtained data distribution with the experience and know-how which the engineers have. Therefore, it is difficult to grasp objectively the characteristic amount related to the distribution of the mass original data. There is also a problem in that accurate results cannot be obtained, even if analysis is performed based on the characteristic amount of the data distribution including the subjective view of the engineer as described above.
Conventionally, engineers study the data distribution obtained by using various statistical analysis tools and table creation tools, to express the characteristic amount of the data distribution by a discrete value, in such a manner that, for example, if there is a certain feature or not in the distribution, if an increase or decrease tendency of a certain feature is xe2x80x9cincreasingxe2x80x9d or xe2x80x9cdecreasingxe2x80x9d, if there is a periodicity of 2 or not in a certain feature, or of there is a periodicity of 3 or not in a certain feature. Therefore, the information representing the degree is lacking, for example, how much there is a certain feature (or no feature), or how much increasing tendency (or decreasing tendency) a certain feature has. There is also a problem in that in the case where a certain feature has a periodicity of 2 and a periodicity of 3 to some extent, only the periodicity having a larger degree can be recognized.
Considering various test results and measurement result, and combinations thereof, the combinations of the assumed data distribution characteristics become huge, and hence it is quite difficult to investigate all of these combinations. Further, the defective factors corresponding to the extracted data distribution characteristics are not always known, and lots of experiences and know-how are required in order to discriminate unknown defective factors.
For example, even if data mining is actually applied to yield analysis of the semiconductor data, there are some cases which don""t work well. With the application in the fields of finance and distribution, since there is a huge number of data, i.e., several millions of data, and the number of explanatory variables is several tens at most, analysis results with high accuracy can be obtained. In the case of the semiconductor process data analysis, however, although the number of data is small, and in the same type, there are only about 200 lots at most, the number of explanatory variables reaches several hundreds (apparatus history, inspection between step, and the like). Hence, a plurality of explanatory variables is not independent any more, and hence reliable results may not be obtained only by performing data mining simply. The yield analysis of the semiconductor data will be explained briefly as an example.
In the process data analysis in which the number of explanatory variables (for example, LSI production step data) is large compared to the number of data (for example, the number of lots), there may be a case where a plurality of explanatory variables confounds with each other (becomes not independent), making it difficult to sufficiently narrow the problems due to the statistical significant difference. Even in the case where the data mining (regression tree analysis), if there is this problem, it is necessary to confirm the accuracy of the analysis results and the reliable range with time and effort.
FIG. 1 shows the relation between the lot flow and abnormal manufacturing apparatus. A outlined circle represents normal apparatus 101 and a black circle represents abnormal apparatus 102. An arrow shows the lot flow. Analysis of the inter-apparatus difference in the LSI production data extracts, from the data of the used apparatus for each step of each lot, as to which yield is most affected, on the condition of which production apparatus and production step are utilized.
FIG. 2 shows a yield distribution by apparatus (box and whisker chart) in a certain step using the conventional art. The yield value of the lot is displayed by the box and whisker chart for each apparatus used for each production step, so that confirmation is performed for each step, to thereby identify a step and apparatus having the most conspicuous difference.
With this method, however, a large number of records are required, since the number of steps becomes several hundreds at present, and in the case where the difference does not clearly appear, or the case where conditions are complicated, judgment is difficult. In order to deal with this, the data mining method by means of the regression tree analysis is effective, in which the used apparatus is divided into a group where the value of the target variable becomes high and a group where the value of the target variable becomes low. As shown in FIG. 3, in the case where the apparatus used for each lot is fixed and the lot is made to flow, there may be a case where the abnormal apparatus 102 represented by a black circle cannot be identified determinately. That is to say, in the case where the independency between explanatory variables is low, the one having a large significant difference due to bisection of set has not always an xe2x80x9cactual large significant differencexe2x80x9d.
The above is confounding in the used apparatus in each step of the semiconductor manufacturing. The same thing applies to the confounding of bisected set as a result of regression tree analysis. That is to say, the same thing applies to the case where the set comprises a group of apparatus having high yield and a group of apparatus having high yield, in each step. The confounding of this bisected set is the same in the case where the explanatory variable is continuous.
It is an object of this invention to provide a data analysis method which extracts a data distribution characteristic amount such as various statistics by editing the original data, and objectively recognize and utilize this to thereby automatically extract defective factors or the like. It is another object of the present invention to provide a data analysis method and a data analysis apparatus which can clarify the confounding degree between a plurality of explanatory variables.
One aspect of the present invention is to automatically and quantitatively evaluate and extract various data distribution characteristics existing in the original data groups stored in a computer system, and select and analyze the amount of each extracted characteristic sequentially, to thereby automatically and quantitatively evaluate and extract the factor of each amount of characteristic. According to this aspect, since lots of information such as the tendency in the data distribution, characteristic patterns and the relation between data are extracted. Therefore, the relations and significant differences, which have been difficult to be discriminated due to being covered with various data, can be quantitatively extracted efficiently based on scientific grounds.
Accordingly, in order to clarify the confounding degree between a plurality of explanatory variables, there is provided a data analysis method comprising the steps of, preparing data result of an explanatory variable and a target variable, calculating the confounding degree and/or independence degree between a plurality of explanatory variables based on the data result, and performing data mining, using the confounding degree and/or independence degree. By calculating the confounding degree and/or independence degree between a plurality of explanatory variables, the confounding degree between the explanatory variables can be clearly caught. If the regression tree analysis is performed based on this, the confounding degree between the explanatory variables can be quantitatively evaluated, based on the result of set bisection in the regression tree analysis. As a result, it becomes possible to clarify a noteworthy explanatory variable confounding with an explanatory variable in which a significant difference at the first branch in the regression tree becomes a big problem.
Other objects and features of this invention will become understood from the following description with reference to the accompanying drawings.