In many industries, very large data sets are collected both in manufacturing, and in the research and development.
In the semiconductor device manufacturing industry, device manufacturers have managed to transition to more closely toleranced process and materials specifications by relying on process tool manufacturers to design better and/or faster process and hardware configurations. However, as device geometries shrink to the nanometer scale, complexity in manufacturing processes increases, and process and material specifications become more difficult to meet.
A typical process tool used in current semiconductor manufacturing can be described by a set of several thousand process variables. The variables are generally related to physical parameters of the manufacturing process and/or tools used in the manufacturing process. In some cases, of these several thousand variables, several hundred variables will be dynamic (e.g., changing in time during the manufacturing process or between manufacturing processes). The dynamic variables, for example, gas flow, gas pressure, delivered power, current, voltage, and temperature change based on, for example, a specific processing recipe, the particular step or series of steps in the overall sequence of processing steps, errors and faults that occur during the manufacturing process or changes in parameter values based on use of a particular tool or chamber (e.g., referred to as “drift”).
The process variables are frequently related to yield or response variables. The process variables can be thought of as predictors or indicative of the yield variables based on an underlying relationship between the variables. Data indicative of the process and yield variables are measured and stored during a manufacturing process, either for real-time or later analysis.
Similarly, in pharmaceutical and biotech production, regulatory agencies such as the U.S. Food and Drug Administration require compliance with strict specifications on the manufacturing processes to maintain high quality products with very small variation around a specified quality profile. These specifications necessitate the on-line measuring of process variables and additional multidimensional sensor techniques such as, for example, process gas chromatography, near-infrared spectroscopy, and mass spectroscopy. Ideally, data measured during manufacturing processes are available for real-time analysis to provide indications or information concerning how close the process conditions are to the process specifications.
In pharmaceutical and biotechnical research and development, many different molecules—often tens of thousands or more—are investigated during the process of finding and optimizing a new drug. Many different physical and biological properties are measured on and/or calculated for each molecule (e.g., potential drug candidates), and many theoretical structure-related properties are calculated for each molecule. The total number of variable values determined for each molecule often exceeds several thousand (e.g., more than 2,000 variable values). Part of the development process comprises finding relationships between, on the one hand, biological properties and, on the other hand, physical, chemical, and theoretically-calculated structure-related properties. An understanding of these relationships helps researchers to modify the chemical structures of promising molecules to move towards new molecules with an improved profile of biological properties.
In large data sets, data are often grouped together, resulting in clustered data. To perform meaningful analysis on the data, comparisons between homogeneous or non-grouped data are preferred. Hence, algorithms have been developed to cluster the grouped data into homogeneous sub-groups.
One way to analyze the grouped data is to use a variant of linear regression analysis on the data (e.g., sometimes called “regression trees” or “classification and regression trees” or “CART”). Regression tree analysis involves a sequence of data splits based on individual X-variables or combinations of X-variables. The number of possible ways in which the data can be split grows rapidly with the number of variables observed. For this reason, regression trees are generally suitable for data sets having only a few variables, and regression tree analysis generally breaks down for data sets having more than 10 to 20 variables due, in part, to computational cost. Based on the results of the regression tree analysis, data are grouped into a tree or branched organization, sometimes called a dendrogram.
One type of hierarchical data clustering is based on a principal component analysis (PCA). Such techniques involve, for each hierarchical level, projecting a data set onto the first principal component axis of the PCA analysis. The projected data are thus aligned one-dimensionally along the first principal component axis, and the data are partitioned near the median position on the first principal component axis. This type of partitioning or clustering is iterated recursively until the maximum distance between cluster members exceeds a predetermined (e.g., user-defined) threshold. Like a CART analysis, a PCA-based analysis is relatively slow for large data sets. A further drawback is that PCA-based analysis generally considers only the X-variables and ignores the influence of Y-variables on the resulting data relationships.
Another technique involves random, binary (0 or 1) Y-vector values, which divide the Y-variables into two random groups. A partial least squares (PLS) algorithm is used to predict new Y-variables using a one-component model, and the predicted Y-variables replace the random Y-variable values. After the analysis converges, the predicted Y-variables are rounded off to the nearest integer (e.g., either 0 or 1), and the rounded Y-variables are used to partition the data into groups. Like PCA-based analysis and CART analysis, this approach tends to operate solely on the X-variables despite using PLS for internal calculations. An extension of this technique allows more than two clusters by establishing a framework for multiple (e.g., 3, 4, or more) partitions instead of binary partitions (0 or 1).
Neural network-type analysis is another approach to analyzing data. However, neural network-type analysis has not been computationally fast enough to be suitable for many applications, and also has difficulties when the number of variables exceeds 10 to 20.