The present invention relates to data clustering. In particular, the present invention relates to method of graph-based clustering of large datasets.
Clustering of large datasets is a long-standing problem in statistical analysis and there is a vast amount of literature on this subject in various mathematical fields including statistics, optimization and operations research, and computer science. In particular, popular methods known as K-means clustering, Classification and Regression Trees (CART), Bayesian methods and many of their variants are commonly available in most popular data processing software such as Matlab's statistics toolbox, S-Plus, SAS etc. Nevertheless, some of these methods, such as K-means clustering, are often non-robust, in the sense that repeated runs of the algorithms on the same data from different starting points gives different results. Further, most of the methods require specifying the number of clusters desired in advance, and the user is often unlikely to know this information in advance for complex datasets. Finally, these methods often involve substantial computational complexity for large datasets, and often many repeated runs are necessary before the user is satisfied that the results are reliable.
The present invention is a very useful statistical tool applied to refinery process data. Refinery processes are usually monitored with the help of a large number of instruments that send periodic (typically every second) information back to central monitoring station. This streaming data is monitored both manually as well as automatically by computer software that may use deterministic rules (expert systems) and/or statistical criteria. The process can evolve into an abnormal state (unsafe and/or inefficient) in a large variety of ways, and in a well-designed system, the rules and statistical criteria will indicate the occurrence of the abnormality as early as possible so that corrective action can prevent further damage. In addition, the present invention is useful for data analysis of data produced by Model Predictive Control (MPC) and Real Time Optimization (RTO).