Principal component analysis (PCA) is a classical statistical method. This linear transform is widely used in data analysis and compression. PCA involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. The objective of PCA is to discover or to reduce the dimensionality of a data set and to identify new meaningful underlying variables.
PCA has been used to analyze complex data, including phylogenetic data used to classify organisms. The principal method of establishing phylogenetic relationships among prokaryotic organisms is through analysis of SSU rRNA. Currently, over 125,000 organism specific SSU rRNA sequences are publicly available. Exploratory data analysis methods such as principal components analysis techniques indicate higher order relationships among SSU rRNA sequences similar to the July 2002 Bergey's taxonomy. However, principal components analysis techniques fail to provide undistorted visual presentation of the orderings, and fail to provide automated identification and re-placement of classification errors.
The simple act of naming and classifying an entity (e.g., biological entity) that is part of a large, complex classification system has potentially far-reaching and long-lived consequences. Names, especially those ascribed to organisms, serve as a primary entry point into the scientific, medical, and technical literature and figure prominently in countless laws and regulations governing various aspects of commerce, public safety and public health. Biological names also serve as a primary entry point into many of the central databases that the scientific community and the general public rely upon. However, ascribed names do not govern the process of biological classification or identification, rather, only the formation and assignment of names to proposed taxa. Hence, legitimate and valid names may be ascribed to poorly formed taxa and illegitimate and invalid names may be assigned to well-formed taxa.
A disjunction between nomenclature and taxonomy leads to an accumulation of dubious names in the literature and databases. From a practical, legal, or regulatory sense, incorrect classification can have significant and unintended consequences. For example, these errors may lead to adding or removing biological species to lists of tightly regulated organisms such as the current list of biothreat agents in the United States or organisms restricted by packaging and shipping regulations.
What is needed is an improved visual presentation of data classifications generated from principal components analysis based orderings, and a method of providing automated identification and re-placement of classification errors. Additionally, there is a need for a system of nomenclature and classification of biological taxa and other similar data sets that takes advantage of the large numbers of SSU rRNA sequences or corresponding identifiers available, that is reconcilable with other knowledge concerning genotypic and phenotypic information.