Modern society creates a sea of data. It can be difficult to understand large data sets using standard data analysis tools. The problem is particularly acute for data sets containing many objects, with many measured properties for each object. The typical approaches of plotting one parameter against another, computing histograms, measuring correlations, and so on are simply insufficient for exploring the data when there are more than a handful of parameters for each object in the sample. Object classification is a very useful tool for data exploration in large, complex problems as it can provide an accurate, understandable characterization of a complex data set.
A classifier takes a set of parameters (or features) that characterize objects (or instances) and uses them to determine the type (or class) of each object. The classic example in astronomy is distinguishing stars from galaxies. For each object, one measures a number of properties (brightness, size, ellipticity, etc.); the classifier then uses these properties to determine whether each object is a star or a galaxy. Classification of objects on the basis of their possession of a diversity of features, however, is a problem of widespread application. Classifiers need not give simple yes/no answers—they can also give an estimate of the probability that an object belongs to each of the candidate classes.
The classification process generally comprises three broad steps. The first is adjustment of input data, the second is classification, and the third is cross-validation. The first step mainly involves noise reduction and/or normalization and is highly domain dependent; it is essential for a proper interpretation of the results. The second step is domain independent. In the third step, the accuracy of the classifier is measured. Knowledge of the accuracy is necessary both in the application of the classifier and also in comparison of different classifiers. Accuracy is typically determined by applying the classifier to an independent training set of objects with known classifications. The advantage of cross-validation is that all objects in the training set get used both as test objects and as training objects. Often steps two and three are carried out repeatedly until a satisfactory classifier has been obtained. Various standard cross validation techniques are known, one of which is five-fold cross-validation.
Generally the computationally hard part of classification is inducing a classifier, i.e., determining the optimal (or at least good) values of whatever parameters the classifier will use. The classification problem becomes very hard when there are many parameters. There are so many different combinations of parameters that techniques based on exhaustive searches of the parameter space are computationally infeasible. Practical methods for classification always involve a heuristic approach intended to find a “good-enough” solution to the optimization problem. There are numerous recognized approaches in the art, including neural networks, neural-neighbor classifiers, axis parallel decision trees, and oblique decision trees. Such approaches are discussed, for example, in Aho, Hopcroft, Ulman “The Design and Analysis of Computer Algorithms”, Addison Wesley Publishing Co, 1976. There is, however, need for improvement so as have more accurate classifications.