The present invention relates to methods and systems for feature selection. More particularly, the present invention relates to methods and systems for feature selection for data classification, segmentation, and retrieval.
With the explosion of data in areas such as machine learning, pattern recognition, statistics, information theory, philosophy of science, combinatorial chemistry, genetics, computer science, multimedia production, the internet, and the like, the need for fast and efficient data management has become a major issue.
One of the fundamental tasks in data management involves classifying the data into a meaningful manner for subsequent retrieval, manipulation, delivery, segmentation, and/or the like.
Human recognition of an object belonging to a certain classification (category) occurs because we learn to associate certain characteristic features of an object with a particular category. Therefore, once the important features are recognized and associated for an object and a category, we routinely classify other objects having these characteristic features as belonging to that particular category.
For example, humans can recognize a difference between a white paper and a document by the presence of certain features, such as texts and images, which may be present on the paper.
In the example above, selecting a small number of distinguishing features is important for accurate and rapid classification. For example, selecting the white areas of a paper to be the sole meaningful feature of a document will likely lead to incorrectly classifying all white papers to be such a document. However, specifying all possible distinguishing characteristics to classify an object as belonging in a particular category may require an inordinate amount of time, because doing so would require comparing each and every proposed feature.
Feature selection has, thus, been developed to reduce the number of features under consideration to a manageable level in a wide range of applications, such as text categorization, gene microarray analysis, web mining, handwriting recognition, and the like.
However, to date, feature selection in areas having massive data, high dimensionality, and complex hypotheses continues to pose a considerable challenge. In addition, accuracy becomes a critical issue when the training data set is sparse and/or noisy.
As such, methods and systems capable of carrying out feature selection on data containing a large amount of information with high dimensionality are desired. Furthermore, accurate feature selection methods utilizing sparse and/or noisy training data are also desired.