1. Field of the Invention
Aspects of present invention generally relate to methods and systems for creating ensembles of hypersurfaces in high-dimensional feature spaces, and to machines and systems relating thereto. More specifically, exemplary aspects of the invention relate to methods and systems for generating supervised hypersurfaces based on user domain expertise, machine learning techniques, or other supervised learning techniques. These supervised hypersurfaces may optionally be combined with unsupervised hypersurfaces derived from unsupervised learning techniques. Additional exemplary aspects of the invention relate to methods and systems for generating supervised hypersurfaces based on user domain expertise, machine learning techniques, or other learning techniques. Lower-dimensional subspaces may be determined by the methods and systems for creating ensembles of hypersurfaces in high-dimensional feature spaces. Data may then be projected onto the lower-dimensional subspaces for further data discovery, visualization for display, or database access. Also provided are tools, systems, devices, and software implementing the methods, and computers embodying the methods and/or running the software, where the methods, software, and computers utilize various aspects of the present invention relating to analyzing data.
2. Description of Related Art
Large numbers of samples of large scale data are being amassed in huge data repositories. Accessing such data in a meaningful way is posing an increasing challenge both in the presentation of information to an end user and in the rapid summary of database content. Historically, the first computer accessing methods were limited to sequential views and storage of the data. A file might be contained in an ordered series of punch cards that could be read only in the physical sequence in which it was ordered. Files contained on magnetic tape were similarly limited to a single sequence of records. The advent of magnetic disk storage enabled the development of indexed sequential access methods. In this approach, an index could be constructed from a key field contained within each data record, and the physical storage of the data record could differ in sequence from the sequence reflected in its index file. Further developments included the relational database, in which any field in the data record could be used to create an index file, and the actual data records could be viewed in many separate sequences by using multiple index files, regardless of the sequence of the actual data records.
Accessing methods still remain essentially sequential since each index file is presented as an ordered series or one-dimensional list, and revealing the relationship among the data records as a simple sequence according to a key field. The complexity of large-scale data makes desirable more sophisticated methods for accessing data records that reflect the relationship among records using more than one ‘key’ field and expressing this greater complexity as more than a simple sequentially-ordered list.
Methods that can both access and express a more complex relationship among data records than a simple ordered list are highly desirable, particularly if such methods are intuitive and do not require advanced mathematical knowledge.
Computational tools and mathematical models have made progress in providing methods for data mining, but the details of using these tools remain largely the province of separate groups of specialists, and are not always effectively utilized in the broader community. There is a need for tools that can be intuitively used by non-mathematicians.
Complex datasets with high dimensionality pose particular challenges for analysis and accurate representation in two-dimensional graphics. The current and ongoing explosion of large-scale data in the life science and health sectors is a case in point. Computational resources required for analysis can be prohibitive, and grasping complex mathematical solutions can be difficult for experts in the data field who are not mathematicians. One approach to presentation of large-scale data has been the use of pseudo-three dimensional representations, but distortions are easily introduced when reducing the representation of high dimensional data to such a small number of dimensions. Inaccurate representations are a barrier to understanding and to data discovery. Better methods of representing high dimensional data are desirable not only to improve display methods, but to form a basis for further investigation of the data.
Among the tools that can be applied to high-dimensional data, the support vector machine is a powerful learning machine. It finds a linear separation between data classes, sometimes by mapping them into higher dimensions until a linear separation is possible. The problem posed by these high-dimensional calculations may be sidestepped via the “kernel trick”, which implicitly maps the data into higher (perhaps infinite) dimensionality, but allows the use of a dot product to avoid undue calculations. Methods in common use to display large-scale data, such as data reduction by ICA or PCA, are not able to clearly illustrate the separation achieved by the learning machine. There exists a need for better display methods that make the solution of the learning machine, such as the svm, more readily interpretable. In addition to the use of the svm, other methods for analysis of high-dimensional data are possible. These methods may also suffer from similar limitations for display of their solutions graphically in only two or three dimensions, so a need exists more broadly in the field of large-scale data analysis for improved methods for display.
Actual data patterns have been used in the current state of the art to search for matches using data mining methods, but current methods suffer from limitations in the display and demonstration of inter-relationships in identified matches. Improved methods for finding patterns that are similar but not identical and to convey information about the similarities are highly desirable.
The direct incorporation of a hypothesis into a model would speed investigation and save research costs. A barrier to such inclusion is that frequently the expert in the domain being investigated is not a mathematician, and this places increased importance on ease of use as well as the visualization of a model. Another barrier can occur when the direct collection of statistical data to support the hypothesis is expensive, and a simple method of examining a model in advance of additional data collection can speed the elimination of unlikely hypotheses and focus further efforts more directly on likely hypotheses. A need exists for simpler methods of incorporating hypotheses into large-scale data models in advance of extensive research effort.
The recent increase in technologies for collecting and amassing large scale data also poses challenges for monitoring and detecting abnormalities or changes in such data. In the area of human health for example, it is possible to monitor tens of thousands of genes. A wide range of gene expression levels and patterns are consistent with a normal, healthy individual, but illness can be manifested as a change in these normal patterns. Simple rule based definitions of a normal pattern fall short of dealing with the complexity and scope of adequately handling what is normal, particularly, for example if dealing with samples containing tens of thousands of features such as found with human gene expression. In many other areas, for example geospatial, finance, or surveillance, normal data encompasses complex patterns of variation that are still normal, but abnormalities may very well be reflected by changes that go out of the bounds of a complicated set of inter-related normal levels. In many of these cases, supervised methods of detecting an abnormal condition are not possible because there may not be enough, or even any, examples of actual abnormal data.
For all of these reasons, improved methods that can detect deviations from a normal state without needing abnormal examples for training is highly desirable.
In the field of biotechnology, for example, improvements in large-scale data display, analysis and hypothesis exploration would also be useful to increase discovery from high dimensional biological/biomedical data sets, such as gene expression, protein expression, and clinical studies, where the visualization of such data is limited by current methods, and additional methods for data discovery are of particular interest for the improvement and understanding of human health. Other fields where large-scale data is collected, for example, include, but are not limited to geospatial, climate, marketing, economics and surveillance data. These and other fields would benefit from such improvements in display and discovery of large-scale data.