Without limiting the scope of the invention, its background is described in connection with gene sequence data. Several systems such as the BLAST system include features for the learning and classification of gene sequence data. However, current solutions do not provide the functionality to automatically distribute learning and classification processes across multiple processors and disks in a distributed parallel computing environment using a map reduction aggregation method. Furthermore, current learning and classification systems implement a rigid framework which requires the use of a single predefined aggregation method and classification metric function. In addition, the current storage requirements of learned gene sequence data in existing systems makes it infeasible to store a very large amount learned gene sequence data on devices with limited hard disk storage space such as a lap top computer.
The limitations previously described often result in extensive and sometimes very overhead intensive input data pre-processing in order to transform the targeted gene sequence data for use within a rigid framework. Furthermore, the rigid framework does not easily support multiple application specific map reduction aggregation methods and classification metric functions created by the application programmer. As the volume of learned gene sequence data increases in these current systems, highly accurate classification of unknown gene sequences requires a very large amount of storage capacity and processing power. In addition, the processing time required for both the learning and classification of gene sequences within current systems is less than desirable.
Accordingly, there is a need for a system and method for machine learning and classifying data.