It is known that attempting to discover or mine useful information from an amorphous collection of records, wherein each record comprises record items or entries, is quite a daunting task. Particularly, the task is made that much more difficult when: (i) data in the collection need not be rectangular (e.g., spreadsheet-like); (ii) metadata associated with the collection may be incomplete or absent; (iii) data in the collection need not always be numeric; and/or (iv) items can occur in a record more than once. The task is daunting because of the shortcomings associated with existing data mining techniques.
By way of one example, it is known that classical statistics, which are the most widely taught and used statistics, do not prepare us for the automated high throughput analysis of the vast complexity of digitized medical and pharmacogenomic data. Such data has come to the fore as a result of the human and other genome projects, and by a recent rapid increase of interest in digitizing the patient record both for healthcare and research. For example, we now know that in most cases polymophisms in not one gene but many determine a disease of primarily genetic origin. Yet, even fairly advanced textbooks usually describe methods for correlating only two sets (columns) of data at a time, whereas recent biological data contains tens, hundreds or thousands of items which come together in complex interplay. Nonetheless, most statistical textbooks have little to say about how and where to direct such analyses in practice.
Thus, a need exists for improved data mining techniques which are effective and efficient for discovering useful information from an amorphous collection or data set of records.