Many companies and research institutions already possess an unmanageably rich, deep and extensive menagerie of valuable raw data. However, these companies and research institutions are often ill-equipped to deal with the data in a comprehensive and meaningful way. It is becoming more expensive to integrate, process, and analyze such a large amount of data compared to the expense of generating the data itself. This problem is particularly evident in the biotechnology industry, and is also evident in other industries, including finance, pharmaceuticals, insurance, operations research, advertising, military intelligence/security, social media analytics and medicine.
For example, in the field of biotechnology, a company (or researcher) may have generated data relating to quantitative RNA sequencing, gene expression and gene regulation, protein crystal structures, protein interaction data, high throughput phenotyping (leaf surface area, root morphology, shoot mass, etc), gene expression data from eukaryotic or bacterial cell systems, leading to the creation of high resolution genetic maps, genotypic marker data and trait association data, and whole reference genome sequencing with a myriad of annotations. The data sets may be across stresses (nitrogen deficit, water deficit, high salt, etc), species (corn, soy, sugarcane, etc), populations (historical, geographic, etc), tissues (root, shoot, meristem, etc), and time (developmental or seasonal/historical). With next-generation sequencing, high throughput automated processing (via imaging or robotics) in growth chambers or the like, biotechnology and/or pharmaceutical companies and researchers will generate more and more insightful data than ever before. Such data may assist in the generation of as new vegetable varieties, protein-optimized antibiotics, individualized medical diagnostics and therapeutics, as well as complete insect, viral, plant or bacterial genomes. When RNA-seq based coding and non-coding gene annotations and expression profiles are included along with whole genome nucleosome positioning, DNA methylation, histone modification and other epigenetic data and single and combinatorial gene knockouts the deluge of data and the current inability to comprehensively analyze it and make it useful are made abundantly clear.
DNA sequencing is the highest possible resolution measurement in the life sciences and, until recently, was the most costly. Since the completion of the human genome project in 2001, the cost of DNA sequencing has dropped more than 10,000 fold. This has been achieved by a radical increase in data output that continues to double every 6 months—much faster than Moore's 18 month law for microprocessor speed doubling. As a result, biotechnology and medical applications are quickly becoming DNA sequencing-based assays. A genetic sequence is the ultimate biomarker—it is the indivisible “quanta” of the life sciences. These technological changes affect everything from the discovery and screening efforts of academics, agro-biotechnology firms, and pharmaceutical giants to diagnostic and screening efforts of the USDA, diagnostics labs, and hospitals. Most recognizable university and life science companies have a genomics program rooted in sequencing. In a few years, the costs will be sufficiently low to spawn entirely new direct-to-consumer markets and help realize true “personalized medicine.”
DNA sequencing, which outputs raw data, has in some ways brought more problems than solutions. Although next generation sequencing provides higher throughput, it is now in smaller, less informative pieces (˜100 letter long DNA strings called “reads”) that are more difficult to analyze. A single HiSeq DNA sequencer (available from Illumina LLC) can produce an overwhelming one terabyte of data per week. Even with a history of genomics expertise and an army of bioinformaticians, it could take a company more than a month to perform the most cursory analysis on a single such HiSeq run. Traditional organizational and software paradigms for dealing with this large amount of data simply do not scale to the level of complexity and richness modern integrated analyses necessitate. Moreover, it is necessary to integrate the data, which means comparing new data to all historical data, and that is precisely where the problem lies: comparing everything with everything else gets into the realm of N2 problems that take enormous computing resources to begin to analyze.