Examining a person's genes can reveal if that person has a genetic disease or even if he or she is a latent carrier of a disease, at risk of passing the disease on to his or her children. The information is the persons' genes can be revealed by DNA sequencing. The DNA sequencing technologies known as next-generation sequencing (NGS) are capable of sequencing an entire human genome in under a day and for under $1,000. See Clark, Illumina announces landmark $1,000 human genome sequencing, Wired, 15 Jan. 2014. The output of NGS instruments typically includes many short sequence reads that must be assembled together and compared to known genetic information to meaningfully determine a person's genetic information.
This assembly and analysis is not a trivial task, and different computer program tools exist that perform various pieces of the assembly and analysis job. There are computer platforms that provide a graphical user interface (GUI) that can be used by a researcher or medical professional to assemble genomic analysis tools into pipelines that perform complex analytical tasks on sequence data. See, e.g., Toni, Next generation sequence analysis and computational genomics using graphical pipeline workflows, Genes (Basel) 3(3):545-75 (2012). However, these pipeline editors require the user to have mastered the intricacies of the underlying tools. If the user wants sequence reads to be aligned to a reference genome, for example, the user must be familiar with the myriad alignment tools such as MAQ, Burrows-Wheeler Aligner, SHRiMP, ZOOM, BFAST, MOSAIK, PERM, MUMmer, PROmer, BLAT, SOAP2, ELAND, RTG Investigator, Novoalign, Exonerate, Clustal Omega, ClustalW, ClustalX, and FASTA, to name a few. Additionally, the user must have a meaningful understanding of the sequence file (e.g., VCF, FASTA, FASTQ, SAM, GenBank, Nexus, EMBL, GCG, SwissProt, PIR, phylip, msf, hennig86, jackknifer) and know which is which and at what points one needs to be converted to another, and what formats are the default inputs and outputs of each tool within a pipeline. Due to the complexities involved, working within a graphical pipeline editor does not solve all the challenges in assembling and analyzing sequence data. Data files may be passed along in the wrong format, causing a program to throw an error and abort the pipeline. In some cases, the tool selected to do a job will be a poor choice and will not work efficiently with the kind of data passed to it or—worse yet—will provide a substantively incorrect output. For example, an inconsistency between the choice of tool, the sequence data, the instructions provided by the user, and the user's expectation may actually cause the pipeline to not provide the correct result and potentially miss an important mutation.