Gene sequencing is a field of active research today. An understanding of the genome variation will enable us to fully understand the issues of genetic susceptibility and pharmacogenomics of drug response for all individuals as well as personalized molecular diagnostic tests. The main issue with genomic sequencing today is the cost involved.
The gene-sequencing images representing DNA sequences generated by fluorescence or microarray techniques are very large in size. Moreover, the images are sparse in terms of real information, and often have an appearance similar to celestial images.
Many methods of carrying out the sequencing process have been proposed, but few are promising enough to bring the cost within affordable levels. One particular approach has been developed by Solexa Technologies. They employ the sequencing-by-synthesis method, wherein fragmented gene samples are first localized in a high-density array of colonies of identical copies. They are then cyclically added with nucleotides labeled with fluorophores and the emitted fluorescence is used to determine the sequence of bases in the fragment, one at a time. During every cycle, the fluorescence emitted by the colonies on the chip is captured in the form of bio-medical images: 4 images per cycle (for A, C, G and T) for an average thirty base pairs. At the end of the operation the size of the images cumulatively are in the order of terabytes.
Accordingly, the sequencing images e.g. resulting from the sequencing-by-synthesis method often have massive sizes. The sets of uncompressed images quickly reach the order of terabytes in matter of a few weeks and are henceforth unsuitable for archiving purposes.
FIG. 1a shows a segment of an original gene-sequencing image and FIG. 1b shows an enhanced gene-sequencing image, which shows the overall density of the image. Based on FIGS. 1a and 1b it is difficult to categorically say, which part pertains to information and which part pertains to noise. Although there are indications that only a few bright clusters or spots represent real information, i.e. nucleotide base pairs, while the rest is noise obtained by light diffraction, while obtaining the images.
FIG. 2 is a diagram illustrating a typical histogram of gene-sequencing images for four bases in the same cycle in the higher pixel values. Therefore, the tail of the histogram corresponds to values that are critical and must be preserved.
Accordingly, no commonly known compression techniques are able to store the information in genetic sequence images in a satisfactory manner, both in the lossy as well as the lossless domains. Lossy means that there is some information loss between the original image and the decompressed image (obtained from compressed image). If there is no information loss between the original and decompressed images this may be referred to as lossless compression. Each gene-sequencing image comprises too much data, and not all of it is important for regeneration of the genome sequence. Some commonly known methods utilizes direct thresholding which removes most of the noise data, but also removes spots of lower intensity that might be of clinical relevance.
Hence, there is a need for being able to transmit and archive gene-sequencing images across companies and academic institutions. However, the massive size of these images is a limiting factor. Hence, and improved method, apparatus, and computer program product for use in compression and decompression of an image dataset would be advantageous.