The human genome represents a significant amount of information, and storing such large quantities of information usually involves representing the four base nucleotides, thymine, cytosine, adenine and guanine (T, C, A, G) as bit pairs. There are about 3 billion base pairs in the human genome, and at two bits per base (four choices), the human genome has about 6 billion bits or about 750 MB of information (storing one copy of each chromosome). Typically, it may be a more common practice to represent each base nucleotide of the base pair with two bits, requiring about 1.4 GB of information. One format for storing sequences is known as, “packedDna.” The DNA, or deoxyribonucleic acid, packed as two bits per base, is represented as binary 2-bit values: T=00, C=01, A=10, G=11. The first base is in the most significant 2-bits of a byte; the last base is in the least significant 2 bits. For example, the sequence TCAG is represented as 00011011 in binary (hexadecimal 0x1B). Similar compression schemes are also employed in some other databases, data mining applications, and search applications.
A common operation in genome alignment is to count the occurrences of nucleotides within a string in order to match or partially match base-pair strings. With a packed data format (such as packedDna) the techniques may involve the use of look-up tables, together with shift and mask operations, and/or bitwise population counts together with logical operations in order to count the different nucleotide occurrences within a string.
Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single-instruction multiple-data (SIMD) vector registers. In SIMD execution, a single instruction operates on multiple data elements concurrently or simultaneously. This is typically implemented by extending the width of various resources such as registers and arithmetic logic units (ALUs), allowing them to hold or operate on multiple data elements, respectively.
The central processing unit (CPU) may provide such parallel hardware to support the SIMD processing of vectors. A vector is a data structure that holds a number of consecutive data elements. A vector register of size L may contain N vector elements of size M, where N=L/M. For instance, a 64-byte vector register may be partitioned into (a) 64 vector elements, with each element holding a data item that occupies 1 byte, (b) 32 vector elements to hold data items that occupy 2 bytes (or one “word”) each, (c) 16 vector elements to hold data items that occupy 4 bytes (or one “doubleword”) each, or (d) 8 vector elements to hold data items that occupy 8 bytes (or one “quadword”) each. On the other hand, some applications may store and operate on packed sub-byte data elements where a register or portion of a register of size k bits may contain n vector elements of size m, where n=k/m. For instance, a 64-bit register or portion of a register may be partitioned into (e) 64 packed elements, with each element holding a data item that occupies 1 bit, (f) 32 packed elements to hold data items that occupy 2 bits each, or (g) 16 packed elements to hold data items that occupy 4 bits (or one “nibble”) each. A 32-bit register or portion of a register may be partitioned into (h) 32 packed elements, with each element holding a data item that occupies 1 bit, (i) 16 packed elements to hold data items that occupy 2 bits each, or (j) 8 packed elements to hold data items that occupy 4 bits each.
A number of applications have large amounts of data-level parallelism and may be able to benefit from SIMD support. However, some applications spend a significant amount of time in operations such as reformatting the data to take advantage of the SIMD parallelism. Some applications (e.g. such as genome sequencing and alignment, databases, data mining, and search applications) may have data elements that are smaller than 8-bits. To maintain SIMD efficiency, these sub-byte elements may need to be decompressed to each occupy one byte before being processed in parallel. As a result, such applications may see somewhat limited performance benefits from SIMD, operations.
To date, potential solutions to such performance concerns and related processing difficulties have not been adequately explored.