Generally speaking, variable length integer encoding uses an arbitrary number of eight-bit bytes to represent an integer. The Musical Instrument Digital Interface (MIDI) file format makes use of variable length integer encoding, as does the Wireless Application Protocol (WAP). 8-bit UCS/Unicode Transformation Format (UTF-8) also uses a variable length encoding scheme.
Many computing operations may involve data expressed as 64-bit unsigned or unsigned integers. Such data may include values within a large range, including both small and large values, and often duplicate values. However, this kind of data may be, in some ways, ill-suited to general-purpose compression schemes (e.g., Lempel-Zev implementations such as lzo, and gz, or other types of lossless encoding, such as Huffman, Arithmetic, Golomb, run-length encoding, and the like). In uncompressed form on disk, such values generally require a full 64-bits of storage space, even though many of the leading bits may be zero bits.
In a common implementation of variable length integer encoding, a continuation bit within a byte is reserved to indicate whether the following byte is also part of the current integer representation. If the continuation bit of a byte is 0, then that is the last byte of the integer. If the continuation bit of a byte is 1, then the following byte is also part of the integer. The scalar value of the variable length integer is the concatenation of the non-control bits (i.e. data bits). However, continuation-bit schemes may not compress optimally because every eighth bit is a control bit, not a data bit. Additionally, decoding continuation-bit-encoded integers may be relatively slow and complex, in part because a branch or other flow control mechanism may be required when processing each byte to determine whether the current integer's data bits continue into the next byte.
In alternate approaches, a block of continuation bits may be stored apart from the binary bytes that represent the integer. For example, an alternate encoding scheme may insert a byte of continuation bits for every eight data bytes. However, in such an alternate scheme, compressibility may also suffer in part because, as a result of the periodically-inserted control bytes, repeated sequences of numbers may not be represented by identical byte sequences within a byte-stream.
A related approach also uses periodically-inserted control bytes, but the control bytes represent a sequence of integer byte-lengths rather than control bits. Such an approach shares similar shortcomings as those discussed just above.
In the approach used by UTF-8 encoding, each byte has 0-4 leading 1-bits followed by a 0-bit. Zero 1-bits indicate a 1-byte sequence; one 1-bit indicates a continuation byte in a multi-byte sequence; and two or more 1-bits indicate the first byte in an N-byte sequence. The scalar value of the variable length integer is the concatenation of the (non-contiguous) non-control bits. However, the UTF-8 approach is inefficient and may not compress optimally in part because it uses at least two bits in every byte as control bits.
Other variable length integer encoding schemes use a variable number of leading 1-bits, generally followed by a 0-bit, as control bits that map to different byte lengths.
The schemes discussed above tend to decrease the compressibility of the data because patterns within the data may be broken up by control data, such that the same sequence of numbers may not always be represented by the same sequence of bytes within a byte-stream. In addition, approaches such as those described above can be relatively computationally expensive to decode, requiring branching or some other method of flow control within a decoding routine.