A Common LISP system, unlike strongly-typed languages such as Pascal, uses latent data typing. Latent typing allows a function to delay the determination of the "type" of an argument during run-time. Type identification is important because although some functions can operate on multiple data types, others operate only on specific data types. The program must be able to identify types and invoke appropriate functions.
Functions that operate on pointers to LISP objects, however, do not need to know the objects' data type. Only when the value of the LISP object is operated on is its data type required. This characteristic allows the LISP program to manipulate "unknown-type" objects, and serves to support the use of future objects in LISP as disclosed by reported by Robert H. Halstead, Jr., of the Massachusetts Institute of Technology, in his article entitled "Multilisp: A Language for Concurrent Computation", which appeared in the October, 1985 issue of ACM Transactions on Programming Languages and Systems, Vol. 7, No. 4.
Most functions, nevertheless, do operate on their arguments and therefore need to know to which data type the arguments belong. It is well known that this data type checking requirement can consume significant computational resources. This is one reason why machines that execute LISP code, which perform these type checks as part of their normal instruction cycle, have a performance edge over "stock" hardware. This is also why compiler declarations for eliminating run-time data type checking can result in significant increases in performance.
Data type checking of the prior art includes the Object Direct, Pointer Indirect, Tagged and Pointer Direct schemes. The Object Direct scheme encodes the type with the object. Its disadvantages stem from the extra space required in each object to hold the data type information, in addition to the necessary memory reference when only a pointer to the object is available.
While the Pointer Indirect method reduces the storage overhead of the Object Direct method by dividing the available memory into regions for each data type, uneven storage can occur, resulting in large "holes" of memory that are unused.
The Tagged scheme is inherently undesirable because of the specialized hardware requirements that follow from encoding the type in the pointer, but separate from the address.
The Pointer Direct scheme encodes the type directly in the pointer. The need to access memory is eliminated, driving performance up, and special hardware is not required.
Within the Pointer Direct scheme, there are several methods in the prior art of encoding the type in the pointer. An important consideration in type encoding is the handling of immediate objects. These objects are known to include small integers or "fixnums," small floats and characters. By representing these objects in the pointer itself, memory does not have to be allocated for them. The non-type bits of the pointer contain the immediate object. If an operation involving immediate objects could result in an overflow, the data must be shifted in the high bits to allow for detection. If no overflow is possible, the shifts are unnecessary.
When immediate objects are not involved, Shifted, High-bits Encoding can be used. In this method, the high address bits of the 32 bit word are used to divide the memory into regions, each associated with only one data type. Thus, each data type has its own contiguous block of memory. To retrieve the type from the pointer, the lower address type bits must be compared to a given type code. This is accomplished by either masking out the lower bits and comparing or shifting the pointer and comparing. Care must be taken when shifting or masking to insure valuable information (i.e. the operand) is not destroyed during the operation.
The main disadvantage with this method reveals itself when the programmer is determining how many bits to allocate for type representation. The number selected determines how many types can be distinguished, in addition to determining the size of that type's associated region. This relationship can be seen in FIG. 1. An allocation of 5 bits out of a 32-bit address yields typing support for 32 different data types, which is usually adequate. The resulting memory, roughly 131 megabytes, should also be sufficient for most Common LISP applications. If, however, a full 32-bit address is not supported, a severe shrinkage in the region size results. A 24-bit address with 5 type bits yields only a one-half megabyte space, leading to frequent garbage collections and an inability to support the larger applications. Further, while reducing the number of type bits is a consideration, it is not acceptable. This is because while a reduction in type bits to 4 would yield roughly 16 megabytes of memory space, it would also reduce the number of types available to 16. This would result in some objects sharing the same pointer type, forcing a supplemental method of type representation.
An additional disadvantage of the Shifted, High-bits encoding method is the sparsely populated virtual address space that frequently results. This is because storage allocated for one type may be unused while other types run out of storage space.
An alternative to using the high bits to encode is to use the Low-bits Encoding method. While this method has a number of known advantages, there are disadvantages. In addition to shifting or masking problems of high-bits encoding, there are not enough low bits available for encoding. This problem is shown in FIG. 2. In a byte-addressed machine with a 32 bit word size, a one-word-aligned address leaves bit 0 and bit 1, the two lower bits, as zeros. Because these are not used for addressing, they can be used for type encoding. A two-word alignment can also be used, but this only frees a third bit for type encoding, allowing 8 types.
An additional complication occurs when the pointer address is not aligned properly after shifting or masking. Resultant memory accesses are slowed with unaligned addresses, since additional bus cycles are required to align the addresses by adding a displacement to the pointer.
Another variation of low-bit encoding is called Shifted-Address, Low-bits Encoding. In this method, the full lower byte is used to encode, allowing 256 types to be encoded. The low byte must be shifted out before the pointer is used to reference memory to get the object. A common compare-byte instruction can be used to the test low byte. The major problem with this scheme is that the pointer must always be shifted before a memory reference, even when type checking is not being performed.
Another problem with this scheme is that the address space is limited to 24 bits when using a 32 bit address. While one-word and two-word alignment can add two to three bits respectively (see FIG. 2), the same alignment complications associated with Low-bits Encoding exist.
The last method of the prior art used for type encoding is called Pointer-direct, Bit-assignment Encoding, and can be used for high-bit or low-bit encoding. Instead of using a bit pattern to encode a data type, individual bits are assigned in the pointer to represent a certain data type. In this method, only a bit test needs to be performed. The disadvantage with this method is that there is only a limited number of bits available for assignment. For example, if bit-assignment encoding using 5 type bits, only 6 types can be represented, as shown in FIG. 3. Using these same 5 bits in the shifted, high-bit scheme yielded 32 types.
Similarly, in the low-bits case, if bit-assignment encoding using 3 type bits, only 4 types can be encoded. Using these same 3 bits in non-bit-assignment low-bits encoding yielded 8 types.
A hybrid of the aforementioned strategies was proposed by the authors of COINS Technical Report No. 88-35, dated Sept. 15, 1988 and entitled "Common Lisp Object Representation Strategies: The Umass Parallel Common Lisp Implementation". The inventor of the present application was a co-author of the report. A combination of the Object Direct and Pointer Direct strategies discussed above is used. By using two-word alignment, the lowest 3 bits are available for encoding. Specific types are assigned to these lower 3 bits, and the assignments are shown in FIG. 4. If, however, these three bits are all zeros, then 5 additional bits, bit 3 through bit 7, are encoded to contain the type of the object. These bit encodings are displayed in FIG. 5. Each of the types represented are briefly discussed below, followed by the disadvantages of the scheme.
If all three lower bits are zeros, then the full lower byte contains the type of object. For these, a simple compare-byte instruction is used. Characters have bit 6 set, and the second byte contains the code. "Fixnums" do not have bit 6, the immediate bit, set as do all other immediate objects. Because of the desire to provide for quick checking and operation on fixnums, fixnums are represented by having the entire lower byte all zeros. This allows for fixnums to be operated on directly without any shifting, masking or correction.
All of the non-immediate types are encoded in the low-byte with bit 3 set. Thus, the comparing done for type checking using the EQL test (the default test function for many Common LISP functions) is done with no memory reference. The pointers to these non-immediate numbers do, nonetheless, have to be shifted before accessing their values. In addition, the address space available to them is reduced to only 27 bits (shift 5, leaving the 3 high type bits as the lowest address bits). These drawbacks are minimized by the fact that these numbers were not intended to be "high-performance" numbers, nor are they heavily used. Time spent manipulating their values surpasses the extra instruction to shift the pointer.
The hybrid scheme also supports futures, and supports them as first class objects. This effects the EQ test, another frequently used LISP test. In LISP, two objects are EQ if either they are identical pointers, 0R if either or both of them is a future object which has a determined value that is "EQ" to the other object. As a result, all EQ tests must check for futures before failing. And because the future type is encoded directly in the pointer (bit 7), no memory reference is required. This however, has the same drawbacks as the non-immediate types: the pointer must be shifted before accessing the future object. Because the large majority of time the objects will not be futures, the tradeoff is acceptable.
The object direct portion of the scheme utilizes bit 2, as shown in FIG. 4. When set, the object begins with a header word, with the low byte in the header containing the type. Testing for these types requires a memory reference. Possible types include arrays, structures, compiled functions and any number of user-defined types.
Testing for either a Symbol or List (Cons) is a simple bit test. The key is setting the appropriate bit. For a Symbol, seven is subtracted from the two-word-aligned pointer. Referring back to FIG. 2, one can see this has the effect of setting bit 0. For a List, six is subtracted from the two-word-aligned pointer. Again, referring back to FIG. 2, it can be seen this shift has the effect of setting bit 1. To access the Symbol, similar positive displacements are made to align the pointers to word boundaries.
There are, however, two main disadvantages with this scheme. The first main disadvantage deals with the representation of NIL. In a LISP system, NIL can be operated on as a Symbol or a List. As a result, bit 0 and bit 1 in FIG. 4 should both be set for NIL. The problem comes when accessing the memory location. For a Symbol operation, seven should be added to the pointer. This addition leaves the lower 3 bits equal to 010. For the far more common List operation, six should be added to the pointer. This addition leaves the lower 3 bits equal to 001. Because the system is two-word-aligned, and the lower two bits are not 00, both of these results are not word-aligned. Either subsequent access will be exceedingly lengthy because the access is across a word boundary.
In other words, in a binary system, bit 0 and bit 1 represent "1" and "2" respectively, and bit 3 represents "4". In a word-aligned system, addresses occur every four bytes. Subsequently, word boundaries occur every four bytes, i.e. 0, 4, 8, etc. When the lower two bit are not all zeros, and a memory access occurs, a word boundary must be crossed. And when word boundaries are crossed, memory accesses are costly from a performance standpoint. The other disadvantage deals with the representation and subsequent use of a certain type of immediate object: a short-float. Because the entire lower byte is used for type encoding, short-floats have only 24 bits of value. This results in rather low precision. In addition, a mechanism has to be utilized which converts the 24-bit short-float into a representation which is recognizable by the hardware. This conversion is quite costly from a performance standpoint.