The present invention relates to a subtype test and, more particularly, to a method of encoding a dataset based on an inheritance hierarchy of the dataset.
Most modern programming languages are based on the notion of type conformance, which allows polymorphism and code reuse. Type conformance is often facilitated by a dedicated procedure to decide whether two types are related by a given subtyping relationship. This procedure is known as a subtype test or a type inclusion tests. Broadly speaking, a specific computer language may distinguish between a type, a class, an interface, a signatures, etc. However, subtype tests are employed onto any set of objects which may be ordered on some graph.
An efficient implementation of the type inclusion test plays an important role in the performance of object-oriented (OO) programming languages with multiple subtyping, such as C++ (e.g., the “dynamic_cast” operation), Eiffel (the “?=” operation), Java (“instanceof”), Smalltalk (“isKindOf”), and the like.
A subtype test is one of the basic operations in a run time environment of object-oriented programs. A formal definition of a subtype test is as follows: given an object, o, and a type b, a subtype test is a query whether the type, a, of the object, o, is a subtype of b, i.e., whether a is a descendant of b in an inheritance hierarchy.
A subtype relation which is reflexive, transitive and anti-symmetric is typically denoted by the subtype symbol, “”. The subtype symbol is used to denote a relation between two types, say, type a and type b. Hence, if it is found that ab then it is said that a is a subtype of b and b is a supertype of a. More generally, given a hierarchy of the form of a set T of types and subtype relations, it is desired to construct a data structure supporting subtype relation queries. Once such a data structure is generated it is said that the hierarchy has been encoded. This encoding involves computer operations which may be both time and space consuming, and hence affect the performances of a specific computer application.
Each encoding procedure may be characterized by four complexity measures.
A first measure is a space measure, also called the encoding length. Encoding methods associate certain data with each type. The space measure is the average number of bits per type.
A second measure is an instruction count measure, which is the number of machine instructions in the test code, on certain hardware architecture. There are indications that the space consumed by the test code, which can appear many times in a program, can dominate the encoding length. An encoding is said to be uniform if there exists an implementation of the test code in which the instruction count does not depend on the size of the hierarchy.
A third measure is a test time measure, which reflects on the complexity of the test code. Time complexity is of major interest in the art. Since the test code might contain loops, the time complexity may not be constant even in uniform encodings, however, constant time encodings are always uniform. To improve timing performance, loops of non-constant time encodings may be unrolled, giving rise to non-constant instruction count, without violating the uniformity condition.
Typically, at compilation time, the supertype b, is known. The test code can then be specialized, by precomputing values depending on b only, and emitting them as part of the test code. Specialization thus benefits both instruction count and test time, and may even reduce the encoding length.
A fourth measure is an encoding creation time which is the elapsed time for generating the actual encoding. This task is typically computationally difficult, so different creation algorithms have been proposed for the same encoding scheme. These algorithms differ in their running time and encoding length.
Many subtyping methods are known in the art. The most obvious method is called binary matrix (BM) representation, in which although the time measure is constant, the encoding length is extremely large (of the order of the size of the set T). Hence, the BM method is useful for small hierarchies and is used, e.g., for encoding a JAVA interface hierarchy in the CACAO 64-bit JIT compiler. However, for large hierarchies containing 5500 types the total size of the binary matrix is rather large and may typically reach 3.8 MB.
The observation that stands behind the work on subtyping tests is that the BM representation is in practice very sparse, and therefore susceptible to massive optimization. Nevertheless, the number of partially ordered sets having n elements is 2(n2), so the representation of some partially ordered sets requires Ω(n2) bits. Thus, for arbitrary hierarchies the performance of binary matrix is asymptotically optimal.
Another method is called a directed acyclic graph (DAG) encoding, according to which a directed acyclic graph is constructed. On the graph, nodes represent types and edges represent direct subtype relations, denoted d. Two types belong to a direct subtype relation if and only if (iff) there is no third type which is simultaneously a subtype of one and a supertype of the other. Formally, ad b iff ab and there is no cεT such that acb, where a≠b≠c.
The involvement of subtyping problems crucially depends on the inheritance which is characterized by the rules of the computer programming language. A special, relatively simple, case of subtyping problems is the so called “single-inheritance” (SI), in which the hierarchy DAG takes is a tree or forest topology, as mandated by the rules of languages such as Smalltalk or Objective-C. SI cases are discussed hereinafter. A more difficult case is the so called “multiple-inheritance” (MI) hierarchy, which is described first.
Referring now to the drawings, FIG. 1 depicts a DAG topology representation of an MI hierarchy, of types A, B, . . . , I. In FIG. 1, the edges are directed from a subtype to a supertype, and types drawn higher in the diagram are considered larger in the subtype relationship, e.g., Gd C and HA.
In DAG-encoding, a list of parents is stored with each type, resulting in total space of (n+|d|)┌log n┐ bits where a logarithm is to be understood as a base 2 logarithm. Therefore, the encoding length is (1+|d|/n)┌log n┐. In the standard benchmark hierarchies the average number of parents, |d|/n, is less then 2, hence the DAG-encoding enjoys a small encoding length. However, the time measure of DAG-encoding is extremely large, of the order of the size of the set T.
An additional encoding method is Closure-encoding, in which each type stores a sorted array of all of its ancestors. This method improves both the time measure and the space measure, to be O(log n), and (||/n)┌log n┐, respectively. Yet, these measures, although improved, are far from being optimal.
The relative numbering method, also known as Schubert's numbering method, guarantees both an optimal encoding length of ┌log n┐ bits and constant time subtyping tests. Reference is now made to FIG. 2, which depicts a tree hierarchy of types A, B, . . . , I, and the encoding of each type according to the Relative numbering method. Hence, each type a is encoded by an interval of integers which represent its minimal and maximal ordinals in a postorder traversal of the set T. Although relative numbering is characterized by a low encoding length and constant time, these achievements are only possible in a single-inheritance (SI) hierarchy.
Another algorithm designed for SI hierarchies is known as Cohen's algorithm ([N. H. Cohen, “Type-extension tests can be performed in constant time”, ACM Transactions on Programming Languages and Systems, 13: 626–629 (1991), the contents of which are hereby incorporated by reference]. The algorithm, relies on hierarchies being relatively shallow, and more so, on types having a small number of ancestors. According to Cohen's algorithm a type a is allocated with an array ra, with entries for each of the supertypes, b, of a. Thus, checking whether or not ba can be carried out by checking whether b is indeed present in a predetermined location of the array ra. The encoding is optimized by not storing b itself in this location, but rather an id, which is unique among all types in its level. A level of a type, c, is the length of the longest directed path starting from c. Cohen's encoding stores, with each type a, its level, its unique id within this level, and the array ra.
Reference is now made to FIG. 3, showing a tree hierarchy similar to the hierarchy of FIG. 2, together with an encoding according to Cohen's method. In FIG. 3, each id is shown as a number in a circle, each array is shown as a column of boxes and each level is shown as a number beside the corresponding column.
Also of prior art interest are Packed Encoding (PE) and Bit-Packed Encoding (BPE) [A. Krall, J. Vitek and R. N. Horspool, Efficient Type Inclusion Tests”, Proceedings of the 12th Annual Conference on Object-Oriented Programming Systems, Languages and Applications, 142–157 (1997), the contents of which are hereby incorporated by reference]. The PE and BPE algorithms are a generalization of Cohen's algorithm for MI hierarchy, both of which enjoy constant time measures. A common theme to PE and BPE is the so called slicing, in which the set T is partitioned into disjoint slices (also called buckets) S1, . . . , Sk. For each slice Si the algorithm stores the entire information required to answer queries of whether type a is a subtype of b where aεT and bεSi. The essence of the two algorithms is that, a set of descendants of each element in Si is stored, in a very compressed format, which is possible since there is a great deal of sharing in the descendants set of different members of S1.
The slices of PE and BPE play a role similar to that of levels in Cohen's algorithm. PE associates with each type a a unique integer ida within its slice sa, so that a is identified by a pair (sa,ida). Also associated with type a is a byte array ra, whose bth position corresponds to idb.
Reference is now made to FIG. 4, which shows a hierarchy of types A, B, . . . , I, which is similar to the hierarchy of FIG. 1, but also includes encodings of each type according to the PE representation. The types of the hierarchy are partitioned into five different slices: S1={A}, S2={B}, S3={D}, S4={C,E} and S5={F, G, H, I}. This is the smallest possible number of slices, since for example type F has five ancestors. PE constrains each slice to a maximum of 255 types, so that ida can always be represented by a single byte. The encoding length is then 8k, where k is the number of slices. The difference between BPE and PE is that BPE permits two slices or more to be represented within a single byte. Referring again to FIG. 4, slices S1, S2 and S3, are represented using a single bit, slice S4 is represented using two bits and slice S5 is represented using three bits, for a total of seven bits, which can fit into a single byte. While both the BPE and the PE techniques are known to be quite efficient in terms of the time measure, the encoding length of these techniques is relatively high.
Reference is now made to FIG. 5, which illustrates one of the most explored directions in the prior art, known as Bit-vector encoding. In this scheme, each type a is encoded as a vector veca of k bits. If an ith element of the vector equals unity then it is said that that type a has gene i. Let φ(a) be a set having all the genes of a, as elements. Then, relation ab holds if and only if φ(a)⊃φ(a), which can be easily checked by masking veca against vecb. FIG. 5 shows an example of bit-vector encoding of the hierarchy of FIG. 1.
In Bit-vector encoding, it is always possible to embed tie hierarchy in a lattice of subsets of {1, . . . k}, by setting k=n and in letting veca be a row of the BM which corresponds to a. A simple counting argument shows that k must depend on the size of the hierarchy. Hence, bit-vector encoding is non-constant time, but it is uniform.
Reference is now made to FIG. 6, which illustrates yet another encoding technique, known as Range-Compression Encoding [R. Agrawal, A. Borgida and H. V. Jagadish, “Efficient Management of Transitive Relationships in Large Data and Knowledge Bases”, Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, 253–262 (1989), the contents of which are hereby incorporated by reference]. This method, which generalizes the Relative Numbering method, has a constant encoding length, and an “almost constant” time. Range Compression encodes each type b as an integer idb, with its ordinal in a postorder scan of a certain spanning forest of the hierarchy. The id's of all the descendants of b form a set φ(b) which includes all the id's of the descendants of b, and can be represented by an array of consecutive disjoint intervals, enumerated by integers from 1 to k(b). For example, in FIG. 6 φ(B)={1, 2, 3, 5, 6, 7, 8, 9} can be represented as two intervals [1, 3] and [5, 9], thus k(B)=2.
Implementation of range compression requires a time measure of the order of O(k(b)). If k(b) is large then a binary search on the intervals of φ(b) reduces the time measure to O(log k(b)). However, the instruction count of the method is Ω(k(b)), which is rather large.
The present invention provides solutions to the problems associated with prior art hierarchy encoding techniques.