The present invention relates generally to the field of distributed systems and in particular, to the application of logical time for detecting causality in a system containing a large number of distributed sites that don""t share global memory and exchange information by passing messages over a communications network.
Modern data networks possess a multilevel hierarchical structure (See, e.g.,xe2x80x9cATM-Forum Private Network-Network Interface Specification Version 1.0xe2x80x9d, ATM Forum, March 1996), where certain sets of physical or logical nodes are grouped together according to physical, geographical, and/or administrative considerations. A distributed application running on such a network may involve thousands of sites belonging to different logical nodes with transport channels between them extending through multiple domain boundaries and crossing multiple hierarchical levels. Knowledge of causality between the events in such a system is essential for analyzing the system behavior and ensuring the correct operation through solving various problems related to mutual exclusion, consistency maintenance, fault tolerance and recovery. Unfortunately however, synchronous methods of causality tracking may be unavailable due to the absence of global clock and unpredictable communication delays, whereas straightforward logging of the local and remote events and exchanging logs along with messages is impractical due to a system""s size and lifespan. In these circumstances, using logical time is a technique which allows efficient encoding of information contained in an event log and causality tracking while keeping the communication overhead low.
The notion of logical time, based on the concept of one event happening before another in a distributed system, was described by L. Lamport in an early paper entitled xe2x80x9cTime, clocks, and the orderings of events in a distributed systemxe2x80x9d, which appeared in Communications of ACM, Vol. 21, pp. 558-564, 1978 and which subsequently attracted great amount of attention. (See, for example, Haban and Weigel, xe2x80x9cGlobal events and global breakpoints in distributed systemsxe2x80x9d, Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences, pp. 166-175, January 1988; F. Mattern, xe2x80x9cVirtual time and global states of distributed systemsxe2x80x9d, M. Cosnard et. al., editors, Proceedings of the International Workshop on Parallel and Distributed Algorithms, pp. 215-226, Amsterdam, 1989. Elsevier Science Publishers; C. J. Fidge, xe2x80x9cTimestamps in message-passing systems that preserve the partial orderingxe2x80x9d, Proc. 11th Australian Comp. Sci. Conf., pp. 56-66, 1988; M. Raynal, xe2x80x9cAbout logical clocks for distributed systemsxe2x80x9d, xe2x80x9cACM Operating Systems Reviewxe2x80x9d, vol. 26. Association for Computing Machinery, 1992; J. Torres-Rojas and Mustaque Ahamad, xe2x80x9cPlausible clocks: Constant size logical clocks for distributed systemsxe2x80x9d, Babaoglu and Marzullo, editors, xe2x80x9cDistributed Algorithms, 10th International Workshopxe2x80x9d, vol. 1151 of xe2x80x9cLecture Notes in Computer Sciencexe2x80x9d, pp. 71-88, Bologna, Italy, 9-11xcx9cOctober 1996. Springer Verlag.
In a distributed system, where no global shared memory is available and messages are exchanged over a communication network, event e is said to causally precede event f(exe2x86x92f), if e can potentially affect f, in other words, by the epoch event f occurs, the process hosting event f has already received the information on the occurrence of event e.
Two distinct events e and f in H for which neither exe2x86x92f of nor fxe2x86x92e holds are called concurrent (e∥f). Relation xe2x86x92is an order relation on the event set H of the distributed system, but it does not introduce a total order. Relation ∥ is not transitive.
Logical time provides a means to encode the causality information contained in the partial order (H, xe2x86x92) by assigning timestamps to all events, so that comparing two event timestamps allows one to draw a conclusion about the causal ordering of these events. Formally, a logical time system contains:
the logical time domain T with a comparison function f: Txc3x97Txe2x86x92R, where R is a set of four outcomes, xe2x80x9cless thanxe2x80x9d ( less than ), xe2x80x9cgreater thanxe2x80x9d ( greater than ), equal toxe2x80x9d (=), and xe2x80x9cincomparablexe2x80x9d (⋄);
the message tag domain, S;
the set of rules which allow, for each send message event f, to form a message tag s(f)∈S to be transmitted along with the message;
a set of rules which allow the assignment of a timestamp C∈T to each event in the system""s event set H, such that for any two events e, f∈H, the following condition holds:
exe2x86x92fxe2x86x92C(e) less than C(f)xe2x80x83xe2x80x83[1]
This monotonicity condition [1] is referred to as clock condition or weak clock consistency. A logical time system is strongly consistent if the converse of equation [1] holds, namely:
exe2x86x92f⇄C(e) less than C(f)xe2x80x83xe2x80x83[2]
and is isomorphic (or, equivalently, is said to characterize causality), if ∀e,f∈ in H:
exe2x86x92f⇄C(e) less than C(f);
exe2x89xa1f⇄C(e)=C(f);
e∥f⇄C(e)⋄C(f).
Causal precedence or xe2x80x9chappened-beforexe2x80x9d relation for a system with totally ordered process event sets was defined by Lamport for a system with totally ordered as the smallest relation on the set H of events in the system, such that:
i. if events e and f belong to the same process and e precedes f in the event sequence of that process, then exe2x86x92f;
ii. if e denotes a message send event and f is a reception event of the same message by another process, then exe2x86x92f,
iii. if exe2x86x92f and fxe2x86x92g, then exe2x86x92g.
Lamport defined a logical time system, commonly known as Lamport clock, which used the set of non-negative integers as the logical time domain, T=N. Each process maintains its logical time, which is initialized to zero and is incremented by 1 (or any positive value) when an internal event or message sent event occurs. The event is stamped with the value of the incremented logical time.
More specifically, the timestamp of a send message event is transmitted with the message (G=T). The timestamp of each message receive event is obtained by incrementing the maximum of the process"" logical time value and the sender""s timestamp extracted from the message. Thus the value of the process"" logical time is always larger than the maximum message tag received so far, therefore, it could be shown that Lamport clock is weakly consistent. However, since non-negative integers form a totally ordered set, the Lamport clock does not allow one to capture concurrency.
To overcome this problem, a concept of vector clock was later proposed by a number researches, most notably C. Fidge in an article entitled xe2x80x9cTimestamps in message-passing systems that preserve partial orderingxe2x80x9d, that appeared in Proc. 11th Australian Comp. Sci. Conf., pp 55-66 in 1988; and an article by F. Mattern entitled xe2x80x9cVirtual time and global states of distributed systemsxe2x80x9d, in Proceedings of the International Workshop on Parallel and Distributed Algorithms, M. Cosnard et. al. editors, pp. 215-226, Amsterdam, 1989, Elsevier Science Publishers. Logical time domain is an integer vector space T=Nn, where n is the number of processes in the system. Each process Pi maintains a vector Vi=(v1iv2i . . . vni), which contains one component per each parallel task.
Two vectors Vi and Vj are equal, if all their components are equal; Vi less than Vj, if and only if vkixe2x89xa6vkj, for all k=1, . . . ,n, and there exist at least one k, such that vkixe2x89xa6vkj. In all other cases, Vi and Vj are incomparable. For each event occurring at process Pi, the i-th component of the logical time vector Vi is incremented. For each send message event, the entire vector with the incremented local component is transmitted along with the message. When a message is received by Pi, the non-local components of Vi are updated by taking the component-wise maximum of Vi and the sender""s vector clock extracted from the message. Without loss of generality, the local component of a vector timestamp can be viewed as the sequential number of the event in the local event order. Then each non-local component corresponds to an index of the most recent event at process Pj which is known to process Pi.
The vector clock possesses a number of attractive properties. First,a logical time system based on a vector clock of size n is isomorphic. Besides, to determine causal relation between two distinct events e and f occurring, respectively, at process Pi and process Pj, which are not necessarily distinct, it is sufficient to compare only one component of their vector timestamps:
exe2x86x92f⇄Vi[i]xe2x89xa6Vj[i]xe2x80x83xe2x80x83[3]
However, B. Charron-Bost showed, in an article entitled xe2x80x9cConcerning the size of logical clocks in distributed systemsxe2x80x9d, which appeared in Information Processing Letters, 39:11-16, 1991, that in order to maintain clock isomorphism, the size of the vector timestamp has to be at least equal to the number of processes in the system. Therefore, the storage and communication overhead, as well as the processing time, grow linearly with the system size n, making the cost of the vector clock in a large distributed system prohibitively high.
A number of vector clock optimizations applicable in certain special cases have been proposed. The differential technique of M. Singhal and A. Kshemkalyani (See, e.g., M. Singhal and A. Kshemkalyani, xe2x80x9cAn efficient implementation of vector clocksxe2x80x9d, Information Processing Letters 43(1):47-52, August 1992), reduces the communication overhead by transmitting only those vector components that have changed since the most recent transfer. The dependency method described by J. Fowler and W. Zwaenopoel, in an article entitled xe2x80x9cCausal distributed breakpointsxe2x80x9d, that appeared in Proc. 10th Intl. Conf. Distributed Computing Systems (ICDCS-10), Paris, France, May 28-Jun. 1, 1990. EEE, allows a process to store only the information related to the direct dependencies between processes and reduces the size of a message tag to a scalar value, at the expense of the substantial latency and overhead required to recursively recompute the causal precedence information off-line. Other proposed optimization can be found in an article entitled xe2x80x9cIncremental transitive dependency tracking in distributed computationsxe2x80x9d, authored by C. Jard and G. V. Jourdan, that appeared in Parallel Processing Letters}, 6(3):427-435, September 1996. All these techniques described are restricted to the certain special cases and don""t affect the impossibility result described in the art.
Another related approach which has been used to discard obsolete information in replicated databases is the matrix clock, as described by Wuu and Bernstein in an article entitled xe2x80x9cEfficient solutions to the replicated log and dictionary problemsxe2x80x9d, that appeared in Proceedings of the Third Annual ACM Symposium on Principles of Distributed Computing, pages 233-242, Vancouver, B. C., Canada, Aug. 27-29, 1984, and an article by Sarin and Lynch entitled xe2x80x9cDiscarding obsolete information in a replicated database systemxe2x80x9d, tse, SE-13(1):39-47, January 1987.
Both the event timestamps and the message tags are defined on the set of integer matrices Nnxc3x97Nn, where k-th component of the j-th row of the logical time matrix maintained by process Pi corresponds to the Pi""s view of the most recent event belonging to the process Pk which is known to process Pj. The (i,i)-th component of the Pi""s logical time matrix is incremented for each local event occurring at Pi, the other components are updated by taking the component-wise maximum of the local matrix and the message tag. The memory and communication overhead of the matrix clock prevents its usage to determine causality in large distributed systems.
A class of logical time systems, referred to as Plausible clocks, sacrifice isomorphism to maintain simplicity of the timestamps and messages, are defined by the following properties:
exe2x89xa1f⇄C(e)=C(f);
exe2x86x92f⇄C(e) less than C(f);xe2x80x83xe2x80x83[4]
xe2x80x83e←f⇄C(e) greater than C(f).
They can be viewed as an approximation of the isomorphic vector clock which are able to detect concurrency between events in H with certain degree of accuracy. The plausible logical time systems form a partial hierarchy, any two of which can be combined to produce another plausible logical time system of higher accuracy than the original ones. A simple example of plausible clock is provided by Lamport clock with the added process identity to disambiguate between different events having the same scalar timestamp is plausible. Another examples of the constant size plausible clocks include the R-entry vector clock and the K-Lamport clock.
R-entry vector clock is a variant of the vector clock with added process identity where the event timestamps and the message tags belong to a vector space Nr, and the vector size r is fixed and independent of n, the number of processes in the system. Each component of the vector timestamp corresponds to the maximum sequential index of an event occurring at some subset of processes in the system, such that this event causally precedes the given local event. The simple mapping of a process identity into the vector component index is given by a modulo-r function. The vector timestamp comparison function can be modified accordingly, so that ∀e,f∈ H:
C(e) less than C(f)xe2x86x92(exe2x86x92f)xcexd(e∥f);
C(e)⋄C(f)xe2x86x92(e∥f).
K-Lamport clock is an extension of the Lamport clock with the logical time domain T=NK and the message tag domain G=NKxe2x88x921. The timestamp of each event is comprised of the process identity, the scalar Lamport timestamp, and a (Kxe2x88x921)-dimensional vector of maximum tags received by the process itself as well as its direct and indirect message suppliers. The process"" Lamport clock and Kxe2x88x922 components of the maximum received tag vector are included with each message. The Lamport clock is maintained in the usual way, whereas on a message receipt event, the k-th component of the local logical time vector, k greater than 1, is set to a maximum of its previous value and the value of (kxe2x88x921) component of the message tag. Using all the components when comparing timestamps allow, in some cases, to detect concurrency between events whose scalar timestamps (or vector timestamps of smaller size) are consistent with the existence of causal ordering.
The model of distributed computation originally considered by Lamport was based on the assumption that all events occurring within the same process are sequentially ordered. This model is not applicable in a multithreaded process environment where the tasks are being created and terminated dynamically and executed in parallel, yet the data and control dependencies may exist that establish causal relation between events of different tasks. The straight-forward solution based which attempts to consider each thread as a process in the original model is quite limited, as in this case the size of the data structures to maintain the isomorphic logical time and the associated processing and communication overhead become unbounded.
For a process with partially ordered event sets, two types of event causality relations are defined: internal causality and message causality. Internal causality is induced between the events of the same process by the existing data and control dependencies which are assumed to be known to the process. Message causality exists between each pair of corresponding message send and message receive events. The causal precedence relation on the entire event set of the distributed system is the transitive closure of the union of the two.
C. Fidge, in an article entitled xe2x80x9cLogical time in distributed computing systemsxe2x80x9d, which appeared in Computer, Vol. 24(8), pp. 28-33, 1991, suggested a method to apply the vector time approach to parallel environment. Fidge extended the Lamport definition of the xe2x80x9chappened-beforexe2x80x9d to include the causal ordering created by forking a task and terminating a task, and replaced the notion of a vector by a set-of-pairs concept, where each pair pertains to a parallel task and contains the task identifier and the counter value. This method requires that the number of task be bounded by a value known in advance and does not allow recursive task definition.
More recently, Audenaert, in an article entitled xe2x80x9cClock trees: Logical clocks for programs with nested parallelismxe2x80x9d, that appeared in IEEE Transactions on Software Engineering, Vol. 23, pp. 646-658, October 1997, proposed a new timestamping method of clock trees which applies in a distributed system where processes exhibit nested parallelism. A nested-parallel execution can be recursively decomposed into unnested partially ordered sets of events. Each partially ordered segment of parallel tasks may:
i. entirely belong to the causal past of the given event, or
ii. entirely lay outside its causal past, or
iii. belong to the casual past only partially.
In the first case (i) the segment of parallel tasks is treated as a single event in the execution history of the respective process; a single component in the vector timestamp of the given event suffices to capture the causal ordering. This timestamp is referred to as a quotient vector. In case (iii) case, however, additional information is required. This information is provided by a remainder vector that contains one clock tree for each parallel task in the nested segment. This recursive definition gives rise to a tree structured clock where each node, or local component, contains the event""s quotient vector and the event""s remainder vector for the current level of decomposition.
The clock trees are effective for the special case of distributed computation. In the general case their size effectively reaches that of a flat vector clock (and even exceeds it due to the overhead pointer structures).
Bit-matrix clock is an isomorphic logical time system for processes with partially ordered events sets. (See, e.g., Ahuja, Carlson, and Gahlot, xe2x80x9cPassive-space and time view: Vector clocks for achieving higher performance, program correction, and distributed computingxe2x80x9d, IEEE Transactions on Software Engineering, Vol. 19(9), pp. 845-855, September 1993.) For a distributed system containing n processes, the logical time domain T of bit-matrix clock is the set of all non-negative integer vectors of size n: T=Nn. However, unlike the regular vector clock, each component of the vector is interpreted as a bitmap, rather than an event counter: the k-th bit position of the j-th component of the clock structure maintained by process Pi is 1, if the k-th event of process Pj causally precedes the current event of process Pi; the bitmap position is 0 otherwise. The length of j-th component grows linearly with the number of events occurring at process Pj.
All bitmap components are initialized to zero and are updated with each internal or message send event by taking the bit-wise inclusive-or operation over bit-matrix timestamps of all local events that causally precede the given event. It can be done using the process"" knowledge of the internal data and control dependencies and the timestamps of the local events. Besides, the bit corresponding to the event itself in the local bitmap vector component is set to 1. Each transmitted message carries the entire timestamp of the corresponding send event. On message receive event, the timestamp is computed by taking the bit-wise inclusive-or operation over bit-matrix timestamps of all local events that causally precede the given event and the corresponding send event timestamp extracted from the message. The corresponding to the event itself is also set.
Two alternative representations of the dependency information were proposed in an article by Prakash and Singhal entitled xe2x80x9cDependency sequences and hierarchical clocks: Efficient alternatives to vector clock for mobile computing systemsxe2x80x9d, that appeared in Wireless Networks, 41(2):349-360, 15 March 1997. The method of dependency sequences presents an alternative encoding of the bitmap vector components which allows to store and transmit the sets of dependency gaps expressed in terms of local event identity pairs, rather than the bitmap itself This method may or may not be advantageous depending on the bitmap pattern. Generally, any lossless data compression technique may be useful in this situation.
Another method, described by Gahlot and Singhal in an article entitled xe2x80x9cHeirarchial Clocksxe2x80x9d, that appeared as a Technical Report OSU-CISRC-93-TR19, Dept. of Comput. and Inf. Sci., Ohio State Univ., Columbus, Ohio., USA, May 1993, can be described as two-level hierarchical clock. In order to reduce the on-line communication and storage overhead, the method uses a timestamp which contains a global component analogous to a conventional vector clock, and a local component which is represented by a variable length bit-vector specifying the causal precedence within the process event set. Unfortunately, the definition of the causal precedence relation used in deriving the two-level hierarchical clock is not without problems, therefore continuing methods and approaches are requried.
I have developed a new logical time system, the Hierarchical Vector Clock (HVC), which can be used to capture casual precedence relation in the large distributed systems with dynamically varying number of processes. Hierarchical Vector Clock is a variable size clock; while approximating the isomorphic vector clock, it requires much smaller communication and storage overhead: for distributed system containing NK processes, its communication and storage requirements are of the order Nxc3x97K.
Hierarchical Vector Clock is both hierarchical and scalable. It is not restricted to any fixed number of hierarchy levels and can be naturally extended by employing the nested data structures and recursive invocation of the algorithms. In the simplest single-layer case, HVC becomes a variation of an isomorphic vector clock. HVC is ideally suited for the modern message-passing distributed systems which use the underlying communication networks of highly hierarchical structure.
Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention are described in detail below with reference to the accompanying drawing.