Large software systems often include multiple virtual machine instances (e.g., virtual machines that adhere to the Java® Virtual Machine Specification published by Sun Microsystems, Inc. or, later, Oracle America, Inc., which are sometimes referred to herein as Java® Virtual Machines or JVMs) running on separate host machines in a cluster and communicating with one another as part of a distributed system. The performance of modern garbage collectors is typically good on individual machines, but may contribute to poor performance in distributed systems.
In some existing systems, both minor garbage collections (e.g., garbage collections that target young generation portions of heap memory) and major garbage collections (e.g., garbage collections that target old generation portions of heap memory) are “stop the world” events. In other words, regardless of the type of collection being performed, all threads of any executing applications are stopped until the garbage collection operation is completed. Major garbage collection events can be much slower than minor garbage collection events because they involve all live objects in the heap.
Some workloads involve “barrier” operations which require synchronization across all of the machines. That is, if any one machine is delayed (e.g., performing garbage collection) then every other machine may have to wait for it. The impact of this problem may grow as the size of the cluster grows, harming scalability. Other workloads, such as key-value stores, may involve low-latency request-response operations, perhaps with an average-case delay of 1 millisecond (exploiting the fact that a modern interconnect, such as one that adheres to the InfiniBand™ interconnect architecture developed by the InfiniBand® Trade Association, may provide network communication of the order of 1-2 μs). A single user-facing operation (e.g., producing information for a web page) may involve issuing queries to dozens of key-value stores, and so may be held up by the latency of the longest “straggler” query taking 10 or 100 times longer than the average case. Young-generation garbage collection may also be a source of pauses which cause stragglers, even when using an optimized parallel collector.