The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
In a distributed system, services may be dependent on one another. For example, a requesting service may send requests to a responding service. During proper functioning of the distributed system, the responding service is expected to send a response to the requesting service in a timely manner. The response may be used by the requesting service to perform further processing. Thus, in this example, the requesting service is dependent on the responding service.
The requesting service may be adversely impacted when a responding service experiences performance issues, such as latency that causes the responding service to delay sending responses or outright failure of the responding service. If the requesting service does not receive a timely response from the responding service, the requesting service may attempt to retry sending the request multiple times. However, if no response is received, the requesting service may experience an error that may cascade to other downstream services in the distributed system.
One approach to providing resiliency and fault tolerance in such a distributed system is to implement a virtual circuit breaker at the requesting service to manage communication with the responding service. For example, HYSTRIX is an example of a library that is used to provide resiliency and fault tolerance via a virtual circuit breaker. The circuit breaker may be implemented at the requesting service and has three states: CLOSED, OPEN, and HALF-OPEN. In the CLOSED state, the circuit breaker will allow all requests to pass from the requesting service to the responding service. In the OPEN state, the circuit breaker will not allow any requests to pass from the requesting service to the responding service. In the HALF-OPEN state, a single request is passed from the requesting service to the dependent service to determine whether the circuit is ready to be CLOSED or should remain OPEN.
In this example circuit breaker architecture, the circuit breaker starts in a CLOSED state, thus allowing requests to be sent to the responding service. If the volume of requests from the requesting service to the responding service meets a certain volume threshold and the error rate of those requests exceeds an error rate threshold, then the circuit breaker is programmed to transition to an OPEN state, thus stopping further requests from being sent to the responding service. After a sleep period, the circuit breaker will transition from an OPEN state to a HALF-OPEN state and allow a single request to be sent to the responding service. If the responding service does not return a response in a timely manner, the request fails. If the request fails, then the circuit breaker returns to the OPEN state and awaits another sleep period before trying another request in a HALF-OPEN state. If the responding service return a response in a timely manner, the request succeeds. If the request succeeds, the circuit breaker transitions to a CLOSED state and all other pending requests to the responding service are sent and normal processing is resumed.
The above circuit breaker implementation, however, has many shortcomings. First, when a circuit breaker transitions from a HALF-OPEN state to a CLOSED state, the number of outgoing requests to the responding service may be very large. This large volume of outgoing requests may, in turn, cause the responding service to experience another failure due to a heavy load caused by the flood of requests, thereby causing the circuit breaker to transition to OPEN again. The process may then repeat itself, thereby never allowing the responding service to fully recover in a gradual way. This shortcoming is referred to as a “circuit flapping” problem.
Second, the sleep window in the circuit breaker implementation is a fixed amount of time. However, it is possible that the responding service has recovered before the end of the sleep window. Nonetheless, in the circuit breaker implementation, the requesting service will not send a request until the sleep window has completed, and the circuit breaker has transitioned to the HALF-OPEN state to send a single request. This means that it is possible that the responding service was fully recovered and ready to receive requests for some time while the requesting service was idling during a sleep window in an OPEN state. This is an inefficient use of the responding service, which could have started receiving requests earlier.
Third, the states of the circuit breaker implementation are a modified binary state, as either zero requests are sent in the OPEN state, all requests are sent in the CLOSED state, or a single request is sent in the HALF-OPEN state. This modified binary state of the circuit breaker implementation, however, does not allow for gradual recovery by the responding service. For example, if the responding service is at 50% capacity, the circuit breaker implementation will either send either zero requests, all requests, or a single request. However, the responding service could tolerate 50% of requests being sent. The modified binary state of the circuit breaker implementation, however, cannot account for sending 50% of requests.
Fourth, the existing circuit breaker implementation only sends a single request to the responding service in the HALF-OPEN state to determine whether to OPEN or CLOSE the circuit breaker, but this may be insufficient sample size in some scenarios and may result in the circuit breaker incorrectly reacting to false negatives regarding the status of the responding service. For example, if the responding service stores a cache and provides cache results to the requesting service in response to data requests, the responding service may experience a cache miss for a particular request for data received from the requesting service. The circuit breaker implementation may interpret the cache miss of the responding service as an error, even if the cache could successfully provide cache hits for other data requests. If the cache miss occurs during the single request that occurs in the HALF-OPEN circuit breaker state, the circuit breaker will transition from a HALF-OPEN state to an OPEN state, thereby halting further data requests from the requesting service to the responding service. However, the cache miss may have been appropriate for that particular request and the responding service would have been able to provide cache hits for additional requests. Since the circuit breaker only sends a single request to the responding service when it is in a HALF-OPEN state, an error during that request may be a false negative regarding the status of the responding service as a whole. A larger sampling size may provide a better indication as to the state of the responding service.
Fifth, the existing circuit breaker implementation only provides limited opportunities for customization and configuration of what rules to consider when determining whether a request should be sent to the responding service or not.
Thus, what is needed is improved techniques for load shedding in a distributed system to allow for graceful recovery from service performance issues or failures that addresses these issues.
While each of the figures illustrates a particular embodiment for purposes of illustrating a clear example, other embodiments may omit, add to, reorder, and/or modify any of the elements shown in the figures.