Fault Tolerance in Helidon SE

Overview

Helidon SE Fault Tolerance support is inspired by MicroProfile Fault Tolerance. The API defines the notion of a fault handler that can be combined with other handlers to improve application robustness. Handlers are created to manage error conditions (faults) that may occur in real-world application environments. Examples include service restarts, network delays, temporal infrastructure instabilities, etc.

The interaction of multiple microservices bring some new challenges from distributed systems that require careful planning. Faults in distributed systems should be compartmentalized to avoid unnecessary service interruptions. For example, if comparable information can be obtained from multiples sources, a user request should not be denied when a subset of these sources is unreachable or offline. Similarly, if a non-essential source has been flagged as unreachable, an application should avoid continuous access to that source as that would result in much higher response times.

In order to tackle the most common types of application faults, the Helidon SE Fault Tolerance API provides support for circuit breakers, retries, timeouts, bulkheads and fallbacks. In addition, the API makes it very easy to create and monitor asynchronous tasks that do not require explicit creation and management of threads or executors.

For more information the reader is referred to the Fault Tolerance SE API Javadocs.

Maven Coordinates

To enable Fault Tolerance add the following dependency to your project’s pom.xml (see Managing Dependencies).

<dependency>
    <groupId>io.helidon.fault-tolerance</groupId>
    <artifactId>helidon-fault-tolerance</artifactId>
</dependency>

Copied

API

The SE Fault Tolerance API is reactive in order to fit the overall processing model in Helidon SE. A task returns either a Single<T> or a Multi<T>. A Single<T> is a promise to produce zero or one value of type T or signal an error; while a Multi<T> is a promise to produce zero or more values of type T or signal an error.

A Single<T>, like CompletableFuture<T>, extends CompletionStage<T> so conversion among these types is straightforward.

In the sections that follow, we shall briefly explore each of the constructs provided by this API.

Asynchronous

Asynchronous tasks can be created or forked by using an Async instance. A supplier of type T is provided as the argument when invoking this handler. For example:

Single<Thread> s = Async.create().invoke(() -> Thread.currentThread()));
s.thenAccept(t -> System.out.println("Async task executed in thread " + t));

Copied

The supplier () → Thread.currentThread() is executed in a new thread and the value it produces printed by the consumer and passed to thenAccept.

The method reference Thread::currentThread is a simplified way of providing a supplier in the example above.

Asynchronous tasks are executed in a thread pool managed by the Helidon SE Fault Tolerance module. Thread pools are created during the initialization phase of class io.helidon.faulttolerance.FaultTolerance and can be configured for your application.

Retries

Temporal networking problems can sometimes be mitigated by simply retrying a certain task. A Retry handler is created using a RetryPolicy that indicates the number of retries, delay between retries, etc.

Retry retry = Retry.builder()
                   .retryPolicy(Retry.JitterRetryPolicy.builder()
                                     .calls(3)
                                     .delay(Duration.ofMillis(100))
                                     .build())
                   .build();
retry.invoke(this::retryOnFailure);

Copied

The sample code above will retry calls to the supplier this::retryOnFailure for up to 3 times with a 100 millisecond delay between them.

The return type of method retryOnFailure in the example above must be CompletionStage<T> and the parameter to the retry handler’s invoke method Supplier<? extends CompletionStage<T>>.

If the CompletionStage<T> returned by the method completes exceptionally, the call will be treated as a failure and retried until the maximum number of attempts is reached; finer control is possible by creating a retry policy and using methods such as applyOn(Class<? extends Throwable>… classes) and skipOn(Class<? extends Throwable>… classes) to control those exceptions on which to act and those that can be ignored.

Timeouts

A request to a service that is inaccessible or simply unavailable should be bounded to ensure a certain quality of service and response time. Timeouts can be configured to avoid excessive waiting times. In addition, a fallback action can be defined if a timeout expires as we shall cover in the next section.

The following is an example of using Timeout:

Single<T> s = Timeout.create(Duration.ofMillis(10)).invoke(this::mayTakeVeryLong);
s.handle((t, e) -> {
    if (e instanceof TimeoutException) {
        // Invocation has timed out!
    }
    //...
});

Copied

The example above monitors the call to method mayTakeVeryLong and reports a TimeoutException if the execution takes more than 10 milliseconds to complete.

Fallbacks

A fallback to a known result can sometimes be an alternative to reporting an error. For example, if we are unable to access a service we may fall back to the last result obtained from that service.

A Fallback instance is created by providing a function that takes a Throwable and produces a CompletionStage<T> as shown next:

Single<T> single = Fallback.create(
    throwable -> Single.just(lastKnownValue).invoke(this::mayFail);
single.thenAccept(t -> {
    //...
});

Copied

In this example, we register a function that can produce a Single<T> (which implements CompletionStage<T>) if the call to this::mayFail completes exceptionally.

Circuit Breakers

Failing to execute a certain task or call another service repeatedly can have a direct impact on application performance. It is often preferred to avoid calls to non-essential services by simply preventing that logic to execute altogether. A circuit breaker can be configured to monitor such calls and block attempts that are likely to fail, thus improving overall performance.

Circuit breakers start in a closed state, letting calls to proceed normally; after detecting a certain number of errors during a pre-defined processing window, they can open to prevent additional failures. After a circuit has been opened, it can transition first to a half-open state before finally transitioning back to a closed state. The use of an intermediate state (half-open) makes transitions from open to close more progressive, and prevents a circuit breaker from eagerly transitioning to states without considering "sufficient" observations.

Any failure while a circuit breaker is in half-open state will immediately cause it to transition back to an open state.

Consider the following example in which this::mayFail is monitored by a circuit breaker:

CircuitBreaker breaker = CircuitBreaker.builder()
                                       .volume(10)
                                       .errorRatio(30)
                                       .delay(Duration.ofMillis(200))
                                       .successThreshold(2)
                                       .build();
Single<T> result = breaker.invoke(this::mayFail);

Copied

The circuit breaker in this example defines a processing window of size 10, an error ratio of 30%, a duration to transition to half-open state of 200 milliseconds, and a success threshold to transition from half-open to closed state of 2 observations. It follows that,

After completing the processing window, if at least 3 errors were detected, the circuit breaker will transition to the open state, thus blocking the execution of any subsequent calls.
After 200 millis, the circuit breaker will transition back to half-open and enable calls to proceed again.
If the next two calls after transitioning to half-open are successful, the circuit breaker will transition to closed state; otherwise, it will transition back to open state, waiting for another 200 milliseconds before attempting to transition to half-open again.

A circuit breaker will throw a io.helidon.faulttolerance.CircuitBreakerOpenException if an attempt to make an invocation takes place while it is in open state.

Bulkheads

Concurrent access to certain components may need to be limited to avoid excessive use of resources. For example, if an invocation that opens a network connection is allowed to execute concurrently without any restriction, and if the service on the other end is slow responding, it is possible for the rate at which network connections are opened to exceed the maximum number of connections allowed. Faults of this type can be prevented by guarding these invocations using a bulkhead.

The origin of the name bulkhead comes from the partitions that comprise a ship’s hull. If some partition is somehow compromised (e.g., filled with water) it can be isolated in a manner not to affect the rest of the hull.

A waiting queue can be associated with a bulkhead to handle tasks that are submitted when the bulkhead is already at full capacity.

Bulkhead bulkhead = Bulkhead.builder()
                            .limit(3)
                            .queueLength(5)
                            .build();
Single<T> single = bulkhead.invoke(this::usesResources);

Copied

This example creates a bulkhead that limits concurrent execution to this:usesResources to at most 3, and with a queue of size 5. The bulkhead will report a io.helidon.faulttolerance.BulkheadException if unable to proceed with the call: either due to the limit being reached or the queue being at maximum capacity.

Handler Composition

Method invocations can be guarded by any combination of the handlers presented above. For example, an invocation that times out can be retried a few times before resorting to a fallback value —assuming it never succeeds.

The easiest way to achieve handler composition is by using a builder in the FaultTolerance class as shown in the following example:

FaultTolerance.TypedBuilder<T> builder = FaultTolerance.typedBuilder();

// Create and add timeout
Timeout timeout = Timeout.create(Duration.ofMillis(10));
builder.addTimeout(timeout);

// Create and add retry
Retry retry = Retry.builder()
                   .retryPolicy(Retry.JitterRetryPolicy.builder()
                                     .calls(3)
                                     .delay(Duration.ofMillis(100))
                                     .build())
                   .build();
builder.addRetry(retry);

// Create and add fallback
Fallback fallback = Fallback.create(throwable -> Single.just(lastKnownValue));
builder.addFallback(fallback);

// Finally call the method
Single<T> single = builder.build().invoke(this::mayTakeVeryLong);

Copied

The exact order in which handlers are added to a builder depends on the use case, but generally the order starting from innermost to outermost should be: bulkhead, timeout, circuit breaker, retry and fallback. That is, fallback is the first handler in the chain (the last to executed once a value is returned) and bulkhead is the last one (the first to be executed once a value is returned).

This is the ordering used by the MicroProfile Fault Tolerance implementation in Helidon when a method is decorated with multiple annotations.

Revisiting the Multi Method

All the examples presented so far have focused on invocations returning a single value of type Single<T>. If the invocation in question can return more than one value (i.e., a Multi<T>) then all that is needed is to use the method invokeMulti instead of invoke. The supplier passed to this method must return a Flow.Publisher<T> instead of a CompletionStage<T>.

A Flow.Publisher<T> is a generalization of a Single<T> that can produce zero or more values. Note that a Flow.Publisher<T>, unlike a Single<T>, can report an error after producing one or more values, introducing additional challenges if all values must be processed transactionally, that is, in an all or nothing manner.

The following example creates an instance of Retry and invokes the invokeMulti method, it then registers a subscriber to process the results:

Retry retry = Retry.builder()
                   .retryPolicy(Retry.JitterRetryPolicy.builder()
                                     .calls(2)
                                     .build())
                   .build();
Multi<Integer> multi = retry.invokeMulti(() -> Multi.just(0, 1, 2));

IntSubscriber ts = new IntSubscriber();
multi.subscribe(ts);
ts.request(Integer.MAX_VALUE);

Copied

The call to Multi.just(0, 1, 2) simply returns a multi that produces the integers 0, 1 and 2. If an error was generated during this process, the policy will retry the call one more time —for a total of 2 calls.

Configuration

Each Fault Tolerance handler can be individually configured at build time. This is supported by calling the config method on the corresponding builder and specifying a config element. For example, a Timeout handler can be externally configured as follows:

Timeout timeout = Timeout.builder()
        .config(config.get("timeout"))
        .build();

Copied

and using the following config entry:

timeout:
  timeout: "PT20S"
  current-thread: true
  name: "MyTimeout"
  cancel-source: false

Copied

Note that the actual timeout value is of type Duration, hence the use of PT20S that represents a timeout of 20 seconds. See the Javadoc for the Duration class for more information.

The following tables list all the config elements for each type of handler supported by this API.

Timeout

Property	Type	Description
name	String	A name given to the task for debugging purposes. Default is `Timeout-N`.
timeout	Duration	The timeout length as a Duration string. Default is `PT10S` or 10 seconds.
current-thread	boolean	A flag indicating whether the task should execute in the current thread or not. Default is `false`.
cancel-source	boolean	A flag indicating if this task’s source should be cancelled if the task is cancelled. Default is `true`.

Circuit Breaker

Property	Type	Description
name	String	A name given to the task for debugging purposes. Default is `CircuitBreaker-N`.
delay	Duration	Delay to transition from half-open state. Default is `PT5S` or 5 seconds.
error-ratio	int	Failure percentage to transition to open state. Default is 60.
volume	int	Size of rolling window to calculate ratios. Size is 10.
success-threshold	int	Number of successful calls to transition to closed state. Default is 1.
cancel-source	boolean	A flag indicating if this task’s source should be cancelled if the task is cancelled. Default is `true`.

Bulkhead

Property	Type	Description
limit	int	Max number of parallel calls. Default is 10.
name	String	A name given to the task for debugging purposes. Default is `Bulkhead-N`.
queue-length	int	Length of queue for tasks waiting to enter. Default is 10.
cancel-source	boolean	A flag indicating if this task’s source should be cancelled if the task is cancelled. Default is `true`.

Retry

Property	Type	Description
name	String	A name given to the task for debugging purposes. Default is `Retry-N`.
overall-timeout	Duration	Timeout for overall retry execution. Default is `PT1S` or 1 second.
delaying-retry-policy	Config	Config section describing delaying retry policy (see below).
jitter-retry-policy	Config	Config section describing jitter retry policy (see below)
cancel-source	boolean	A flag indicating if this task’s source should be cancelled if the task is cancelled. Default is `true`.

Delaying the Retry Policy

Property	Type	Description
calls	int	Number of retry attempts. Default is 3.
delay	Duration	Delay between retries. Default is `PT0.2S` or 200 milliseconds.
delay-factor	double	A delay multiplication factor applied after each retry.

Jitter Retry Policy

Property	Type	Description
calls	int	Number of retry attempts. Default is 3.
delay	Duration	Delay between retries. Default is `PT0.2S` or 200 milliseconds.
jitter	Duration	A random delay additive factor in the range `[-jitter, +jitter]` applied after each retry.

Examples

See section for examples.

Additional Information

For additional information, see the Fault Tolerance SE API Javadocs.