“The ability of a sustance or object to spring back into shape.”
“The capacity to recover quickly from difficulties.”
Merriam Webster
Resilient
The system should be responsive in the face of failure.
Failure != Error
Examples of failures:
program defects causing corrupted internal state
hardware malfunction
network failures
troubles with external services
Resilient - how?
Failure Recovery in OOP
Resilient - how?
Failure Recovery in OOP
single thread of control
if the thread blows up => you are screwed
no global organization of error handling
defensive programming tangled with business logic, scattered around the code
Resilience is by design
ok... but how?
Resilience is by design - How?
containment
delegation
isolation
let's review this concepts...
Resilience is by design - How?
containment:
=> bulkheads => circuit breaker
delegation
=> supervision
Resilience is by design - How?
isolation (decoupling, both in time and space)
- in time: sender and receiver don't need to be present at the same time for communicate
=> message-driven arquitecture
- in space (defined as Location Transparency): the sender and receiver don't have to run in the same process, and this might change during application's lifetime
=> message-driven arquitecture
Bulkheads
+
Supervison
Bulkheads help us to:
isolate the failure
compartmentalize
manage failure locally
avoid cascading failures
this concept should be used together with supervison!
Supervison
Supervison
Core Concept
Supervisor hierarchies with Actors
Supervisor hierarchies with Actors
Supervisor hierarchies with Actors
Configuration in Akka
import akka.actor.OneForOneStrategy
import akka.actor.SupervisorStrategy._
import scala.concurrent.duration._
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
case _: ArithmeticException => Resume
case _: NullPointerException => Restart
case _: IllegalArgumentException => Stop
case _: Exception => Escalate
}
Circuit Breaker
Circuit Breaker
PROBLEM - Example Situation
web app interacting with a remote WS
remote WS is overloaded, and its DB takes a long time to respond with a fail
=> WS calls fail after a long period of time
=> web users noticed that form submissions takes time to complete
=> web users start to click the refresh button adding more requests!
=> web app fails due to resource exhaustion, affecting all users across the site
failures in external dependencies shouldn' t bring down an entire app
Circuit Breaker
Solution Proposal
monitor response times
if time consistently rises above a threshold, either:
fail fast approach, or...
route following requests to an alternative service
monitor the original service
if the service is back in shape, restore to the original state
if not, continue with the same approach
Circuit Breaker
Akka Implementation
implements fail fast approach
provide stability
prevent cascading failures in distributed systems
Circuit Breaker
Akka Implementation
configuration
max. number of failures
call timeout: response time threshold
reset timeout: we'll see this later
Circuit Breaker
Akka Implementation
Closed state (normal operation)
Exceptions or calls exceeding the configured callTimeout increment a failure counter
Successes reset the failure count to 0 (zero)
When the failure counter reaches a maxFailures count, the breaker is tripped into Open State
Circuit Breaker
Akka Implementation
Open State
All calls fail-fast with a CircuitBreakerOpenException
After the configured resetTimeout, the circuit breaker enters a Half-Open State
Circuit Breaker
Akka Implementation
Half-Open State
The first call attempted is allowed through without failing fast
All other calls fail-fast with an exception just as in Open state
If the first call succeeds, the breaker is reset back to Closed state
If the first call fails, the breaker is tripped again into the Open state for another full resetTimeout
implicit val system = ActorSystem("Sys")
val mat = FLowMaterializer(...)
Flow(text.split("" "").toVector).
map(word => word.toUpperCase).
foreach(transformed => println(transformed)).
onComplete(mat) {
case Success(_) => ...; system.shutdown();
case Failure(e) => ...; system.shutdown();
}