Resilience involves the capacity to recover from temporary setbacks. The application’s recovery strategy aims to restore normal functioning with minimal impact on users. Failures are not uncommon in cloud environments, and it is crucial for your application to respond in a manner that reduces downtime and prevents data loss. Ideally, your application should adeptly manage failures without the user being aware of any issues.
Given the volatility of microservices environments, it is advisable to design your applications to anticipate and handle partial failures. Examples of partial failures include code exceptions, network disruptions, unresponsive server processes, or hardware malfunctions. Even planned activities, like relocating containers to a different node within a Kubernetes cluster, can lead to temporary setbacks.