I recently read an interesting blog post by Marc Brooker who is an engineer at Amazon working on Lambda Services. The full post (found here) is worth a read and some real think time for your own environments. It discusses his four rules around when/how to identify if Redundancy actually assists availability. His rules are simply:
- The complexity added by introducing redundancy mustn’t cost more availability than it adds.
- The system must be able to run in degraded mode.
- The system must reliably detect which of the redundant components are healthy and which are unhealthy.
- The system must be able to return to fully redundant mode.
Some might read those and say “Of course. Common sense.” Except that I would suggest based upon my experience and evidence of digging into many production environments across many firms and many years that this kind of of thinking is *NOT* very common. In fact, in many cases the efforts to add redundancy or even multiple layers of redundancy can cause so much complexity that often efforts to just understand the basic flow of how an application works can be clouded with rabbit holes. Additionally the ability to effectively monitor the various infrastructure, components, systems, and possible states an application could be in, become very difficult to identify and are at best implemented in a very rudimentary fashion or at their worst abandoned or not attempted in the first place.
This is a clear area where a reductionist approach in thinking that combines operational efficiency, designed simplicity and the availability in real time velocity may drive more value. Thinking this way may cause you to change some of the upfront variables in terms of how the code works, or how complex or simple the required infrastructure or design should be.
The Money always matters. But its where the money is applied is what matters more and how you think about it. I have seen numerous financial institutions invest in a platform having multiple layers of resiliency across and between numerous sub-systems, infrastructure, data center replication, across multiple geographies and a very complex set of application behavior across that backdrop. In almost every case, when things begin to go south that complexity always ends up in significant complication in either maintaining the operational status, bringing it back to its original operational status, or even understanding what status the application is actually in. Balancing some of those variables upfront in the design could save you time, money, infrastructure, and effort later.
This conversation should always involve technology, end to end design, the commercial targets of the platform, and ultimately the various time elements of fail-over, recovery, while never losing sight of the current state. All of which is easier said than done.