Reliability Engineering

Coinbase just spent 16 million dollars on a 30-second Superbowl ad. It seems like the ad worked, because their website was promptly overwhelmed with traffic and crashed. Maybe they should have spent a bit on resilient network infrastructure as well.

The problem with many of the IT infrastructures I see is that they are brittle. Each component can be resilient with load balancing and database failover without making the overall system robust. Reliability engineering is a cross-domain discipline, and it is not enough that each team builds robustness into their little piece of the total landscape. Who is responsible for the overall reliability of your systems?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.