Optimization to Powerlessness

Here in Denmark, we were surprised to find that the Russians have rendered our military combat ineffective. When NATO asks what we can provide, we can offer a hundred special forces soldiers, some past-due-date antitank weapons, and an armored brigade without armor. The reason is not lack of money. We spend many millions. We just don’t spend it on things that matter.

The Russians did not have to attack us kinetically or subject us to a devastating cyber-attack to achieve this. They simply needed to infiltrate the Ministry of Defence with spreadsheet-wielding MBAs supported by a fifth column from McKinsey. We have now optimized our way to warfighting impotence.

Many organizations have similarly found that they have optimized themselves to powerlessness. A ship stuck in the Suez or a war in Ukraine will bring their entire production to a halt.

The only way to resilience, as any capable army knows, is to have extra. You have more supplies on hand than the absolute minimum, and more different suppliers than you need. You have spare warehouses and production capacity. If you let the MBAs with their spreadsheets run the business, you might suddenly find you have no business.

Reliability Engineering

Coinbase just spent 16 million dollars on a 30-second Superbowl ad. It seems like the ad worked, because their website was promptly overwhelmed with traffic and crashed. Maybe they should have spent a bit on resilient network infrastructure as well.

The problem with many of the IT infrastructures I see is that they are brittle. Each component can be resilient with load balancing and database failover without making the overall system robust. Reliability engineering is a cross-domain discipline, and it is not enough that each team builds robustness into their little piece of the total landscape. Who is responsible for the overall reliability of your systems?

Consider the Failure Scenarios

The cable snapped, and a 25-tonne undersea mining vehicle is now stuck at the bottom of the Pacific Ocean.

Having one cable is the proverbial “single point of failure.” Just like in IT, it might not make business sense to pay the extra cost for full redundance. But in a professional IT organization, somebody has examined the failure scenarios. If the database server crashes, we might lose this much data, and we will restore operations in this way.

Sending a robot to the bottom of the ocean without implementing a feature that allows it to autonomously return to the surface seems like an over-optimistic strategy. Do you allow similar unwarranted optimism in your IT organization?

Article from BBC: https://www.bbc.com/news/science-environment-56921773