How Could That Happen?

How could that happen? We always ask that question after a scandal or disaster, because all that went wrong seems so obvious in hindsight.

Here in Denmark, one of the news stories today is about a sperm donor who turned out to have a potentially cancer-causing mutation. Firstly, it should have been detected before his sperm was accepted. Secondly, one person should never have been allowed to father 197 children across Europe. But the system to limit harm was implemented piecemeal, and apparently nobody verified that sperm banks adhered to national laws or their own rules.

When you implement an IT system, things can go wrong. But the people building the system cannot see where. All experience shows that builders are unable to see beyond the “happy path” in which the system delivers the benefits it was designed for. We try to compensate for that with separate testers who did not write a line of code. But that only covers programming errors. Most significant failures involve the processes and people around the IT system.

Do you have an imaginative Red Team that will challenge both the system and the processes around it?

Good Intentions are not Enough

“We have the ambition to test disaster recovery twice a year.” That’s not something anybody in a professional IT organization would say, is it? Ambition? I have the ambition to create a spam- and hate-speech-free Twitter alternative powered by unicorns and rainbows, but unless I act on my ambition, nothing will happen.

Nevertheless, critical Danish infrastructure was operated on that principle. The common login system that everything from banks to tax authorities to municipalities uses is operated by a company called Nets. They apparently got to write their contract with the state themselves because it contains the ridiculous “ambition” instead of an actual requirement.

They did run a test on May 28, 2020. They did not run a test in November 2020, as was their ambition. Nor in May or November 2021. Not even in May 2022 did they test it. So when they crashed the system in June 2022 due to undocumented changes and other unprofessional shenanigans, the disaster recovery unsurprisingly failed.

Please tell everyone this story. When you are done laughing at the incompetence of central Danish authorities and their vendors, make sure you are testing your own disaster recovery…

How to Avoid Techno-Blindness

Techno-blindness is a dangerous affliction. It is a disease of over-optimism mainly affecting people in the technology industry. The symptom is overconfidence that a system works as intended and a lack of awareness of what might go wrong.

Somebody in Moscow thought it was a cool idea to have a computer play chess with children, using a robotic arm to move the pieces. Until a child made an unexpected movement. The robot grabbed his hand and broke his finger. TuSimple is building autonomous trucks, and one of them accidentally executed an old instruction, causing it to turn left in the middle of the highway. Fortunately, nobody was injured as the truck veered across the I-10 and slammed into a barrier.

Important systems need independent safeguards. That means a completely separate piece of code that can intervene if the output of an algorithm lies outside some boundary. A truck shouldn’t be able to turn left at high speed. A robotic arm shouldn’t move on the chessboard until the player’s hands are off the pieces.

As a CTO, it is your job to ensure there are safeguards around important systems. You cannot depend on techno-blind developers to do this by themselves.