What is Your Time to Recover?

It wouldn’t take you three to four weeks to rebuild a critical system, would it? But that’s how they do things in the National Health Service in the UK. Doctors and hospitals have been advised that the central patient record system is offline due to a ransomware attack and will not be back until sometime in September. In the meantime, doctors will have no access to their patients’ medical histories and will have to keep notes on paper or in Microsoft Word on their laptops.

As a national health monopoly, the NHS will not be going out of business. But a private company that lost its manufacturing, logistics, or service management system for a month would be finished.

You do everything you can to prevent bad things from happening. But have you also planned contingent action in case something terrible does happen? The NHS hadn’t.

When the Shortcut Becomes the Standard Way

There must be a shortcut for every Standard Operating Procedure (SOP). You cannot allow the production database to be down for hours while you chase down everyone on the Change Advisory Board. The police cannot wait for a judge and the associated paperwork if somebody’s life is in danger.

The problems start if the shortcut is abused. Amazon is sharing video from your Ring doorbell with anyone who works for one of their 2,161 “partners” as long as they promise on their scout’s honor that it is really important. Google will similarly share the video feed from your Nest camera. They say they haven’t done so yet, but that’s only because law enforcement hasn’t noticed them yet.

I’ve been in organizations where half the work of IT operations was Emergency Changes. Do you track how much work is handled following SOP and how much your people use the shortcuts?

The Original Sin of Technology Projects

Today is the 394th anniversary of the original sin of technology projects. The Swedish King sent out an RFP for a huge warship, and a private contractor was selected to build it. But on second thought, the King wanted the most powerful warship in the world. So he required another deck of cannons to be added. A conscientious engineer would have refused, telling the King that this would make the vessel too top-heavy. But just like a modern contractor, the private shipbuilders accepted the questionable change request and charged extra.

On its maiden voyage, the ship heeled over and sank just off the pier in front of thousands of spectators. Just like in modern IT disasters, a commission investigated, and in the end, nobody was punished.

In every failed technology project, dozens or hundreds of people know it will fail many months or even years before the failure becomes obvious. What are your processes to ensure that you are not building a modern version of the good ship Vasa?

(image by Jorge Lascar/Flickr used under CC BY 2.0)

Are you Depending on Luck or Skill?

If you can buy anything in 7-eleven today, count yourself lucky. Here in Denmark, the entire chain is shut down because hackers got into their Point-of-Sale terminals.

Through luck or skill, 7-eleven seems to have managed to limit the damage to Denmark. If they were lucky, each country simply runs its own network and are not interconnected. If they were skillful, they had segmented their network so the malware couldn’t spread. Remember that Maersk Lines had not sufficiently segmented their network, and were laid low when malware targeting Ukrainian companies spread from their Ukrainian subsidiary to their entire global operation.

Is your network appropriately segmented so one hacker cannot kill your entire operation?

Holding Your Ears is not an Effective Strategy

Closing your eyes and holding your ears is considered an effective IT strategy. At least here in Denmark, where the Danish public schools have been ignoring European data privacy regulations. With much hand-wringing, they are now scrambling to replace their Google Chromebooks as the new school year starts.

The 2020 Schrems II judgment from the European Court of Justice said that because all data passed to American providers end up in the databases of the NSA, you are not allowed to store personal information with American cloud providers. Nevertheless, Danish schools kept using Google services. The Danish Data Protection Agency (DPA) has finally told them to stop.

The people at the coalface in your organization know where corners are being cut. But there are several layers of management between the people who know and the CIO and CTO who will be fired once the problem explodes. So if you are in an IT leadership position, how are you ensuring that you hear about questionable practices in your organization?

How to Avoid Techno-Blindness

Techno-blindness is a dangerous affliction. It is a disease of over-optimism mainly affecting people in the technology industry. The symptom is overconfidence that a system works as intended and a lack of awareness of what might go wrong.

Somebody in Moscow thought it was a cool idea to have a computer play chess with children, using a robotic arm to move the pieces. Until a child made an unexpected movement. The robot grabbed his hand and broke his finger. TuSimple is building autonomous trucks, and one of them accidentally executed an old instruction, causing it to turn left in the middle of the highway. Fortunately, nobody was injured as the truck veered across the I-10 and slammed into a barrier.

Important systems need independent safeguards. That means a completely separate piece of code that can intervene if the output of an algorithm lies outside some boundary. A truck shouldn’t be able to turn left at high speed. A robotic arm shouldn’t move on the chessboard until the player’s hands are off the pieces.

As a CTO, it is your job to ensure there are safeguards around important systems. You cannot depend on techno-blind developers to do this by themselves.

How Do You Handle Security Issues?

Over breakfast, the CEO asks you about the latest Atlassian vulnerability that he’s just read about in the Wall Street Journal. Good answers are: “That doesn’t apply to us” or “It has been addressed.” OK answers are: “We’re looking into it” or “It is being mitigated.” The horrible answer is: “What vulnerability?”

Last month, 1,973 new vulnerabilities were published. July 2022 was a quiet month – most months have over 2,000. Many of these don’t apply to you, but you need to evaluate all of them. Do you just have one guy following @CVEnew on Twitter, or do you have a real process able to handle the ever-increasing load?

Somebody Else’s Problem

Things that are Somebody Else’s Problem (SEP) are invisible. Douglas Adams famously joked about this in “The Hitchhiker’s Guide to the Galaxy,” but the effect is serious and real.

For example, local British politicians were falling over each other trying to attract data centers. They were focusing on the cachet of having Google or Facebook in their town, and the half-dozen jobs for the electricians and plumbers maintaining them. Supplying these energy-hungry behemoths with power was Somebody Else’s Problem.

Now they have so many data centers in West London that their electrical grid is overloaded, and they won’t be able to build more housing until they have upgraded their main cables. That’ll be sometime in the 2030s.

As an IT leader, it is your job to ensure that each team knows the problems they might cause for other parts of the organization.

Cloud Services Leak Your Data

Big Brother is watching what you write. Chinese users working on the local equivalent of Google Docs discovered that there are some things you can’t write. An author was locked out of the novel she was writing, with the system telling her that she was trying to access “sensitive content.” It didn’t matter that she wrote herself.

Of course, Google would never lock you out of your Docs or Sheets. And they claim they don’t look at your documents to sell you ads, though plenty of users report spooky coincidences. The default setting in Microsoft producs is to enable “Connected Experiences.” That means your content is being sent to Microsoft servers for analysis. Microsoft claims no human looks at it.

Do you have guidelines and technical measures in place to prevent sensitive data leaking out of your organization through cloud services?

The Tolstoy Principle in Action

This is what failure looks like: 50% one-star reviews. The other half is five-star reviews. Assuming these are not all from the app developers themselves, the app apparently can work. It just didn’t work for me, nor for many others.

I call this the Tolstoy principle: All successful apps are alike; each unsuccessful app is unsuccessful in its own way. The end-user does not care that 98% of your back-end infrastructure is running. They care that they can complete their task. And if one critical component fails, your app is a failure. Like this one from my local supermarket chain.

When you build systems, is all the attention lavished on a cool front-end app? Unsexy back-end services are equally important.