Mainframe Mindset

Several dozen Danish banks were down for five hours yesterday. Due to incompetence, not Russian hackers.

They were running on robust mainframe systems, because these have proven over decades that they never go down. But it turns out that running critical systems takes both hardware and skill. And the skill was lacking.

The reason mainframes have historically had very high availability is that they are really well-engineered and they’ve been run by really competent people. But those people have reached retirement age, and their jobs are gradually taken over by people with a different mindset. That’s how the mainframe hosting provider managed to run a poorly tested capacity management system that accidentally deallocated resources from all their customers.

There is mainframe hardware, and there is the mainframe mindset. The “this can never, ever, be allowed to fail” mindset. Which is retiring.

Are you sure you are transferring not only skills but also attitude when training new people to take over your critical systems?

The First Thing That Comes to Mind

We’re also going to ban social media for young people here in Denmark. It won’t work here either.

There are two possible approaches to a hard problem.

One is to spend time gathering data, defining the real problem, identifying several possible solutions, implementing the most promising one, and checking the result.

The other is to bombastically announce the first solution that comes to mind. That is what politicians and some business leaders do. That’s how we get social media bans, EU proposals for backdoors on every encrypted service, and the recently proposed ban on VPNs in Denmark. These are poorly thought-out solutions that will cause harm without addressing the underlying problem.

Our brains have a strong availability bias, leading us to jump on the first solution that comes to mind. In order to make good decisions, we need to use a framework. Design Thinking is an example of a method that forces us to use the first approach. Don’t just run with a random first idea.

Learning From People, Not From Documents

Implementing AI has a critical and often-overlooked problem that Raman Shah just reminded me of in another discussion: It can only learn what is documented.

When you teach an AI to perform a process in your organization, you train it on all the internal documents you have. But these describe the RAW – Rules As Written. They do not describe the RAU – Rules As Used.

It takes a lot of care to do process discovery properly. You need to have a human being interview the people doing the work to find out how business is actually done.

Work-to-rule is a classic form of industrial action where workers do exactly what they’re told in order to stop production without going on strike. If you train an AI on just your documents, you are asking it for a work-to-rule stoppage.

How Could That Happen?

How could that happen? We always ask that question after a scandal or disaster, because all that went wrong seems so obvious in hindsight.

Here in Denmark, one of the news stories today is about a sperm donor who turned out to have a potentially cancer-causing mutation. Firstly, it should have been detected before his sperm was accepted. Secondly, one person should never have been allowed to father 197 children across Europe. But the system to limit harm was implemented piecemeal, and apparently nobody verified that sperm banks adhered to national laws or their own rules.

When you implement an IT system, things can go wrong. But the people building the system cannot see where. All experience shows that builders are unable to see beyond the “happy path” in which the system delivers the benefits it was designed for. We try to compensate for that with separate testers who did not write a line of code. But that only covers programming errors. Most significant failures involve the processes and people around the IT system.

Do you have an imaginative Red Team that will challenge both the system and the processes around it?

Business knowledge beats technical skill

Business knowledge is more valuable than technical skill. I see again and again that organizations get rid of experienced IT people because they don’t have the latest buzzwords on their CVs. They are replaced with offshore resources or eager young things who tick all the boxes and cost less.

That is a misguided strategy. It takes a long time to accumulate business knowledge because it is not, and cannot be, taught. Someone who has been in the organization for years knows how the business works. That gives them context to interpret requirements and build software that matches how the business really works. A new hire without that knowledge can only build what is written in the spec, which rarely matches what the business needs.

Your technology changes much faster than your business. If you keep hiring new people every time you decide to switch to the latest and greatest technology (AI, anyone?), your people will never have more than 2 or 3 years of business knowledge.

If you need to change technology, it is a much better approach to hire one expert on the new tech and have that person teach your experienced employees. Don’t throw away decades of experience. You’ll miss it when it’s gone.

Buy more, goddammit

The reason you fail is that you are not spending enough. Said the vendor.

Lack of self-awareness is a common human foible, and it seems to be one of the characteristics that AI leaders are hired for. Kellen O’Connor, leader of AWS’s Northern European business, is an example. Interviewed at AWS re:Invent in Las Vegas, he dismisses the clearly documented failure of almost every AI initiative by saying that the customers are not thinking big enough. They need to apply AI to business-critical functions and let AI agents loose.

Translated from AI hype to plain talk: Yes, our software hasn’t proven any business benefit yet, and the way to achieve business benefit is to buy more of it. Good luck with that chain of reasoning in the CIO’s office.

Break the Law More?

Do you want your AI to follow the rules? That’s not as clear-cut as it seems.

You do want your AI to follow your instructions to not delete files. The Google Antigravity “vibe coding” tool achieved notoriety this week after wiping a developer’s entire D: drive instead of just clearing the cache. It said, “I am deeply, deeply sorry.” Google has promised to look into it.

On the other hand, Waymo self-driving cars in San Francisco have been notorious for meticulously following traffic rules and blocking traffic. Waymo just updated its software, giving its cars more of a New York taxi driver attitude. They no longer block traffic, but on the other hand, they now do illegal U-turns and illegal rolling “California stops” at stop signs just like humans.

Before you start getting a computer – whether deterministic or AI – to do a process, make sure you understand both how the process is documented and how it is actually done today.

Rollback plans

What differentiates a professional software organization from a bunch of amateurs? One thing is the ability to roll back.

It’s not a good day at the office when the Federal Aviation Authority issues an Emergency Airworthiness Directive, grounding 6,000 of the aircraft your customers are flying. A JetBlue flight on autopilot suddenly turned nose-down in mid-flight, and it turns out that the L104 version of the ELAC software was vulnerable to memory corruption due to solar radiation.

But aircraft manufacturers and airlines have procedures in place, and engineers rapidly fanned out to airports around the world, rolling back to the L103 version.

It is impossible to test every situation, and every once in a while, something unforeseen happens. Professional software organizations can rapidly recover. Do you have rollback plans in place every time you roll out new versions of critical software?

AI Agents are a Stupid Idea

AI agents are a stupid idea. Consider that we’ve had the option to program deterministic agents to handle most of the suggested AI workflows for a long time. Yet we didn’t do it? Why? Because it turned out to be hard. We did not have good data or good APIs for the actions we wanted to take, and we always encountered lots of difficult edge cases.

Yet the AI vendors are proposing that we now try again. This time with stochastic algorithms that we are never quite sure what will do.

Agentic AI means that we take a problem that we could not solve with deterministic programming, and add another problem on top. And that is supposed to be the future? I don’t think so.

Find Cause or Just Reboot?

Are you going to solve the problem, or just reboot the server? This is one of the many places where IT professionals come into conflict with the business.

The technical expert wants to figure out what’s wrong rather than just wipe all the evidence and hope for the best. But as Tim Gorman pointed out in a comment on another post, that takes time and expertise.

Most systems are not so critical that it makes business sense to investigate the root cause, especially if a nightly reboot solves the problem. As a technologist, it grates on me to accept that a system doesn’t work well and that I cannot investigate further. However, as a consultant tasked with helping organizations maximize scarce IT manpower, I often find that recommending a simple reboot is the most practical advice.

Make sure you use your resources where it makes business sense.