If you are working in the field of industrial safety, you know two things to be true:

Safety is built on solid engineering principles and has an incredibly successful track record.
For complex environments, it is getting more and more difficult to fulfill the promise of reliable risk reduction.

Why would the second statement be true? When doing an analysis of the critical components in a safety-relevant environment, analysts often use the single failure criterion. This criterion basically requires that no failure of a single component can lead to unacceptable consequence. Sounds like a good idea!

A less good idea is the flipside of the coin, namely the practice to stop analyzing beyond the single failure criterion. When analyzing safety architectures, a good question to ask is “what if this other critical component fails simultaneously?”. Unfortunately, you may often get the answer “that’s beyond the single failure assumption”. In other words: We don’t know — but we assume it just won’t happen. (Knock on wood)

What about combinatorial failure modes where multiple components, either related or unrelated, fail simultaneously or staged? Such failure modes used to be excluded from safety analysis because of their low likelihoods, which are simply based on the observation that they haven’t been observed before. Unfortunately, stochastics doesn’t have memory, so if a certain event didn’t happen before, that is no guarantee that it won’t happen in the future. Taleb told you so with what he calls the Thanksgiving turkey fallacy.

The whole picture changes once more if we go from random failure to malicious intent. As I have pointed out elsewhere, for a cyber attacker it is nothing but logical to manipulate multiple components in a plot in order to violate the assumptions — and thus the effectiveness — of a given safety design. Which would then be coordinated malfunction.

In a digital world, the safety of our nuclear and chemical facilities is lower than what the public assumes, and what safety experts demonstrate within their self-restrained analytic scope. What is most disturbing is the fact that today, we do have technology available to thoroughly analyze something as simple as a nuclear safety design, including potential malicious manipulation, yet few are actually doing it. Upgrading safety discipline to go beyond random component failure and simplistic spurious actuation is necessary for a digital world, and it isn’t even particularly difficult.

If you want to learn how we approach the problem, check out or OT-BASE platform which includes cyber-physical impact analysis.

Combinatorial failure modes and coordinated malfunction: Why we need to upgrade safety wisdom in a digital world