Tuesday, August 9, 2011

No-Failure Design and Disaster Recovery: Lessons from Fukushima

One of the striking aspects of the early stages of the nuclear accident at Fukushima-Daiichi last March was the nearly total absence of disaster recovery capability. For instance, while Japan is a super-power of robotic technology, the nuclear authorities had to import robots from France for probing the damaged nuclear plants. Fukushima can teach us an important lesson about technology.

The failure of critical technologies can be disastrous. The crash of a civilian airliner can cause hundreds of deaths. The meltdown of a nuclear reactor can release highly toxic isotopes. Failure of flood protection systems can result in vast death and damage. Society therefore insists that critical technologies be designed, operated and maintained to extremely high levels of reliability. We benefit from technology, but we also insist that the designers and operators "do their best" to protect us from their dangers.

Industries and government agencies who provide critical technologies almost invariably act in good faith for a range of reasons. Morality dictates responsible behavior, liability legislation establishes sanctions for irresponsible behavior, and economic or political self-interest makes continuous safe operation desirable.

The language of performance-optimization  not only doing our best, but also achieving the best  may tend to undermine the successful management of technological danger. A probability of severe failure of one in a million per device per year is exceedingly  and very reassuringly  small. When we honestly believe that we have designed and implemented a technology to have vanishingly small probability of catastrophe, we can honestly ignore the need for disaster recovery.

Or can we?

Let's contrast this with an ethos that is consistent with a thorough awareness of the potential for adverse surprise. We now acknowledge that our predictions are uncertain, perhaps highly uncertain on some specific points. We attempt to achieve very demanding outcomes  for instance vanishingly small probabilities of catastrophe  but we recognize that our ability to reliably calculate such small probabilities is compromised by the deficiency of our knowledge and understanding. We robustify ourselves against those deficiencies by choosing a design which would be acceptable over a wide range of deviations from our current best understanding. (This is called "robust-satisficing".) Not only does "vanishingly small probability of failure" still entail the possibility of failure, but our predictions of that probability may err.

Acknowledging the need for disaster recovery capability (DRC) is awkward and uncomfortable for designers and advocates of a technology. We would much rather believe that DRC is not needed, that we have in fact made catastrophe negligible. But let's not conflate good-faith attempts to deal with complex uncertainties, with guaranteed outcomes based on full knowledge. Our best models are in part wrong, so we robustify against the designer's bounded rationality. But robustness cannot guarantee success. The design and implementation of DRC is a necessary part of the design of any critical technology, and is consistent with the strategy of robust satisficing.

One final point: moral hazard and its dilemma. The design of any critical technology entails two distinct and essential elements: failure prevention and disaster recovery. What economists call a `moral hazard' exists since the failure prevention team might rely on the disaster-recovery team, and vice versa. Each team might, at least implicitly, depend on the capabilities of the other team, and thereby relinquish some of its own responsibility. Institutional provisions are needed to manage this conflict.

The alleviation of this moral hazard entails a dilemma. Considerations of failure prevention and disaster recovery must be combined in the design process. The design teams must be aware of each other, and even collaborate, because a single coherent system must emerge. But we don't want either team to relinquish any responsibility. On the one hand we want the failure prevention team to work as though there is no disaster recovery, and the disaster recovery team should presume that failures will occur. On the other hand, we want these teams to collaborate on the design.

This moral hazard and its dilemma do not obviate the need for both elements of the design. Fukushima has taught us an important lesson by highlighting the special challenge of high-risk critical technologies: design so failure cannot occur, and prepare to respond to the unanticipated.

No comments:

Post a Comment