Thursday, September 11, 2014

Fail-Safe Mechanisms Must Be Tested

Some systems base their safety arguments on the presence of “fail-safe” behaviors. In other words, if a failure occurs, the argument is that the system will respond in a safe way, such as by shutting down in a safe manner. If you have fail-safe mechanisms, you need to test them with a full range of faults within the intended fault model to make sure they work properly.

Consequences:
Failing to specifically test for mitigation of single points of failure means that there is no way to be sure that the mitigation really works, putting safety of the system into doubt.

As an example, if a hardware watchdog timer is not turned on, it won’t reset the system, but there might be no way to tell whether the watchdog timer is on or not (or set to the wrong value, or otherwise used improperly) without specifically testing whether the watchdog works or not. Thus, you can’t take credit for having a watchdog timer unless you have actually tested that it works for each fault that matters (or, if there are many such faults, argue that you have attained sufficient coverage with the tests that are run).

Accepted Practices:
  • Each and every fail-safe mechanism and fault management mechanism must be tested, preferably on a fully integrated system. Such tests may be difficult to perform in normal functional testing and may require intentional fault injection from the outside of the system (e.g., breaking a sensor) or fault injection at test points inside the system (e.g., intentionally killing a task using special test support infrastructure).
Discussion:
Fault injection is the process of intentionally inducing a hardware or software fault and determining its effect upon the system.

Fault management mechanisms, and especially fail-safe mechanisms, are often the key points upon which an argument as to the safety of a system rests. As an example, a safety case based on a watchdog timer detecting task failures requires that the watchdog timer actually work. While it is of course important to make sure that the system has been designed properly, there is no substitute for testing whether the watchdog timer is actually turned on during system test. (To revisit a point on system testing made elsewhere in my postings – system testing is not sufficient to ensure safety, but thorough system testing is certainly an important thing to do.) It is similarly important to specifically test every fault mode that must be handled by the system to ensure fault handling is done correctly.

Some examples of fault tests that should be performed include: killing each task independently to ensure that the death of any task is caught by the watchdog (and, by extension, cannot cause an unsafe system state); overloading the system to ensure that it behaves safely in an unanticipated CPU overload situation; checking that diagnostic fail-safes detect the faults they are supposed to and react by putting the system into a safe state; disabling sensors; disabling actuators; and others.

Another perspective on this topic is that ensuring safety usually involves arguing that all single points of failure have been mitigated to make the system safe. To demonstrate that the reasoning is accurate, a system must have corresponding failures injected to make sure that the mitigation approaches actually work, since the system’s safety case rests upon that assumption. This might include intentionally corrupting bits in memory, corrupting computations that take place, corrupting stack contents, and so on.

It is important to note that ordinary system functional testing tends to do a poor job at exercising fault mitigation mechanisms. As an example, if a particular task is never supposed to die, and testing has been thorough, then that task won’t die during normal functional testing (if it did, the system would be defective!). The point of detecting task death is to handle situations you missed in testing. But that means the mechanism to detect task death and perform a restart hasn’t been tested by normal system-level functional tests. Therefore, testing fail-safe mechanisms requires special techniques that intentionally introduce faults into the system to activate those fail-safes.

Selected Sources:
Safety critical systems are deemed safe only if they can withstand the occurrence of any single point fault. But, there is no way to know if they will really do that unless testing includes actually injecting representative single point faults to see if the system will respond in a safe manner. You can’t know if a system is safe if you don’t actually test its safety capabilities, and doing so requires fault injection. For example, if you expect a watchdog to detect failed tasks, you need to kill each and every task in turn to see if the watchdog really works. Arlat correctly states that “physical fault injection will always be needed to test the actual implementation of a fault tolerant system” (Arlat 1990, pg. 180)

The need to actually test fail-safe mechanisms to see if they really work should be readily apparent to any engineer. Pullum discusses this topic by suggesting the use of fault injection (intentionally causing faults as a testing technique) in the context of “verification of integration of fault and error processing mechanisms” for creating dependable systems (Pullum 2001, pg. 93).

“Fault injection is important to evaluating the dependability of computer systems. … It is particularly hard to recreate a failure scenario for a large complex system.” (Hsueh et al., 1997 pg. 75, speaking about the need for fault injection as part of testing a system). Mariani refers to the IEC 61508 safety standard and concludes that “fault-injection will be mandatory for soft error sensitivity verification” for safety critical systems (Mariani03, pg. 60). “A fault-tolerant computer system’s dependability must be validated to ensure that its redundancy has been correctly implemented and the system will provide the desired level of reliable service. Fault injection – the deliberate insertion of faults into an operational system to determine its response – offers an effective solution to this problem.” (Clark 1995, pg. 47).

Fault injection must include all possible single-point faults, not just faults that can be conveniently injected via the pins or connectors of a component. Rimen et al. compared internal vs. external fault injection, and found that that only 9%-12% of bit flip faults that occur inside a microcontroller could be tested via external pin fault injection (Rimen et al. 1994, p. 76). In 1994, Karlsson reported on the effectiveness of using a radioactive isotope to inject faults into a microcontroller (Karlsson 1994). Later fault injection work by Karlsson’s research group was performed on automotive brake-by-wire applications, sponsored by Volvo (Aidemark 2002), clearly demonstrating the applicability of fault injection as a relevant technique for safety critical automotive systems. And other similar work found defects in a safety critical automotive network protocol. (Ademaj 2003)

A test specifically on an engine control program using fault injection caused “permanently locking the engine’s throttle at full speed.” (Vinter 2001).

There are numerous other scholarly works in this area.  An early example is Bossen (1981). Some others include: Arlat et al. (1989), Barton et al. (1990), Benso et al. (1999), Han (1995), and Kanawati (1995). As a more recent example, Baumeister et al. performed fault injection on an automotive braking controller via irradiating it and measuring the errors, finding that unprotected SRAM and unprotected microcontroller paths were both sensitive to upsets (Baumeister 2012, pg. 5)

MISRA Software Guidelines take it for granted that fault management capabilities will be tested (e.g., MISRA Software Guidelines 3.4.8.3 pg. 44, MISRA Report 4 p. v) “Fault injection test” is recommended by ISO 26262-6 (pg. 23) for software integration, noting that “This includes injection of arbitrary faults in order to test safety mechanisms (e.g., by corrupting software or hardware components).”

By the late 1990s fault injection tools had become quite sophisticated, and were capable of injecting faults while a system was running at full speed even if source code was not available (e.g., Carreira 1998).
An example of a testing approach along these lines is E-GAS (E-GAS), which includes numerous tests based on auto manufacturer experience to ensure that various faults will be handled safely.

It is important to note that while mitigation techniques such as watchdog timers are a good practice if implemented properly, they are not sufficient to guarantee safety in the face of random errors. For example, Gunneflo presents experimental evidence indicating that watchdog effectiveness is less than perfect, and depends heavily on the particular software being run. Gunneflo recommends: “To accurately estimate coverage and latency for watch-dog mechanisms in a specific system, fault injection experiments must be carried out with the final implementation of the system using the real software.” (Gunneflo 1989, pg. 347). In other words, even if you have a watchdog timer, you need to perform fault injection to understand whether there are holes in your fault tolerance approach.

References:
  • Ademaj et al., Evaluation of fault handling of the time-triggered architecture with bus and star topology, DSN 2003.
  • Aidemark, Experimental evaluation of time-redundant execution for a brake-by-wire application, DSN 2002.
  • Arlat et al., Fault Injection for Dependability Validation of Fault-Tolerant Computing Systems, FTCS, 1989.
  • Arlat et al., Fault Injection for Dependability Validation: a methodology and some applications, IEEE Trans. SW Eng., 16(2), pp. 166-182 Feb. 1990.
  • Barton et al., Fault injection experiments using FIAT, IEEE Trans. Computers, pp. 575-592, April 1990.
  • Baumeister et al., Evaluation of chip-level irradiation effects in a 32-bit safety microcontroller for automotive braking applications, IEEE Workshop on Silicon Errors in Logic – System Effects, 2012.
  • Benso et al., Fault injection for embedded microprocessor-bases systems, J. Universal Computer Science, 5(10), pp. 693-711, 1999.
  • Bossen & Hsiao, ED/FI: a technique for improving computer system RAS, FTCS, 1981.
  • Carreira et al., Xception: a technique for the experimental evaluation of dependability in modern computers, IEEE Trans. Software Engineering, Feb. 1998, pp. 125-136.
  • Clark, J. et al. Fault Injection: a method for validating computer-system dependability, IEEE Computer, June 1995, pp., 47-56
  • EGAS, Standardized E-Gas monitoring concept for engine management systems of gasoline and diesel engines version 4.0, Work Group EGAS, Jan. 30, 2007.
  • Gunneflo et al., Evaluation of error detection schemes using fault injection by heavy-ion radiation, Fault Tolerant Computing Symposium, 1989, pp. 340-347.
  • Han et al., DOCTOR: An integrated software fault injection environment for distributed real-time systems, International Computer and Dependability Symposium, 1995, pp. 204-213.
  • Hsueh, M. et al., Fault injection techniques and tools, IEEE Computer, April 1997, pp. 75-82.
  • ISO 26262, Road vehicles – Functional Safety, International Standard, First Edition, Nov 15, 2011, ISO, part 6.
  • Kanawati et al., FERRARI: a flexible software-based fault and error injection system, IEEE Trans. Computers, 44(2), Feb. 1995, pp. 248-260.
  • Mariani, Soft errors on digital computers, Fault injection techniques and tools for embedded systems reliability evaluation, 2003, pp. 49-60.
  • MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).
  • MISRA, Report 4: Software in Control Systems, February 1995.
  • Pullum, L., Software Fault Tolerance Techniques and Implementation, Artech House, 2001.
  • Rimen et al., On microprocessor error behavior modeling, Fault Tolerant Computing Symposium, 1994, pp. 76-85.
  • Vinter, J., Aidemark, J., Folkesson, P. & Karlsson, J., Reducing critical failures for control algorithms using executable assertions and best effort recovery, International Conference on Dependable Systems and Networks, 2001, pp. 347-356.

No comments:

Post a Comment

Please send me your comments. I read all of them, and I appreciate them. To control spam I manually approve comments before they show up. It might take a while to respond. I appreciate generic "I like this post" comments, but I don't publish non-substantive comments like that.

If you prefer, or want a personal response, you can send e-mail to comments@koopman.us.
If you want a personal response please make sure to include your e-mail reply address. Thanks!

Job and Career Advice

I sometimes get requests from LinkedIn contacts about help deciding between job offers. I can't provide personalize advice, but here are...