Paul L. Jones

Joseph Jorgens III

Alford R. Taylor, Jr.

Markus Weber

 

 

 

 

 

 

 

Abstract

The safety of any medical device system is dependent on the application of a disciplined, well-defined, risk management process throughout the product life cycle. Hardware, software, human, and environmental interactions must be assessed in terms of intended use, risk, and cost/benefit criteria. This paper addresses these issues in the context of medical devices that incorporate software. The paper explains the principles of risk management, using terminology and examples from the domain of software engineering. It may serve as guidance for those new to the concepts risk management and as an aide-memoire for medical device system/software engineers more familiar with the topic.

 


Table of Contents

 

1     Introduction.................................................................................................................................. 2

2     Purpose...................................................................................................................................... 2

3     Scope......................................................................................................................................... 3

4     Definitions................................................................................................................................... 3

4.1      Discussion of definitions......................................................................................................... 3

5     Relevant Medical Device Standards................................................................................................ 6

6     Risk Assessment......................................................................................................................... 7

6.1      Risk Analysis........................................................................................................................ 8

6.1.1      Identification of known or foreseeable hazards.................................................................... 8

6.1.2      Risk Estimation............................................................................................................. 19

6.2      Risk Evaluation.................................................................................................................... 22

7     Risk Control Techniques............................................................................................................. 25

7.1      Inherent safe design............................................................................................................. 25

7.2      Fault tolerant measures........................................................................................................ 26

7.3      Protective measures............................................................................................................. 26

7.4      User information................................................................................................................... 27

7.5      Combined Measures............................................................................................................ 27

8     Integration with the Quality Management System.......................................................................... 28

8.1      Design and development planning.......................................................................................... 30

8.2      Design input........................................................................................................................ 30

8.3      Design verification, validation, review, and transfer to production............................................... 30

8.4      Design change..................................................................................................................... 31

8.5      Production and process controls........................................................................................... 31

8.6      Corrective and preventive actions........................................................................................... 31

8.7      Document Control................................................................................................................ 31

9     Risk Management Report............................................................................................................ 31

10       Conclusions............................................................................................................................ 32

11       Bibliography:........................................................................................................................... 33

A     ANNEX...................................................................................................................................... 35

A.1     Risk Management Process Documentation............................................................................ 35

A.2     Risk Management Summary Table Example.......................................................................... 38

 

1         Introduction

The development of a medical device product relies on disciplined implementation of numerous project management and product design processes. One of these processes is risk management. Risk management goals vary, depending on the development context. Project risk management is concerned with business risks. In this article, the concern is technical or design risks posed by medical device technology in relation to product safety.

1         Purpose

The goal of this article is to provide developers of medical device software with an understanding of risk management principles. This paper is not a complete exposition of all of the software engineering tools and techniques used to manage risk.  Rather, our aim is to provide those working “down in the trenches” with some insight into the “big picture” of risk management.  A second objective is to help risk managers to understand the implications for risk management posed by the presence of software in the system.  All too often, in our experience, those charged with the responsibility for developing software and those charged with responsibility of managing risk operate in two separate spheres.  We hope that this article will help to bridge the divide by fostering communication and shared understanding of the relationship between software engineering and risk management.

 

 

 

Understanding the terminology and its proper context is key to understanding the associated processes. Many of the terms used in risk management describe similar concepts, in other environments or industry sectors, whose differences are subtle and often overlooked in “normal” usage. This article attempts to clarify some of these subtleties by looking at the components of risk management, and by using precise language to identify how these components relate to each other. The article may serve as guidance for those new to the concepts of risk management in the medical device industry and as an aide-memoire for medical device systems/software engineers more familiar with the topic.

2         Scope

This article discusses the Risk Management Process as it relates to software. This article does not go into a detailed discussion of software development processes, activities, and tasks.  For example, software safety architectures, coding techniques, and hardware failure correction methods are not enumerated in this article. An enormous amount of such information is available in the software engineering literature, addressing implementations ranging from small embedded systems to large distributed systems [20, 12, 13, 1, 21]. The expectation is that professional software engineers, once armed with the principles of risk management, will avail themselves of this information.

3         Definitions

 

harm

Physical injury and/or damage to the health of people or damage to property or the environment [3].

hazard

Potential source of HARM. [3] {See section 4.1, Discussion of definitions, below}

risk

Combination of the probability of occurrence of HARM and the SEVERITY of that HARM [3]

risk analysis

Use of available information to identify hazards and to estimate the risk [3]

risk evaluation

Judgement, on the basis of RISK ANALYSIS, of whether a RISK which is acceptable has been achieved in a given context based on the current values of society. [3]

risk assessment

Overall process of RISK ANALYSIS and RISK EVALUATION [3]

safety

Freedom from unacceptable RISK of HARM [3]

severity

Measure of the possible consequences of a HAZARD [3]

3.1        Discussion of definitions

Hazard

ISO/IEC 14971 defines harm to be physical injury or damage to the health of people or damage to property or the environment and hazard is defined to be a potential source of harm. ISO/IEC 14971 gets its definition for harm and hazard from ISO/IEC Guide 51:1999; definition 3.1 and 3.5, respectively.

 

The ISO/IEC definition of hazard contributes to a great deal of confusion when compiling a list of known or foreseeable hazards[1]. This is, in part, because the definition of hazard is ambiguous in that almost every state or event within the device system, given certain conditions, could be construed as a potential source of harm. The distinction then, between a hazard and the event(s) resulting in a hazard can become very subjective. For example, if the device has a component insulation failure, resulting in an electrical short, resulting in an electrical shock, resulting in cardiac arrhythmia, resulting in fibrillation or death, which of these events is the potential source of harm ( i.e. the hazard), and which is the cause of the hazard? Similarly, if a device software component performs an incorrect calculation, resulting in an incorrect electrical stimulation, resulting in cardiac arrhythmia, which event is the potential source of harm and which is the cause of the hazard? The ambiguity of the definition results in hazard lists that often vary widely between manufacturers of similar devices.

 

What constitutes a hazard depends upon how the designers abstract the boundaries of the device or system, as shown in Figure 4-1 below, the intended use, and how the designers define a hazard. It would be useful to regulators, auditors, and manufacturers if these boundaries were unambiguously “standardized”. ISO/IEC 14971 begins to address this issue by providing lists of clinical hazards and classes of hazardous conditions in Annexes A-D.

 

 

 

Figure 4-1 provides a model of the medical device hazard / cause continuum. The Device may affect the Patient, User, and Service domains by means of direct or indirect (Environmental) interfaces in such a manner as to cause harm. Clinical and Device level hazards identified in Figure 4-1, generally map to ISO 14971 Annexes, and may serve as a starting point for establishing hazards.

 

 

 

Figure 4-1   Hazard / Cause Continuum

 

 

 

For this paper, we propose to resolve some of the ambiguity by using the following definitions.  We believe that these refinements are consistent with the spirit of ISO/IEC definitions, and will help to resolve some of the confusion experienced by medical device developers.

 

We therefore define a hazard[2] as any broadly characterized means, mode, or manner by which a medical device (or more generally, a system being analyzed) might cause harm.  The classification of hazards is highly subjective and purely pragmatic.  The fundamental test for classifying hazards is whether the proposed classification scheme provides insights that are useful for analyzing risk. Many hazards involve exposure to harmful amounts of energy, or harmful agents; e.g., exposure to electric shock, or a toxic material.  Conversely, the withholding of energy or beneficial agents may be equally hazardous, e.g., exposure to cold, oxygen deprivation, interruption in medical treatment. 

 

A hazardous event is any occurrence of a hazard.  Whether or not harm results from a given hazardous event depends on the degree to which hazardous preconditions are present, and often the precise timing of contributory events.  In the real world, it is often difficult to quantify all of the factors involved, and it may be equally difficult to predict and measure actual consequences and outcomes. 

 

A cause of a hazard is any set of events and/or circumstances, the combination of which might reasonably be expected to result in a hazardous event.  A given hazard might have one, several, or many possible causes.

 

Applying these definitions to the example cited earlier, electrical shock is most properly identified as the hazard.  The insulation failure, together with the circumstance—e.g, shock, vibration, or handling—that caused the exposed conductors to become shorted together, constitute a possible cause of the hazard.  One might also identify other entirely different causes of electrical shock—misassembled components, software design error, operator error, etc.  The medical conditions of cardiac arrhythmia and/or fibrillation are possible consequences of a hazardous event involving electrical shock, and death is one possible outcome.

 

 

Hazard Analysis

A hazard analysis is defined to be the identification of hazards and their causes.

 

The term hazard analysis is useful because it encapsulates, in a less ambiguous manner, the two distinct, but complementary activities described in the ISO/IEC 14971 hazard identification process1.

 

NOTE: For purposes of this article, the terms hazard, cause, and hazard analysis are to be interpreted in the above context.

 


4         Relevant Medical Device Standards

Currently there are three International medical device standards, generally referenced by manufacturers, which define risk management activities that are relevant to the guidance provided in this article. These standards include: ISO/IEC 14971[3], IEC 60601-1-4 [4], and EN 1441[2].  ISO/IEC 14971 is a risk management standard. IEC 60601-1-4 addresses risk management processes in the context of Programmable Electronic Medical Systems (PEMS). EN 1441 is a risk analysis standard. ISO/IEC 14971 is a superset of EN 1441 and IEC 60601-1-4 (i.e. it integrates the contents of EN 1441 and IEC 60601-1-4, and adds post market monitoring activities.) In addition to risk management activities, these standards identify classes of hazards, their causes, and risk control methods as aides-memoire. Risk management process activities as defined in ISO/IEC 14971 are shown in Figure 5-1 below.

 

 

Figure 5-1 Risk Management Process (Activities)  [ISO/IEC 14971]

 

 

All of the standards identified above (and FDA Guidance documents) specify the same basic activities of the risk management process shown in Figure 5-1. These include:

 

·         identification of hazards and their cause(s)

·         risk estimation (i.e., determination of the likelihood of the hazard’s occurrence and the severity of the consequence of each hazard)

·         risk evaluation (i.e., determination of the tolerability of risks)

·         risk control (i.e., identification and implementation of measures to limit or reduce unacceptable risk)

 

ISO/IEC 14971 is expected to obviate the need for IEC 60601-1-4 and EN1441, so understanding the requirements in this standard is important. The main body of the standard is silent with regard to software. However, annexes in the standard provide examples of how software failure can result in a hazardous or unsafe condition.

 

When developing software, the risk management process specified in ISO/IEC 14971 should be integral to software life cycle development processes specified in standards such as ISO/IEEE/EIA 12207 and AAMI SW68.  In fact, AAMI SW68 explicitly requires the use of ISO/IEC 14971.

 

5         Risk Assessment

When designing a medical device, one must be concerned about potential injury to the patient, user, or a third party. A process for addressing this injury potential is risk assessment. Risk assessment as characterized in ISO/IEC 14971 is defined to mean the “overall process of risk analysis and risk evaluation”. Risk assessment as it is diagrammed in ISO/IEC 14971, Figure 6-1 below, includes intended use, hazard identification (and causes), risk estimation, and risk acceptability decision subprocesses.

 

 

Figure 6-1    ISO/IEC 14971 risk Assessment

 

During the risk assessment process, and assuming device requirements have reached an appropriate level (defining intended use), one develops a list of hazards and sequence of foreseeable events resulting in those hazards (hazard analysis), estimates the severity of the hazards and their likelihood of occurrence (risk estimation), and then evaluates the acceptability of the risk (risk evaluation). The list of hazards and their causes evolves throughout the development of the device in a continuous and iterative manner and is used to ensure, to the extent possible, that identified hazards have been eliminated, mitigated, or are determined to be acceptable (such as those inherent in the design). The process of risk assessment continues even after the product is released, because previously unidentified hazards may become apparent. Subsequent sections provide visibility into activities associated with these sub-processes.

 

When developing a medical device, there are other areas of risk assessment that must be considered in addition to safety risks. These include business risks, environmental risks, and other forms of risk which fall outside the scope of this paper. While the tools for performing risk assessment may vary, depending on the type of risk under consideration, the basic risk management process is the same.

5.1        Risk Analysis

ISO/IEC 14971 defines risk analysis as the “use of available information to identify hazards and to estimate the risk”. Sometimes the terms hazard analysis and risk analysis are used interchangeably, but an important distinction exists. Hazard analysis, as defined in this article, only identifies hazards and the foreseeable sequence of events and/or circumstances resulting in the hazard (causes). Risk analysis adds a level of harm (severity), and a likelihood of occurrence estimation process (which must also take into account the identification and assessment of environmental conditions along with duration or exposure). Thus hazard analysis is a subset of risk analysis.

5.1.1       Identification of known or foreseeable hazards

Hazard identification is a prerequisite to risk assessment, evaluation, and control. The analysis of hazards forms a basis for system safety requirements and is integral to the product development life cycle process. When hazards are identified early in the product development phase, the hazards and their possible mitigation methods should be explicitly documented. This results in improved control over the design process, and ultimately helps to assure the safety of the final product.

5.1.1.1     Hazard identification process

The process of identifying hazards can become arduous because while many of the potential hazards are immediately obvious, there are usually many more non-obvious hazards that are sometimes very difficult to identify. Often the process can begin as a kind of free-association process; just letting one’s thoughts dwell on all aspects of the device, and what problems could occur. A nagging problem with this hazard generation process is that one is never sure that the list is complete. Therefore a process should be developed and followed which will make use of all available knowledge such as corporate product history, device industry literature, relevant standards, and even relevant hazards identified outside the device industry.

 

Initially, a list of device level hazards is generated which does not take into account specific causes of those hazards, but is just a list of all known and foreseeable hazards. Many of these potential hazards will be immediately obvious. However, there will be many hazards that are not obvious. To discover these additional hazards and to identify specific causes, various methods have been suggested and employed. Some of these methods are discussed in Section 6.1.1.3, “Hazard analysis models and techniques.”

 

Currently there is no method available to demonstrate the completeness of a generated list of hazards and their causes. It is possible that no matter how much rigor is applied to the hazard identification process, some hazards and their causes will not be uncovered. These latent hazards and their causes could result in harm. Therefore, development of the hazard list should not omit hazards because they seem to be unlikely or because historical precedence is missing. Hazard analysis is a ‘cornerstone’ for continuous and effective evaluation of device safety.

 

It is important to be aware that hazard identification is a multi-faceted activity requiring a variety of skills and perspectives. Design engineers, manufacturing and product service personnel, and clinicians all have unique perspectives. Input from multiple sources of knowledge, such as physiology, pharmacology, life sciences, physical sciences, human factors, and engineering disciplines, are necessary to ensure a comprehensive identification of hazards.

 

5.1.1.1.1           Intended and Unintended Uses

The requirements for intended use should be either part of the hazard analysis or specified in separate documents traceable to the hazard analysis. Identified hazards must be evaluated if they could occur during the intended use of the device under the following conditions:

·         under normal operating conditions

·         foreseeable misuse of the device

·         under or beyond marginal operating conditions

·         under foreseeable fault conditions

 

Additionally, the type of operators and patients assumed as intended users of the device should be identified. This will give a clearer indication of what use-environment the device is targeted for. As an example, the operator(s) could be characterized as:

·         licensed clinician, specially trained in the use of the device

·         board-certified surgeon having no device-specific training

·         nurse or medical technologist (may be contract employee)

·         patient or family member

 

and the patient treated or diagnosed by the device could be categorized as:

·         healthy

·         healthy, impaired (medications, disabilities, competence)

·         fragile, elderly or slightly ill

·         heavily or critically ill

·         life-sustained, terminally ill

 

Usage requirements are necessary information to determine hazard potential and to evaluate risk/benefit trade-offs that are necessary in determining the level of risk reduction and acceptable risk level. For example, a specially trained professional is less likely to use the device incorrectly, and a healthy patient is less likely to be sensitive to certain hazards.

5.1.1.1.2           Scope and Exclusions

The scope of the hazard analysis should be carefully considered because it will, in part, determine the labeling of the device. Hazard analysis should be integrated with the requirements definition phase of product development because it establishes basic design parameters.

The scope of the hazard analysis should include not only the intended use and use environment of the device, but also foreseeable consequences of unintended use or environmental factors. For example, professional-grade otoscopes are now being marketed on the Internet to potentially untrained parents of small children for home diagnosis of ear infections. Otoscope manufacturers have an obligation to consider the implications of this, and risk management provides a structured framework for doing so.

If exclusions in scope are made, these exclusions should be clearly identified. Criminal use, sabotage, or use by untrained personnel are some exclusions that might be considered under some circumstances. However, for certain products intended for use in an institutional setting, criminal use or sabotage may not be excludable.

 

5.1.1.1.3           Classes of Hazards

 

The first step of a hazard analysis should consist of a comprehensive collection of potential hazards and is primarily a cause-effects analysis, not withstanding their likelihood or perceived relevance. This collection of potential hazards should begin with a review of company hazard/safety device history records. The list may be refined as implementation of the device becomes more defined, but should be free from implementation details. Figure 6-2 identifies device properties and interactions that can assist in the hazard identification process. All hazards are not usually identified immediately. For example, hazards may arise from unforeseen use scenarios or subtle interactions between systems, patients, and the environment. A hazard analysis should therefore be periodically reviewed and updated. At this stage, specific details associated with the hazard (software, hardware, control systems, accessories, etc.) should not be exhaustively evaluated because they may depend on future design decisions. However, as many interactions as possible between different device (fault-free, fault) states and the environment should be identified. Figure 6-2 illustrates how the device is subject to multiple interactions with its environment and its internal states.

 

 

 

                        Figure 6-2    Device Properties and Interactions

 

 

5.1.1.1.3.1        Patient/Person Contact

Devices that contact a patient directly or are invasive usually have a higher hazard potential than devices which are isolated from the human body. An evaluation of the kind and duration of all physical contact with humans should be made such as:

 

·         Surface contact points

·         Invasive contact / Contact through bodily orifices

·         Implantable contact

 

Identification of the type and duration of device contact facilitates analysis of hazards associated with and propagated through these contact points.

 

Software systems may initiate, terminate, or regulate the delivery of substances or energy through these contact points which can result in a hazardous condition, such as in a heart pacer device.

 

5.1.1.1.3.2        Substances

Materials

Materials used in the design have a direct impact on potential hazards. This applies not only to human contact materials and the resulting sterility and bio-hazards, but also to materials used in the construction of the device. Materials that could fail or generate hazardous by-products under normal operating conditions (ozone), or under environmental extremes like fire / heat (poisonous fumes) may not be suitable for use in a device.

 

Materials used in moving components such as pistons, bearings, or gears may be subject to wear. This wear may result in a condition such that the component is not operating within design tolerances. Parameters specified in software which may be controlling (actuating) or interpreting sensor information may become invalid because they are no longer appropriate for an out of tolerance condition, thus creating the potential for a hazardous condition.

 

Substance Delivery

Absolute dose and rate (dose / time) are potential hazards if the delivery is unintended or an intended delivery does not occur. It will be difficult to specify an absolute dose or rate at which a hazard will develop because this will depend on multiple factors such as; the potency of the drug, the patient weight, interacting substances, pharmacokinetics, etc. However, it should be a design requirement to meet a self-imposed rate / dose accuracy limit as well as a maximum limit under fault conditions. The effects of non-delivery should be investigated as well.

 

Software systems such as those found in infusion pump devices may initiate, terminate, or regulate the delivery of drugs or other substances to the patient.

 

Time/Delayed Effects/Dose

Time and dose are important properties of potential hazards. Usually there are limits on the amount of exposure to a hazard that may be considered negligible. Determining these limits is important for the design of a device, since they decide how the implementation will control exposure to the hazard. The duration of exposure to a hazard source before the hazard manifests itself is also an important variable. Accumulative or time-delayed hazards such as; radiation or iodine accumulation in the thyroid gland, must be evaluated. Figure 6-3 illustrates the link of safety critical times.

 

                        Figure 6-3    Safety Critical Times

 

 

 

Improperly designed software, such as an Interrupt Service Routine, can result in incorrect fault reaction times. In this illustration, fault detection and reaction time is greater than the fault tolerance time, permitting a second fault to occur before the first one has been resolved. This may result in an over-exposure scenario because the system is unaware of missed interrupts or because correct instruction sequencing is corrupted.

 

5.1.1.1.3.3        Energy

Energy delivery or extraction is a predominant hazard. Incorrect energy levels, energy delivery or extraction can lead to life-threatening situations (laser, hypothermia equipment, RF surgery devices). In general, the potential of a device to emit or absorb energy should be evaluated not only with respect to the patient (as ISO/IEC 14971 suggests), but also the operator and other medical personnel. Physical and physiological properties of the primary energy transfer on humans should be identified. Additionally, secondary energy generated by the primary energy source through energy transformation should be evaluated, such as the transformation of light energy into thermal energy in laser surgery.

 

Software systems such as those found in linear accelerators, excimer lasers, and others may initiate, terminate, or regulate energy delivery. Hazardous conditions may occur if the software incorrectly implements dosage algorithms, incorrectly responds to sensor information, or incorrectly actuates energy delivery systems.

 

5.1.1.1.3.4        Information

Even though information is not a direct hazard, incorrect information has hazard potential if used to control or direct a potential hazard source. This is immediately obvious in closed loop systems (control of drug delivery by a physiological parameter like blood pressure) or if an incorrect course of action for treating a patient is selected based on incorrect information presented to the physician.

 

The timeliness of information may also be critical. Not only is correctness and proper timing of information potentially critical, but its absence may lead to a hazardous condition. A specific example of this consideration is a critical alarm like the apnea alarm of a ventilator.

 

Software systems may produce incorrect information due to incorrect instruction sequences (e,g, algorithms) or incorrect data (e.g. data corruption).

 

 

 

 

 

5.1.1.1.3.5        Bio-matter

If the device processes biological materials, bio-hazards may result from its operation. Extracorporeal processing of bodily fluids such as blood processing or dialysis may generate hazards for the patient or the operator. This concern also applies to invitro diagnosis or processing devices. Hemolysis, viral infections, and bacterial or chemical contamination are some of these bio-hazards.

 

Software systems, such as blood bank systems, may introduce bio-hazards through incorrectly implemented quarantine or testing algorithms.

 

5.1.1.1.3.6        Environmental Conditions

A device can affect or be affected by the environment in which it is used. It may alter the patient environment (pressure, temperature, humidity) but it will also be influenced by the environment (electromagnetic susceptibility, temperature). Both factors should be investigated in terms of their hazard potential.

 

The environment can affect software systems indirectly through hardware the software executes on. Device exposure to EMI, for example, can result in an incorrect instruction sequence which may result in a hazardous condition.

 

5.1.1.1.3.7        User Interface

The user interface is an important part of the device's safety system because it may contribute to intentional or foreseeable unintentional use of the device, resulting in a hazardous condition. Incorrect parameter input, misinterpretation of information presented by the device, or ambiguous user interfaces can be hazardous. Hazardous conditions associated with the user/device interface from a human factors and system perspective should be evaluated.

 

Software driven user interfaces such as those found in radiation treatment planning systems may constitute the majority of the device. Analysis of human factors issues associated with software system user interfaces may help identify hazardous conditions.

 

5.1.1.1.4           Miscellaneous Factors

Miscellaneous factors may have to be addressed during the hazard identification process, such as:

·         Calibration

·         Sterility

·         Storage environment

·         Ergonomic considerations

·         Influence of accessory devices

 

5.1.1.1.5           Cultural Differences

Many hazards arise when patients, caregivers, and products cross international borders. These hazards may result from unfamiliar languages, symbols, codes, or conventions that contradict user expectations.

 

 

6.1.1.2     Causes of known or foreseeable hazards

Software is nothing more than a written language acted upon by a ‘language’ interpreter (hardware) to some purpose. The written language is translated (compiled) into instructions the hardware instruction processor is to execute. These instructions must be executed in the correct sequence with the correct data in order for it to be said that the software is performing correctly. It follows, therefore, that if the written language is incorrect, it will be translated into incorrect instruction sequences and/or data and otherwise perform incorrectly. When the hardware instruction processor (CPU) executes incorrect instruction sequences and/or data, a potentially hazardous condition may occur. Hazards may be manifested when the hardware instruction processor is instructed to incorrectly enable an actuator, get information from a sensor incorrectly, and/or manipulate or display data incorrectly.

 

Because of software’s dependence on hardware to affect the physical universe (at a device or system level) one can conclude that incorrect instruction sequencing and/or data is primarily a causative element in the identification of a hazardous condition.

 

Hazards caused by software may appear in many different forms, for example:

 

·         Delaying, impeding, or incorrect dosage delivery (e.g. under/over dose)

·         Incorrect sensor readings (e.g. monitoring)

·         Incorrectly controlled actuators delivering energy (e.g. burns, electrical shock, radiation)

·         Incorrect information (e.g. diagnostics, therapy)

 

 

 

 

Systematic failure

A potentially hazardous condition may occur as a result of a hardware / software component or system failure. At the system level, failures may be classified as either systematic or random. If there is an error in a hardware or software design, the resulting failure is said to be systematic. Systematic faults are errors in design or implementation that are present in all hardware and software systems. These faults and their manifestation are generally unknown until they occur. Once detected, they can be eliminated entirely. An infamous hardware example of this type of fault was the faulty look up table in the Pentium floating point processor chip. Correctly designed software running on this processor would return an incorrect value because of this error in processor implementation. The Therac 25 [12], a medical linear accelerator tragedy, provides an example of several systematic type software faults in which undetected software design errors resulted in the serious injury or death of several patients.  One software design error involved a modification to the system to facilitate faster re-entry of patient data by using the carriage control key to copy treatment site data. The design created a memory race condition resulting in a state where the operator console information did not match the actual treatment parameters. Another software design error involved the reuse of software from a predecessor system. Under certain entry key combinations the predecessor system would trip fuses and breakers on the device. This did not result in serious injuries because hardware interlocks would shut the machine down before an injury could occur. However, some of the predecessor hardware interlocks were removed from the Therac 25 to be controlled by software. The combination of predecessor software design faults migrated into the Therac 25 and the removal of some hardware interlocks contributed to one of the reported accidents.

 

Random failure

Hardware components may be subject to fatigue and stress related failures which are termed random failures. The failure of a transistor in a microprocessor, for example, can cause software instructions to execute out of sequence. Software misoperation may also occur randomly due to external stimuli even if all of the components are working correctly. For example, ionizing radiation or an external magnetic field may change a bit in a memory cell, causing the software instructions to execute out of sequence or produce incorrect data.

 

As the above examples show, random hardware failure may cause incorrect software instruction sequencing resulting in a system failure. A substantial body of software design techniques exist to mitigate the consequences of these random hardware failure mechanisms [1], [13], [20]. It is important to realize that a hardware failure which affects software operation is fundamentally different from a software design error.

 

Software failure

Software, unlike hardware, is not subject to fatigue or other random failure mechanisms. Software only fails systematically (assuming the CPU and associated hardware are functioning correctly). If software errors are known they can be corrected, and once corrected, never fail again.

 

Software / hardware system coupling considerations

In analyzing the hazard potential of a software system one has to be aware of the interaction boundaries between the various software and hardware components as depicted in Figure 6-4. Software on a single processor (CPU) system (device) may consist of safety and non-safety related modules, if only the functionality of these modules is considered. However, both types of modules share many resources like memory, buses, the CPU, etc. In this environment, a non-safety related module can easily change the data of a safety related module (except if 100% data hiding is implemented and supported by hardware), potentially resulting in a hazardous condition. The interaction boundary can more easily be isolated in multi-processor systems, where safety and non-safety related software modules typically share fewer resources.

 

Modern programming techniques like object oriented programming help substantially to minimize data interference between modules (objects), if a disciplined development process is implemented. However, potential hazards may still be introduced through interfaces with lower level modules (written in assembly language) that access hardware directly.

 

Software compilation considerations

The translation (compilation) process of a software system should also be assessed if a requirement exists to separate safety and non-safety related modules. Assumptions that usage of a high level language has achieved decoupling often proves incorrect. The actual binary code may still contain coupling points, i.e. the stack, library routines, memory allocation, operating systems, etc.

 

Real-time software system considerations

The dynamic nature of real-time applications makes it very difficult to comprehensively ascertain software-related hazards associated with concurrent events and non-deterministic environments. This is in part, because the number of possible event combinations may be practically infinite. Software timing issues can result in incorrect instruction sequencing and/or data corruption, resulting in potential hazards. Improper device alarming is a common result of this issue.

 

COTS considerations

Special attention should be given to ‘Commercial Off The Shelf’ (COTS) software components. Some or all of the interface coupling issues discussed earlier may apply to a COTS software component. Instruction sequences for these components are generally unknown, making it not only difficult to assess the component’s hazard potential(s), but to test it as well. It is reasonable, therefore, to assume the COTS software component will fail and then assess the severity of the consequences.

 

 

 

 

                        Figure 6-4    Software Boundaries and Coupling Points

 

 

Key software issues to consider

The following are some key questions that should be answered during the software hazard analysis.

 

Will a hazard occur if:

 

·         software is executed using incorrect data

·         software is executed in an incorrect order (program flow)

·         software execution is halted (stalled processor)

·         software execution is delayed (due to CPU resource issues)

·         software is responding incorrectly or not responding to external or internal events

 

If any of these questions can be answered YES, a control measure which does not rely on software may be necessary.

 

Will a hazard occur under:

 

·         fault-free conditions

·         single fault conditions

·         multiple fault conditions

 

The answers to these questions will be useful when considering risk control measures.

 

6.1.1.3     Hazard analysis models and techniques

Preceding sections should make it evident that the process of identifying hazards and their causes must take into account many aspects of the device and its environment. As is often the case, the more complex the device and its environment, the more likely it is to overlook potential sources of harm. Despite all of the advances made in system, software, and safety engineering, there is no single methodology available which guarantees the identification of all potential sources of harm associated with complex hardware/software systems. Common practice is to employ several of the following complementary methodologies in a disciplined manner to detect hazards and assess risk.

 

It is very important to try to identify all possible hazards and their causes.  Potential hazards should not be omitted because their occurrence probability at the time of the analysis seems too remote. The results of a hazard analysis should give a complete unbiased picture of all potential hazards and their causes.

 

Safety-critical industry sectors, such as nuclear, aerospace, or chemical, have used various approaches to assess hazards and minimize risk potential. The earliest approaches all too often identified hazardous conditions after the basic design decisions were made, successively mitigating or changing the design to address the causes of the hazards as they became known. Techniques such as Fault Tree Analysis (FTA) and Failure Modes and Effects (Criticality) Analysis (FME( C )A) developed in these industry sectors are now used extensively to aid in the identification of hazardous failure modes early in the design process. It should be cautioned however, that none of these techniques, by itself, fulfills all of the functions of a comprehensive risk analysis methodology.   Rather, such techniques may be considered as components of a comprehensive risk analysis process.

 

Preliminary Hazard Analysis (PHA) and Fault Tree Analysis (FTA)

These analytical techniques are ‘top-down’ methodologies in which the analyst postulates undesired outcomes or system-level faults, and determines which system components (software/hardware) may contribute to each. They are called top-down methodologies because the starting point of each branch is a postulated system level event or outcome, which is then decomposed into more detailed branches and sub-branches of causes contributing to that system level event. 

 

Preliminary hazard analysis serves principally to develop an inventory of the possible hazards associated with the system.  There are several design domains which should be considered to facilitate this; for example:

 

·         Energy (outcomes)––electrical shock, burn, mechanical impact, radiation

·         Sub-system (failure mode)––hardware, software, packaging, interfaces, environment

·         Life-cycle stages––development, manufacturing, testing, handling, use, maintenance, disposal

 

Fault tree analysis [16] may be used as a tool to extend the preliminary hazard analysis in a methodical manner, based on the principle that most failures are caused by a combination of circumstances.  A fault tree analysis takes a single undesired outcome or system-level fault and synthesizes the combination and/or sequence of factors and events that might lead to that fault (such as a software failure).  The result is a cause-and-effect diagram which uses standardized logic symbols to depict the relationship between causative factors and the resulting outcome or fault.

 

One should keep in mind that these techniques have some limitations, such as the inability to account for common-cause[3] failures, time and rate dependent events, failure chains where the order of failure is important, and the interactions between elements of a complex system.

 

These techniques depend heavily on the intuition, experience, and insights of the analyst(s).  They ‘work’ by guiding the analyst through a structured thought process.  A major advantage of these techniques is that they can be performed at the system-level or  preliminary design stage of the development cycle, before details of the implementation have been worked out.  Insights gained may then lead to corrective or mitigative actions before the design is cast in stone.

 

Ideally, as development proceeds, the risk management plan should require that these ‘front-end’ analyses be revisited at appropriate development stages.  Knowledge and insights gained during successive design stages may lead to risk control refinements, which in turn may yield further design improvements.

 

Failure Modes and Effects Analysis (FMEA)

Failure modes and effects analysis [17] starts with causes, and works toward the effects. The failure modes and effects analysis begins with the “lowest” level component, typically a function in software, and objectively evaluates how the component might fail. This is sometimes characterized as a ‘bottom-up’ methodology. Software failure modes are typically characterized in terms of erroneous output conditions, which are enumerated; e.g., Boolean output FALSE when it should have been TRUE, temperature calculation erroneously HIGH. For each identified failure mode of each component, the consequences to the system and intended use are assessed. The FMEA methodology may be expanded to include consideration of the criticality of failures (FMECA).

 

Prerequisite to a software FMEA is a modularized design and sufficient knowledge of all failure modes of a component.  While the FMEA can be applied at the preliminary design stages, it is most useful  when the design has been fully implemented.  When compared to top-down approaches, FMEA requires relatively less intuitive skill and more attention to rigorous technical detail.  For a large system, the results from a FMEA are voluminous.

 

One advantage of FMEA is the fact that every component of the system is systematically examined.  Thus the technique can potentially uncover a multitude of subtle failure modes that might be overlooked in a top-down approach.    At first glance, then, it might appear that FMEA is a ‘better’ technique.  In reality, however, top-down and bottom-up techniques are complementary

 

For example, there is a whole category of hazards arising from misuse of the system; even if the system and all its components are functioning perfectly.  FMEA can overlook such hazards.  FMEA also has limitations in dealing with multiple-fault scenarios; as previously noted, fault tree analysis is more helpful for that purpose.

 

Descriptions of FTA and FMEA's here reflect traditional usage of these methodologies. The need to address identified limitations has been recognized by industry and academia.  Consequently, these methods are constantly being refined. For example, FMEA methodologies have evolved to encapsulate FTA methodologies through a "blocking" process. Dynamic analysis of fault trees is becoming possible by decomposing complex trees into static and/or dynamic subtrees. The dynamic sub-trees are then analyzed using Markov methods. There is much information available on these topics to be found on the Internet.

 

Hazard and Operability Analysis (Hazop)

Hazop [18] is a well established qualitative methodology. It was originally developed in the chemical process control industry for identifying process deviations leading to hazards or operational deficiencies.  The technique involves a detailed review of the process design and operation, focusing on possible deviations in process parameters. The methodology uses guide words as aides-memoire for exploring possible causes and consequences of process failure such as; none, more of, and less of. These guide words may be applied to any process variable, e.g., increased or decreased flow, reverse flow, or no flow.  For each postulated deviation, the consequences are assessed, and if the consequences are deemed hazardous, risk analysis and reduction techniques come into play.  The Hazop process of encouraging thought on how a system will operate and the consequences of deviations from the designed operation is equally applicable to medical devices.

 

The methodology can be used early in the system and software design phase to minimize costly "downstream" design changes. A strength of this methodology is that analysis is performed with a team of diverse domain experts who step through the design in a systematic manner. However, this may also be viewed as a drawback of the methodology because of the labor intensiveness and team judgement involved.

 

Another strength of the technique is that it is equally useful for discovering hazards, whatever the cause.  For example, if low flow in a particular process constitutes a hazardous condition, hazop analysis will reveal the existence of the hazard regardless of whether the process deviation is caused by a design defect, component failure, operator error, or environmental condition.

 

Attempts have been made to make the Hazop more applicable to "software" safety systems through a computer Hazop or Chazop process [11]. Sample guide words under this process include: no, more, less, and wrong. For example, the guide word no might equate to no signal or no action. A consensus view regarding the Chazop process is that it must be performed in the context of the total system.

 

The Hazop or Chazop process goal is a little different than other analysis techniques discussed here in that its main purpose is to identify hazards which are later analyzed for causes, failure modes, and control measures by other techniques discussed in this section; such as the FTA and FMEA.

 

Human Factors Analysis

The science of human factors focuses on those variables that affect the performance of individuals using equipment [8], [14], [19A user's behavior is directly influenced by operational characteristics of the equipment; user interfaces that are misleading or illogical can induce errors by even the most skilled users. Such things as the operating environment (light, noise, humidity), physical and sensory characteristics (reach, hearing), and expectancies (color red = danger) all may contribute to an event leading to potential injury; i.e. a hazard

 

Human factors analysis encompass a variety of techniques for applying this body of knowledge to a design.  Some of these techniques include: 

·         Heuristic analysis (essentially, reviewing lists of human factors principles to see whether they are applicable to a particular design);

·         Use testing (in the actual or simulated operating environment with representative users);

·         Function and task analysis (decomposing equipment operation into individual functions performed and tasks required and examining each element of potential hazards);

·         Walkthroughs

 

Human factors analysis should be an integral part of any hazard analysis performed.

 

 

State Machine Hazard Analysis

State machine hazard analysis relies on the establishment of a model of system states and the transitions between them. These models are often used in software system design.

 

State machine models may facilitate the exposure of hazardous states as the model is traversed from initial to terminal states [12]. Software and other component behaviors can be modeled at a high level of abstraction such that analysis of faults and failures can begin early in the system and software development process. One should remain aware that this process is based on a model abstraction versus the actual design.

 

Hazard analysis methodologies discussed here are not exhaustive. However, judicious usage of these techniques in a complementary manner should provide comprehensive system coverage and a good return on time invested.

 

5.1.2       Risk Estimation

Referring to Figure 5-1, the next step after the collection of potential hazards and their causes is to estimate the risk associated with each identified hazard. Risk in this context is the consequence and likelihood of the hazard becoming reality. These two parameters require a judgment to be made on the likelihood of occurrence of the uncontrolled hazard and the severity of the unmitigated occurrence. Establishing the value of these parameters can be a challenge. Factors such as frequency of occurrence and duration of exposure, and the possibility of avoiding the hazardous event, are integral to the estimation of likelihood. The risk graph in Figure 6-5 shows some possible interrelationships of these risk parameters.

 

 

            Figure 6-5    Risk Estimation parameters (risk graph)

 

 

For very complex systems, other likelihood factors associated with the degree of system component coupling may be relevant; e.g. the coincidence of conditions resulting in a hazard.

 

The primary purpose of risk estimation is to facilitate the allocation of appropriate resources to the design of components whose failure can result in intolerable consequences. By establishing meaningful risk estimation parameters, such as those based on relevant historical data, analytical or simulation techniques, and judgment, evaluation of risk reduction methods, such as those discussed in section 6.2, can be more objective in nature.

 

When estimating the risk of a hazard occurring, it should be determined if the hazard can occur under fault-free, single fault, or multiple fault conditions. Severe hazards should not be acceptable under any of these conditions.

 

When estimating risk, it is important that the occurrence probability portion of the analysis be applied appropriately. Components of a device may fail randomly or systematically. For those components subject to systematic failure, such as software or digital circuits, the estimation of occurrence probability is problematic. Section 6.1.2.4 discusses this problem for software systems.

 

5.1.2.1     Qualitative vs. Quantitative Risk Estimation

Two different methodologies can be used to estimate the risk of each hazard occurring, qualitative and quantitative. Both have pros and cons.  Whichever method is chosen, one should be aware of its strengths and weaknesses. Even though it is difficult or impossible to estimate software risk, there is some value in this effort from the perspective that it causes one to think about the effectiveness of risk control measures implemented.

 

The qualitative method assigns certain discrete (judgmental), but well defined, levels to the severity and likelihood of occurrence of a hazard. The granularity of the levels is discretionary, but should facilitate risk reduction evaluation. The combination of these levels represents risk.  A weakness of this method is that the assignment of levels may be arbitrary without underlying detailed knowledge.

 

The quantitative method uses occurrence probabilities combined with a severity rating. These probabilities can be derived using statistical methods. Severe hazards should be assessed independently as in the qualitative method. The weakness of this method is that sound statistical data may not be available and that using questionable data may falsely represent a high level of accuracy.

 

5.1.2.2     Calculating the Risk Rating

 

A variety of methods may be used to combine the likelihood estimate with the severity rating.  Some practitioners simply list both parameters.  In the following section, a detailed example of this approach is provided.

 

Many methods yield numerical indices for likelihood and severity, and these may be multiplied together to obtain a numeric estimate of risk.  For example, if four likelihood categories are defined (L = [1 .. 4] ) and four severity ratings are defined (S = [1 .. 4]), then the product of L and S yields a risk estimate in the range R = [1 .. 16].  In this risk estimation system, a low severity combined with a high likelihood of occurrence will generate the same risk rating as a high severity and a low likelihood.

 

A slight enhancement to this scheme assigns greater weights to increasing likelihood and severity. For example, the four likelihood categories might be assigned weights of 1, 3, 6, and 10 respectively. Similarly, the severity ratings might be assigned weights of 1, 3, 6, and 10.  In this scheme, the combined risk rating may vary from 1 (least risk) to 100 (greatest risk), and the risk rating may take on the following discrete values:

 

            R   =   L ´ S   =    [1, 3, 6, 10] ´  [1, 3, 6,  10]    =     [1, 3, 6, 9, 10, 18, 30, 36, 60, 100]

 

Yet another defines risk R as equal to S raised to the power L.  For example, if L = [1 .. 6], (i.e., incredible to frequent), and S = [2 .. 5], (i.e., negligible to catastrophic), then

 

            R =   LS   = [21 .. 56]    =  [2 .. 15,625]

 

Proponents of these more complex risk estimation models argue that the resulting risk indices more accurately reflect society’s tolerance (or intolerance) of risk.  However, there is a danger that the distinction between risk estimation and risk evaluation may be lost in the process.  Section 6.2 focuses on the topic of risk evaluation; at that point, the importance of this distinction will become apparent.

 

Example of a Risk Estimation Scheme

 

The standards, IEC 61508 and IEC 60601-1-4, use four levels of severity for examples.

 

Catastrophic            – potential of multiple deaths or serious injuries

Critical                    – potential of death or serious injury

Marginal                  – potential of injury

Negligible                – little or no potential injury

 

Six qualitative levels of likelihood are presented as well, which include: incredible, improbable, remote, occasional, probable, and frequent.

 

The most straightforward risk estimation scheme simply combines these two dimensions of risk, as shown in Table 6-1.

 

 

 

 

Severity

 

 

 

 

 

I

 

Catastrophic

II

 

Critical

III

 

Marginal

IV

 

Negligible

 

A – frequent

I-A

II-A

III-A

IV-A

 

B – probable

I-B

II-B

III-B

IV-B

Likelihood

C – occasional

I-C

II-C

III-C

IV-C

 

D – remote

I-D

II-D

III-D

IV-D

 

E – improbable

I-E

II-E

III-E

IV-E

 

F – incredible

I-F

II-F

III-F

IV-F

 

            Table 6-1    Risk Estimation Scheme Based on IEC 61508 and IEC 60601-1-4

 

 

The risk estimation method shown in the table has seen widespread use, based, no doubt, on its simplicity.  Of course, its use as an example in the referenced standards confers a mantle of legitimacy.  However, in applying this method to an actual medical device, the risk management team is responsible for defining these likelihood and severity terms in the context of device usage.   There is no provision precluding a risk management team from redefining these terms or changing the number of levels represented by these terms as deemed appropriate.  Any other risk estimation model described in the literature may be equally acceptable, and in fact may be better in a given situation.

 

5.1.2.3Special Considerations When Estimating Risk in Software Systems

 

As stated earlier, software only fails systematically; i.e., as a consequence of an undetected design error or oversight. If software errors are known they can be corrected, and once corrected, never fail again. The characteristics of a systematic software fault make it very difficult to assign an occurrence probability to these faults. Software, in particular, is not subject to the physical attributes of wear, fatigue, or other types of stochastic processes, a foundation of failure prediction. At best, judgmental risk estimates for the purpose of focusing resources may be made in a qualitative sense; preferably based on some prior history. A common practice is to assume that the software failure occurrence probability is high. When this assumption is made, risk estimation is solely a function of the consequences of software failure.

 

There are practitioners who attempt to measure the degree to which a system is fault free. For example, methods exist to derive a metric out of various factors like complexity, testing time, etc. This may be a useful exercise for some classes of business software, where release of the product before it is error free may be a valid business decision.  However, no methodology has been developed to measure the degree to which a system is fault-free Risk estimation of software systems based on quantitative parameters should therefore be considered with caution in medical device applications.  Whichever method is used, the reason for the risk parameter selection should be documented.

 

One must also be careful to distinguish between the probability of a single occurrence of a hazardous event and the frequency of occurrence of the event in a population of devices and/or users. In many medical device software applications, once a software defect becomes apparent, corrective action may be taken to limit the exposure of the remaining population. In this case, the likelihood of occurrence of a single hazardous event is of primary interest to the risk analyst. However, particularly in the case of implantable devices and those in volume distribution, one must be concerned with the frequency of occurrence and the total population.

 

Software failures in medical devices rarely have catastrophic consequences.  A software failure in a flight control system might injure hundreds or thousands.  Even the worst medical device failures will generally be detected and remedied before injuries approach catastrophic levels.  Thus, severity ratings for medical devices often focus on the degree of injury caused by the failure. For example, levels of severity could be characterized as:

 

Major injury             -- death or permanent loss of function

Moderate injury        -- hospitalization or prolonged loss of function

Minor injury             -- requires treatment before release

Negligible injury       -- does not require medical intervention

 

It must be recognized that this represents an expansion of the general medical device risk management model, as shown by the shaded area in Figure 6-6.  However, society tends to be fairly conservative in its tolerance of medical device risk.  This conservative bias may be accommodated by following the dashed lines straight across the chart, e.g., treating “major injury” as equivalent to “catastrophic consequences.”

 

 

 

            Figure 6-6    Severity Ratings for Medical Devices

5.2        Risk Evaluation

Following risk estimation, each risk must be systematically evaluated to assess its tolerability and consider the need for risk reduction measures.  Risk evaluation is inherently a complex and judgmental process, having technical, legal, and ethical aspects.  The objective is to balance the conflicting dimensions of risk, benefit, and cost in a manner deemed satisfactory to the product developer, customers, users, and other interested parties.  In the case of medical device software, societal values regarding patient safety inevitably play a significant role.

 

There is no single structured evaluation methodology which