Single Event Effect Criticality Analysis

Sponsored by NASA Headquarters/ Code QW
February 15, 1996
Kenneth A. LaBel, Michele M. Gates, Janet L. Barth, NASA Goddard Space Flight Center
Allan Johnston, Jet Propulsion Laboratory
Paul Marshall, Consultant

Table of Contents:
Introduction
1. The SEE Problem
2. Functional Analysis and Criticality
3. Ionizing Radiation Environment Concerns
4. Effects in Electronic Devices and SEE Rates
5. SEU Propagation Analysis: System Level Effects
6. SEE Mitigation: Methods of Reducing SEE Impacts
7. Managing SEEs: System Level Planning
8. SEE Criticality Assessment Case Studies

Introduction

Our goal in generating this document is to aid the individuals in project management, systems engineering, radiation effects, and reliability engineering who carry the responsibilities for successful deployment of NASA systems in orbital particle environments. Traditionally, in a manner which may differ from NASA center to NASA center, this effort has involved many iterative passes through system and subsystem designs with involvement of engineers representing the above disciplines. These efforts began in the 1970s when one or two low level integration device types were identified to be susceptible to single event upset (SEU). Since then, with advances in technology, the arena has expanded to include many types of single event effects (SEEs) in many technologies. The necessary advent of SEE hardened device technologies has alleviated some of the worries, but simultaneously added another dimension to the already complex trade space involved in SEE system design and analysis. Indeed, it is the combination of the universal nature of the concern across NASA centers, coupled with the complexities of the issues, which has prompted this study. Our aim is not to prescribe approaches to SEE immune system design, but rather to examine the analysis process and suggest streamlined approaches to the related design problems. In short, we seek to codify the successful elements which, in many cases, already exist for assessing SEE risk and suggest a timeline and procedure for implementing SEE risk analysis with respect to the system design effort.

A combination of factors have converged to impact the growing importance of the traditionally informal single event effects criticality analysis (SEECA). Among these are:

1) the increased functionality of satellite systems which impacts the number and complexity of various types of microcircuits,
2) the increased device SEE sensitivity commensurate with the smaller feature sizes and advanced technologies (e.g. GaAs signal processors) required to field these systems,
3) the difficulty in acquiring space-qualified and SEE tolerant parts and the cost forces driving the use of commercial-off-the-shelf (COTS) parts, and
4) the overall complexity of a typical orbital platform which relies on the successful execution of an ever-growing number of instructions.
In short, it is often neither possible nor cost effective to construct systems using SEE immune hardware, and the systems engineer must necessarily make decisions within a trade space including availability, performance, schedule, and cost risk associated with single event effects.

Throughout these discussions we recognize that SEECA covers a highly specialized set of concerns which in many ways parallels conventional reliability analysis. While reliability analysis is by no means simple, the concepts and tools employed by the systems engineering teams and project managers are familiar, and methods exist for both the estimation and quantification of risk. Unfortunately, there seems to be no plausible approach to direct application of these tools to single event analyses. This situation is further complicated by the nature of the complex interplay between the environments, mechanisms, effects, and mitigation approaches. This has led to ad hoc treatments of single event analyses. On one side, systems engineers have a sometimes incomplete understanding of the exact nature of the risk. On the other side, experts are familiar with the details of single event effects, particle environments, and radiation hardness issues at the component level but have an incomplete picture of the risk-cost-performance trade space comprising mission reality.

The ad hoc approach has evolved as an informal system which works to meet the perceived mission needs, but it can be argued that it is not optimized without the full appreciation by the SEE expert regarding mission requirements and the commensurate understanding of the systems engineers and project managers concerning the SEE risk. The possibility exists to launch with unforeseen and unacceptable risk, or conversely to be overly conservative and lose the battle in terms of the component costs, power requirements, or system complexity through poorly planned actions aimed at controlling these risks. Finally, as with any source of risk, there is potential to overanalyze the problem and thereby expend limited resources through study while overlooking other important risks, SEE rated or otherwise. As mentioned in the NASA Systems Engineering Handbook, this comprises the equivalent of the Heisenberg Uncertainty Principle in risk management.

It is one key aim of this document to pull together the primary elements of single event effects in microelectronics along with the applicable concepts established and proven through years of risk analysis and planning. In the following sections, an overview will be provided for the key elements in the single event risk management "equation". Functional analysis and criticality, which provide the foundation for defining a system and an SEE problem in criticality studies, will be discussed first. A brief discussion on the radiation environment will then be presented. The orbit and time-dependent environment governing the particle types and energies responsible for single event effects will be covered. An overview of the single event interaction mechanisms and the complex matrix of technologies and effects is also provided. Systems-level impacts are determined by analyzing the propagation of SEEs and assessing criticality for which we will also draw on materials to establish approaches from traditional Failure Modes and Effects Analysis (FMEA). Another section will present SEE mitigation techniques, including software mitigation, error tolerance approaches, component-level hardening, and a discussion of the power-speed-cost trades involved. An additional section presents the application of SEECA useful in the generation and flowdown of SEE requirements. A final section will illustrate, by example, application of SEECA in the assessment of SEE-induced failure modes.

Single Event Effect Criticality Analysis offers a methodology to identify the severity of an SEE in mission, system, and subsystem reliability and also provides guidelines for the assessment of SEE-induced failure modes. SEECA may be used in determining the severity of faults caused by SEEs, accounting for criticality of functions performed, and identifying necessity to provide for SEE tolerance. SEECA is intended as a tool for radiation tolerant design, requirements generation for SEEs, design verification, and requirements validation. Ultimately, SEECA will hopefully aid in launching fully functional satellites with acceptable and understood SEE risks and with minimum cost, complexity, and power consumption in the final product.
Section 1
The SEE Problem

Radiation damage to on-board electronics may be separated into two categories: total ionizing dose and single event effects. Total ionizing dose (TID) is a cumulative long-term degradation of the device when exposed to ionizing radiation. Single event effects (SEEs) are individual events which occur when a single incident ionizing particle deposits enough energy to cause an effect in a device.

There are many device conditions and failure modes due to SEE, depending on the incident particle and the specific device. It may be convenient to think of two types of SEEs: soft errors and hard errors. Soft errors are nondestructive to the device and may appear as a bit flip in a memory cell or latch, or as transients occurring on the output of an I/O, logic, or other support circuit. Also included are conditions that cause a device to interrupt normal operations and either perform incorrectly or halt. Hard errors may be (but are not necessarily) physically destructive to the device, but are permanent functional effects. Different device effects, hard or soft, may or may not be acceptable for a given design application.

Unlike TID degradation, SEE rates are not evaluated in terms of a time or dose until failure, where the stopwatch begins at launch, but a probability that an SEE will occur within a known span of time. Devices are tested in ground test facilities to characterize the device in a radiation environment. Calculations are also performed to predict the radiation environment for a particular mission orbit. Environment predictions are used with the experimental device data to calculate the probability of occurrence of SEEs in the device for the mission.

Device failure, of course, of great concern. The effects of propagation of SEEs through a circuit, subsystem, and system are also often of particular importance. The level of impact on the affected circuit, box, subsystem, etc. depends on the type and location of the SEE, as well as on the design. For example, a device error or failure may have effects propagating to critical mission elements, such as a command error affecting thruster firing. There are also cases where SEEs may have little or no observable effect on a system level. In fact, in most designs, there are specific areas which have less system impact from certain radiation effects. The data storage memory in a solid state recorder, for example, may have error detection and correction coding (EDAC) which makes bit errors in the devices transparent to the system. Evaluating the severity of the single event effect hazard involves knowledge from several technical fields including radiation physics, parts engineering, solid state physics, electrical engineering, reliability analysis, and systems engineering.

Both the functional impact of an SEE to the system or spacecraft and the probability of its occurrence provide the foundation for setting a design requirement. System-level SEE requirements may be fulfilled through a variety of mitigation techniques, including hardware, software, and device tolerance requirements. The most cost efficient approach may be an appropriate combination of SEE-hard devices and other mitigation. However, the availability, power, volume, performance, and cost of radiation-hardened devices prohibits their use. Hardware or software design also serve as effective mitigation, but design complexity may present a problem. A combination of the two may be the selected option.

Terms and Definitions:

Single Event Upset (SEU) is a change of state or transient induced by an ionizing particle such as a cosmic ray or proton in a device. This may occur in digital, analog, and optical components or may have effects in surrounding circuitry. These are "soft" bit errors in that a reset or rewriting of the device causes normal behavior thereafter.

Single Hard Error (SHE) is an SEU which causes a permanent change to the operation of a device. An example is a permanent stuck bit in a memory device.

Single Event Functional Interrupt (SEFI) is a condition where the device stops normal functions, and usually requires a power reset to resume normal operations. It is a special case of SEU changing an internal control signal.
Single Event Latchup (SEL) is a potentially destructive condition involving parasitic circuit elements. In traditional SEL, the device current may exceed device maximum specification and destroy the device if not current limited. A "microlatch" is a subset of SEL where the device current remains below the maximum specified for the device. A removal of power to the device is required in all non-catastrophic SEL conditions in order to recover device operations.

Single Event Burnout (SEB) is a highly localized burnout of the drain-source in power MOSFETs. SEB is a destructive condition.

Single Event Gate Rupture (SEGR) is the burnout of a gate insulator in a power MOSFET. SEGR is a destructive condition.

Linear Energy Transfer (LET) is a measure of the energy transferred to the device per unit length as an ionizing particle travels through a material. The common unit is MeV·cm²/mg of material (Si for MOS devices).

LET threshold (LET_{th}) is the minimum LET to cause an effect. The JEDEC recommended definition is the first effect when the particle fluence = 1x10⁷ ions/cm².

Cross section (sigma) is the device SEE response to ionizing radiation. For an experimental test for a specific LET, sigma = \#errors/ion fluence. The units for cross section are cm² per device or per bit.

Asymptotic or saturation cross section (sigmasat) is the value that the cross section approaches as LET gets very large.

Sensitive volume refers to the device volume affected by SEE-inducing radiation. The geometry of the sensitive volume is not easily known, but some information is gained from test cross section data.

Section 2
Functional Analysis and Criticality
Michele M. Gates, NASA Goddard Space Flight Center

2.1 Introduction

Since SEE-inducing particles are, in general, not effectively attenuated with shielding, design tolerance requirements are not based upon location on the vehicle. Instead, SEE requirements depend on the functions devices perform. Many SEEs are different for different device types, e.g. memories will exhibit different conditions than power converters, so the function the device performs is critical to the analysis. In addition, SEEs may present functional impacts by propagating through the design and impacting other areas. These two conditions make each single event problem different in terms of failure mode and effect. SEE analysis is most effectively supported by viewing a design or system from the perspective of the function(s) it performs.

In this section, we present some systems engineering tools useful in constructing and assessing an SEE problem. Functional analysis is an effective method for the consideration of a design for single event effects. The concept of criticality lends itself well to the assessment of the impact of a specific effect. Error propagation analysis, discussed in the following chapter, provides the final link. With the use of these tools, SEECA becomes a specialized Failure Modes and Effects Criticality Analysis (FMECA)-type study.

2.2 Functional Analysis

The systems engineering process, presented as one of the Systems Engineering Practices in the MIL-STD-499 Engineering Management Practices, is given in Figure 2.1 [1]. The first box represents the input requirements for the system being considered. With the known performance requirements, one then identifies the required functions to achieve performance, termed "functional analysis". Potential mechanisms to fulfill the functions, or design options, are explored
and evaluated. A decision is made, leading to the system description. The process may be applied to many levels in a design, from the large-scale system, or upper-level, to the lower levels of subsystems and circuits.

[Diagram of the System Engineering Process]

Figure 2.1: The System Engineering Process

Considering a design in terms of function facilitates engineering groups in developing plans and requirements and in performing analyses. Specific to SEECA, it provides the foundation for studying the impact of single event effects (SEEs) on system performance. SEE presents a functional impact on both the device and system levels. By analyzing a design or system in terms of the functions it performs, regardless of its given subsystem name or physical location on the vehicle, we may form an SEE problem statement and explore solutions. Considering both the device and system in terms of function sets the framework for defining the problem, analyzing it, and exploring solutions.

Different subsystems on a spacecraft are generally associated with different engineering disciplines. The subsystems are typically found on different physical locations on a space vehicle, such as in separate boxes. The attitude control subsystem, for example, is responsible for attaining and maintaining spacecraft orientation. This subsystem usually has several associated boxes which may include earth sensors, sun sensors, reaction wheels, gyros, and support electronics. The command and data handling subsystem may be responsible for issuing, delivering, and storing all computer commands and data. The propulsion subsystem usually contains the on-board thrusters, fuel, and its own electronics. The separation of subsystem boxes is extremely advantageous during design, integration, and test. However, it is easy to overlook the overlapping of functionality. One specific function will often involve hardware and/or software from more than one different subsystem. For example, a reorientation maneuver, when broken down, involves lower level functions in many subsystems: the attitude control system senses orientation data; the command and data handling subsystem generates the required thruster command; and the propulsion subsystem fires a thruster. A schematic of some designated levels of design is presented in Figure 2.2.
Just as the entire systems engineering process in Figure 2.1 applies at many hierarchical levels in design, the functional analysis portion applies similarly. In functional analysis, a design is viewed from the perspective of the functions it performs. The objective of a conventional functional analysis is to define a comprehensive set of baseline functions and functional performance requirements which must be met in order to accomplish the overall mission objectives. This is achieved through the breakdown of top-level requirements into successively lower-level performance requirements, in a methodical and traceable manner. Functional analysis applied at lower levels involves the breakdown of requirements and functions at the subsystem, card, circuit, and device levels. Top level functional analysis is useful in requirements generation, such as for SEE tolerance. Lower level functional analysis is useful in SEE impact assessment, or failure modes and effects analysis.

Functional analysis may be performed in a clear, methodical way through the use of functional flow block diagrams. This flowchart-like method enables the identification of functions while providing traceability. Figure 2.3 presents a functional flow block diagram created in a mission-level functional analysis effort for the Far Ultraviolet Spectroscopic Explorer mission [2]. Mission operations, specified as function #4, is broken down into the next level, functions 4.1 - 4.7, which include contingency operations, safehold, deployment and initialization, maneuvers, target acquisition & tracking, science data acquisition, and science data processing. Figure 2.4 presents function 4.5, target acquisition & tracking, broken down into its next level, functions 4.5.1 - 4.5.6, which include sun acquisition, inertial attitude determination, inertial attitude processing, sensor configuration, target selection, relative attitude processing, slew specification, and instrument alignment [2]. Science data acquisition, function 4.6, is broken down in Figure 2.5 [2].

![Figure 2.3: Functional Flow Block Diagram for Mission Operations.](image-url)
Figure 2.4: Functional Flow Block Diagram for Target Acquisition and Tracking.
For quick studies of design issues, less formal analyses are often useful. Here, many-tiered functional flow block diagrams may not be needed. Quickly drafted notes or even a simple thought experiment may suffice as a short functional analysis on the subsystem or device level.

### 2.3 Single Event Effect Perspective

The systems engineering process is used in many engineering disciplines, including single event effect (SEE) analysis. Some SEE mitigation techniques are system level and are designed directly into the system. For these, system level functional analysis identifies functions that are performed to meet the system requirements. Different system design options mitigating SEE to meet performance requirements may then be considered. Device cost, design complexity, design schedule, system weight and power may be potentially impacted by SEE mitigation, just as with many design selections.

The systems engineering process also applies to device-level SEE analysis. This may be done much later in the design process, after the system baseline has been described. A device has specific requirements associated with it in a design, such as operating current, bit error rate, etc. The device also performs functions to fulfill system level requirements, which may or may not overlap the device requirements. Mitigation schemes at the device level may be considered which ensure that performance requirements are met.
2.4 Functional Criticality

One objective of viewing a design or system in terms of function is to determine the criticality of the function(s) performed on an operational level. Many SEEs present a functional impact, but do not cause permanent damage to the device. Depending on the criticality of a function, these nondestructive conditions may or may not be acceptable in a design. In assessing criticality, we determine the impact of an SEE in a device on the functions it performs. Device hardness requirements are not considered here, since SEEs may be mitigated through many routes. What is of interest is the operational impact of a specific device SEE propagating through the design or system.

Functions may be categorized into "criticality classes", or categories of differing severity of SEE occurrence. Many times, most or all of the functions performed by a design or system are considered critical to a mission. The operational impact of SEEs in critical functions may be unacceptable. For these designs, usually either no single event effects, or a very small probability of SEE occurrences, are permitted. When considering a subsystem, some components may not be SEE-critical, while others may indeed be crucial. For example, the flight data system program memory is certainly critical, while data storage memories may tolerate SEEs if utilizing error correction schemes. Both of these functions are located in the Data System.

In general, one might consider three criticality groups for Single Event Upset: error-functional, error-vulnerable, and error-critical. Functions in the error-functional group may be unaffected by SEUs, whether it be due to an implemented error-correction scheme or redundancy, and a large probability of SEU may be acceptable. Functions in the error-vulnerable group might be those for which the risk of a low probability is assumable. Functions in the error-critical group are functions where SEU is unacceptable. Figure 2.6 presents a decision tree for criticality analysis, describing a representative criticality grouping and corresponding risk levels, or SEE tolerance requirements [3]. In this discussion, we are applying the decision tree to SEU analysis. One might use Figure 2.6 or a similar process for other nondestructive SEEs.
Figure 2.6: Single Event Effects Decision Tree.

This functional criticality concept applies directly at the device level. One may specify the criticality of a device function and determine whether current device tolerance needs and mitigation schemes are adequate to protect the system from impacts. Functional criticality is also a direct lead into SEE requirements generation on any level, including spacecraft, system, and subsystem.

2.5 Functional and Device SEE Requirements

Once the criticality of functions are determined, requirements for design, including hardware and software may be directly obtained. In the criticality analysis presented in Figure 2.6, the requirements for SEU probability for all three criticality groups are directly tied to acceptable risks. The more critical an SEE is to operational performance, the more strict the SEE requirement should be.

In general, the tradeoff in the development of SEE requirements is risk vs. cost and design complexity. The more risk assumed, the higher the allowable probability of an SEE, and potentially the less the cost of the design. There may be cases in which a greater percentage of SEEs may be acceptable for a reduction in cost. Other design concerns also play a role, such as performance, power, weight, and volume.

Requirements are specified for each functional group by specifying the maximum probability of SEE occurrence permitted in each category. The SEE rate requirements may be different for SEU, latchup, gate rupture, and any other SEE of concern. These requirements are specified at the functional level, and are achievable through many avenues, including hardware mitigation, software schemes, redundancy, and device hardness. In contrast to specifying a spacecraft-level requirement, functional SEE requirements may yield areas in the design, or specific functions, with
lower necessary tolerance levels. This reduction in requirements usually translates to a reduction in the cost of design. However, common devices across functions might be cost-advantageous under the worst-case radiation specification. The decision tree in Figure 2.6 is again helpful here. For each criticality group, there is a functional requirement. The functional requirement may be fulfilled using a combination of methods. The selection of mitigation tools leads to the device requirement. A functional requirement does not necessarily translate directly to a device requirement. Figure 2.7 presents this requirements flow [3].

![Decision Tree](image)

**Figure 2.7: Single Event Effect Requirements Generation.**

This idea of functional and device SEE requirements is useful when working at many levels in design. Some projects perform a complete spacecraft functional analysis as part of the systems engineering responsibility. In this case, functional SEE requirements for the entire design, or any portion of it, may be directly derived by categorizing the functional breakdown by criticality. For specific portions of a design, functional SEE requirements may be developed by detailing the functions performed in that portion. Device SEE requirements flow directly from both of these, as described earlier. If addressing a problem in more detailed design phase, device SEE requirements may be determined by assessing the functional criticality of specific components and assessing mitigation options to meet the specified operational requirements.
2.6 Conclusion

This section has presented some discussion on functional analysis and criticality, two systems engineering tools for design assessment. It is hoped that these methods will facilitate SEE analysis, both in forming a problem statement and arriving at solutions. Depending on the application, this methodology may be used in its entirety or in various degrees.

2.7 References


Section 3
Ionizing Radiation Environment Concerns
Janet L. Barth, NASA Goddard Space Flight Center

The definition of the radiation environment for SEE predictions must provide sufficient information to meet two criteria:

1) What is the "normal" radiation environment under which the system must operate? In other words, will the mitigation measures and mission operation plans be adequate to handle the SEU rates during normal operation times?

2) What is the "worst-case" radiation environment that the mission will encounter? In other words, will the levels of radiation during a pass through the peak fluxes of the proton belts or at the peak of a solar flare result in catastrophic data loss or cause parts to experience permanent or semi-permanent damage?

This section is intended to inform SEECA users of the risks, unknowns, and uncertainties inherent in radiation environment predictions. Thus, they will be better able to define SEE mitigation requirements that reduce risk with reasonable cost.

3.1 Ionizing Radiation Environment Sources

The main sources of energetic particles that are of concern to spacecraft designers are:

1) protons and electrons trapped in the Van Allen belts,
2) heavy ions trapped in the magnetosphere,
3) cosmic ray protons and heavy ions, and
4) protons and heavy ions from solar flares.

The levels of all of these sources are affected by the activity of the sun. The solar cycle is divided into two activity phases: the solar minimum and the solar maximum. An average cycle lasts about eleven years with the length varying from nine to thirteen years [1, 2, 3]. Generally, the models of the radiation environment reflect the particle level changes with respect to the changes in solar activity.
3.1.1 Trapped Heavy Ions and Electrons

From the information provided by the mapping of the trapped heavy ions by the SAMPEX satellite [4], we know that these ions do not have sufficient energy to penetrate the satellite and to generate the ionization in electronic parts necessary to cause SEEs. Also, electrons are not known to induce SEEs. Therefore, trapped heavy ions and trapped electrons are not included in a radiation environment definition for SEEs and will not be discussed in the sections below.

3.1.2 Trapped Protons

In the past, analyses of SEEs focused on energetic heavy ion induced phenomena. However, SEE data from recent spacecraft [5, 6, 7] have shown that newer, high density electronic parts can have higher upset rates from protons than from heavy ions because of their low threshold LET value. In addition, it is difficult to shield against the high energy protons that cause SEE problems within the weight budget of a spacecraft. As a result, any successful and cost effective SEE mitigation plan must include a careful definition of the trapped proton environment and its variations.

Protons are the most important component of the "inner" Van Allen belt. In the equatorial plane, the high energy protons (E>30 MeV) extend only to about 2.4 earth radii. The energies range from keV to hundreds of MeV. The intensities range from 1 proton/cm²/sec to 1 x 10⁵ protons/cm²/sec. The location of the peak flux intensities varies with particle energy. This is a fairly stable population but three known variations are important when defining requirements for SEE analyses. The most well known variation in the population is due to the cyclic activity of the sun. During solar maximum, the trapped proton populations near the atmospheric cut-off at the inner edge of the belt are at the lowest levels and, during solar minimum, they are at their highest. Second, the trapped protons are subject to perturbations at the outer edge of the inner belt and in the region between two and three earth radii due to geomagnetic storms and/or solar flare events. Last, the particle population is affected by the gradual change (secular variation) of the earth's magnetic field.

Trapped proton levels are calculated using the NASA AP8 model [8]. In the model, flux intensities are ordered according to field magnitude (B) and dipole shell parameter (L). The AP8 model comes in solar minimum and solar maximum versions, therefore, it is possible to take into account the solar cycle variations by simply selecting the appropriate model version. Otherwise, the models are static and do not reflect the variations due to storms and the geomagnetic field changes. Consequently, the trapped proton fluxes from the AP8 model represent omnidirectional, integral intensities that one would expect to accumulate on an average over a six month period of time. For limited durations, short term excursions for the models averages can reach orders of magnitude above or below.

Analyses of data gathered in flight before, during, and after geomagnetic storms and solar flare events have shown that the trapped proton population is affected by these phenomena at the outer edges of their trapping domain. It was observed on the CRRES satellite that flew during solar maximum that the so called "slot" region of the magnetosphere (2 < L < 3) can become filled with very energetic trapped protons as a result of solar flare events [9]. The decay time of the second belt is estimated to be on the order of 6-8 months. Phillips Laboratory has modeled this second proton belt as detected by the CRRES satellite [10]. The Air Force DMSP satellite flew during solar minimum. Particle flux monitors on board the DMSP showed that, after a major magnetic storm, the inner proton belt was reconfigured and eroded such that a second belt was formed [11]. A model of this redistribution of particles is not available.

To address the problem of the variation in the particle population due to the changes in the geomagnetic field, it has become a common practice to obtain fluxes from the AP8 model by using geomagnetic coordinates (B,L) calculated for the epoch of AP8 model (1964 for solar minimum and 1970 for solar maximum). This practice came about with the observation that, by using the actual epoch of the mission (e.g., 1995) for the geomagnetic coordinates for orbits at low altitudes (<1000 km), unrealistically high levels of fluxes are obtained from the models due to a lack of an atmospheric cutoff condition in the AP8 [12]. However, B,L coordinates calculated with 1964 and 1970 epochs must be used with caution because it has been shown by in-flight proton flux measurements at an altitude of 541 kilometers [13] that the predictions obtained with geomagnetic coefficients for 1970 can result in significant errors in the spatial placement of
the particle populations. This error is usually averaged out when the proton fluence is orbit integrated over a period of 24 hours or greater but it can result in errors when specific positions in space are analyzed.

3.1.3 Galactic Cosmic Ray Protons and Heavy Ions

Galactic cosmic ray particles originate outside the solar system. They include ions of all elements from atomic number 1 through 92. The flux levels of these particles are low but, because they include highly energetic particles (10s of MeV/n ~ E ~ 100s of GeV/n) of heavy elements such as iron, they produce intense ionization as they pass through matter. As with the high energy trapped protons, they are difficult to shield against. Therefore, in spite of their low levels, they constitute a significant hazard to electronics in terms of SEEs.

As with the trapped proton population, the galactic cosmic ray particle population varies with the solar cycle. It is at its peak level during solar minimum and at its lowest level during solar maximum. The earth’s magnetic field provides spacecraft with varying degrees of protection from the cosmic rays depending primarily on the inclination and secondarily on the altitude of the trajectory. However, cosmic rays have free access over the polar regions where field lines are open to interplanetary space. The exposure of a given orbit is determined by rigidity functions calculated with geomagnetic field models [14]. The coefficients in the models include a time variation so that the rigidity functions can be calculated for the epoch of a mission.

The levels of galactic cosmic ray particles also vary with the ionization state of the particle. Particles that have not passed through large amounts of interstellar matter are not fully stripped of their electrons. Therefore, when they reach the earth's magnetosphere, they are more penetrating than the ions that are fully ionized. The capacity of a particle to ionize material is measured in terms of LET and is primarily dependent on the density of the target material and to a lesser degree the density and thickness of the shielding material.

Several models of the cosmic ray environment are available including CREME [15], CHIME [16], and a model by Badhwar and O’Neill [17]. The model most commonly used at this time is CREME; however, CHIME is based on more recent data from the CRRES satellite. The authors of CREME recommend that most of the environment options available in CREME not be used because they are outdated or inaccurate [18]. They suggest that the standard solar minimum calculations be used for most applications (M=1) and that a worst-case estimate should be obtained using the singly ionized model (M=4). Reference 16 compares the CHIME and CREME models and includes a brief discussion of the Badhwar and O’Neill model.

The CREME and CHIME models include solar cycle variations and magnetospheric attenuation calculations. The CREME model calculates LET for a simple shield geometry for aluminum shields and targets. CHIME improves the LET calculations by permitting the user to choose a shield material density and a target material density. Also, the CHIME model assumes that the anomalous component of the environment is singly ionized.

3.1.4 Solar Flare Protons and Heavy Ions

As mentioned in Section 3.1, work by Feynman et al. [2, 3] and Stassinopoulos et al. [1] shows that an average eleven year solar cycle can be divided into four inactive years with a small number of flare events (solar minimum) and seven active years with a large number of events (solar maximum). During the solar minimum phase, few significant solar flare events occur; therefore, only the seven active years of the solar cycle are usually considered for spacecraft mission evaluations. Large solar flare events may occur several times during each solar maximum phase. For example, in cycle 21 there were no events as large as the August 1972 event of cycle 20; whereas, there were at least eight such events in cycle 22 for proton energies greater than 30 MeV. The events last from several hours to a few days. The proton energies may reach a few hundred MeV and the heavy ion component ranges in energy from 10s of MeV/n to 100s of GeV/n. As with the galactic cosmic ray particles, the solar flare particles are attenuated by the earth’s magnetosphere. The rigidity functions that are used to attenuate those particles can also be used to attenuate the solar flare protons and heavy ions. When setting part requirements, it is important to keep in mind that solar flare conditions exist for only about two percent of the total mission time during solar maximum.
An empirical model of the solar flare proton environment based on solar cycle 20 has existed since 1973 [19]. In 1974 King introduced a probabilistic model of the solar cycle 20 events [20]. This model divides events into "ordinary" and "anomalously large" (AL) and predicts the number of AL events for a given confidence level and mission duration. Stassinopoulos published the SOLPRO model [21] based on King's analysis. Since data for more solar cycles have become available, Feynman et al. [2, 3] have concluded that the proton fluence distributions actually form a continuum of events between "ordinary" the "anomalously large" flares. A team at JPL has combined the results of several works into the JPL Solar Energetic Particle Event Environment Model (JPL92) [22]. This model consists of three parts: a statistically based model of the proton flux and fluence, a statistically based model of the helium flux and fluence, and a heavy ion composition model.

The solar flare proton portion of the JPL92 model predicts essentially the same fluences as the SOLPRO code for the solar flare proton energies that are important for SEE analysis (E>30 MeV). However, for worst-case analyses, the peak solar flare proton flux is required and neither model contains this information. The peak flux of the protons for the August 1972 event can be obtained from the CREME model by specifying M=9 and element number = 1.

For the 26 events observed on the CRRES satellite [16], the peak fluxes for the helium ions with energies E > 40 MeV/n were three times higher than the galactic cosmic ray heavy ion levels. Above the energy of a few hundred MeV/n, the solar flare levels merge with those of the galactic cosmic ray background. The CREME model of the solar flares assumes that the solar particle events with the highest proton fluxes are always heavy ion rich. However, Reames et al. [23] contradict this assumption in their study of the ISEE 3 data. They found an inverse correlation between proton intensity and the iron/carbon heavy ion abundance ratio and that the composition of the flare was a result of the location of the flare on the sun.

The JPL92 model includes a definition of the solar flare heavy ion component based on the data from the IMP series of satellites. A paper by McKerracher et al. [24] gives an excellent overview of that model and presents sample calculations for interplanetary space at 1 AU. One of the findings of this work is that the JPL92 model calculates more realistic and lower solar heavy ion induced SEE rates. The CHIME model also contains a definition of the solar flare heavy ion fluence. As with the JPL92 model, it is expected that the CHIME model will predict lower SEE rates due to solar heavy ions.

3.2 Orbit Environments

There are extremely large variations in the SEE inducing flux levels that a given spacecraft encounters depending on its trajectory through the radiation sources. Some of the typical orbit configurations are discussed below with emphasis given to considerations that are important when calculating SEE rate predictions.

3.2.1 Low Earth Orbits (LEOs)

The most important characteristic of the environment encountered by satellites in LEOs is that several times each day they pass through the proton and electron particles trapped in the Van Allen belts. The level of fluxes seen during these passes varies greatly with orbit inclination and altitude. The greatest inclination dependencies occur in the range of 0°< i < 30°. For inclinations over 30°, the fluxes rise more gradually until about 60°. Over 60° the inclination has little effect on the flux levels. The largest altitude variations occur between 200 to 600 km where large increases in flux levels are seen as the altitude rises. For altitudes over 600 km, the flux increase with increasing altitude is more gradual. The location of the peak fluxes depends on the energy of the particle. For trapped protons with E > 10 MeV, the peak is at about 4000 km. For normal geomagnetic and solar activity conditions, these proton flux levels drop gradually at altitudes above 4000 km. However, as discussed above, inflated proton levels for energies E > 10 MeV have been detected at these higher altitudes after large geomagnetic storms and solar flare events.

The amount of protection that the geomagnetic field provides a satellite from the cosmic ray and solar flare particles is also dependent on the inclination and to a smaller degree the altitude of the orbit. As altitude increases, the exposure to cosmic ray and solar flare particles gradually increases. However, the effect that the inclination has on the exposure to
these particles is much more important. As the inclination increases, the satellite spends more and more of its time in regions accessible to these particles. As the inclination reaches polar regions, it is outside the closed geomagnetic field lines and is fully exposed to cosmic ray and solar flare particles for a significant portion of the orbit.

Under normal magnetic conditions, satellites with inclinations below 45° will be completely shielded from solar flare protons. During large solar events, the pressure on the magnetosphere will cause the magnetic field lines to be compressed resulting in solar flare and cosmic ray particles reaching previously unattainable altitudes and inclinations. The same can be true for cosmic ray particles during large magnetic storms.

### 3.2.2 Highly Elliptical Orbits (HEOs)

Highly elliptical orbits are similar to LEO orbits in that they pass through the Van Allen belts each day. However, because of their high apogee altitude (greater than about 30,000 km), they also have long exposures to the cosmic ray and solar flare environments regardless of their inclination. The levels of trapped proton fluxes that HEOs encounter depend on the perigee position of the orbit including altitude, latitude, and longitude. If this position drifts during the course of the mission, the degree of drift must be taken into account when predicting proton flux levels.

### 3.2.3 Geostationary Orbits (GEOs)

At geostationary altitudes, the only trapped protons that are present are below energy levels necessary to initiate the nuclear events in materials surrounding the sensitive region of the device that cause SEEs. However, GEOs are almost fully exposed to the galactic cosmic ray and solar flare particles. Protons below about 40-50 MeV are normally geomagnetically attenuated, however, this attenuation breaks down during solar flare events and geomagnetic storms. Field lines that cross the equator at about 7 earth radii during normal conditions can be compressed down to about 4 earth radii during these events. As a result, particles that were previously deflected have access to much lower latitudes and altitudes.

### 3.2.4 Planetary and Interplanetary

The evaluation of the radiation environment for these missions can be extremely complex depending on the number of times the trajectory passes through the earth’s radiation belts, how close the spacecraft passes to the sun, and how well known the radiation environment of the planet is. Each of these factors must be taken very carefully into account for the exact mission trajectory.

Careful analysis is especially important for missions that fly during solar maximum and that have trajectories that place the spacecraft close to the sun. Guidelines for scaling the intensities of particles of solar origin for spacecraft outside of 1 AU have been determined by a panel of experts [25]. They recommend that a factor of 1 AU x 1/r² be used for distances less than 1 AU and that values of 1 AU x 1/r³ be used for distances greater than 1 AU.

<table>
<thead>
<tr>
<th>Radiation Source</th>
<th>Models</th>
<th>Effects of Solar Cycle</th>
<th>Variations</th>
<th>Types of Orbits Affected</th>
</tr>
</thead>
<tbody>
<tr>
<td>Trapped Protons</td>
<td>AP8-MIN; AP8-MAX; Solar Min - Higher; Solar Max - Lower</td>
<td>Geomagnetic Field; Solar Flares; Geomagnetic Storms</td>
<td>LEO; HEO; Transfer Orbits</td>
<td></td>
</tr>
<tr>
<td>Galactic Cosmic Ray Ions</td>
<td>CREME; CHIME; Badhwar &amp; O'Neill; Solar Min - Higher; Solar Max - Lower</td>
<td>Ionization Level</td>
<td>LEO; GEO; HEO; Interplanetary</td>
<td></td>
</tr>
<tr>
<td>Solar Flare Protons</td>
<td>SOLPRO; JPL92; Large Numbers During Solar Max; Few During Solar Min</td>
<td>Distance from Sun Outside 1 AU; Orbit Attenuation; Location of Flare on Sun</td>
<td>LEO (&gt;45°); GEO; HEO; Interplanetary</td>
<td></td>
</tr>
<tr>
<td>Solar Flare Heavy Ions</td>
<td>CREME; JPL92; CHIME; Large Numbers During Solar Max; Few During Solar Min</td>
<td>Distance from Sun Outside 1 AU; Orbit Attenuation; Location of Flare on Sun</td>
<td>LEO; GEO; HEO; Interplanetary</td>
<td></td>
</tr>
</tbody>
</table>

Table 3.1 Summary of Radiation Sources
3.4 Environment Prediction Uncertainties

Actual flight data have shown that SEE rates are clearly influenced by the dynamics of the radiation environment. The section above mentions some of the uncertainties inherent in predicting the radiation environment for a mission. The purpose of this section is to specifically address the factors that must be considered to reduce uncertainties.

Experience has shown that the most effective means of reducing uncertainty factors and design margins in particle predictions is to define for the mission:

1. when the mission will fly,
2. where the mission will fly,
3. mission duration,
4. when the systems will be deployed,
5. what systems must operate during worst-case environment conditions,
6. what systems are critical to mission success, and
7. the amount of shielding surrounding the SEE sensitive part(s).

We caution against using old predictions from previous missions. As discussed in the sections above, the predictions have large variations that are a function of altitude, inclination, model used, and time of the mission.

3.4.1 Solar Cycle and Mission Scenarios

Particle level predictions based on the actual mission scenarios are lower than those based on taking the worst-case prediction for every environment. Estimates that include only worst-case conditions lead to over-design and should be used only in the concept design phase of a mission when the actual launch date and length have not been defined. After the launch date and duration are defined, it is possible to estimate how long the spacecraft will be in each phase of the solar cycle. These estimates should consider the impact of a launch delay of one year. Mission scenario definition is especially important for solar flare particles where the number of events is highly dependent on the amount of time that the satellite spends in solar maximum conditions.

3.4.2 Trapped Protons

The uncertainty factor defined for the trapped proton AP8 model is two. This is based on the statistical error inherent in merging the several spacecraft data sets that make up the model and does not include the substantial variations that occur over time. The largest variability in the trapped proton predictions is a function of the trajectory of the spacecraft. Therefore, applying proton predictions for a satellite in one orbit trajectory to another trajectory can result in errors up to several orders of magnitude.

To reduce the uncertainty of trapped proton calculations for SEE application, the definition of the trapped proton environment must specifically take into account:

1. how long the mission will be in each phase of the solar cycle and the effect of changes in the launch date,
2. the orbit trajectory,
3. analysis of the effects of the secular variation of the geomagnetic field (especially for orbits under 1000 km),
4. analysis of the variation in the outer edges of the proton trapping regions due to solar flare events/magnetic storms, and
5. the amount of spacecraft shielding surrounding the SEE sensitive part(s).
3.4.3 Galactic Cosmic Ray Heavy Ions

The basic uncertainty factor defined for the CREME model is two. The CHIME model will provide more updated abundances when it is available.

To reduce the uncertainty in the predictions of the galactic cosmic ray heavy ion levels, the definition must consider:

1. how long the mission will be in each phase of the solar cycle and the effect of changes in the launch date,
2. the effect of the ionization state of the anomalous component,
3. the amount of geomagnetic shielding for the orbit, and
4. estimate of the amount of shielding surrounding the SEE sensitive part(s).

3.4.4 Solar Flare Protons

The component of the environment that presents the largest uncertainty in predictions is the solar flare protons. Some solar cycles (Cycle 21) contain no extremely large flares at all. Other cycles contain as many as eight extremely large events (Cycle 22). The problem of providing solar flare predictions to those concerned with SEE criticality analysis is compounded by the limitations of the models. They are designed for determining mission integrated total dose or solar cell degradation levels and do not adequately address the SEE problem. That is, they provide event integrated fluences (SOLPRO) or mission integrated and daily fluences (JPL92). These values are not adequate for determining worst-case SEE vulnerability during the peak flux levels of the flares. The new CHIME model promises to provide these options for users when it becomes available. In the meantime, the best option is to use the peak flux spectrum from the August 1972 event [14]. The fluence levels provided by the SOLPRO and JPL92 models are a function of confidence level and mission duration.

To reduce the uncertainty in solar flare proton predictions, the definition must take into account:

1. how long the mission will be in each phase of the solar cycle and the effect of changes in the launch date,
2. the level of confidence selected by the project,
3. fluence levels for an extremely large event,
4. flux levels for the peak of an event,
5. the amount of geomagnetic shielding for the orbit,
6. estimate of how many times such an event will occur, and
7. the amount of shielding surrounding the SEE sensitive part(s).

3.4.5 Solar Flare Heavy Ions

The JPL92 model provides a significant improvement over the CREME model for the solar flare heavy ion compositions. As with the solar flare proton portion of the model, the heavy ion model gives fluences as a function of confidence level and mission duration. Again, for SEE analysis, a peak spectrum must be analyzed for worst-case conditions.

The solar flare heavy ion predictions must take into account:

1. how long the mission will be in each phase of the solar cycle and the effect of changes in the launch date,
2. the level of confidence selected by the project,
3. fluence levels for an extremely large event,
4. flux levels for the peak of an event,
5. the amount of geomagnetic shielding for the orbit,
6. estimate of how many times such an event will occur, and
7. the amount of shielding surrounding the SEE sensitive part(s).

3.5 Mission Specific Application

It is not as easy to define the radiation environment for SEE requirements as for TID requirements. In specifying a TID environment, all components of the environment (electrons, protons, bremsstrahlung) are converted to dose units (rads) and summed. The SEE-inducing environment may consist of both protons and heavy ions. Since the underlying physics of the interactions of protons and heavy ions are different, the SEE prediction models and the environment input required are not the same. In general, heavy ions cause upsets via direct ionization of the sensitive regions in the device. The LET spectrum for the particular orbit is used to define this portion of the SEE-inducing radiation environment. Proton-induced upsets are usually caused by secondaries produced by nuclear collisions in the material surrounding the sensitive node of the device. The energy of the incident proton is the best predictor of the damage potential as it determines the levels of secondary heavy ions produced by the collisions. Therefore, the proton energy spectrum is used to define this component of the SEE-inducing radiation environment. In rare cases, where the LET threshold of the device is very low (< 1 MeV·cm²/mg), the protons can directly ionize the sensitive regions. One example is the 1773 fiber optic data bus. In these situations, the LET spectrum of the protons is used, rather than the proton energy spectrum.

SEE predictions are further complicated by the large variation in the criticality levels of system components. For example, it is not necessary to apply worst-case solar flare proton conditions that occur for only a few days of the mission to a data recording system that is not required to maintain data integrity during the flares. These are "normal operation" conditions, hence, environments to consider are daily averages, worst-case passes through the South Atlantic Anomaly (SAA), and background cosmic ray heavy ions. On the other hand, devices that control critical functions must be able to operate at all times and may have the requirement that no SEEs are permitted. If this is the case, worst-case environments are defined and applied, such as, peak fluxes in the SAA and peak solar flare conditions.

It is most advantageous to a mission if the radiation environment specialist is involved as soon as mission requirements are set. In fact, there are cases where it benefits the mission to have advice on radiation environment levels during the orbit selection process. Experience has shown that it is possible to reduce radiation exposure by choosing more benign regions of space while still meeting mission goals. In the SEE requirements generation flow (Figure 2.7), the radiation environment prediction and subsequent SEE predictions for the parts on the preliminary parts list occurs in parallel with setting system functional requirements. At this phase in the mission, a nominal shielding value must be set (e.g., 60 mils). The environment predictions and SEE predictions will be for this shielding value in the requirements generation at the point where decision tree analysis begins.

After setting functional requirements and predicting SEE rates, device sensitivity and criticality are taken into account if further mitigation is necessary. (See Figure 2.6, Single Event Decision Tree). In the case of devices with a low threshold LET (implying probable sensitivity to protons), it may be beneficial to evaluate the actual shielding geometry to determine if a lower, more accurate SEE prediction can be defined. The obvious advantage of this "mitigation" approach is the potential cost savings in eliminating or reducing design impacts. In the case of heavy ions, shielding is not likely to have an effect on the rate of SEE occurrence.

After mission planners have determined the mission specifications, they should provide the radiation environment specialists with:
1. the orbit configuration,
2. the date of launch,
3. the mission duration, and
4. the nominal shielding thickness(es).
The definition of the radiation environment for SEE analyses based on the above parameters should include (if applicable for the orbit):

1. trapped proton spectra attenuated by the nominal shielding thickness(es) for:
   a. orbit averaged daily fluences,
   b. fluences for worst-case pass through the SAA, and
   c. peak fluxes in the SAA;

2. LET spectrum for the nominal shield thickness(es) for orbit attenuated, galactic cosmic ray heavy ions;

3. orbit attenuated solar flare proton spectra attenuated by the nominal shielding thickness(es) for:
   a. an entire solar flare event and
   b. the peak of an event;

4. LET spectrum for the nominal shield thickness(es) for orbit attenuated, solar heavy ions for:
   a. an entire solar flare event and
   b. the peak of an event.

If any devices are susceptible to direct ionization by protons, it is necessary to include the LET spectrum for trapped and solar flare protons for the nominal shielding thickness(es) in the definition of the radiation environment.

3.6 References


Section 4
Effects in Electronic Devices and SEE Rates
Allan Johnston, Jet Propulsion Laboratory

4.1 Single-Event Upset and Related Effects

4.1.1 Collection of Charge in p-n Junctions

High-energy protons and heavy ions lose energy in materials mainly through ionization processes. When this occurs, they deposit a dense track of electron-hole pairs as they pass through a p-n junction. Some of the deposited charge will recombine, and some will be collected at the junction contacts. Charge can be collected from regions outside the junction from charge funneling and diffusion, as well as from the junction depletion region.[1] The net effect is a very
short duration pulse of current at the internal circuit node which is struck by the particle. The magnitude of the charge depends on several factors:

1. Ion properties, including energy, ion type, and charge state;
2. Physical properties of the device, including the path length over which charge is deposited and collected; and
3. The sensitivity of the circuit to small current impulses, which depends on the voltage required to switch states, capacitance, and circuit response time.

A large fraction of the total charge collected by the circuit node occurs in time periods of about 200 ps, and this is referred to as prompt charge. There is also a delayed component that is collected by diffusion. The delayed component can extend to 1 µs or longer, and is important for slower SEE phenomena such as upset in dynamic memories, and latchup.

4.1.2 Circuit Effects

4.1.2.1 Overview

Basic Circuit Effects

The effects of ion-induced charge transients on a circuit depends on several factors, including the minimum charge required to switch states (or to instigate other effects). If the charge collected from the ion strike exceeds the minimum charge, defined as critical charge, then the passage of the ion will upset or otherwise affect the circuit. Note that the critical charge depends on the specific device design.

High-energy ions can induce a number of effects in integrated circuits. Not all of these effects are possible in all devices either because the critical charge for the effect is too high, or because the specific design (or processing) of the circuit precludes occurrence of the effect (e.g., latchup in silicon-on-insulator technologies). These effects can be subdivided into three basic categories:

1. Transient effects, such as single-event upset (SEU) and multiple-bit upset (MBU) that change the state of internal storage elements, but can be reset to normal operation by a simple series of electrical operations or reinitialization; and

2. Potentially catastrophic events, such as single-event latchup (SEL) and snapback, that may cause destruction unless they are corrected for within a short time after they occur; and

3. Single-event hard errors (SHE), which cause catastrophic failure of a single internal transistor within a complex circuit. Two mechanisms can cause hard errors; microdose deposition within the gate region and gate rupture.

Circuit Fabrication Technology Overview

The relative importance of SEE phenomena also depends on circuit fabrication technology. It is not possible to cover all aspects of circuit fabrication, but several generalizations can be made. The two approaches that are used to fabricate most CMOS circuits are shown in Figure 4.1. Note that both processes rely on a reverse-biased junction to electrically isolate the well and substrate regions.
A large number of integrated circuits are fabricated with bulk substrates because of the low starting material cost. With a bulk substrate, junctions that are diffused directly into the substrate have a very long charge collection path for charge generated by heavy ions within the substrate. This affects both the prompt and diffusion components of the charge-collection process. In general, devices fabricated on bulk substrates are highly susceptible to SEU. Bulk CMOS circuits are often very susceptible to latchup as well.

The second is junction-isolation using an epitaxial substrate. This process begins with a highly doped low-resistivity substrate. A thin (5-15 µm) epitaxial layer is grown on the wafer prior to subsequent processing, and the active circuit elements are fabricated above (and within) the epitaxial layer. The low-resistivity substrate has the effect of limiting the prompt charge collection region to that of the thin epitaxial layer, with the result that much less charge is collected for epitaxial than for bulk processes, raising the minimum LET required for upset-related effects. Epitaxial substrates also improve latchup hardness compared with bulk processes. However, latchup is still possible with epitaxial processes,[3] particularly for scaled technologies.

Newer processes are available that isolate different regions with special oxide layers instead of junction isolation. These processes are costly, but can be very effective in hardening devices to SEE effects. With oxide isolation, the charge collection depth is limited to that of the top semiconductor epitaxial layer because of the underlying oxide. In most cases oxide-isolated circuits are also immune to latchup because they eliminate the possibility of four-layer paths. Oxide-isolated technologies include a technique that forms the isolation region by oxygen implantation (SIMOX),[4] and an approach that bonds two wafers with an oxide separation region, etching one wafer to form a thin epitaxial region (BESOI).[5]

Circuit design and feature size are also important in determining SEE sensitivity. In general, SEE sensitivity increases as devices are scaled to smaller feature sizes. Most scaled devices have faster response times, with lower critical charge. Newer technologies with reduced supply voltage are expected to be even more sensitive because switching levels are reduced, lowering critical charge compared to circuits with higher voltages.

**4.1.2.2 Single-Event Upset and Multiple-Bit Upset**

Single-event upset occurs in storage elements when the charge collected from a heavy-ion interaction exceeds the critical charge required to upset the circuit. The circuit then changes state, and stored information is lost. However, the circuit still functions normally, and it can be restored to its original operating state by rewriting or reinitializing the circuit.

Originally, only heavy ions caused single-event upset. However, as individual transistors were scaled to smaller dimensions to increase the size and complexity of VLSI circuits, their sensitivity to SEU increased sufficiently so that it was possible for protons to induce upset as well.[6] This may increase the upset rate by several orders of magnitude because of the large number of protons in solar flares and trapped radiation belts.
In some bulk device technologies multiple-bit upset can occur because of diffused charge in the substrate, which can be collected by several different circuit elements. Devices that are most likely to undergo MBU include DRAMs and four-transistor SRAMs, where diffused charge can be an important part of the charge collection and switching process. MBU can cause the SEU cross section to increase significantly at higher LET values rather than saturating.[7]

4.1.2.3 Latchup and Snapback

Latchup

Most junction-isolated circuits contain parasitic bipolar transistors that can form a four-layer region, similar to that of a silicon-controlled rectifier. These bipolar structures are not involved in normal operation of CMOS devices, but can be triggered by transient currents. Figure 4.2 shows the bipolar structures that create a four-layer latchable structure in a p-well CMOS circuit. Latchup disrupts normal operation in the region of the circuit where it occurs, causing partial loss of functionality as well as higher current in the local region where latchup occurs.

![Latchup paths in a bulk CMOS circuit showing the two parasitic transistors in a four-layer part.](image)

All CMOS designs use special guardbands and clamp circuits at input/output terminals to prevent latchup from occurring in standard circuit applications. However, in a radiation environment transient signals are no longer confined to I/O terminals, and it is possible for the current pulses from heavy ions or protons to trigger latchup in internal region of a CMOS device, as well as in I/O circuitry.

Once latchup occurs, the four-region structure will be switched into a conducting mode, and it will remain in that mode until power is removed, or until the voltage across the latched region is reduced to very low values. During latchup currents can be very high. In most circuits, currents of several hundred milliamps or more will flow in the localized region where latchup is triggered, rapidly heating that region to extremely high temperatures. These high temperatures not only introduce the possibility of localized damage to the silicon and metallizations, but the excessive heating may also cause the latchup to spread to other regions.

Because of the potential for catastrophic damage, latchup is a very serious problem for space systems. The most conservative approach rules out use of any latchup susceptible circuit. A number of methods have also been proposed to overcome latchup at the system or subsystem level by sensing excess current, which is a signature of latchup, and
temporarily removing power. However, power must be removed within a few milliseconds after latchup occurs to avoid possible catastrophic damage. It is also difficult to make sure that latchup detection circuits will be completely effective because many different latchup paths exist in complex circuits, with different current signatures.[8] Device scaling generally increases susceptibility to latchup, and latchup is expected to be even more important for devices with reduced power supply voltage and operating power.[9]

Unlike most radiation phenomena, latchup is highly sensitive to temperature. The threshold LET for latchup is reduced by approximately a factor of three at 125 °C compared to room temperature. [10,11] The larger number of ions with lower LET values in the distribution of galactic cosmic rays [see Sections 3 and 8] causes a factor of three reduction in LET threshold to increase the total upset rate by much larger values at high temperature. Because of this, testing for latchup should always be done at the highest temperature expected in the application.

**Snapback**

As device dimensions are reduced, the parasitic bipolar transistor within an MOS device has sufficient gain so that its parameters can also affect device operation. Snapback is a reduction in the breakdown voltage of this parasitic transistor that is caused by injection of minority carriers from the source diffusion to the well.[12] Just like latchup, snapback also causes local loss of functional operation, along with an increase in current. However, much smaller currents generally occur as a result of snapback.

Another difference between latchup and snapback is that it is usually possible to recover from snapback by sequencing electrical signals without reducing the supply voltage. Snapback involves only three semiconductor regions, and can occur in oxide-isolated structures as well as in those with junction isolation. Unlike latchup, snapback is not very sensitive to temperature. [13]

**4.1.2.4 Single-Event Hard Errors**

As devices are scaled to smaller dimensions, it becomes possible to cause catastrophic damage from the interaction of one (or a small number, i.e. 2-3) ions. These effects have been recently observed in 4 Mbit DRAMs with feature sizes of 0.6-0.8 µm, and are expected to become more important as devices are scaled further.

One mechanism, microdose deposition, differs from other SEE effects in that it involves charge deposition within the gate oxide.[14] This is the same mechanism that causes total dose damage. It becomes significant when the gate area is comparable to that of the microdose deposition region from a single ion. The microdose effect changes the threshold voltage of an individual transistor within a VLSI circuit, increasing the leakage current. This can cause failure in some types of circuits, particularly DRAMs and SRAMs that use a four-transistor memory cell.

The other mechanism appears to be similar to gate rupture in power MOSFETs, and causes a short in the gate region of an individual transistor.[15] This mechanism is important for random logic as well as for memory cells. A similar effect has been seen in field-programmable gate arrays; in this case the heavy ion permanently alters the gate array.[16]

Both mechanisms have been observed for 4- and 16- Mbit DRAMs, but the probability of either phenomenon occurring in space is sufficiently low that neither is very significant for today’s circuit technology. However, both mechanisms are expected to be increasingly important as devices are scaled to smaller dimensions because the threshold LET is expected to decrease.[17]
4.2 SEE Testing

4.2.1 Facilities

4.2.1.1 Heavy-Ion Testing

Heavy-ion tests are done using high-energy accelerators such as the Brookhaven Van de Graaff accelerator or UC Berkeley cyclotron. The range of the particles is very limited -- far less than that of galactic cosmic rays -- and testing must be done within a vacuum chamber. The limited range of the particles can be important when test results are related to space environments, particularly for device technologies where charge diffusion is important. The finite range also means that the LET value changes as the beam passes through the device.

Tests are done using several different ion species, covering a range of LET values. A scattering foil is generally employed within the vacuum system to increase the beam area. The facility is calibrated by measuring the flux rate or current, and by using surface-barrier detectors to determine the LET. During testing, it is important to restrict the total fluence of heavy ions in order to keep from damaging the device that is being tested.[18] Total dose damage and displacement damage can alter device characteristics, affecting SEE test results.

In order to allow a wider range of LET values with fewer ions, it is common practice to adjust the incident angle of the particle beam by rotating the device under test. For a thin p-n junction with constant LET through its depth, the path length increases as $1/\cos \theta$, where $\theta$ is the incident angle. This does not change the LET of the ion, but it increase the path length by the secant of the angle. Thus, as long as these assumptions hold, the effective LET increases. Unfortunately this "cosine law" is not always applicable. It fails in several cases: (1) where charge collection occurs over a path length that is a sizable fraction of the total range of the particle (the LET varies along the path); (2) for devices that collect much of their charge by diffusion, where the collection volume is spherical, and the LET does not vary with angle; and (3) where the aspect ratio of the collection volume is small, causing a more complex angular dependence. The validity of the cosine law must be carefully checked for each device technology. Test results should always include angle and range data for each ion species.

Proton tests are also done at accelerators. Unlike heavy-ion accelerators, protons have large ranges, making testing more straightforward. Different reactions are possible for different proton energies. Proton tests are generally done at several proton energies in order to determine the threshold energy level for proton upset instead of threshold LET, which is used for heavy ions.

4.2.1.2 Californium Sources

It is also possible to use californium fission sources for SEE testing.[19] Fission sources are very low in cost, and can be used in a normal laboratory environment. Fission sources produce a spectrum of fission fragment energies, which complicates interpretation of the results. The range of fission fragments is smaller than that of most heavy-ion sources, and this places a major restriction on testing with californium sources. However, californium can be an effective low-cost alternative for technologies with shallow charge collection structures.

Because of the finite range, the effective LET of californium fission fragments decreases rapidly as they go through the device structure. The maximum LET available from californium is approximately 40 MeV*cm$^2$/mg near the device surface. However, this decreases to 10-15 MeV*cm$^2$/mg at a depth of 10 $\mu$m. The effective LET of ions from californium depends on the average LET within the charge collection depth, and is usually much less than maximum LET. This needs to be taken into account when californium test results are compared with conventional heavy ion testing; in most cases the saturation cross section with californium is lower than the saturation cross section obtained with heavy ions because of the limited range.
4.2.2 Single-Event Upset Testing

Single-event testing is relatively straightforward for memory circuits, which are often used as an example of single-event testing. It is easy to define the internal conditions and to test the entire storage array of a memory circuit, although large commercial memories may use more complex "hidden" architectures that complicate the interpretation of memory test results.

As discussed in the previous section, testing is done using several different ions (and often several incident angles), measuring the number of errors and the total particle fluence to determine the cross section at various LET values.[20] The error rate must be low enough to avoid complications from multiple errors during short time periods (note that this differs from multiple-bit errors) and to correct for the latency period during the time that the memory is being rewritten.

In all SEE testing, it is important to recognize the importance of counting statistics. Counting uncertainties depend upon the square root of the number of occurrences. In general, at least 100 events should be observed at each effective LET value, and the uncertainty in counting statistics should be included when reporting data. Note that the observance of one or two events is virtually impossible to interpret.

Other VLSI devices, such as microprocessors and random logic are much more difficult to test. Bias conditions play a major role in single-event upset testing, particularly for complex circuits. In order to interpret results, one must know which regions of a device involve internal storage cells, and how many of them are being exercised during the test. For example, test results for some types of microprocessors have shown an order of magnitude increase in cross section when the device is exercised by operations that use cache memory compared to non-cache results.[21]

4.2.3 Latchup Testing

Many circuit variables affect latchup testing, including the bias conditions applied during testing. Latchup tests should be made under conditions of maximum power supply voltage. As discussed earlier, testing should also be done at the maximum temperature expected in the application. Note that a null result at room temperature means nothing about latchup susceptibility at higher temperature.

Because latchup is a relatively slow process, diffused charge is extremely important during latchup testing. It is important that particles have sufficient range. For devices with p-substrates, charge collection can occur at depths of 60 µm or more. Varying the incident angle may actually lower the adjusted latchup cross section if ions are used with insufficient range.

In most cases a power monitoring and control circuit is used during latchup testing that allows power to be shut down quickly after latchup is detected. If power cycling occurs, care must be taken to account for the "dead time" between shutdown and power up when the latchup cross section is evaluated.

Although latchup usually produces large increases in power supply current, some circuits exhibit very small changes in current ("microlatches"). These microlatches may be caused by localized latchup paths which have relatively high resistance paths (this can depend on the location of the latchup region relative to bond wires and power supply or ground metallization) or by other effects, such as snapback. In many cases insufficient resources are available to distinguish the exact mechanism. Nevertheless, it is important to realize that microlatches can occur and to set up a testing approach that can detect them.
4.3 SEE Rate Calculations

Calculations of SEE rates involves three different quantities:[22,23]

1. The cross section of the device, often determined empirically;

2. The distribution of particles expected in the space environment, which depends on assumptions about solar flare activity, radiation belt activity, and shielding; and

3. The critical charge, sensitive area and sensitive volume associated with the SEE phenomenon of interest.

These three quantities are folded together in order to calculate the estimate of the upset rate. All three are complex, and usually a number of simplifying assumptions are made in the process of doing the calculations. Assuming that the cross section is accurately known, the approach used is outlines below.

Particle Distribution. The particle distribution of galactic cosmic rays is often assumed to follow the so-called Heinrich curve, which provides several distributions of flux vs. LET corresponding to solar max, solar min, a 10% worst-case flare (i.e., only 10% of the expected solar flares will exceed this distribution), and a worst-case flare distribution [this is discussed in more detail in Section 8]. The worst-case distribution is rarely used, because it is not only statistically unlikely but also increases the particle distribution envelope by nearly five orders of magnitude.

Other particle distributions must be added to the distribution of the Heinrich flux. Particles trapped within radiation belts also have a distribution of energies. However, they generally have lower energy than galactic cosmic ray, and are more affected by shielding.

Sensitive Geometry and Critical Charge

The sensitive geometry and critical charge are the most difficult parameters to determine. Charge funneling, which extends the collection depth below the depletion region cannot be determined directly, and involves an assumption about device geometry. For processes where diffused charge is important, it can be even more difficult to determine the appropriate volume for charge collection.

Charge collection assumptions are more straightforward for epitaxial processes, where it is usually reasonable to assume that the charge collection depth is limited by the epi thickness. The effective area for charge collection may be difficult to determine accurately, particularly for cases where the cross section does not saturate, but continues to increase at higher LET values.

Chord-Length Distribution. A cosmic ray can strike an IC from any angle. In order to calculate the number of upsets that will occur, it is necessary to take into account the way that particles intercept the total change collection volume as they pass through it with different locations and angles. These details have been worked out for simple parallelepiped geometries, and can be used to provide a distribution of chord lengths within the volume. [24] The chord-length integral is then used along with the ion distribution and critical charge to determine the probability that particles in the environment will produce an upset.

Error-Rate Prediction Techniques

For cases where a single device geometry is involved, computer programs such as CRUP or CREME can be used to calculate the final error rate, in errors/bit-day. However, these computer codes require a fixed, single value for critical charge and the device collection volume. In real devices the collection volume may depend on LET, complicating the analysis.
It is important to realize that these are very complex calculations, involving many assumptions and uncertainties.[23,24] The result is only an estimate of the upset rate expected in the application. The established approaches work reasonably well for devices with a nearly ideal cross section and sensitive volume, but are less successful for highly scaled devices or for processes involving diffused charge, such as latchup.

Figure 4.3 illustrates the way that the cross section, particle LET distribution, and sensitive volume are combined to calculate the error rate. Some important details associated with these parameters are discussed below.

Cross Section. Measurements of the cross section are available only at fixed values of LET, and it is usually necessary to adjust some of the results to account for experimental uncertainties, including differences associated with the assumption of the cosine dependence for ions that have angles other than normal incidence. Most cross sections rise gradually with increasing LET, and it is generally not correct to assume a step-function dependence for the cross section.

Particle Distribution. As discussed above, most space systems assume a particle distribution that includes the possibility of solar flares during the mission. The "10% flare" case is often used. One must keep in mind that an unusual solar flare event can cause much larger increases in the particle distribution, which will increase the upset or latchup rate above this value. For earth orbiting missions, the effect of protons from the radiation belts must be taken into account as well as distributions of galactic and solar flare particles in order to estimate the upset rate.

Sensitive Volume. Unless a specific device has been studied in exhaustive detail, many assumptions are necessary in arriving at a sensitive volume for SEE effects. Physical information about the structure (i.e., junction depths, doping levels and substrate characteristics) can be used, if available, but often the sensitive volume can only be obtained by making assumptions about the device structure. Furthermore, the sensitive volume is generally different for SEE from heavy ions and protons, as well as for SEL.
Calculation of Upset Rates. Fortunately, the steep falloff of the LET distribution and the gradual increase of the LET cross section make the final upset rate less dependent on small deviations and inaccuracies in these parameters, and it is usually possible to obtain upset rates that are within a factor of 2-3 of the actual upset rate for most devices.

The results of these calculations can be expressed in several different ways. One common approach is to calculate the number of errors that occur in a 24-hour period; this can either be reported for the entire chip, or normalized to the number of bits, i.e., errors/bit-day. Often two calculations are done, one for the typical galactic cosmic ray environment, and one assuming a large solar flare. This provides an approximate estimate of the likely change in upset rate during enhanced solar activity.

Another important factor in error rate calculations is shielding, particularly for lower energy solar particles and protons in the radiation belts. Although shielding is not discussed here, it often reduces the error rate to much lower values for devices that are shielded by other equipment or structures within the spacecraft.

4.4 Summary

This section of the document provides a brief summary of basic SEE effects. The primary emphasis was on SEE effects in integrated circuits. Two important areas were omitted because of space limitations: catastrophic failure in power devices, including burnout in bipolar transistors and gate rupture in power MOSFETs; and single-event transients in integrated circuits.

It is important to distinguish between transient SEE effects, which are generally easy to recover from, and catastrophic effects, which are generally of far more concern for most space systems. As devices are scaled to smaller dimensions, it is likely that catastrophic effects will become increasingly important, and the success of future space missions may ultimately depend on how well these phenomena are understood and characterized for various technologies.

Finally, it is important to realize that error rate calculations are complex, and involve many assumptions, particularly about device geometry. Although error rates for individual components are important, they are only part of the equation. Other sections of the document deal with ways to accommodate SEE effects, and one should keep in mind that the net effect of errors on the spacecraft is highly system dependent.

Notes

§ Even though the main energy loss mechanism for protons is ionization, the charge produced by direct ionization from protons is too small to cause SEE effects in most semiconductors. SEE effects from protons are caused by the reaction products of proton-induced nuclear reactions.

4.5 References

5. W. M. Huang, et al., "ULSI-Quality Gate Oxide on Thin-Film-Silicon-on-Insulator," 1993 IEDM Technical Digest, p. 735.
Section 5
SEU Propagation Analysis: System Level Effects
Kenneth A. LaBel, NASA Goddard Space Flight Center

5.1 Definition

SEU propagation is the art and science of determining the effect and potential impact that the occurrence of an SEU has on the device where the SEU occurs, its associated circuitry, subsystem, system, and spacecraft. That is to say, how an SEU propagates up the ladder of design integration. For example, an SEU occurs in an A-to-D converter causing a single incorrect data sample to be gathered. This "invalid" data sample may provide an incorrect data point such as a star location or a misleading temperature value.

5.2 Ground Test and Simulation of System Level or Propagated SEEs

The concept of propagated SEUs is straightforward to the typical electrical engineer. It is similar to what one might perform in a standard mathematical circuit simulation, that is, how a signal pulse, transient, or state will affect a circuit's performance either instantly or in future clock cycles.

Several groups have published information pertaining to either the simulation of SEU effects and their propagation to circuit and system level, as well as the performance of SEE ground testing on devices with the actual circuit design as used in a spacecraft system [1-12].

Newberry, et al. [1-3] have been leaders in the area of SEU propagation. In particular, they have discussed the effects of radiation-induced input/output (I/O) transients or noise spikes on system performance as well as that of VLSIC transients. In essence, the concept relays the idea that traditional bit flips in memory cells are not the only cause of SEUs on a system level, but also SEU-induced voltage spikes occurring in logic or I/O devices impact the system SEU rate and effects. This work was also among the first to discuss transients and circuit-specific levels for defining SEUs (i.e., duration and amplitude constraints). For example, a 0.25V spike of 5 nanoseconds in duration may or may not be observed by the following circuit elements.

Leavy, et al. [4] have described the propagation of events inside of a bulk CMOS microprocessor with SEU-hardened clocked flip-flops. In this instance, SEU-induced transients on the clock lines were shown to be capable of causing upsets to microprocessor operation. As a side note, Leavy, et al. were able to solve this problem through a circuit redesign for their next foundry run of the microprocessor.

LaBel, et al. [5,6] have described the effects of transients in a fiber optic receiver photodiode as well as how this affects a system bit error rate (BER) from both the physical link perspective as well as through higher layers of network protocol. This will be described below.

SEU-induced transients in analog devices have been reported by several organizations [7,8,9]. All of these references point out two facts. First, transients in devices such as a comparator or op amp may propagate to the digital electronics in the surrounding circuitry. Depending on the specific circuit designs, these transients may only corrupt a single telemetry sample or, in a worst-case scenario, cause system disfunction or failure. The second item was pointed out by Newberry [3] as well: the definition of an analog SEU phenomena is specific to the interface circuitry surrounding the radiation-sensitive device.

Taking this one step further, Turflinger, et al. [10,11] have extensively delved into separating SEUs for conventional analog-to-digital converters (ADCs) into several categories. The two major categories are noise and offset errors that are analogous to Gaussian and non-Gaussian errors. Neither of these errors is fatal to the device itself, but both are capable of causing erroneous telemetry and misinterpretation by or impairment of the surrounding spacecraft systems.
McCarty, et al. [12] also have explored an ADC. However, this ADC was not a conventional successive-approximation register (SAR) or flash ADC, but a complex hybrid delta-sigma averaging ADC susceptible to both noise and offset errors, as well as control errors. These control errors are capable of affecting device operation and calibration. Furthermore, they hinder system performance in a space environment.

At this point, we have emphasized the effects of transient SEUs on system performance. It is not intended to slight digital SEU effects such as bit flips. These types of SEUs may propagate, for example, from a control or data register inside of a microprocessor into operational performance of the circuit or system. A worst-case example may be the false commanding of critical hardware such as a thruster or pyro.

In some instances, it is not required to know what particular area of a device has seen an SEU, but how well the system mitigation design will work. NASA has been among the first to fly a commercial 32-bit microprocessor in a critical space application [13]. The Small Explorer Data System (SEDS) is a spacecraft Command and Data Handling subsystem for the Solar Anomalous Magnetospheric Particle Explorer (SAMPEX) mission at Goddard Space Flight Center (GSFC). Included as a critical portion of the SEDS is the Recorder Processor Packetizer (RPP): an INTEL 80386 microprocessor-based flight computer with 26.5 MBytes of solid state data storage.

One of the design features of the SEDS is its built-in fault tolerance and its ability to recover from observed errors. This is accomplished via SEDS hardware watchdog circuitry (at multiple levels: circuit, board, box, etc...) as well as software health and safety tasks. To this end, a SEU test was performed on the RPP. SEUs were induced on the 80386 microprocessor family in order to verify the fault tolerant capabilities of the SEDS [13]. The Brookhaven National Laboratories' tandem Tandem VandeGraaff accelerator was utilized for this purpose.

To summarize the SEDS ground SEE test results, several different errors were observed including a halting of the RPP's operation and "processor exceptions". All the SEE events were recoverable using planned mitigation techniques by the SEDS.

It should be noted that the SEDS has been performing flawlessly from the SEE mitigation perspective since its launch in July of 1992.

5.3 Propagation Analysis Methodology

In many ways, SEU propagation is similar to both traditional circuit simulation and FMEA. In both instances, the end result is to determine the end effects that an error or failure has on the performance of a device, circuit, or system. To this end, we shall trace the steps and engineer may utilize in determining SEU propagation effects.

5.3.1 Device Analysis

This is the lowest level of propagation analysis included herein. Figure 5.1 illustrates this methodology.
Step 1: Is the device sensitive to SEUs?

This is relatively straightforward,
- If the answer is no, then no further analysis is required.
- If the answer is yes, then go to Step 2.

Step 2: Does the device meet mission requirements?

A device that has a known SEU sensitivity might still meet mission requirements. An example would be a device having an $\text{LET}_{\text{th}} = 45 \text{ MeV} \cdot \text{cm}^2/\text{mg}$ when the mission requires devices with a $\text{LET}_{\text{th}} > 35 \text{ MeV} \cdot \text{cm}^2/\text{mg}$. The device is not insensitive to SEUs, but is acceptable for this particular mission.
Step 3: Determine SEU sensitive device areas.

In analyzing a device, one must determine where and what types of SEUs may occur. Simple devices such as a memory device may have two device areas for discussion: memory cells and control logic while complex devices such as microprocessors may have dozens of individual areas. As one would expect, the more highly integrated a device is, the more sensitive areas may be associated with it. For simplicity, we shall limit the types of SEUs discussed to two types: bit flips (state changes) that typically occur in memory cells or flip-flops, and transients, those SEUs that occur in combinatorial logic or manifest themselves as a "noise" spike on both analog and digital IC areas. Table 5.1 illustrates several potential ICs and their associated areas. This list should not be construed as exhaustive, but simply a sampling of device types.

<table>
<thead>
<tr>
<th>Device Type</th>
<th>Sensitive Area</th>
<th>SEU Types</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memories</td>
<td>Memory cells</td>
<td>Bit flips</td>
</tr>
<tr>
<td></td>
<td>Control Logic</td>
<td>Bit flips if sequential,</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Transients if combinatorial</td>
</tr>
<tr>
<td>Combinatorial logic</td>
<td>Combinatorial logic</td>
<td>Transients</td>
</tr>
<tr>
<td>Sequential logic</td>
<td>Sequential logic</td>
<td>Bit flips</td>
</tr>
<tr>
<td>FPGAs</td>
<td>Combinatorial logic</td>
<td>Transients</td>
</tr>
<tr>
<td></td>
<td>Sequential logic</td>
<td>Bit flips</td>
</tr>
<tr>
<td>Microprocessors</td>
<td>Registers, cache, sequential control logic</td>
<td>Bit flips</td>
</tr>
<tr>
<td></td>
<td>Combinatorial control logic</td>
<td>Transients</td>
</tr>
<tr>
<td>ADCs, DACs</td>
<td>Analog portion</td>
<td>Transients</td>
</tr>
<tr>
<td></td>
<td>Digital portion</td>
<td>Bit flips or transients depending on design</td>
</tr>
<tr>
<td>Linear ICs</td>
<td>Analog area</td>
<td>Transients</td>
</tr>
<tr>
<td>Photodiodes</td>
<td>Photodiode</td>
<td>Transients</td>
</tr>
</tbody>
</table>

Step 4: Determine operational parameters

How a device is being utilized in its specific application may affect its SEU performance as well. Parameters such as access rates, operational modes, clock frequency, power supply voltage, etc... have definitive impacts not on the occurrence, but on the observed effect of an SEU. Several examples may aid the reader to understand this.

Starting with an SRAM device, used in a data storage area, provides a simple example. SRAMs, again for convenience, have three operating conditions: Read, Write, and Static (Data Storage) modes. SEU ground testing may show each mode to have a different SEU sensitivity, i.e. LETth and cell cross-section. In a typical SSR application, an SRAM is written to once between downlink operations to the ground, read once during downlink playback, and remains in static mode for the remainder of the time (typically >99%). Because all memory cells in a device are not written to at the same time (i.e., one byte at a time), SEUs that have an observed effect are those that occur during a write or read operation and those that occur after the device is written to and prior to downlink. If an SEU occurs during the time period between downlink and the writing of a memory cell, the SEU would be overwritten during the write operation. Hence, that particular SEU has no observed effect. This is sometimes known as a benign SEU. Additionally, actual write and read accesses take on the order of 10-200 nsecs to occur. Thus, the sensitive time window, i.e., the time period when an SEU has an observed effect, is very small for these operations.

A second sample scenario might involve a microprocessor. As discussed previously, these types of devices are very complex and have many different areas where an SEU may occur. Some areas have obvious effects on the device performance: for example, a program control (PC) register. If a bit flip occurs in the PC, the microprocessor program flow is disrupted. However, there may be other device areas such as a status register or an area of the device not being utilized where the occurrence of an SEU is benign. If, for example, the microprocessor has a programmable interval
timer (PIT) built-in, one must know if and how it is utilized in this specific design. If the PIT is not used, the SEU would be benign. If the PIT is utilized, one must analyze what performance effect (i.e., different time period than expected) this has based on when the SEU occurs. Additionally, one should know the expected operating modes and area utilization to determine sensitive time windows and non-benign SEU conditions.

Other parameters may affect the device's SEU performance. These include clock frequency and power supply voltage. One should always ask the "what if" question: what if an SEU occurred at location A during time period B? Note that the probability of SEU observance is linked to the sensitive time window for the event as well as to area SEU sensitivity and the environment.

Step 5: Determine/simulate device performance

Now that we have determined the sensitive device areas and operational effects on observed SEUs, the determination of what apparent effect the SEU has on device performance must be explored. Several outcomes may transpire. These include, but are far from limited to:

- improper device operation,
- incorrect device output,
- errors in memory structures to be accessed externally,
- noise spikes on transmission lines,
- device mode changes such as going from an active to standby mode, and,
- incorrect device timing.

If one looks at this as a traditional circuit simulation, digital test vectors with errors (SEUs) could be used to determine the observed effect. At a lower level, SPICE (analog) simulations with injected transients could be utilized as well. Sample scenarios would include FPGA simulations of combinatorial and/or sequential logic or a microprocessor PIT sending out a pulse at an incorrect time. The output of this analysis is a list of potential SEUs for each device.

### 5.3.2 Circuit Level Analysis

Circuit level analysis follows the same steps (3-5) as the device level but with the key now being the circuit operation and performance. As with device level analysis, once we know which devices have SEUs and what those SEUs may look like, we then look at the operational parameters and their impacts on SEU performance. For example, we know that a bit flip may occur in an SRAM, but the circuit level effects are dependent on the what the SRAM is being used for in this application. Sample propagated effects might include:

- an SEU in an SRAM being used for data storage
  - a bad data point,
- an SEU in an SRAM holding software program instructions
  - improper processor operation or flow, or
- an SEU in an SRAM used as a shared memory buffer between two other ICs such as a processor and a direct memory access (DMA) controller
  - any of a large number of potential error conditions (program flow, bad data point, etc).

One must again be aware of the potential for benign and non-benign SEU effects. A sample case is as follows. Assume that a bus driver IC that is being used to drive a microprocessor address bus has an SEU-induced noise spike. Both the time that this spike occurs and the transient's amplitudes (time and voltage levels) determine whether this condition is observed by the surrounding circuitry as an error or not. Again, the concept of a sensitive time window is observed. If the transient occurs on a quiescent bus (i.e., no transactions taking place), the SEU is most likely benign. If the transient occurs on an active bus, the SEU may or may not be non-benign depending on the exact timing of the transaction and the noise spike, as well as the spike's amplitudes.
Once the operational analysis is performed, the engineer is again able to perform a circuit simulation using digital or analog tools. The output of this analysis is a list of the potential SEUs in a circuit and their effects on circuit operation. We may view this as a "black box" wherein the internal circuitry doesn't matter, but what is observed by the outside world (subsystem, system, etc...) is noted.

### 5.3.3 Higher Level Analysis

We may treat subsystem, system, and spacecraft levels of analysis in a single manner. Each of these levels handles the previous level as a black box, not worrying about intimate details, but only on the higher level effects. We will discuss the subsystem level herein as a representative analysis layer.

Once the circuit level analysis is complete, we begin the subsystem level analysis. In essence, we may treat the subsystem exactly like the circuit level, but look for performance aspects of the SEU-induced anomaly. An example follows.

A Command and Data Handling (CADH) subsystem may be composed of separate circuits such as those data storage, spacecraft command processing, attitude control processing, instrument interfacing, spacecraft engineering telemetry gathering, etc... Let's say, for instance, that an SEU occurs in the spacecraft command processing circuitry. To be more specific, we know by circuit analysis that this SEU causes the spacecraft command processing circuitry to have a false output. Again looking at operational parameters and sensitive time windows and amplitudes, we determine if and how this may affect the surrounding circuits and whether there is an effect on the subsystem performance and its output on the whole. For example, we determine if the false output propagates through the instrument interfacing circuit causing an incorrect output on the instrument command interface.

The system level analysis takes this one step further. By continuing with the CADH example, we observe that this false output again may or may not propagate to another subsystem. Depending again on sensitive time windows and amplitudes, an incorrect command may or may not be issued to the instrument.

The spacecraft level of analysis then would take the output of the system level analysis and determine, in this case, whether the incorrect command would affect the overall spacecraft operation. For example, we might observe incorrect instrument data being gathered or a system safing occur.

### 5.4 Example

To provide a little more detailed understanding, we shall discuss a typical ADC. This (hypothetical) ADC has both digital and analog sections. Let's assume an SEU occurs in a calibration RAM area of the device. We shall look at how this SEU could propagate to affect spacecraft performance.

At the device level, we observe a shift of the output levels by +1V. That is, each sample gathered is incorrect with a constant offset of +1V.

At the circuit level, we observe that the engineering telemetry circuit output for a temperature/thermistor circuit for the CADH subsystem has the same +1V offset.

At the subsystem level, the CADH subsystem observes that the CADH temperature is +10 degrees higher than previous.

At the system level, no direct effect is propagated to another subsystem, but we still observe the abnormally high temperature for the CADH subsystem.

At the spacecraft level, we observe that the CADH subsystem is operating at a temperature above its specified limit and take an action such as entering a safing mode, turning off a heater, or sending an anomaly report to the ground via downlink and then awaiting ground intervention to correct the anomaly.
5.5 Summary

We have presented some methodology in viewing the propagation of SEUs from the device level to the spacecraft level of integration. Understanding the effect a single bit flip or transient has on the spacecraft is a key to reducing risk in spacecraft programs.

5.6 References

Section 6
SEE Mitigation: Methods of Reducing SEE Impacts
Kenneth A. LaBel, NASA Goddard Space Flight Center

6.1 Introduction

For simplicity's sake, it is convenient to classify system level SEE effects into two general categories: those that affect data responses of a device, and those that affect control of a device or system. Whereas there is some overlap between the two (an obvious example being a bit flip in a memory device that contains executable code for a processor), we may consider data errors to be those that occur in memory structures or data streams and control errors to be in other hardware such as microprocessors, power devices, or FPGAs.

All of the potential SEE mitigation methods may require that either additional hardware or software be added to the system design. The complexity and, in many cases, the increase in system overhead caused by the addition(s) are fairly linear with the power of the mitigation scheme.

The most cost efficient approach of meeting an SEE requirement may be an appropriate combination of SEE-hard devices and other mitigation. The cost, power, volume, performance, and availability of radiation-hardened devices often prohibits their use. Hardware or software design may serve as effective mitigation, but design complexity may present a problem. A combination of the two may be the most effective and efficient option.

6.2 Sample System Level Mitigation Techniques and Examples

6.2.1 Classification of System Level SEEs by Device Type

Much as we partition SEEs into two arenas, we may divide devices into two basic categories: those that are memory or data-related devices such as RAMs or ICs that are used in communication links or data streams, and those that are control-related devices such as a microprocessor, logic IC, or power controller. That is not to say that there is no overlap between the two categories. For example, an error could occur in the cache region of a microprocessor and cause a data error, or a data SEU (bit flip) might occur in a memory device that contains an executable program potentially causing a control SEU.

6.2.2 Mitigation of Memories and Data-Related Devices

The simplest method of mitigating errors in memory/data stream is to utilize parity checks. This method counts the number of logic one states (or "ones") occurring in a data path (i.e., an 8-bit byte or 16-bit word, etc...) [1]. Parity, usually a single bit added to the end of a data structure, states whether an odd or even number of ones were in that structure. This method detects an error if an odd number of bits are in error, but if an even number of errors occurs, the parity is still correct (i.e. the parity is the same whether 0 or 2 errors occur). Additionally, this is a "detect only" method of mitigation and does not attempt to correct the error that occurs.

Another common error detection only method is called cyclic redundancy check (CRC) coding [2]. This scheme is based on performing modulo-2 arithmetic operations on a given data stream, then interpreting the result as a polynomial. The N data bits are treated as a N-1 order polynomial. When encoding occurs, the data message is modulo-2 divided by the generating polynomial. The remainder of this operation then becomes the CRC character that is appended to the data structure. For decoding, the new bit structure which includes the data and CRC bits is again divided by the generating polynomial. If the new remainder is zero, no detectable errors were observed. A commonly used CRC code, especially for mass storage such as tape recorders, is the CRC-16 code which leaves a 16-bit remainder.

Hamming code is a simple block error encoding (i.e., an entire block of data is encoded with a check code) that will detect the position of a single error and the existence of more than one error in a data structure [1]. Hamming strategy essentially states that if there are Q check bits generated using a parity-check matrix, then there is a syndrome
represented by the Q-digit word that can describe the position of a single error. This is seen simply, for example, by having a syndrome (s) with s=000H being the no error condition in a single byte, s=001 being an error in bit 1 of the byte, and so on. By determining the position of the error, it is possible to correct this error. Most designers describe this method as "single bit correct, double bit detect". This EDAC scheme is common among current solid-state recorders flying in space [for example, 3-5]. When a system performs this EDAC procedure, it is called scrubbing (i.e., scrubbing of errors from clean or good data). An example would be an 80-bit wide memory bus having a 72-bit data path and 8-bits of Hamming code. This coding method is recommended for systems with low probabilities of multiple errors in a single data structure (e.g., use only with a single bit error condition in a byte-wide data field).

Other block error codes, while beyond the scope of this paper in terms of operational description, provide more powerful error correcting codes (ECCs). Among these, Reed-Solomon (R-S) coding is rapidly becoming widespread in its usage [6]. The R-S code is able to detect and correct multiple and consecutive errors in a data structure. An example [7] is known as (255,223). This translates to a 255 byte block having 223 bytes of data with 32 bytes of overhead at the end of the message. This particular R-S scheme is capable of correcting up to 16 consecutive bytes in error. This R-S encoding scheme is available in a single IC as designed by NASA VLSI Design Center [7]. A modified R-S scrubbing for a SSR has been performed in-flight by software tasks as well [5].

Convolutional encoding [8], again outside the scope of operational description, is able to detect and correct multiple bit errors, but differs from block coding by interleaving the overhead or check bits into the actual data stream rather than being grouped into separate words at the end of the data structure. This style of encoding is typically considered for usage in communication systems and provides good immunity for mitigating isolated burst noise.

System level protocol methods are best understood by illustration. The SEDS MIL-STD-1773 fiber optic data bus has been successfully flying since July, 1992 [9]. This system utilizes among it’s error control features two methods of detection: parity checks and detection of a non-valid Manchester encoding of data. This military standard has a system level protocol option of retransmitting or retrying a bus transaction up to three times if the error detection controls are triggered. Thus, the error detection schemes are via normal methods (parity or non-valid signaling), while the error correction is via retransmission.

Retransmission of data on a communication link may be autonomously performed as in the example above or may be accomplished via ground intervention. For example, if data collected in a SSR shows an unacceptable BER during a "pass" or downlink transmission to a ground station, the station may then issue a command to the spacecraft requesting retransmission of all or a selected portion of that data.

All of the above methods provide ways of reducing the effective BER of data storage areas such as SSRs, communication paths, or data interconnects. Table 6.1 summarizes sample EDAC methods for memory or data devices and systems.

<table>
<thead>
<tr>
<th>EDAC Method</th>
<th>EDAC Capability</th>
</tr>
</thead>
<tbody>
<tr>
<td>Parity</td>
<td>Single bit error detect</td>
</tr>
<tr>
<td>CRC Code</td>
<td>Detects if any errors occurred in a given data structure</td>
</tr>
<tr>
<td>Hamming Code</td>
<td>Single bit correct, double bit detect</td>
</tr>
<tr>
<td>RS Code</td>
<td>Correct consecutive and multiple bytes in error</td>
</tr>
<tr>
<td>Convolutional encoding</td>
<td>Corrects isolated burst noise in a communication stream.</td>
</tr>
<tr>
<td>Overlying protocol</td>
<td>Specific to each system implementation</td>
</tr>
</tbody>
</table>

### 6.2.3. Mitigation of Control-related Devices

Whereas the above techniques are useful for data SEUs, they may also be applicable to some types of control SEUs as well (microprocessor program memory, again being an example). Other devices such as VLSI circuitry or microprocessors have more complex difficulties to be aware of. Potential hazard conditions include items such as the issuance of an
incorrect spacecraft command to a subsystem or a functional interruption of the system operation. Microprocessors are among the many new devices that have "hidden" registers. These are registers that are not readily accessible external to the device (i.e., on I/O pins), but provide internal device control and whose SEUs could affect the device or system operation.

Microprocessor software typically has tasks or subroutines dubbed Health and Safety (H&S) which may provide some mitigation means directly applicable to SEE [10]. These H&S tasks may perform memory scrubbing utilizing parity or other methods on either external memory devices or registers internal to the microprocessor. The software-based mitigation methods might also use internal microprocessor timers to operate a watchdog timer (see below) or to pass H&S messages between spacecraft systems. A relevant example would be if the software provided a parity check on the stored program memory when accessing an external or internal device such as a electrically erasable programmable read only memory (EEPROM). If a parity error was detected on a program memory fetch, the software might then access (read) the memory location a second time, place the system into a spacecraft safing or safe operations mode, or read the program from a redundant EEPROM.

Watchdog timers may be implemented in hardware or software or through a combination of both. Typically, watchdogs are thought of as an "I'm okay" method of error detection. That is, a message indicating the health of a device or system is sent from one location to another. If the message is not received by the second location within a set time period, a "time out" has occurred. In this instance, the system then may provide an action to the device, box, subsystem, etc... Watchdog timers may be implemented at many levels: subsystem-to-subsystem, box-to-box, board-to-board, device-to-device, etc... Watchdogs may be active or passive. The different types are best understood by example.

Example 1 is an active watchdog. Device A has to send a "I'm okay" pulse on a once per second basis to an independent device B. B, for example, is an interrupt controller for a microprocessor system. If A fails to send this pulse within the allocated time period, device B "times out" and initiates a recovery action such as issuing a reset pulse, removing power, sending a telemetry message to the ground, placing the spacecraft into safing mode, etc... B's actions are very specific to each mission scenario and spacecraft mode of operation.

Example 2 is a passive watchdog timer. In spacecraft X's normal operating scenario, it receives uplink messages (commands, code patches, table loads, etc...) from the ground station every twelve hours. There is a timer on-board the spacecraft that times out if no uplink is received within this 12 hour (or perhaps, a 24 hour) time frame. The spacecraft then initiates an action such as a switch to a redundant antenna or uplink interface, a power cycling of the uplink interface, etc... What makes this a passive watchdog is that no specific "I'm okay" needs to be sent between peers, but a monitoring of normal operating conditions are sufficient.

Redundancy between circuits, boxes, subsystems, etc... provides a potential means of recovery from an SEE on a system level. Autonomous or ground-controlled switching from a prime system to a redundant spare provides system designers an option that may or may not fit within mission-specific spacecraft power and weight restrictions. Redundancy between boxes is relatively straightforward, therefore we present a lower system level redundancy example. The MIL-STD-1773 fiber optic data bus is a fully redundant bus with an A side and a B side. Redundancy, in this implementation, allows the system designer to automatically switch from the prime (A) side to the redundant (B) side for all transactions in case of a failed transmission on the A bus, or to retry on the B side in case of an A failure, or wait for a command to switch to B if the bus BER on the A side exceeds a specified limit, etc...

Operating two identical circuits with synchronized clocking is termed a lockstep system. One normally speaks of lockstep systems when discussing microprocessors [11]. Error detection occurs if the processor outputs do not agree, implying that a potential SEU has occurred. The system then has the option of reinitializing, safing, etc... It must be pointed out that for longer spacecraft mission time frames, lockstep conditions for commercial devices must be well thought out. In particular, the TID degradation of the commercial devices must be examined for clock skew with increasing dosage. This may potentially cause "false" triggers between two such devices if each responds to dosage even slightly differently.
Voting is a method that takes lockstep systems one step further: having three identical circuits and choosing the output that at least two agree upon. Katz, et al. [12] provide an excellent example of this methodology. They have proposed and SEU-tested a triple modular redundancy (TMR) voting scheme for FPGAs, i.e., three voting flip-flops per logical flip-flop. FPGAs, one should note, replace older LSI circuits in many systems by providing higher gate counts and device logic densities. Thus, the IC count as well as the physical space required for spacecraft electrical designs may be reduced. The TMR scheme proposed does not come without an overhead penalty; one essentially loses over two-thirds of the available FPGA gate count by implementing this method.

The discussion of FPGAs brings out an interesting point: systems are becoming increasingly more complex as well as integrated. Gate arrays, FPGAs, and application specific ICs (ASICs) are becoming increasingly more commonplace in electrical spacecraft designs. Liu and Whitaker [13] provide one such SEU hardening scheme to provide SEU immunity in the custom IC design phase that is applicable to spacecraft designs. This method provides a logic configuration which separates the p-type and the n-type diffusion nodes within a memory circuit.

The use of "good" engineering practices for spacecraft contributes another means of SEU mitigation [14]. Items such as the utilization of redundant command structures (i.e., two commands being required to trigger an event usually with each command having a different data value or address), increased signal power margins, and other failsafe engineering techniques may aid an SEU hardening scheme.

These and other good engineering practices usually allow designers to be innovative and discover sufficient methods for SEU mitigation as needed. The authors would like to point out that the greatest risk to a spacecraft system and conversely, the greatest challenge to an electrical designer is having unknown device or system SEE characteristics.

6.2.4 Treatment of Destructive Conditions and Mitigation

Destructive SEE conditions may or may not be recoverable depending on the individual device's response. Hardening from the system level is difficult at best, and in most cases, not particularly effective.

This stems from several concerns. First, non-recoverable destructive events such as single event gate rupture (SEGR) or burnout (SEB) require redundant devices or systems be in place since the prime device fails when the event occurs. SEL may or may not have this same failure with each malfunction response being very device specific. Microlatch, in particular, is difficult to detect since the device's current consumption may remain within specification for normal device operation. LaBel, et al. [15] have demonstrated the use of a multiple watchdog timeout scheme as a potential mitigation. In this instance, the first level watchdog acts as an "I'm okay" within a local circuit board. If this watchdog is triggered, a reset pulse is issued to the local circuitry. If this trigger-reset scenario occurs N times continuously or fails to recover the board within X seconds, a secondary watchdog is triggered that removes power from the board. Power is restored via a ground command. This SEDS system was successfully SEL tested at BNL.

For individual devices, a current limiting circuit that may also cycle power is often considered. However, the failure modes of this protection circuit are sometimes worse than finding a less SEL-sensitive device (e.g., infinite loop of power cycling may occur). Hence, SEL should be treated by the designer on a case-by-case basis considering the device's SEL response, circuit design, and protection methods. Please note that multiple latchup paths are present in most circuits, each with a different current signature. This makes the designer's job difficult in specifying the required current limit.

A concern similar to microlatch exists if, for example, current limiting is performed on a card or higher integration level and not on an individual device. A single device might enter a SEL state with a current sufficient to destroy the device, but not at a high enough current level to trigger the overcurrent protection on a card or higher level. The key here is again to know the device's SEL current signatures for each of its latchup paths.

One other, and more risky method of SEL protection due to its potential time lags to detect and recover is best demonstrated by example. An ADC has a known SEL sensitivity. The device's current consumption is gathered periodically via a control processor. If the read current exceeds a specified limit, power cycling is performed. This
method may also use either telemetry data points for ground intervention or a device’s specific or internal calibration parameters to be successful [16].

6.2.5 Sample Methods of Improving Designs for SEE Performance

By changing the design of a circuit or certain circuit parameters, improved SEU performance may be gained. Marshall, et al. [17] and LaBel, et al. [18] have demonstrated several ways of improving a fiber optic link’s SEU-induced BER. First is the selection of diode material (typically, III-V versus Si). The use of a III-V material results in a significantly smaller device sensitive volume. A second way to reduce BER is by the selection of the method for received signal detection (edge-triggered versus level sensitive) with a level-sensitive system being less SEU sensitive. A third scheme for BER reduction is to define a dynamic sensitive time window. This method essentially states that there are only certain time periods when the occurrence of a radiation-induced transient will have an observed effect. Lastly, by increasing the optical power margin, the BER is also reduced. These and similar techniques may apply to other designs as well.

6.2.6 Sample Methods of Realistic SEE Risks and Usage

Deciding whether an SEE in a device has a risk factor that makes a device usable in spaceflight or not is complex at best. Many factors weigh into the concern: mission environment, device test data, modes of operation, etc... Several sample system issues may clarify the types of issues that are involved.

The SEDS RPP uses separate EEPROMs for its boot and application software storage on-board the SAMPEX spacecraft [19]. These particular EEPROMs have shown a sensitivity to SEUs while being programmed, albeit not during read operations. In addition, stuck bits may occur during programming operations at LETs above Ni-58 (i.e., there is a low probability of occurrence in-flight). Since its launch in July of 1992, the application software EEPROMs have successfully been reprogrammed in-flight twice, but with certain constraints. These mission-specific constraints include: the time period for programming uses a relatively proton and heavy ion flux-free portion of the orbit, and that the boot EEPROM is not programmed during flight. Why was the risk taken? The SEDS the verifies the newly programmed data by the use of a CRC code as well as by ground station activities prior to loading the new executable software for SEDS operations. If an incorrect byte was programmed into the device, this mitigation scheme would catch it. If a stuck bit is discovered in the EEPROM, a recovery option is built-in to the system that provides a memory mapping around the failed location. Lastly, since the actual time window during programming when the device is susceptible to error is very small, few, if any, particles capable of causing an anomaly are seen at the device. However, it should be noted that the risk might be deemed unacceptable if continuous programming of the EEPROM was being performed throughout the mission’s orbit.

The SEDS system has previously been pointed out for its use of system level error control in its fiber optic data bus as well as for the use of Hamming code EDAC on its SSR[3,9]. The SEDS system also has a multi-layer system of watchdog timers that monitor system operation [19]. The layers are as follows:

- a software task executing in the main spacecraft microprocessor that times out if a value is not passed by a second software task and that restarts the processor from a known state,
- a programmable interrupt signal from the main spacecraft microprocessor that provides a reset pulse to an external timer circuit that times out if not written to within an N second window causing a hardware reset pulse to occur to the processor,
- if multiple reset pulses occur consistently, this same external timer circuit provides a H&S message to a secondary processor box whereupon the secondary takes action,
- an "I'm okay" pulse between the prime and secondary processors that must occur once every X seconds upon which the secondary processor may remove/cycle power to the main processor or place the spacecraft in safehold until ground station intervention, and
- a multi-day timer that places the spacecraft into safehold if proper system operations have not occurred within a 24 hour period.
As one may observe, mitigation methods for the SEDS are performed on several levels: software, device, circuit/card, box, and subsystem/spacecraft. Also note the use of both active and passive watchdogs.

6.3 Summary

We have presented a sampling of information regarding SEE mitigation from the systems design level. This has included defining functional impacts of SEEs, examples of spacecraft designs, potential methods of SEE mitigation, as well as an example of realistic risks in space utilization of a sensitive EEPROM.

6.4 Acknowledgements

We would like to acknowledge the insight provided by Dr. Paul Marshall in numerous discussions prior to the drafting of this document.

6.5 References

Section 7
Managing SEEs: System Level Planning
Paul Marshall, Consultant

7.1 Introduction

In this section we describe how SEECA applies to system level requirements generation and flowdown. Figure 7.1 depicts the flow of the criticality analysis as it occurs with other stages of radiation evaluation activities in the generalized system design process. From this figure, we see the bold outlined boxes which address specific stages of the process pertaining to radiation effects and analysis. Recognizing the time and design stage progression from right to left in the figure, these tasks can be broken out of the overall effort as follows:

[1] Ionizing radiation environment prediction
[2] Ideal geometry total ionizing dose analysis
[3] Top-level total ionizing dose requirements definition
[4] Top level SEE functional requirements
[5] 3-D geometry ray trace dose analysis
[6] Part level total dose requirements
[7] Part level SEE requirements
[8] SEE testing and design verification
[9] Total ionizing dose testing and design verification
As the figure depicts, the SEECA methodology involves steps 4, 7, and 8. Also, it ties in with system and subsystem failure assessment and system/board level functional analyses as is indicated.

In the remainder of this section, we track these processes and indicate the roles of the various specialists (system engineers, project managers, design engineers, and radiation effects engineers) through the timeline as it progresses from mission definition to system design verification.

### 7.2 System Level Requirements for SEE

Mission planners ultimately decide the radiation environment in which the satellite will need to function, and this is usually done in a manner which maximizes performance while minimizing radiation exposure. Even so, most missions will at some point encounter considerable particle exposures whether in the South Atlantic Anomaly or from energetic solar particles ejected from solar flares. Also, as discussed in Section 3, the heavily ionizing cosmic ray environment extends to all orbits to some degree. The anticipated particle environment then follows from the orbit and the mission time with respect to solar activity, and the models for predicting these environments have been described in section 3.

As part of the requirements, there should be unambiguous statement of the environment in which the system will need to operate as it pertains to SEE. For this purpose, the total ionizing dose environment or depth dose curves alone are not adequate, in addition the requirements specifications document should include detailed information of the various SEE environments. For example, the cosmic ray environment should be specified as an LET spectrum with identification of the models and the conditions used in calculating the spectrum. Likewise, proton environments should be specified...
according to the worst-case fluxes and energy. If the requirements cover solar flare conditions separately from "normal" environmental conditions, there should be spectral and flux information provided here as well. Typically, these environmental descriptions will assume some nominal shield thickness (e.g. 80 mils Al), and provisions should be made for modifications to the specified environment to allow calculation of SEE rates for more heavily shielded parts.

The system level requirements for SEE performance should be viewed as largely independent of the orbital particle environment. Even so, when establishing system requirements, it is essential that they be expressed with the needs for satisfactory performance in the presence of ionizing particles. These requirements should be expressed in view of all the possible ways in which single event effects could compromise mission performance. The two main categories are system availability and information quality.

**System Availability**

System availability requirements address extreme events leading to possible loss of mission as well as less severe events which might require ground station or possibly autonomous reset with a brief disruption in system performance. It is the decision of mission planners to determine what level of temporary outage is acceptable (and affordable), along with establishing appropriate ways to restore operability.

Typically, availability requirements for single event effects have been expressed in general terms along the lines of "no single event effect (e.g. latchup or any other potentially catastrophic SEE related failure mode) shall be allowed to result in the loss of the mission." In terms of requirements, this represents the SEE equivalent to the conventional reliability requirement of not allowing single point failures to result in mission loss, though the analysis for assessing risk and the details of meeting the requirement will certainly differ between the SEE and hard failure cases. System availability is also an appropriate way of expressing requirements for specific mission functions, which might flow directly into a subsystem level availability requirement.

Additional availability requirements can be specified in terms of the severity of disruption, the acceptable frequency of disruption, or the maximum duration of disruption, or some combination of the three. In addition to SEE induced hard failures, soft errors may occur which disrupt system performance but allow complete system recovery. For example, it might be required that normal mission functions not be disrupted by any single event effect with an outage requiring ground station intervention more than once per year, and the occurrences of autonomously reset disruptions cannot happen more than once daily with the system recovery required to result in an overall system availability of 0.9993 (corresponding to system unavailable for 1 minute per day on average).

SEE requirements might also reflect the compromise between mission objectives and cost constraints by allowing for less stringent performance under extreme circumstances. For example, if the mission science objectives of a LEO platform do not require highest availability levels while in the SAA, then the SEE/SEU requirements might be relaxed in that region with substantial cost savings. Similarly, if more frequent disruption of operations could be tolerated for short durations over the course of a multi-year mission, then requirements could be relaxed for anticipated solar flare related particle bombardment which might be several orders-of-magnitude more harsh than daily peak particle fluxes under normal conditions.

**Information Integrity**

As a separate requirement from system or subsystem availability, the mission might consider the payload functional requirements in terms of information integrity. In many cases, soft errors can occur in a relatively benign manner which affect data without altering the system functions. For example, a soft error in a sensor A/D converter or in a data path might result in a glitch in an image. Such errors to not interrupt the flow of information, but rather degrade its quality.

These less severe types of single event errors lend themselves to EDAC techniques as described in Section 6. The implementation of EDAC and the type of approach selected should be based on the following: the environment, the hardware, and the requirements for data integrity. The establishment for reasonable requirements at the mission
planning stage should lead to acceptable, but not necessarily error free, performance within constraints of cost and design complexity.

The form of the data integrity requirement for the payload will likely reflect the type of information being collected and how it is handled between collection and downlink. As an example, a charged coupled device (CCD) for earth imaging might be the source of a data stream which flows from a camera through a data bus to a solid state recorder and then to a downlink. In such a case, the top level requirement might be for example, no more than 3 bad pixels per frame of imagery. Another form for a top level requirement in this example might be a bit error rate (BER) requirement. The establishment of such top level requirements provide the basis for subsequent SEE criticality subsystem assignments as the error budget is allocated to the various potential sources (e.g. proton events in the CCD’s pixels, SEU in the camera ADC, bit errors in the data bus, soft errors in the solid state recorder, etc.). Sections 7.3 and 8 will illustrate the details of this process using specific examples.

Just as with the availability requirements already discussed, the top level data integrity requirements could be tailored to the mission needs in terms of different performance levels for different aspects on the environment (e.g. SAA and solar flare protons). Whenever the most demanding performance requirement can be divorced from the most severe environmental conditions, mission complexity and cost can be reduced, and it is the proper expression of the top level system requirements which allow this.

In summary, SEECA serves as the foundation for developing top level SEE requirements through both preliminary and detailed design phases. These requirements are essential to provide reasonably reliable and acceptable mission performance within the constraints of satellite complexity and cost. Top level requirements should assure both the availability of the satellite to perform its designed function, and the integrity of the information provided. Where it makes sense, cost and complexity savings can result from requirements which are multi-tiered, with relaxed performance required during extreme environmental conditions.

The generation of top level SEE requirements should follow from a coordinated effort between mission planners, systems engineers, radiation environment specialists, and radiation effects engineers. Ultimately, compliance with these top level requirements should be demonstrated with test data and analysis. Too often, improper or incomplete SEE requirements are generated, resulting in ambiguous design objectives. If the top level requirements do not provide sufficient guidance with respect to SEE, then the procured system either should not be expected to function adequately or the mission costs will not be minimized.

### 7.3 Criticality Assignments

As a part of mission planning, functional requirement definitions for each primary function are established, and this may occur without consideration of radiation effects. As part of the single event effects assessment, these same functions must be ranked according to the degree of severity their temporary disruption or permanent loss would impose. As discussed in Section 2.3, the SEE criticality of a given mission operation is assigned along these functional rather than component or subsystem boundaries. Section 2.4 also has suggested a hierarchical scheme for ranking SEE criticality as error-functional, error-vulnerable, or error-critical corresponding to little or no concern, low rates acceptable, and no events acceptable respectively. The decision tree in figure 2.6 provides a means for determining the severity of a single event based on the criticality ranking of the function which it affects.

Functions can be broadly sorted into payload versus bus groupings. Bus functions would typically include Telemetry and Control, Power and Power Distribution, Data Bus and Mass Memory Storage, Downlinks, etc. whereas payload functions would tend to be more mission specific and include things such as UV / Visible Imaging, Infrared Imaging, Environment Monitors, etc. Obviously, all are important functions, but some (especially those associated with the bus) are clearly mission critical. Even though Telemetry and Control or other essential functions are always protected against any single point failure by dual redundant hardware architectures, it is usually assumed that loss of a redundant portion of a critical subsystem should not be allowed to occur due to a SEE. Thus all subsystems supporting mission critical functions would typically be designed assuming error-critical levels.
Other functions, for example a secondary experiment payload to evaluate a new technology, might be considered of less importance, and the only mission imposed requirement might be that a failure within the experiment, SEE induced or otherwise, must not affect the host. Even so, the experiment designers would likely have considerable investments in the experiment and would consequently impose their own higher level criticality rankings to assure the success of the experiment.

In between these two extremes we have the error-vulnerable category in which a certain number of errors could be tolerated or mitigated with acceptable performance. Many satellite functions are inherently error-critical, but wherever error-critical ratings can be avoided, they should be. The error-vulnerable category allows considerable flexibility in providing acceptable performance with reliance on less expensive parts and less complex systems.

Since SEE is actually a catch-all category comprised of several types of effects (see Section 4), realistically, the analysis tree of figure 2.6 should be evaluated for the consequences of each type of effect. For example, a payload function might be considered error-vulnerable for soft errors, but error-critical for hard errors. This might translate to use of a memory with sensitivity to proton-induced upsets and the use of EDAC to meet performance requirements, but require that it not latch up or exhibit SEE induced stuck bits. Indeed, functions which might be susceptible to hard errors from stuck bits, destructive latchup, or gate rupture would usually lead to more restrictive criticality ratings for those effects than for soft errors.

**7.4 Allocation of SEE requirements to subsystems**

As the mission development progresses from planning to satellite conceptual design, the satellite functions are divided across various hardware subsystems, each of which will have to perform within certain measures to meet system top level functional requirements. Along with the division of satellite functions across these subsystems, as described in Section 2, the preliminary design phase will also include a set of derived SEE requirements which will flow out of the top level SEE requirements.

As with the case of top level requirements, the subsystem derived requirements should be expressed in terms of availability and, where appropriate, information integrity. It is the role of the team comprised of the radiation environment and effects specialists, the subsystem lead engineers, and the system engineers to establish the subsystem level derived requirements based on the subsystem function, as described in Section 2.3. The budgeting of availability and information integrity requirements may occur across multiple subsystems where those subsystems are functionally related. In no case should the availability or performance of the subsystem (or collection of functionally related subsystems) be designed with SEE vulnerability in excess of that allowed based on the functional criticality.

In terms of the example set forth in Section 7.2, the mission requirement might be for the collection of image data with a CCD camera. Functionally, this requires several subsystems including Telemetry and Control, Pointing and Tracking, Power Distribution, the CCD Camera Payload, the High Speed Data Bus, the Solid State Recorder, and the Downlink. Obviously, a number of these are mission critical, and will carry error-critical criticality ratings for that reason.

However, the transmission of CCD imagery, which might be a primary mission objective, would not necessarily be deemed error-critical, and the costs associated with guaranteeing uninterrupted, error-free data might be prohibitive. In this case, availability and information integrity allowances could be applied to the function of CCD image collection and transmission in the top level requirements. It is then the task of the engineering team to allocate this error allowance between the CCD Camera, Data Bus, and Solid State Recorder subsystems. This is typically done along with functional requirements definition in the preliminary design phase (see Section 2.3), and it necessarily relies on past experience and educated guesswork with anticipation of the trades associated with the degree of difficulty in hardening against or tolerating SEE in one subsystem versus another.

In this manner, functional requirements from the top level and associated SEE criticality levels for those functions are translated into SEE requirements at the system and subsystem hardware levels. This allocation of error allowances...
necessarily must occur early in the preliminary design, but it may be a dynamic process which continues into the
detailed design and through test and evaluation phases. With system cost and complexity always guiding the trades, the
reallocation of SEE error allowances may be required due to a number of factors, such as the availability (or cost) of SEE
hardened parts or test results on candidate components indicating different sensitivities in ground radiation tests than
anticipated based on initial information. In this sense, there exists an advantage to satellite procurement approaches
which allow for allocation and modification of error budgets among various subsystem suppliers.

7.5 Detailed subsystem SEE design and analysis

At this point we have established functional SEE requirements with assigned criticality levels, which in turn have been
applied to error allocation budgets at the hardware subsystem level. It is now the task of the subsystem engineering
team to allocate their error budgets among the various segments of the subsystem in a manner which minimizes the
system cost and complexity. Again, this occurs early in the subsystem design and may be modified iteratively as the
detailed design progresses for the reasons previously stated. At this level the trade space involves component choice
selection and error mitigation approaches, and now the environment details are incorporated to predict SEE rates (See
Sections 4 and 5) and evaluate the efficacy of the candidate design approaches.

As part of the evaluation, it is necessary to review candidate approaches to assess the possible SEE related failure modes
which may occur. This represents the equivalent to the familiar Failure Modes Analysis from conventional reliability
analysis, and for complex logic microcircuitry, it represents a formidable challenge. This analysis must necessarily be
coordinated between the radiation effects experts and the design engineers, and its success will rely on knowledge of
the susceptibilities of the candidate components to the various SEE mechanisms.

This knowledge may be based on a number of factors including laboratory radiation test data, component
manufacturer's analysis, heritage of the circuit design cell library and process methods, and previous flight data. Where
insufficient data exists, it is the role of the radiation test engineers providing support to the flight project to conduct
accelerator radiation tests for assessing proton and heavy ion induced SEE vulnerabilities. Ultimately, the vulnerabilities
of each candidate part must be identified and the associated rates for each possible single event effect must be
calculated for the radiation environment established in the requirements. The contributions to the allocated subsystem
error budget must then be assessed, and as indicated in figure 2.6 and in Section 6, hardening or mitigation approaches
identified where necessary.

The SEECA approach would now be applied to the subsystem level with the possible failure modes gauged according to
what the effect might be and whether or not it reaches the boundary of the subsystem to impact the allocated error
budget. In this sense, the use of SEU soft parts might be allowed even within a subsystem designated error-critical,
provided error mitigation techniques within the subsystem prevented the errors from reaching the subsystem
boundaries. Through this process, the error budget is managed through the completion of detailed design, and with
control of cost and complexity as the driving forces.

7.6 Test and Verification

The ultimate endpoint test of the design will be actual flight performance, since it is not possible to fully simulate the
space environment at the system level. Even so, the radiation effects engineers can play a crucial role in design
verification. This takes place on two counts, the verification of SEE sensitivity in actual flight lot parts to confirm
assumptions made during earlier design stages and also subsystem level flight prototype tests at particle accelerators to
verify subsystem performance with errors induced in specific locations within the subsystem. This latter type of test, if
properly planned and executed, can validate error mitigation techniques as well as hardware performance.

This type of in situ testing can be important for two reasons. First it can serve to validate error mitigation approaches in
the subsystem design by demonstrating that errors at the component level are not sensed at the subsystem output, or
that when errors disrupt the performance, the recovery is accomplished within requirements. Also, even though
component tests are usually done with test fidelity to the application as an objective, in situ testing can help in
discovering the circuit performance under actual operating conditions which may differ from component level test conditions. As an example, component tests might be conducted at clock frequencies which differ from the application. SEE sensitivity can be highly dependent on clock rates.

7.7 Summary

The process of SEECA must be part of an integrated effort beginning in mission planning phases. Identification of functional requirements, along with the criticality of those functions provide the basis for the analysis. Unambiguous statement of these requirements, along with comprehensive statement of the SEE relevant aspects of the environment must then be included in the procurement specifications. From this foundation, the hardware level requirements for various subsystems follow, and finally these requirements flow down to the component level.

In this latter stage, the concepts behind the system functional criticality evaluation can be reapplied at the subsystem level with the understanding that possible SEE failure mechanisms must be identified at the component level, and the effects of those SEEs tracked to the board or subsystem boundaries to assess their effects on the system function. The team comprised of the radiation effects experts and the subsystem engineers then evaluate the need for SEE hardening and mitigation techniques based on the expected frequency of occurrence in the given orbit, the severity of the occurrence, the error budget allocated to that particular subsystem, and the cost and complexity of reducing the occurrence or impact of SEE in one part of the subsystem versus another.

Elements of SEECA are found throughout the process, from mission planning, to requirements definition and environment specification, to system and subsystem criticality assignments, to detailed subsystem design, and finally test and verification. In each of these stages, radiation environmental and test scientists should provide input and work as integrated members of the design or procurement effort. These individuals will likely belong to both the procurement and the contracting activities, and their roles should be identified clearly in the beginning stages of the procurement.

The benefits of a disciplined approach to single event effect management result in the deployment of a reliable system with known risk levels, the aversion of costly retrofits of SEE hardened parts or mitigation schemes, and the minimization of overall system complexity and costs.

Section 8

SEE Criticality Assessment Case Studies

Paul Marshall, Consultant

8.1 Design of the AS-1773 Fiber Optic Data Bus for T&C or Payload Applications

This case study illustrates the top-down application of SEECA in system design beginning with the definition of functional requirements and the use of those requirements to establish derived requirements for subsystem performance and hardware specifications. The example shows the expression of the SEE particle environment and its impact on the design from an SEE perspective. It also discusses the trade of various candidate technologies in the design by using SEE criticality, along with other performance gauges, as a guideline. Finally, the verification of the SEE-related performance is discussed along with the test and validation, analysis, and flight demonstration approaches used to prove the design.

Background

Beginning in the late 1980s, NASA Goddard Space Flight Center managers and engineers sought to take advantage of the weight and power savings afforded by fiber optic communications technology for satellite applications. After carefully assessing various feasibility issues for inserting this technology into NASA programs, the decision was made to pursue the development of a fiber optic based physical layer for the familiar MIL-STD-1553 bus used extensively for the mission-critical telemetry and control (T&C) needs. This would provide immediate savings for planned applications of the MIL-
STD-1553 while leveraging heavily off of existing electronic hardware for the MIL-STD-1553, while at the same time providing a basis for further developments of higher data rate busses using fiber optics.

The effort began around 1989 under the Small Explorer Program, and the bus itself came to be known as the Small Explorer Data System (SEDS). As part of the effort, the standards community was involved, and the SEDS bus gained extended recognition as the MIL-STD-1773 fiber optic data bus. In 1995, the MIL-STD-1553 standard was revised to accommodate data transmission at either 1 Mbps or 20 Mbps, and the new standard, now under the Society of Automotive Engineers, became the AS-1773.

General Requirements Definition Specific to SEEs

The functional requirements of the SAE AS-1773 data bus were largely inherited from the MIL-STD-1553 system. Its intended applications as a high reliability bus for avionics and satellite T&C roles have driven the tradeoff of data throughput capacity in favor of enhanced reliability. For our purposes in this illustration, it is not necessary to describe these features, except to say that the bus protocol described in the SAE AS-1773 data bus standard has extensive error detection capabilities, and when errors are detected in messages, the messages can be automatically retransmitted. The bus availability for data transmission is a requirement also addressed in the standard which requires a dual redundant physical layer with automatic switching to the redundant side on hard error detection.

In short, the bus must be available for the transmission of mission-critical commands and telemetry. It must transmit that information without error and provide a response acknowledging successful transmission. Retries of messages with errors are allowed, but retries must be successful and this must be minimized so as not to affect bus traffic.

The SAE AS-1773 data bus standard is not specific to satellite applications, and there are no explicit requirements for radiation-related issues of any sort. For envisioned NASA applications of the AS-1773, the bus is obviously required to function in compliance with the standard, and it must do so in the presence of ionizing particle environments with characteristics and effects as already described. Obviously, the bus must also be tolerant to accumulated total ionizing dose, but that is outside the scope of this discussion.

Since the AS-1773 data bus is mission critical, it’s SEE functional criticality level is event-critical. This means no single particle event can prevent the bus from compliance with the AS-1773 standard. Further, this means that message transmissions must be successful (with occasional retransmissions allowed), and the bus must be available when needed. Even though the hardware level is dual redundant, since the bus is mission-critical, its availability must not be compromised by a permanently destructive particle events. Therefore, the SEE criticality of components in the bus would be event-critical for destructive SEEs.

A list of functional requirements pertaining to SEEs might include:

1. The system must be compliant with the SAE AS-1773 data bus standard in the presence of the satellite particle environment.
2. No single particle effect shall lead to permanent failure of any system component.

This second requirement might be modified in some cases to allow exceedingly improbable occurrences of certain events (e.g. parts which may latchup may be allowed if the test and analysis support that decision based on the expected likelihood of the event). For example, if the likelihood of part destruction from latchup were possible with an expected frequency of once in 50 missions, then the part could possibly be used with the assumption that failure of one side of a dual redundant system would not result in system failure.

Beyond these general performance requirements, there might be specific requirements on system data handling performance. For example, given that single event transient events might require data retransmission when certain types of errors occur (system level error tolerance), there may be a requirement on how often this might be allowed to happen. AS-1773 messages are transmitted in strings of twenty 32 bit words corresponding to 640 bits. When a
transient disrupts a single bit, the entire message is retransmitted. Retransmission frequency would then be specified, for example, as only one retransmission in 106 messages.

As was mentioned in the section on criticality, there may be multi-tiered requirements with stipulations, for example, for flare versus nonflare conditions. Also, there may be separate requirements for different functional aspects of the bus. AS-1773 operates to transfer message traffic at two data rates, 1 Mbps and 20 Mbps. Mission critical T&C traffic would likely occur at the 1 Mbps rate, but payload traffic might be handled better at 20 Mbps. If payload information were rated error-vulnerable in criticality, then it would be expected that 20 Mbps traffic could have less demanding functional requirements.

Environment Description

As is often the case, NASA hardware development efforts target a broad user base both within and beyond NASA. As was the case with the SEDS MIL-STD-1773 bus, the AS-1773 bus is intended for flight on a number of missions and in a variety of orbits with varying severity of the single event environment. A single bus design, with SEE and total ionizing dose immunity levels for the more demanding missions envisioned, is usually the most appropriate design (as opposed to multiple designs for varying levels of severity or design for an initial application and retrofit to meet more difficult requirements for later missions).

For this reason, even though the initial application of a subsystem such as the AS-1773 might be for a low earth orbit (LEO), the design might eventually be required to perform in a more severe cosmic ray environment. For this reason, AS-1773 development has assumed the cosmic ray environment shown in figure 8.1. This has been calculated using the CREME models and assuming solar minimum conditions. The orbit is assumed to be at geostationary position, where geomagnetic shielding effects are minimal. These conditions represent a worst-case average cosmic ray exposure, but this is not an unrealistically extreme worst-case. Conditions at LEO might be reasonably similar, particularly for orbits at high inclination angles.
The requirement specification with respect to proton-induced SEE is a bit more complex, as the proton flux may vary by several orders of magnitude depending on the orbital position and the occurrence of solar particle events. Since the SAE AS-1773 data bus would be expected to perform mission critical functions without interruption, its design has been implemented with the worst-case expected proton flux in mind. According to NASA’s AP-8 environment models, the proton belt peak integral flux for energies capable of penetrating ~ 60 mils Al shielding would not be expected to exceed about $5 \times 10^4$ p/cm$^2$/s. However, since the bus must also function during short duration solar flare particle events, the design requirement for AS-1773 is somewhat higher at $2 \times 10^5$ p/cm$^2$/s. To further specify the proton requirement, the spectral energy composition (along with the assumptions made in establishing it) is also provided.

Figure 8.2 shows, for design purposes, the worst-case proton flare environment which has been arrived at using the CREME model August 1972 flare as described by King [1]. This particular environment was selected for the AS-1773 development since it is sufficiently large that it would probably not be exceeded during a 10 year mission. Even so, there is some likelihood of a larger flare occurring. Feynman [2] has treated this problem of solar particle event peak fluxes in a probabilistic sense, and this reference would be of interest to anyone tasked with defining a worst-case design requirement for solar flare events.
Figure 8.2: The August 1972 solar flare peak flux environment behind 60 mils Al as calculated using no geomagnetic shielding. The integral flux is calculated for energies penetrating 60 mils Al.

The solar flare conditions depicted in figure 8.2 establishes the most demanding proton flux for the AS-1773. This exceeds the proton belt peak levels even for orbits passing through the heart of the belts, and it may exceed low LEO requirements in the SAA by 2-3 orders of magnitude. For a payload data bus or some other noncritical subsystem, this flare condition might be covered with a somewhat relaxed functional requirement, but we note that this is not the case for mission critical subsystems which must function adequately even in rare stressful events. There is some degree of margin assumed in the choice of the August 1972 flare as a worst-case design-criteria, since this was an unusually large flare event. More quantitative treatment of the largest expected fluxes and design margins are described by Feynman [2].

SEE Component Requirements and Design Issues for the AS-1773

The functional requirements for the AS-1773 provide the basis for component SEE requirements and hardware design trades. These issues are guided according to figure 7.1 by recognition of which function is served by the hardware and what its functional criticality rating is. Further, there must be consideration of how a SEE affects the system performance in terms of system availability versus system performance.

Where system availability has error-critical rating, such as with the AS-1773, this dictates a hardware requirement for virtual latch-up immunity. In the case of AS-1773, this requirement is met by selection of custom ASICs from foundries with proven capabilities to provide latchup immune microcircuits, and by ASIC procurement with latchup parameters.
specified. Ultimately, latchup immunity will be confirmed by ion beam testing on flight lot parts, unless the process can be certified to provide immunity to latchup.

Availability might also be compromised by use of ASICs with soft error susceptibilities in microcircuits controlling the subsystem configuration. In particular, the protocol chip and bus transceiver circuits could be adversely affected by upsets in certain locations. Consequently, soft error immune cell libraries and processes are used to control the frequency of soft error occurrence in these circuits. Soft error upset thresholds of > 20 MeV*cm²/mg assure low upset rates for cosmic rays and immunity to upsets from protons. These levels are met in the case of the AS-1773 by protocol and transceiver chips from United Technologies Microelectronics Center (UTMC) and Honeywell's RHC-MOS IV line respectively.

In the case of the transceiver chip, Honeywell's standard cell libraries could not meet some of the functional requirements for circuit performance at clock rates of 200 MHz, and custom design was required. This custom design was accompanied by evaluation of soft error vulnerabilities at the microcircuit level, and SEU hardened registers were applied where necessary. Though less formal in detail, this microcircuit vulnerability analysis and hardening repeats the same theme applied at the system level with SEE failure mode analysis, criticality evaluation, and hardening. Ultimately, the objective is the same in both cases, to attain appropriate levels of SEE immunity without taking unnecessary measures.

The transceiver chip also has other noteworthy features related to SEE. In the qualification of the MIL-STD-1773 hardware for the small explorer data system (SEDS) for the SAMPEX satellite, it was discovered that fiber-based data links can be extremely sensitive to proton strikes in the receiver's photodiode [3]. Further studies showed the severity of the problem could be reduced by changing the optical wavelength of the system from 830 nm light on the SAMPEX generation hardware to 1300 nm on the newer design for AS-1773 [4]. Though this provides substantial improvement by allowing the use of InGaAs photodiodes, analysis indicated further reductions in the expected SEU rates would be needed. System level trades were performed as to where the error mitigation for this specific type of error could best be accomplished, and subsequently circuit level hardening against such errors was included as part of the design requirement for the AS-1773 transceiver chip.

The method for hardening against photodiode proton events involves a certain circuit level technique which is a variation of majority vote logic. The design has been described in reference [5]. Analysis of the temporal characteristics of the proton-induced single event transient revealed that the proton-induced signal was short in duration relative to symbols in the Manchester encoded data in the serial data stream. The transceiver circuit differentiates between the true data and "false" signals from protons by taking advantage of this difference. For each Manchester symbol period, the signal is oversampled at five times the symbol rate, and these results are clocked into a 5-stage serial shift register. Then the 5 outputs (one from each register stage) are majority voted to determine whether at least 3 of the 5 stages held low or high levels. Proton transients would affect only one (or possibly 2) of the 5 results, and would subsequently be rejected in the voted output.

In this example of proton-induced transients in the optical receiver's photodiode, we see that the ultimate solution of the problem is a combination of several solutions. First there is the hardware level hardening which is implemented with the choice of the wavelength and consequently the photodiode material. Next, is the circuit level hardening, which in this case follows a novel approach to discriminating the particle effect based on its temporal characteristics. Finally, as a last level of protection, any errors which were not already suppressed before reaching the system level would be recognized as an invalid Manchester symbol, and system level retransmission of the message would assure error-free message traffic.

Not surprisingly, the most efficient means of dealing with errors usually involves dealing with them at or near their point of origin, which in this case involve the receiver photodiode and circuit. The combination of the two approaches offer sufficiently robust tolerance at both 1 and 20 Mbps, though the effectiveness of the approaches differ at the two data rates. Without these two measures, the burden at the third level, system level message retransmission, would be too great during peak proton flux periods (e.g. solar flares).
In this section we have described the process of establishing system level functional requirements for SEE, flowing these into hardware requirements based on SEE criticality, trading component performance versus circuit and system level error mitigation, and arriving at a final design.

**SEE Design Verification**

In order to gain confidence that the design will meet SEE requirements, it is necessary to engage in a test verification process. This effort takes place on both the component and subsystem levels. Component tests are necessary for two reasons. First, it is often the case that the SEE characteristics of parts desired by the design engineers will not be known, either because of the use of a new vendor or part type or possibly because the vendor has altered something in the design or process. This is particularly true for commercial off-the-shelf (COTS) parts. SEU testing prior to final design is then necessary to accurately assess the expected performance, and the results of testing may indicate whether or not the part may be used in the design.

Once the design is finalized, it is again necessary to test components which have been procured as flight lot parts. This verifies that the actual flight performs as expected, and demonstrates their SEE performance at the levels at which they were procured. Of course, every component in the design will not undergo such scrutiny. The SEE failure modes analysis will identify those parts in the design which require this level of test and analysis. In the case of AS-1773, component SEE testing is required for the transceiver and for the protocol ASICs, as well as for the dual port RAMs. The type of testing (e.g. heavy ion SEU, heavy ion latchup, or proton SEU) will also be determined by the failure modes analysis.

The final test phase is carried out in situ on the actual subsystem design. This may be done on an engineering development unit or on a brassboard version built specifically for SEE testing. This level of testing may provide additional information about component response, but its primary function is to evaluate SEE impact at the system level. It validates models for error propagation within the subsystem, and it validates error mitigation schemes. Also, for certain types of errors such as proton strikes on the fiber optic receiver's photodiode, in situ testing is required since component testing outside the system application can be extremely difficult to interpret. In other cases, subsystem level in situ testing may not be necessary, provided component testing and analysis can provide orbital performance estimates with desired accuracy.

AS-1773 system level tests will be carried out to evaluate the heavy ion and proton response in each SEE sensitive component. The testing is usually done as a function of ion energy (or LET), and for a variety of system operating conditions. One main objective of the system tests is to formulate and execute a test plan covering the range of system test vectors and environment variables to refine models for expected flight performance in a specific orbit. Consequently, for each ion energy or LET used in testing, the AS-1773 system will be exercised at both of the two data rates, and at a series of incident optical powers. Variation of the optical power will establish performance in terms of beginning of life conditions (with stronger signals) and at end of life with typical or worst-case power levels.

The performance of the AS-1773 subsystem is monitored in terms of system retransmissions, and also system availability. Permanent failures (which should not occur) are monitored, as well as switch-offs to the other half of the dual redundant architecture.

The purpose and goals of the subsystem tests are to verify the absence of permanent failures, parametrically identify system performance to verify design and refine flight performance models, evaluate error tolerance and mitigation schemes, and finally to guard against any surprises which may have been overlooked in the failure modes analysis.

The SEECA process for the AS-1773 described in this section illustrates its use for deriving hardware requirements from functional requirements, and for carrying out the details of an appropriately SEE immune design. The process involves close coordination with radiation environment and radiation effects specialists from the beginning and throughout the design and test phases. We summarize here by reviewing these various roles in the case of AS-1773 which in turn illustrate the process described in the section on Criticality.
SEE functional requirements definition. SEE engineers work with system designers to identify various ways in which SEEs might damage or disrupt system operation and help to identify meaningful ways to specify SEE functional requirements.

SEE environment specification. Radiation environment and effects specialists analyze orbital parameters to generate SEE-relevant charged particle environment descriptions to include in the system specifications.

SEE requirements flowdown in preliminary design. SEE specialists coordinate with system engineers and design engineers to derive SEE hardware requirements from the SEE functional requirements based on criticality of the function. Further consultations follow in the allocation of SEE budget to various segments of the design.

Detailed Design. SEE specialists work with system designers to identify appropriate component choices and to perform trades of various candidate error hardening, tolerance, and mitigation schemes.

Test and verification. SEE test engineers perform parts evaluation and screening for candidate components, and after final design, testing is done in situ on the operating subsystem to verify design and derive needed parameters for flight performance prediction.

Flight performance prediction. Based on the test results, SEE specialists predict the performance for specific orbital conditions. Having carried out the above process, the predictions should establish that the performance will meet SEE requirements, and with minimal impact on cost and complexity. In the event that functional requirements are not met, SEECA provides the framework for rectifying the problem.

8.2 Case Study: Retrofit of a DC to DC Power Converter

The discussions in the previous sections and in the AS-1773 case study identify the roles of SEECA in the design and qualification of a system or subsystem, but this is not the only application for SEECA. In many cases satellite missions rely on "heritage" designs which already exist and may already have flight histories. In such cases, the prior experience may not have involved qualification to the radiation and SEE environments necessary for the mission being planned, or as a worst-case there may have been no such requirements at all. Also, part lists corresponding to an existing design may include items which are no longer available, and if parts are available, their radiation and SEE characteristics may differ from those qualified for the initial application.

Heritage designs represent a special case for SEECA since it is assumed that the nonrecurring engineering costs have been paid and redesign for SEE or any other reason is a costly proposition. Nevertheless, a SEECA must be executed for the intended application and with the SEE characteristics of available parts in mind. Where requirements cannot be met with existing designs, subsystem engineers must be inventive to find hardening or mitigation approaches which do not involve alteration of the heritage design.

This case study involves the use of a power supply which in turn uses a DC-to-DC converter manufactured by Modular Devices Incorporated (MDI). Not only does the study illustrate the nature of dealing with an existing design, but it also highlights the fact that SEEs are not limited to memories and other digital logic devices.

The subsystem in question in this case is generic in the sense that the portion of the subsystem of interest is a power supply typical of those found throughout many satellite subsystems. The power supply function here involves the conversion from the spacecraft 28 volt supply to regulated +15 volt supplies. The part in question is the MDI2690R-D15F DC-to-DC converter which is actually a hybrid comprised of many components.

In the course of testing SEE characteristics of power converters, NASA Goddard radiation effects engineers discovered that the MDI hybrid was susceptible to single particle induced resets which dropped the supply output from 15 volts to 0 volts followed by a spontaneous recovery after about 10 ms. The details of the testing and results have been reported in [6]. By testing with heavy ions incident on various isolated portions of the hybrid part, the problem was isolated and
identified to be related to a LM139 op amp. The LET threshold of this linear device was sufficiently low to indicate a sensitivity even to protons. The existence of SEEs in linear devices had been reported previously, but the effects are highly application dependent. This example is one of a very few cases where subsystems have been shown to be sensitive to "upsets" in linear devices, probably because the spacecraft community has not been fully aware of the potential for these problems and SEE testing of analog parts is not usually done.

These results were sufficiently alarming to engineers who had included the MDI converter in their designs to warrant activity on multiple fronts. In a coordinated effort between NASA Goddard, the Jet Propulsion Laboratory, the Naval Research Laboratory, and MDI engineers, the MDI design was analyzed and a potential solution was suggested which involved the addition of a capacitive filter to suppress the transient resulting from the particle interaction. Subsequent tests with heavy ion beams were conducted to first verify the initial results and also to validate the efficacy of the proposed solution to "harden" the converter against transients. The details and results of these tests are available on the NASA Goddard Radiation Effects Group home page on the world wide web [7]. In summary, the previous results were reproduced on the unhardened design, and the modified design was shown to have a sufficiently high tolerance so that only the most energetic interactions could produce the reset effect. Analyses for the orbit in question indicated that the expected rate for the effect on orbit was reduced by several orders of magnitude by the proposed solution, and the problem would occur so infrequently that the MDI converter could be used with acceptable risk. MDI subsequently adopted the minor alteration to the hybrid without impact to the converter's other electrical characteristics, and consequently a costly redesign of all the power supplies using the MDI converter was averted. In the absence of such an elegant solution, it might be necessary for system engineers to abandon the heritage design, or to add external mitigation hardware, or (unless the function is error-critical) to absorb the resulting SEE rates by reallocating more restrictive error budgets to other subsystems.

Though the discovery of the problem and identification of a solution was harrowing, especially for projects who had already purchased flight lot converters, the launch of this potentially catastrophic design flaw was averted. This case study illustrates how a heritage design must be adopted with proper planning using SEECA, how SEE problems may be discovered in unexpected places (e.g. linear parts), and how testing and innovative solutions involving teamwork between suppliers, test engineers, design engineers, and system engineers can turn a serious problem into a successful design with understood and acceptable SEE related risks.

8.3 References

1. We use the CREME model as contained in the software package SPACE RADIATION, Severn Communications Corp., Millersville MD.