## New Developments in Field Programmable Gate Array (FPGA) Single Event Upsets (SEUs) and Fail-Safe Strategies



#### Melanie Berg, MEI Technologies in support of NASA/GSFC



## Acknowledgements

- Some of this work has been sponsored by the NASA Electronic Parts and Packaging (NEPP) Program and DTRA
- Thanks is given to the NASA Goddard Radiation Effects and Analysis Group (REAG) for their technical assistance and support. REAG is led by Kenneth LaBel and Jonathan Pellish
- Thanks to the MAPLD committee for welcoming me back as a tutorial instructor



Before we get started... here's some things to keep in mind during this very long presentation

fotolia

## Single Event Upsets (SEUs) and Field Programmable Gate Arrays (FPGAs)

- Ionizing particles cause upsets (SEUs)
- Each FPGA type has different error signatures
- Regarding SEUs, the question is how to avoid system failure
- The answer depends on
  - The system's requirements and the definition of failure
  - The target FPGA and surrounding circuit susceptibility
  - Implemented fail-safe strategies
  - Radiation environment
  - Trade space and decided risk



# Fail-Safe Strategies...Don't Get The Actions Confused...There's A Detection: Difference

- Watchdog (state or logic monitoring)
- Simplistic Checking ... Complex Decoding
- Action (correction or recovery)
- Masking
  - Not letting an error propagate to other logic
  - Redundancy+mitigation or detection
  - Turn off faulty path
- Correction
  - Error state (memory) is changed
  - Need feedback
- Recovery
  - Bring system to a deterministic state
  - Might include correction



## SEU Induced Fail-Safe Concerns for FPGA Based Systems

- Are you reducing error rate?
  - Be careful not all FPGAs have the same Single Event Upset (SEU) error signatures... don't be fooled
  - Poorly selected/implemented Mitigation scheme may increase upset rate instead of decrease
- Accumulation versus Multiple Bit Upsets (MBUs) may need to be handled differently (rate of correction versus correction technique)
- Tradeoffs: Is your scheme buying you anything?
  - May reduce system error rate at a high cost (area, power, complexity, cost)
  - **STOP**.... Requirements may not need Mitigation
  - If you can't validate that it meets requirements then risk is high

## Things To Look Out for In a Design Review: How Safe is Your Design?



- Are SEU error modes addressed properly?
- Did you mitigate where you expected to mitigate?
- Are there lock-up conditions in the design?
- Does your strategy protect the entire critical path?
- Is the synthesized design fail-safe?
- Can your watch-dog catch failure?
- Will your recovery scheme work?
- What are the limitations of your verification strategy?

#### The list goes on... Based on error signatures of the target FPGA, the designer must keep all points in mind at all stages of the design

## Agenda



- Section I: Single Event Effects (SEEs) in Digital Logic
- Section II: Application of the NASA Goddard Radiation Effects and Analysis Group (REAG) FPGA SEU Model
- Section III: Reducing System Error: Common Mitigation Techniques

## Break

- Section IV: When Your Mitigation Fails
- Section V: Xilinx V4 and Mitigation
- Section VI: Fail-Safe Strategies

## Van Allen Radiation Belts Have Varying Particle Spectra





Van Allen Radiation Belts: Illustrated by Aerospace Corp.

## Source of Faults: SEEs and Ionizing Particles



- Terrestrial devices are susceptible to faults mostly due to:
  - Alpha particles: from packaging and doping and
  - 9 Neutrons: caused by Galactic Cosmic Ray (GCR) Interactions that enter into the earth's atmosphere.



- Devices expected to operate at higher altitude (Aerospace and Military) are more prone to upsets caused by:
  - Heavy ions: direct ionization
  - Protons: secondary effects



## Device Penetration of Heavy lons and Linear Energy Transfer (LET)

- LET characterizes the deposition of charged particles
- Based on Average energy loss per unit path length (stopping power)
- Mass is used to normalize LET to the target material

 $\int dE$ 



mg

Average energy 2 deposited per unit path length

#### Density of target material

LE

To be presented by Melanie Berg at the Revolutionary Electronics in Space (ReSpace) / Military and Aerospace Programmable Logic Devices (MAPLD) 2011 Conference, Albuquerque, NM, August 22-25, 2011, and to be published on nepp.nasa.gov web site

Units

Me

## Terminology Used in Device Datasheets: LET vs. SEU



## **Error Cross Section (** $\sigma_{SEU}$ **)**

Terminology:

- Flux: Particles/(sec-cm<sup>2</sup>)
- Fluence: Particles/cm<sup>2</sup>
- The σ<sub>SEU</sub> is calculated at several LET values (particle spectrum)
  - LET Threshold (LET<sub>th</sub>) is the point where errors are first observed (on-set)
  - LET Saturation (LET<sub>SAT</sub>) is the point where errors stop statistically increasing with LET



## SEU Information: Manufacturer Datasheet Example





## Radiation Data is always changing ... best to keep yourself updated: http://radhome.gsfc.nasa.gov/



## Single Event Effects (SEEs) and Common Terminology

- Single Event Latch Up (SEL): Device latches in high current state
- Single Event Burnout (SEB): Device draws high current and burns out
- Single Event Gate Rupture: (SEGR): Gate destroyed typically in power MOSFETs
- Single Event Transient (SET): current spike due to ionization. Dissipates through bulk
- Single Event Upset (SEU): transient is caught by a memory element
- Single Event Functional Interrupt (SEFI) upset disrupts function



## **FPGA SEU Susceptibility**

#### • FPGA SEUs or SETs can occur in:

- Configuration
- Combinatorial Logic (including global routes or control)
- Sequential Logic
- Memory Cells
- Hidden logic (SEFI)

#### Every Device has different Error Responses – We must understand the differences and design (or plan) appropriately



## Agenda

- Section I: SEEs in Digital Logic
- Section II: Application of the NASA REAG FPGA SEU Model
  - Configuration σ<sub>SEU</sub> (P<sub>configuration</sub>)
  - Functional Data Path σ<sub>SEU</sub> (P<sub>functionalLogic</sub>)
  - Microsemi (Actel) ProASIC3 Example
- Section III: Reducing System Error: Common Mitigation
   Techniques

## Break

- Section IV: When Your Mitigation Fails
- Section V: Xilinx V4 and Mitigation
- Section VI: Fail-Safe Strategies



## The NASA Goddard REAG FPGA SEU Model : Top Down Approach

#### Top Level Model has 3 major categories of $\sigma_{\rm SEU}$ :

$$\begin{array}{cc} P(fs)_{error} \propto P_{Configurat \ ion} + P(fs)_{functional \ Logic} + P_{SEFI} \\ \hline \textbf{Design } \sigma_{SEU} & \textbf{Configuration } \sigma_{SEU} & \textbf{Functional logic} \\ \hline \sigma_{SEU} & \sigma_{SEU} \end{array}$$



$$P(fs)_{error} \propto (P_{Configurat ion} + P(fs)_{functional Logic} + P_{SEFI})$$

### **Configuration SEU Cross Sections**

## Place, Route, and Gate Utilization are Stored in the FPGA Configuration



- Configuration Defines: Arrangement of pre-existing logic via programmable switches
  - Functionality (logic cluster)
  - Connectivity (routes)
  - Placement



- Programming Switch Types:
  - Antifuse: One time Programmable (OTP)
  - SRAM: Reprogrammable (RP)
  - Flash: Reprogrammable (RP)



## **Programmable Switch Implementation and SEU Susceptibility**





### Configuration SEU Test Results and the REAG FPGA SEU Model

| Configuration            | REAG Model                                                                     |
|--------------------------|--------------------------------------------------------------------------------|
| Antifuse                 | $P(fs)_{error} \propto P_{functionalLogic}(fs) + P_{SEFI}$                     |
| SRAM (non-<br>mitigated) | $P(fs)_{error} \propto P_{Configuration}$                                      |
| Flash                    | $P(fs)_{error} \propto P_{functionalLogic}(fs) + P_{SEFI}$                     |
| Hardened SRAM            | $P(fs)_{error} \propto P_{Configuration} + P_{functionalLogic}(fs) + P_{SEFI}$ |



## $P(fs)_{error} \propto P_{Configurat ion} + P(fs)_{functional Logic} + P_{SEFI}$

## Functional Data Path SEU Cross Sections



## Configuration versus Data Path (Functional Logic) SEUs

- Configuration and Functional data path circuitry are separate logic
- Can be implemented with different technologies within one device
- Configuration is static and data paths are not. Requires a different test and analysis approach

This explains why there are separate categories of error:

**P**<sub>configuration</sub> **VS. P**<sub>functionalLogic</sub>



## **SEUs and SETs in a Data Path**

| Combinatorial                                                          | Sequential                                                                                                                   |
|------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| Logic function generation (computation)                                | Captures and holds state of<br>data input at rising edge of<br>clock                                                         |
| SET SET                                                                |                                                                                                                              |
| SET: Glitch in the combinatorial logic: Capture is frequency dependent | SEU: State changes until<br>next cycle of enabled input:<br>Next state capture can be<br>frequency dependent<br>Single Sided |



## DFF's in a Synchronous Design









 $P(fs)_{error} \propto P_{Configuration} + P(fs)_{functional Logic} + P_{SEFI}$ 

## Functional Data Path SEU Cross Sections and Combinatorial Logic Effects (Capturing SETs)

 $P(fs)_{functional \,Logic} \propto P(fs)_{DFFSEU \to SEU} + P(fs)_{SET \to SEU}$ 



## SETs and a Synchronous System

- Generation (P<sub>gen</sub>)
- **Propagation** (*P*<sub>prop</sub>)
- Logic Masking (P<sub>logic</sub>.)
- Capture

## All Components comprise: $P(fs)_{SET \rightarrow SEU}$



## **SET Generation**: *P*<sub>gen</sub>

- SET generation occurs due to an "off" gate turning "on".
- CMOS SET: there is a pushpull between the on gate and the off gate collected charge<sup>-</sup>
- SET has an amplitude and width (τ<sub>width</sub>) based on:
  - Amount of collected charge (i.e. small LET → small SET)
  - The strength of the gate's load
  - The strength of its complimentary "ON" gate
  - The dissipation strength of the process.



Collected<br/>ChargeCritical<br/>ChargeQ coll> Q Crit

 $Q_{crit} = C_{node} * V_{node}$ Node Node Voltage Capacitance

## **SET Propagation to an EndPoint DFF:** *P*<sub>prop</sub>



- *P*<sub>prop</sub> only pertains to electrical medium (capacitance of path... combinatorial logic and routing)
  - Capacitive SET amplitude reshaping
  - Capacitive SET width reshaping
- Small SETs or paths with high capacitance have low P<sub>prop</sub>
- $P_{prop}$  contributes to the non-linearity of  $P(fs)_{SET \rightarrow SEU}$  because of the variation in path capacitance





## **SET Logic Masking:** *P*<sub>logic</sub>

• *P*<sub>logic</sub>: Probability that a SET can logically propagate through a cone of logic. Based on state of the combinatorial logic gates and their potential masking.





## **SET Capture at Destination DFF**



The transient width ( $\tau_{width}$ ) will be a fraction of the clock period ( $\tau_{clk}$ ) for a synchronous design in a CMOS process.

$$P( au_{clk})_{SET o SEU} \propto rac{ au_{width}}{ au_{clk}}$$

$$P(fs)_{SET \to SEU} \propto \tau_{width} fs$$

Probability of capture is proportional to the width of the transient as seen from the destination DFF

## Data Path Model and Combinatorial Logic SETs

and number of combinatorial logic gates are directly proportional to  $\sigma_{\rm SEU}$ 

#



# Have you always believed that if you decrease operational frequency, the $\sigma_{\text{SEU}}$ will also decrease?

#### Or

# If you increase the amount of combinatorial logic, you will increase the $\sigma_{\text{SEU}}$





## Functional Data Path SEU Cross Sections and DFF Effects (Capturing StartPoint SEUs)





### Does not fully characterize DFF upsets as they pertain to a synchronous system



### StartPoint SEUs and a Synchronous System: New Stuff

- Generation (P<sub>DFFSEU</sub>)
- *P*<sub>prop</sub>=1 for hard state switch
- Logic Masking (P<sub>logic.</sub>)
- Capture

### All Components comprise: P(fs)<sub>DFFSEU→SEU</sub>



### **Generation of DFF Upsets:** *P*<sub>DFFSEU</sub>

- Probability that a DFF will flip its state
- Can be a hard flip:
  - Will not change until the next clock cycle
  - Amplitude and width are not affected as with a SET
- Can be a metastable flip
  - No real defined state



- Otherwise known as a "weak" state
- Can cause oscillations in the data path
- Eventually settles to a state ... not deterministic!

**P**<sub>DFFSEU</sub>



### Generation $P_{DFFSEU}$ versus Capture $P(fs)_{DFFSEU \rightarrow SEU}$

| <b>P</b> <sub>DFFSEU</sub>                         | $P(fs)_{DFFSEU \rightarrow SEU}$                                               |
|----------------------------------------------------|--------------------------------------------------------------------------------|
| Probability a StartPoint<br>DFF becomes upset      | Probability that the<br>StartPoint upset is<br>captured by the endpoint<br>DFF |
| Occurs at some point in time within a clock period | Occurs at a clock edge (capture)                                               |
| Not frequency dependent                            | Frequency dependent                                                            |



### Logic Masking DFFs... P<sub>logic</sub>

- Logic masking for DFF StartPoints is similar to logic masking of combinatorial logic.
- DFF logic masking is generally the point where Triple Modular Redundancy (TMR) is inserted







Percentage of Clock Cycle for SEU Capture:

$$\tau < \tau_{clk} - \tau_{dly}$$

Upset is caught within this timeframe



Fraction of clock period for upset capture

$$\tau fs < 1 - \tau_{dly} fs$$

upset capture with respect to to frequency



### **Data Path Upsets and StartPoint DFFs**





### *P(fs)*<sub>FunctionalLogic</sub> Putting it all together:

capture



#### NASA REAG FPGA Upper Bound Susceptibility Model





Upper-bound assumes P<sub>logic</sub>=1 (no mitigation) and NO DFF frequency (fs) dependency



To be presented by Melanie Berg at the Revolutionary Electronics in Space (ReSpace) / Military and Aerospace Programmable Logic Devices (MAPLD) 2011 Conference, Albuquerque, NM, August 22-25, 2011, and to be published on nepp.nasa.gov web site

47

# How DFF or Combinatorial Logic Susceptibility Dominance Affects $\sigma_{\text{SEU}}$



|                                    | $P(fs)_{DFFSEU \rightarrow SEU}$                                                          | $P(fs)_{SET \rightarrow SEU}$                                                             |
|------------------------------------|-------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| Logic                              | DFF Capture                                                                               | Combinatorial SET<br>Capture                                                              |
| Capture percentage of clock period | $(1 - \frac{\tau_{dly}}{\tau_{clk}}) = (1 - \tau_{dly} fs)$                               | $\frac{\tau_{width}}{\tau_{clk}} = \tau_{width} fs$                                       |
| Frequency<br>Dependency            | Increase Frequency<br>decrease σ <sub>SEU</sub>                                           | Increase Frequency Increase $\sigma_{SEU}$                                                |
| Combinatorial Logic<br>Effects     | Increase<br>Combinatorial logic<br>increases $\tau_{dly}$ and<br>decreases $\sigma_{SEU}$ | Increase in<br>combinatorial logic<br>increases $P_{gen}$ and<br>increases $\sigma_{SEU}$ |



You can't answer the question until you understand the relative  $\sigma_{SEU}$  contribution of DFFs to Combinatorial Logic... Is there Logic Mitigation?



### NASA REAG Models + Heavy Ion Data: Microsemi (Actel) ProASIC3



### Background: Micro-Semi (Actel) ProASIC3 Flash Based FPGA

- Originally a commercial device
- Configuration is flash based and has proven to be almost immune to SEUs
- No embedded mitigation in device
- User must insert mitigation if σ<sub>SEU</sub> reduction is required.



## ProASIC3 Analysis: Combinatorial Logic Contributions to $\sigma_{\text{SEU}}$ using Shift Registers





#### Windowed Shift Register (WSR) WSR<sub>0</sub>: N=0 Chain ... Only DFFs

WSR<sub>8</sub>: N=8 Chain... 8 Inverters per 1 DFF WSR<sub>16</sub>: N=16 Chain... 16 Inverters per 1 DFF



Microsemi (Actel) ProASIC3 Shift  
Register Study 
$$\tau_{dly}$$
 and Adding  
Combinatorial Logic  
 $P(fs)_{error} \propto P_{Confectural}$  ion  $+ P(fs)_{functional}$  Logic  $+ P_{SEFI}$   
 $P(fs)_{functional}$  Logic  $\propto (\underbrace{\exists}_{DFE} \left( \sum_{j=1}^{r} P(fs)_{DFFSEU} \rightarrow SEU(j) + \sum_{i=0}^{r} P(fs)_{sc} \rightarrow SEU(i) \right)$   
If the DFFs are not mitigated they will have the dominant  $\sigma_{SEU}$   
 $P(fs)_{functional Logic} \propto \underbrace{\exists}_{DFF} \left( \sum_{j=1}^{1} P_{DFFSEU(j)}(1-\tau_{dly(j)}fs) \right)$ 

**Only One StartPoint per EndPoint DFF and P<sub>logic</sub>=1** 

### No-TMR ProASIC3: Which String Would You Expect to Have a Higher SEU Cross Section? WSR<sub>0</sub> or WSR<sub>8</sub>





### σ<sub>SEU</sub> Test Results: Windowed Shift Registers (WSRs) No-TMR

- No-TMR:  $\sigma_{SEU} WSR_{\theta} > \sigma_{SEU} WSR_{\theta}$  For every LET
- No-TMR: Increasing combinatorial logic does not increase  $\sigma_{sev}$  because increase in  $\tau_{dlv}$



### Agenda



- Section I: Single Event Effects (SEEs) in Digital Logic
- Section II: Application of the NASA Goddard Radiation Effects and Analysis Group (REAG) FPGA SEU Model
- Section III: Reducing System Error: Common Mitigation Techniques
  - Triple Modular Redundancy (TMR)
  - Embedded Radiation Hardened by Design (RHBD)

### Break

- Section IV: When Your Mitigation Fails
- Section V: Xilinx V4 and Mitigation
- Section VI: Fail-Safe Strategies



# Example: TMR Mitigation Schemes will use Majority Voting

 $MajorityVoter = I1 \land I2 + I0 \land I2 + I0 \land I1$ 

| 10                             | <br>I1 | 12     | <b>Majority Voter</b>      |
|--------------------------------|--------|--------|----------------------------|
| 0                              | 0      | 0      | 0                          |
| 0                              | 0      | 1      | 0                          |
| 0                              | 1      | 0      | 0                          |
| 0                              | 1      | 1      | 1                          |
| 1                              | 0      | 0      | 0                          |
| 1                              | 0      | 1      | 1                          |
| 1                              | 1      | 0      | 1                          |
| 1                              | 1      | 201120 | <u>5</u> <u>3</u> <u>1</u> |
| Best 2 Our Triplicate and Vote |        |        |                            |



### **TMR: Correction vs Masking**

- TMR with feedback will mask and correct an error
- TMR with no feedback will only mask an error
  - May not buy you anything if a large amount of circuitry has no correction capability
  - Triple the circuitry without correction:
    - triples the upset rate
    - may end up with the same upset rate using this scheme





 Generally can not apply internal correction from voted outputs

#### Local Triple Modular Redundancy (LTMR): Only DFFs Voter+Feedback=Correction



## ProASIC3 LTMR Shift Register Data Path Model



### LTMR ProASIC3: Which String Would You Expect to Have a Higher SEU Cross Section? WSR<sub>0</sub> or WSR<sub>8</sub>





#### σ<sub>SEU</sub> Test Results: Windowed Shift Registers (WSRs) Trend Reverses with TMR

- LTMR is effective and has mitigated  $P(fs)_{DFFSEU \rightarrow SEU}$
- LTMR:  $\sigma_{SEU} WSR_{\theta} < \sigma_{SEU} WSR_8$  For every LET
- Increasing combinatorial logic increases  $\sigma_{\text{SEU}}$



To be presented by Melanie Berg at the second second

### Summary of No-TMR vs. LTMR: Combinatorial Logic Effects



|                                                                            | No-TMR ProASIC3                                                                                                                                                                                   | LTMR ProASIC3                                                                                                                                                    |
|----------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Significant circuit type                                                   | StartPoint DFF<br>(sequential): SEU capture                                                                                                                                                       | Combinatorial: SET capture                                                                                                                                       |
| Significant<br>Model<br>component                                          | P <sub>DFFSEU</sub> (1-τ <sub>dly</sub> fs)                                                                                                                                                       | $P_{gen}P_{prop}	au_{width}$ fs                                                                                                                                  |
| Error Type                                                                 | One sided function                                                                                                                                                                                | Two-sided function                                                                                                                                               |
| σ <sub>SEU</sub> WSR <sub>8</sub> vs.<br>σ <sub>SEU</sub> WSR <sub>0</sub> | $\sigma_{SEU} WSR_8 < \sigma_{SEU} WSR_0$                                                                                                                                                         | σ <sub>SEU</sub> WSR <sub>8</sub> >σ <sub>SEU</sub> WSR <sub>0</sub>                                                                                             |
| Relative σ <sub>SEU</sub><br>reasoning                                     | $\label{eq:WSR_8} \mbox{ has more } \\ \mbox{ combinatorial Logic and } \\ \mbox{ more } \tau_{dly} \mbox{ between DFFs } \\ \mbox{ hence } \sigma_{\rm SEU} \mbox{ is reduced } \\  \end{cases}$ | $\begin{tabular}{ll} WSR_8 has more \\ combinatorial Logic and has \\ more opportunity for SET \\ generation hence $\sigma_{SEU}$ is \\ increased \end{tabular}$ |

### No-TMR vs. LTMR: Frequency Effects



- The same reasoning for  $\tau_{\text{dly}}$  can be used for Frequency
- No-TMR:
  - Inversely proportional to frequency:  $P_{DFFSEU}(1-\tau_{dly}fs)$
  - Increase Frequency Decrease  $\sigma_{\text{SEU}}$
- LTMR
  - Directly proportional to frequency:  $P_{gen}P_{prop}\tau_{width}fs$
  - Increase Frequency increase  $\sigma_{\text{SEU}}$

The assumption: If you operate your circuit slower then you will decrease your  $\sigma_{SEU}$  is **NOT** always valid!!!!!!!!

### Distributed Triple Modular Redundancy (DTMR): DFFs + Data Paths All DFFs with Feedback have Voters







# GTMR Proves To be A Great Mitigation Strategy... BUT...

- Triplicating a design and its global routes takes up a lot of power and area
- Generally performed after synthesis by a tool
   not
  part of RTL
- Difficult to verify
- Does the FPGA contain enough low skew clock trees? (each clock + its synchronized reset)x3

### Agenda



- Section I: Single Event Effects (SEEs) in Digital Logic
- Section II: Application of the NASA Goddard Radiation Effects and Analysis Group (REAG) FPGA SEU Model
- Section III: Reducing System Error: Common Mitigation Techniques
  - Triple Modular Redundancy (TMR)
  - Embedded Radiation Hardened by Design (RHBD)

### Break

- Section IV: When Your Mitigation Fails
- Section V: Xilinx V4 and Mitigation
- Section VI: Fail-Safe Strategies

### DFF with Embedded LTMR: Microsemi (Actel) RTAXs Family of FPGA

- Localized (only at DFF)
- Microsemi uses Wired "OR" approach to voting no SETs on voters



### DFF with Embedded Dual Interlock Cell (DICE): Aeroflex Eclipse FPGA

- Localize mitigation for DFFs.
- Uses a Dual Redundancy Scheme instead of LTMR
- Single nodes can become upset but their partner node will pull the output in the correct direction



# Embedded Temporal Redundancy (TR):

- Temporal Filter placed directly before DFF
- Localized scheme that reduces SET capture
- Delays must be well controlled. FPGA designers should not implement
   best if embedded
- Maximum Clock frequency is reduced by the amount of new delay





### **Combining Embedded Schemes**

- Some Radiation Hardened by Design (RHBD) schemes combine embedded temporal redundancy with localized redundant latches:
  - TR+LTMR
  - TR+DICE
- New Xilinx RHBD FPGA (Virtex 5QV) has embedded TR+DICE





#### **RHBD for Global Routes**

- Some RHBD FPGAs contain hardened clock trees and other global routes
- Global structures are generally hardened by using larger buffers
- TR will not work on a global network (signal integrity, skew balancing, speed and area would be significantly affected)





#### Break! 10 minutes



### Agenda

- Section I: Single Event Effects in Digital Logic
- Section II: Application of the NASA Goddard Radiation Effects and Analysis Group (REAG) FPGA SEU Model
- Section III: Reducing System Error: Common Mitigation
   Techniques

# Break

- Section IV: When Your Mitigation Fails
- Section V: Xilinx V4 and Mitigation
- Section VI: Fail-Safe Strategies



#### **LTMR Failure**

- Shared Data Path into DFFS
- Voters can upset
- Global routes





#### **DTMR Failures**



- Global routes
- Domain placement
  - possible for domains to share common routing matrix
  - Hit to shared routing matrix can take out two domains



#### **GTMR Failures**



- Domain placement
  - possible for domains to share common routing matrix
  - Hit to shared routing matrix can take out two domains
- Clock Skew
- Asynchronous clock domain crossings need additional voter insertion – tools don't auto handle

#### **TR Failures**







#### **DICE Susceptibility**

 One particle strike can take out 2 nodes and break Dice

Source: "Radiation Hard by Design at 90nm"; Warren Snapp et. al, MRQW December 2008



DICE Susceptibility: Not So Bad for a SRAM Cell – However, Can Cause Metastability Problems in a High Speed Master-Slave DFF

Takes time for the dual node to pull the output to a correct state





#### Agenda

- Section I: Single Event Effects in Digital Logic
- Section II: Application of the NASA Goddard Radiation Effects and Analysis Group (REAG) FPGA SEU Model
- Section III: Reducing System Error: Common Mitigation
   Techniques

# Break

- Section IV: When Your Mitigation Fails
- Section V: Xilinx V4 and Mitigation
- Section VI: Fail-Safe Strategies

## Commercial Devices in Critical Applications



- Why are commercial devices being considered?
  - Fast
  - Cheap
  - Easier to design with (especially with reprogramability option)
- Commercial devices were not designed for critical applications.... Considerations:
  - Requires extensive knowledge of SEU error signatures
  - Requires knowledge of proper mitigation techniques
  - Requires additional tool costs
  - Watchdogs become more complex
  - Recovery becomes more complex
  - Verification becomes more complex

# The following slides illustrate some of the considerations regarding using commercial devices for critical applications

#### General Xilinx Virtex 4 FPGA Architecture



#### **Functional Logic**



# Xilinx SX55: Radiation Test Data



Xilinx Consortium: VIRTEX-4VQ STATIC SEU CHARACTERIZATION SUMMARY: April/2008

|                                       | Probability                | Error Rate                     | LEO        | GEO                  |
|---------------------------------------|----------------------------|--------------------------------|------------|----------------------|
|                                       |                            |                                | Upsets     | Upsets               |
|                                       |                            |                                | device–day | device–day           |
| Configuration<br>Memory:<br>XQR4VSX55 | P <sub>configuration</sub> | $rac{dE_{configuration}}{dt}$ | 7.43       | 4.2                  |
| Combined<br>SEFIs per<br>device       | P <sub>SEFI</sub>          | $\frac{dE_{SEFI}}{dt}$         | 7.5x10⁻⁵   | 2.7x10 <sup>-5</sup> |

- For non-mitigated designs the most significant upset factor is:  $P_{Configuration}$
- Localized redundancy is NOT effective. Designer must use DTMR or GTMR





## Is GTMR (a.k.a. XTMR) or DTMR All We Need?

- GTMR only:
  - Masks and corrects the Functional logic
  - Masks most configuration upsets (no correction to configuration bits)
- Two upsets in a mitigation window can cause a system upset
- Accumulation in Configuration can occur and eventually break the GTMR
- Scrubbing corrects the configuration memory
  - Does not reduce Configuration Upset Rate
  - Reduces the accumulation bit error rate
  - Does not correct functional upsets
  - Will not disrupt device operation

#### Variations of Scrubbing Implementation





#### Scrubber Fault Detection and Correction



ECC: Error Correction Code CRC: Cyclic Redundancy Code

|                  | ReadBack Frame ECC      |
|------------------|-------------------------|
| Fault Detection  | ───→ ReadBack CRC       |
|                  | →ReadBack Compare       |
|                  | No Detection            |
|                  | Syndrome Decoding       |
| Fault Correction | ──→ Golden Write Back   |
|                  |                         |
|                  | Every bit error         |
| When to Correct  | ──→ Watchdog controller |
|                  | Calculated Frequency    |



## **System Operation and Scrubbing**

- System retains its state of operation during scrubbing cycles
- Scrubbing can NOT correct DFFs (state machines, counters, control registers, etc...)
- If operation has been affected, correcting the configuration bit may not recover operation
  - Reset
  - Re-synchronize
  - Correction circuitry
  - Re-power or Reconfigure
    - FPGA becomes inactive
    - Peripheral devices that require control will require an alternate source of control



- For a critical design Scrubbing is secondary but should be implemented
- As mitigation windows increase (partial mitigation), scrubbing becomes more of a significant factor
  - Larger windows = more bits with pairs that can break mitigation
  - Several bit upsets per day makes accumulation significant



#### **Processor Based Internal Scrubber**

#### Processor for Fault Correction

- Failed Miserably during radiation testing
  - Processor was internal to FPGA with no mitigation
  - Processor used memory with no mitigation
  - Detection and correction scheme was Single Error Correct Double Error Detect (SECDED)
- Presented: RADECs 2007, "Effectiveness of Internal versus External SEU Scrubbing Mitigation Strategies in a Xilinx FPGA: Design, Test, and Analysis"; M. Berg et. al



#### Frame ECC is only SECDED

Be Aware and Take Caution When Using SECDED Based Scrubbers



- SECDED Scrubbers can only correct one upset per configuration frame
- MBU or an accumulation of upsets within a frame can cause SECDED correction circuitry to write incorrect values into the frame



#### **Virtex5 Scrubbing**

- New embedded logic that performs readback in the background
- Read back is now free
- Lower power does not require I/O switching

Virtex-5 Device

To be

(MAP



#### REAG State Machine Driven Error Correction



- Does not use FRAME\_ECC Syndrome (see block diagram)
- Scrubs entire Configuration upon CRC error using the Internal Configuration Access Port (ICAP)



## V5 REAG State Machine Driven Performance – Proton Testing



#### • External

- CRC Error invoked external scrubber
- Was always able to correct and operation was not disrupted

#### Internal

- CRC Error invoked internal scrubber
- Was always able to correct and operation was not disrupted
- Worked just as well as external scrubber
- Total circuitry only occupied 134 slices
  - GTMR: Clocks, DFFs, and LUTs
  - Non GTMR Logic: ICAP + internal read back blocks



#### V5 REAG State Machine Driven Performance – Heavy Ion Testing – Not As Good as Proton Results

#### • External

- CRC Error invoked external scrubber
- Was always able to correct and operation was not disrupted
- Internal
  - CRC Error invoked internal scrubber
  - readback SEFIs occurred required a full reconfiguration.
     Upsets occurred at the lowest LET tested 2.5 MeV\*cm<sup>2</sup>/mg

# An Alternative is to build your own CRC checker (either externally or GTMR'd internally



## Xilinx V4 and V5 Takeaway Points

- Can be used in non-critical missions without any mitigation
  - Upset rates in the order of days
  - Will need to be reconfigured periodically
  - Watchdog required
  - Great for non-critical data processing
- Can be used in a critical path (beware of SEFIs) with mitigation
  - Utilize mitigation tools from a proven vendor, otherwise:
    - Design may break after GTMR (XTMR) insertion
    - Mitigation may not be placed where expected
  - Upset rates are low
  - It is a complex process to make a commercial device perform at the required level for a critical mission
- Xilinx has come up with a solution: RHBD V5QV



### Agenda

- Section I: Single Event Effects in Digital Logic
- Section II: Application of the NASA Goddard Radiation Effects and Analysis Group (REAG) FPGA SEU Model
- Section III: Reducing System Error: Common Mitigation
   Techniques

# Break

- Section IV: When Your Mitigation Fails
- Section V: Xilinx V4 and Mitigation
- Section VI: Fail-Safe Strategies



## How Safe is Your Design?

- Are SEU error modes addressed properly?
  - Did you mitigate where you expected to mitigate?
  - Have you taken into account device SEFIs?
- Are there lock-up conditions in my design?
- Does your strategy protect the entire critical path?
- Is the synthesized design fail-safe?
- Can your watch-dog catch failure?
- Will your recovery scheme work?
- What are the limitations of your verification strategy?

# The list goes on... Focus will be on fail-safe concerns regarding SEUs



## **System Fail-Safe Concepts**

#### **Common FPGA Options for Space-Flight Missions:**

- -Xilinx Virtex 4
- -Microsemi RTProASIC3
- -Microsemi RTAX2000S



To be presented by Melanie Berg at the Revolutionary Electronics ... optice (Neopuee), humany and Neopuee rev (MAPLD) 2011 Conference, Albuquerque, NM, August 22-25, 2011, and to be published on nepp.nasa.gov web site

## Know The SEU Susceptibility: FPGA Types and the REAG FPGA SEU Model

| FPGA       | REAG Model                                                                      |
|------------|---------------------------------------------------------------------------------|
| RTAX2000s  | $P(fs)_{error} \propto P(fs)_{SET \to SEU} + P_{SEFI}$                          |
| Virtex 4   | $P(fs)_{error} \propto P_{Configuration}$                                       |
| RTProASIC3 | $P(fs)_{error} \propto P(fs)_{DFFSEU \to SEU} + P(fs)_{SET \to SEU} + P_{SEFI}$ |

# Virtex 4 susceptibility is proportional to configuration upsets (in the order of days).

#### Example Has A Processor Designed into the FPGA



Software is required to run in the processor External memory may also be required



To be presented by Melanie Berg at the Revolutionary Electronics in Space (ReSpace) / Military and Aerospace Programmable Logic Devices (MAPLD) 2011 Conference, Albuquerque, NM, August 22-25, 2011, and to be published on nepp.nasa.gov web site

104

## Hardened Software... How Fail-Safe Is It?

- Depends on how susceptible the hardware is
- Upsets do not occur in software
  - Hardware gets upset and can disrupt software
- Software can be used to reduce the necessity of resets
  - Best for register or memory type SEUs
  - Can also be used in pipelined architectures where upsets can be flushed out
- Software will not help with hardware upsets such as:
  - Stuck states
  - Broken routes
  - Broken functionality



#### CMOS Microprocessors in a Heavy Ion Rich Radiation Environment

 Critical mircoprocessor designs in heavy ion environments require some form of hardened hardware before hardened software can be effective

| FPGA Type                                                    | Mitigation?                                        | Hardened Software<br>Effectiveness |  |  |
|--------------------------------------------------------------|----------------------------------------------------|------------------------------------|--|--|
| Virtex 4                                                     | None                                               | Low                                |  |  |
| Virtex 4                                                     | User inserted DTMR or GTMR                         | Good                               |  |  |
| RTAXs                                                        | Embedded<br>LTMR+Configuration+hardened<br>globals | Good                               |  |  |
| RTProASIC3                                                   | Configuration                                      | Medium                             |  |  |
| RTProASIC3                                                   | Configuration + User inserted LTMR                 | Good                               |  |  |
| <b>7</b><br>Vague description – really depends on design and |                                                    |                                    |  |  |

actual radiation environment

## CMOS Microprocessors in a Proton Rich Radiation Environment



- Xilinx embedded processor has not proven to be highly susceptible to protons
  - International Space Station (ISS) experiments such as Space cube
  - Embedded processor uses minimal amount of configuration bits
- Xilinx user designed processor (or soft core) will be highly susceptible in a proton environment
  - User designed processor will use a significant number of configuration bits
  - Virtex configuration bits are highly susceptible to protons
- Microprocessors designed into Microsemi FPGAs will have low susceptibility to protons

SRAM Configuration: Susceptible to Protons

#### CMOS functional Data Path: Low susceptibility to Protons



#### State Machines Drive Most Synchronous Designs: Basic State Machine





### Hardware Fail-Safe Concepts: Lockedup State Machine versus A Locked-up System

- A great deal of attention is given to fail-safe state machines
- Very little attention is given to consequence of implementation and system recovery
- First we'll discuss potential state machine lockup conditions
- We will follow with system response and fail-safe considerations



To be presented by Melanie Berg at the Revolutionary Electronics in Space (ReSpace) / Military and Aerospace Programmable Logic Devices (MAPLD) 2011 Conference, Albuquerque, NM, August 22-25, 2011, and to be published on nepp.nasa.gov web site



To be presented by Melanie Berg at the Revolutionary Electronics in Space (ReSpace) / Military and Aerospace Programmable Logic Devices (MAPLD) 2011 Conference, Albuquerque, NM, August 22-25, 2011, and to be published on nepp.nasa.gov web site

# Commonly Used Definition of A Safe State Machine



- If a bit flips into an unmapped state, circuit automatically jumps to idle state
- Origin of a safe state machine is from a designer who suggests not to use resets on state machines... needs a scheme to get back to IDLE
- Warning... No resets on state machines is a violation of synchronous design rules



To be presented by Melanie Berg at the Revolutionary Electronics in Space (ReSpace) / Military and Aerospace Programmable Logic Devices (MAPLD) 2011 Conference, Albuquerque, NM, August 22-25, 2011, and to be published on nepp.nasa.gov web site



# Safe State Machine Concerns

- This is a detection/recovery scheme,
  - there is no redundancy/mitigation and no correction
  - SEU Rate is increased
- Does not account for the state machine jumping into a mapped state
- Gives a false sense of "safety"



# Binary Coded States: Jumping into Mapped States



- In a binary state machine, the complete state is changed, can be dangerous to jump into a mapped state
- During a design review... a designer is responsible to be aware of all possible upsets and the functional response
  - When the designer is asked what do you do if you jump into a mapped state
  - His answer is, use a reset



State:101 State:110 State:111



114



### Binary Coded States: Jumping into Unmapped States

- It is a much lower probability that the machine will jump into an unmapped state
- However, a major amount of logic is added... why?
  - User is worried about getting locked into an unmapped state
  - However, same can be possible for jumping into a mapped state (except for very simple cyclic state machines)
- When the designer is asked what do you do if you jump into an unmapped state
  - His answer is, a safe state machine
  - Why not just use the same reset logic as if jumping into a mapped state? Keep it simple!!!!!!



Start=0

IDLE

# When and How are "Safe State" Machines a Viable Option?

- For very simple state machines (cyclic without needing input stimulus to push to next state)
- Will need to verify:
  - No lockup conditions exist due to communication with other logic
  - No crucial events if state machine jumps into mapped state
  - No crucial events occur if jump to IDLE state (i.e., is it OK for the output to abruptly turn off)
- Not the safest solution because the full state of the system may not be deterministic (why we use resets)
- However, may reduce the need for soft resets
- When M. Berg is not your design reviewer!

## Have You Looked at Your Synthesis Output!



- Many designers are opting for cycling to IDLE instead of a direct assign to IDLE
- This option only works for:
  - When the state machine has jumped into an unmapped state
  - When the state machine has been synthesized as a binary encoded machine
    - Binary state machines synthesize into circuits that have counter-like control (cycle through all counter states)
    - One-hot state machines synthesize into shift register type logic... cycling concept doesn't apply
    - Don't assume encoding... you must check!

### **Critical Paths**



- We discussed fail-safe state machine concepts
- Don't forget that if a path is truly critical, then the designer will need to consider more than a pseudo safe state machine:
  - Should additional mitigation be inserted?
  - Are the inputs to the machine and other logic protected?
  - Can multiple outputs turn on at once
  - Can outputs turn off or on too soon
  - Lock up conditions
  - Other logic that is left in a particular state and not cleared

#### There is an entire system to consider – Resets can be your safest option



# **Fail Safe Memory Control**

- Memory elements can be your most susceptible portion of your design
- There are various methods of protection:
  - None: data is not stored long enough to worry about it
  - Error Detection and correction (EDAC)
    - Pay attention to MBUs and ability of EDAC
    - How susceptible is the EDAC circuitry?
  - Scrubbing
    - Scrubbers should be hardened can write a lot of bad information otherwise
    - Use of goldens instead of EDAC helps
  - TMR with read/write/modify cycles
- Be careful with FIFOs if their address pointer become upset, you can lose all of your stored data

### Watchdogs



- Monitors portions of circuit (error detection)
- Which portion do you monitor?
  - Common to monitor a heart beat
  - Generally does not give enough information regarding the state of the entire design
- How dependable is the watchdog?
- Where is it performed?
- Too often an after thought and not carefully evaluated during design reviews

# Not Easy!!!!!!!!

### Recovery



- What happens if an error is detected?
- Usually an afterthought
- Sometimes recovery scheme is over-designed too complex... KEEP IT SIMPLE
- Sometimes a reset is your best bet
- A list of all known error events should be made with a correspondent list of recovery modes
  - Sometimes not all error modes are known
  - The benefit of having a verification team (they think of how to break your design instead of how to create your design)
- Recovery should be verified (not easy)

## Conclusion



- Understand the device's error signatures and upset rates before mitigation is implemented
- Slowing down the frequency does not necessarily mean you are reducing your SEU susceptibility
- Not all designs are critical and may not need mitigation
- Be aware when correction is necessary:
  - Make sure you are correcting your state
  - Masking without correction can incur error accumulation and eventually break

# Conclusion



- Detection circuits don't generally have redundancy and can be susceptible – make sure they are not making your design more susceptible (e.g. "safe" state machines)
  - Perform trade ... what is the detection circuitry buying the system
- Perform proper trade studies to determine the type of mitigation necessary to meet requirements:
  - Upset rates
  - Area+Power
  - Complexity... completion and verification within time specified
- Keep it simple verification is the next and final step to a fail-safe system