Update of SSR Guidelines: What Twenty Years of Solid-State Recorders (SSR) Tells Us about the Next Twenty

Ray Ladbury

Radiation Effects and Analysis Group
NASA Goddard Space Flight Center
Greenbelt, MD 20771 USA
Acknowledgements

• NASA Electronic Parts and Packaging (NEPP) Program
• Lunar Reconnaissance Orbiter (LRO) Mission
Why Study Solid-State Recorders?

- Solid-state recorders (SSRs) perform a critical mission function
  - Allow efficient data return and mission continuity when transfer impossible
- SSRs use very large amounts of commercial memory
  - Commercial memories have >30x density of radiation hardened memories
    - Now dominated by Synchronous Dynamic Random Access Memory (SDRAM)
  - Each SSR is an on-orbit test of up to thousands of state-of-the-art-memory die
    - Increases likelihood of seeing new or rare events and failure modes
    - May allow exploration of part-to-part- variation.
- Commercial memories have high single-event effects (SEE) rates
  - High sensitivity and density validate radiation-environment and SEE rate models
  - Ubiquity of SSRs allows comparison of operation in many environments
- Commercial memory use means their technology changes rapidly
  - Do hardening techniques of the past still work?
  - Have new units operated as did their predecessors?
- Memory is the gateway drug for commercial electronics
  - Support circuitry + controllers must have similar speed to memories
- Presenting results of updated SSR study

To be presented by Ray Ladbury at the NASA Electronic Parts and Packaging Program (NEPP) Electronics Technology Workshop (ETW), NASA Goddard Space Flight Center in Greenbelt, MD, June 11-12, 2013 and published on nepp.nasa.gov.
**SDRAM Subtask**

**Description:**
This is a continuation task for evaluating the effects of scaling (<100nm), new materials, etc. on state-of-the-art (SOTA) mass volatile memory (VM) technologies—mainly SDRAM. The intent is: To determine inherent radiation tolerance and sensitivities, identify challenges for future radiation hardening efforts, investigate new failure modes and effects, and provide data to DTRA/NASA technology modeling programs. Testing includes total dose, single event (proton, laser, heavy ion), proton damage (where appropriate) and reliability. Test vehicles will include a variety of volatile memory devices as available, including DDR2 SDRAMs and commercial SRAMs... and DDR3 devices. Emphasis for 2013 will be SEE testing of DDR2 and DDR3 DIMMs using commercial Evaluation boards.

**FY13 Plans:**
Probable test structures
- DDR2 and DDR3 SDRAM from Samsung and Micron
- Develop test strategies using commercial FPGA-based evaluation boards for DDR2 and DDR3 DIMMs
- Test focuses for year will be SEE in DDR2 and DDR3 devices
- Evaluate potential synergistic effects of TID and SDRAM aging
  - Use thermal and voltage acceleration methods
  - Evaluate degradation due to aging/stress
  - Compare TID response of stressed to unstressed parts
- SDRAM SEE response for current generation DDR2 and DDR3 SDRAMs if funding, tester capability and time allow.
*DDR=Double Data Rate

**Schedule:**

<table>
<thead>
<tr>
<th>SDRAM radiation response</th>
<th>2012</th>
<th>2013</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>OND</td>
<td>JFMAS</td>
</tr>
<tr>
<td>TID/stress test of DDR3</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Develop Guidelines for TID + Stress/Aging testing</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Final SSR Guidelines</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Delivery of final reports and Guidelines</td>
<td></td>
<td></td>
</tr>
<tr>
<td>SEE testing of DDR2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>SEE testing of DDR3</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**Deliverables:**
- Updated guidelines for TID testing of SDRAMs and SSRs
- Guidelines/Tradeoffs for use of commercial eval boards as testers
- Test reports
- Publications

**Beam procurements:** GSFC/REF, TAMU
**Partners:** BAE, 3D Plus, JPL, Micron

**Subtask lead:** Ray Ladbury

To be presented by Ray Ladbury at the NASA Electronic Parts and Packaging Program (NEPP) Electronics Technology Workshop (ETW), NASA Goddard Space Flight Center in Greenbelt, MD, June 11-12, 2013 and published on nepp.nasa.gov.

---

**SDRAM**: synchronous dynamic random access memory; **SRAM**: static random access memory; **DIMM**: dual in-line memory module; **FPGA**: field programmable gate array; **TID**: total ionizing dose; **DDR**: double data rate (generation 1, 2, 3, etc.); **SEE**: single-event effects

**Legend:**
- O = October
- N = November
- D = December
- J = January
- F = February
- M = March
- A = April
- J = June
- S = September

**Notes:**
- Updated guidelines for TID testing of SDRAMs and SSRs
- Guidelines/Tradeoffs for use of commercial eval boards as testers
- Test reports
- Publications

---

**NASA and Non-NASA Organizations/Procurements:**

- **Beam procurements:** GSFC/REF, TAMU
- **Partners:** BAE, 3D Plus, JPL, Micron

**REF:** Radiation Effects Facility; **TAMU:** Texas A&M University
Goals

- Improved understanding of SSR performance
  - Are current SSR designs adequate for high-reliability data storage?
  - Are current mitigations inadequate, about right or overdesigned?
- Understand radiation risks in (near) current memories
  - Use large memory arrays to elucidate and bound rated for rare modes
  - Assess accuracy of rate estimation tools
  - Assess effect of space weather on SEE rates
- Assess current SSR mitigation and design strategies, especially their implications for DDR2/3/X in future designs
  - Future SSR designs likely to use DDR2 or DDR3 devices
  - Are current mitigation strategies sufficiently robust
  - Will new error modes require new recovery strategies
- Consider implications of SSR results for other uses of commercial memory in space

SSR: solid state recorder; DDR: double data rate (generation 1, 2, 3, etc.)
Expected Impact to Community

• Increase confidence for use of DDR devices in space environments
  – SEE represent main limiting factor for use of DDR2/3 in space

• Improved understanding of rare SEE Modes
  – Understand error consequences
  – Suggest recovery mechanisms
  – Understand what error modes are important for future testing and design of DDRX memories

DDR: double data rate (generation 1, 2, 3, etc.); SEE: single-event effects
Status/Schedule

- Solid state recorder (SSR) update is completed and released
  - Minor modifications made to original study by Christian Poivey to improve consistency with new results
  - New results indicated as pertaining to 2002-2012 period covered by update
Missions for 2002 Study

Original study done by Christian Poivey in 2002

- **Original study**
  - >2 solar cycles
  - >60 missions/experiments
  - SRAM (up to 1 Mbit) (static random access memory)
  - DRAM (up to 16 Mbit)

- **Update adds data for**
  - Two published studies
  - 5 missions
  - 64-512 Mbit SDRAM
  - Environments from low-Earth orbit (LEO) to interplanetary
Guideline 1: Know your MissionEnvironment

C. Poivey, “Flight Data Analysis Report,”
Guideline 2: Know Your Requirements

- Is application a 1 TB data recorder or a 10 Gbit scratchpad?
  - Error Detection and Correction (EDAC) options and available architectures (e.g., for interleaving bits in data words) very different

- What is important?
  - Is bit-error rate really important for video data?
  - Do you care if <<1% of science frames are lost due to recoverable Single-Event Functional Interrupt (SEFI)?
  - If SEU are single-bit, and block errors are >100x rarer, is single-bit correction acceptable in lieu of double-bit correction?
  - How long does data stay in memory?

- What are the physical requirements?
  - Memory size, speed, physical dimensions, power consumption, etc.

- What are the performance requirements?
Guideline 3: Know Your Device SEE Response

- Is response simple or complex?
  - Does device exhibit nonrecoverable SEFIIs, stuck bits, multi-bit errors?
    - Double Data Rate (DDR2) SDRAM much more complex than 16 Mbit DRAM
- How well are rates known?
  - How large are statistical errors (e.g., Poisson fluctuations)?
  - How well is part-to-part and/or lot-to-lot variation bounded

### Guideline 4: Consequences & Remediation

<table>
<thead>
<tr>
<th>Radiation Risk</th>
<th>Consequence</th>
<th>Remediation</th>
<th>Impact to design</th>
</tr>
</thead>
<tbody>
<tr>
<td>Destructive Single-Event Latchup (SEL)</td>
<td>Permanent loss of 1 die in memory array</td>
<td>Redundant die in array such that probability of meeting End-of-Life (EOL) requirements is high</td>
<td>Severe</td>
</tr>
<tr>
<td>Nondestructive SEL</td>
<td>Loss of all data on affected die/stack</td>
<td>Requires power cycle of affected die/stack for recovery</td>
<td>Moderate to severe</td>
</tr>
<tr>
<td>Single-Event Functional Interrupt (SEFI) requiring power cycle</td>
<td>Loss of functionality on affected die; Loss of most or all data on affected die/stacks</td>
<td>Requires power cycle of affected die/stack for recovery; Error Detection and Correction (EDAC) may correct data loss.</td>
<td>Moderate to severe</td>
</tr>
<tr>
<td>Recoverable SEFI</td>
<td>Temporary loss of functionality; Loss of large amounts up to all data on affected die.</td>
<td>EDAC + Organization of data words across independent die; FPGA programmed w/ ability to refresh mode registers/reset device</td>
<td>Moderate</td>
</tr>
<tr>
<td>Stuck Bits</td>
<td>Uncorrectable loss of data integrity in affected bits/symbols</td>
<td>EDAC can correct incorrect bit, but capability permanently degraded</td>
<td>Minor</td>
</tr>
<tr>
<td>Multi-Bit Upset (MBU)</td>
<td>Correctable loss of data for multiple bits in same word</td>
<td>EDAC must have sufficient power to correct w/c MBU (usually no harder to correct than worst-case SEFI)</td>
<td>Moderate</td>
</tr>
<tr>
<td>Multi-Cell SEU</td>
<td>Multiple bits upset, but in different words</td>
<td>EDAC</td>
<td>Minor</td>
</tr>
<tr>
<td>SEU</td>
<td>Single-bit upset</td>
<td>EDAC</td>
<td>Minor</td>
</tr>
</tbody>
</table>
Guideline 5: Know Options for SSR Hardening

- SSR hardening relies on a multi-tiered approach
  - First line of defense is conservative SEE testing
    - Testing goal #1: Find a part with SEL rate as close to 0 as possible
    - Testing goal #2: Ensure rate for SEFI requiring power cycle as close to 0 as possible
    - Testing goal #3: Get sufficient data to BOUND rates for all SEE modes
    - Allow for part-to-part variation in SEE rates—estimate rates conservatively
  - Second tier is Error Detection and Correction (EDAC)
    - EDAC code calculates error correction bits from values of data bits and can detect and correct errors up to some maximum size if discrepancies are found.
    - Other mitigations needed to ensure worst-case error does not exceed EDAC capability
  - Other mitigations
    - Goal is to keep errors from overwhelming EDAC—either all at once or cumulative
    - Interleave words across multiple die (die width is a good “symbol” length), so that even if all bits on a single die corrupted, EDAC can still correct data words
    - Scrub entire SSR periodically so probability of errors accumulating in any period small
- For small memory applications interleaving may not be possible
  - Triplicate voting may be more efficient.

SEE: single-event effects; SEL: single-event latchup; SEFI: single-event functional interrupt
Guideline 6: These are Commercial Parts

• Product life-cycles are short (~18 months)
  – Procuring test lot, radiation testing, qualification, etc. can leave little time to procure part before it becomes obsolete or is revised.
  – A revised part means starting all over again.

• Lot-traceability may not be possible
  – Need to ensure test sample is representative of flight parts some other way
  – Part-to-part variation in SEE response may be larger than for MIL or space parts
  – Threat of counterfeit parts is real unless source is trusted

• Commercial memories are too complicated to test completely
  – Cannot test all possible operating-mode combinations
  – Parts exhibit a variety of disruptive error modes (e.g. SEFI, SEL)
  – Lack of traceability means should test larger sample; may not be possible

• These factors argue for significant margin when estimating SEE rates
  – Needs to be remembered by designer as well as SEE analyst—if rate drives design and increases cost, pushing back may be fruitful

SEE: single-event effects; SEL: single-event latchup; SEFI: single-event functional interrupt
Technical Highlights

Changes Since 2002 I—Memories

- SEL performance has been variable, but improved since 2008
- SEU—per bit SEU rates remain very low (~$10^{-12}$-$10^{-11}$ per bit day)
- Multi-bit upsets usually due to control logic or read errors due to interleaving
- Stuck bits have been manageable (<$10^{-5}$ per device day)
- SEFIs and block errors have accounted for increasing proportion of errors
  - Large blocks of multibit errors increase importance of error detection and correction EDAC

Figures taken from: "Lessons Learned from Radiation Induced Effects on Solid State Recorders (SSR) and Memories"
Technical Highlights
Changes Since 2002 II: Recorders

• SDRAMs have dominated bulk data storage applications since 2002
  – Most SSRs have used Elpida 256 Mbit or 512 Mbit SDRAM
  – First DDR devices flown in 2013 (Delay-Locked Loop (DLL) disabled)

• Data words are larger—increased from 16 bits to 32 bits
  – Encourages use of wider SDRAMs, requiring more powerful EDAC or degrading margins

• DRAM die often stacked 4-8 in a package to save board real estate
  – Since die share a common power supply, cannot cycle power to individual die
    • If SEFI or nondestructive SEL require power cycle all data on stack lost.
    • Note that use of FLASH or other nonvolatile eliminates this problem

• Mitigation has become more sophisticated
  – Greater symbol/word width favors multi-symbol error correction
  – SEFI requiring a power cycle for recovery favor sparser interleaving

• Recorder performance
  – Most recorders cannot utilize full speed of DDR2/3 devices.
  – Mitigations available continue to be effective—no data lost in studies reviewed

SDRAM: synchronous dynamic random access memory; SSR: solid state recorder; SEFI: single-event functional interrupt; DDR: double data rate (generation 1, 2, 3, etc.)
Technical Highlights
Changes Since 2002 III: Flight Data

Only 2 published studies comparing on-orbit and predicted SEE rates

- Schaefer et al.* Looked at SEU in 64, 128 & 256 Mbit SDRAMs
  - Space weather effects difficult to discern
  - Effect of solar cycle evident
  - Predicted:Observed varies from 1.26-10.2
  - No trend with technology

- GSFC looked at data storage boards (DSB) for LRO
  - Elpida 512 Mbit SDRAM
  - >90% of errors occurred in Block errors
  - Predicted SEU rate ~6x observed—estimated with good statistics
  - Poor Statistics for SEFI and block errors, so prediction overestimates rate >100x
  - CREME96 estimates rates adequately up to 512 Mbit generation

SEU: single-event upset; SDRAM: synchronous dynamic random access memory; LRO: Lunar Reconnaissance Orbiter; SEFI: single-event functional interrupt

<table>
<thead>
<tr>
<th>Error Mode</th>
<th>Predicted #/dev-day</th>
<th>Observed #/dev-day</th>
</tr>
</thead>
<tbody>
<tr>
<td>SEU</td>
<td>1.54E-02</td>
<td>3.10E-03</td>
</tr>
<tr>
<td>Logic Errors</td>
<td>1.00E-02</td>
<td>1.70E-04</td>
</tr>
<tr>
<td>Block Errors</td>
<td>3.50E-03</td>
<td>3.30E-05</td>
</tr>
<tr>
<td>SEFI</td>
<td>&lt;0.000008</td>
<td>&lt;6.4E-06</td>
</tr>
<tr>
<td>Stuck bits</td>
<td>&lt;1E-5</td>
<td></td>
</tr>
</tbody>
</table>

Technical Highlights

SSRs of The Future

• Near future will use DDR2/3/4
  – 2-4 Gbit die, >2 GHz operation possible
  – Challenges
    • Too fast for radiation hardened controllers—can run <100 MHz w/ DLL disabled
    • Increased width of part and word length may require sophisticated EDAC
    • Common VDD of stacked die make recovery from SEFI more challenging
• Improved endurance, retention and radiation performance make FLASH viable
  – Nonvolatile storage a significant advantage and >density than SDRAMs
  – Challenges
    • Too slow for many applications
    • Page Erase and other operational characteristics inconvenient in some applications
    • Susceptible to hard SEE failure during Write and Erase operations
• Whatever comes next???
  – Nonvolatile—top candidates are resistive RAM & Spin-Torque Transfer RAM
    • Hope: >DDR4 speed, >flash density, memory cell likely hard to SEE
    • Challenges will come from support/control circuitry

DDR: double data rate (generation 1, 2, 3, etc.); DLL: delay locked loop; EDAC: error detection and correction; SEFI: single-event functional interrupt; TID: total ionizing dose SDRAM: synchronous dynamic random access memory; RAM: random access memory; SEE: single-event effects
Technical Highlights
Additional Guidelines And Advice

• Guidelines and hardening techniques from 2002 still hold well
• Additional trends worth noting
  – Continued consolidation of commercial memory manufacturers
    • Limits choices and adds another risk for procurement
  – Increasing prevalence of SEFI/block errors vs. SEU
    • Some applications may get by with limited EDAC
    • Some may really need multi-symbol correction
    • Need to understanding application and conservatism of rate estimates
      – SEFI very disruptive to testing, so rates estimated with poor statistics
  – Increasing complexity of memory makes every test application specific
    • Need to understand application if test is to be valid
  – For DDR2/3, SEU cross section does not scale with effective LET
    • Introduces more uncertainty into rate estimation
    • SEFI/ block errors still scale, but estimation poor due to poor statistics

SEFI: single-event functional interrupt; SEU: single-event upset;
EDAC: error detection and correction;
DDR: double data rate (generation 1, 2, 3, etc.);
LET: linear energy transfer
Conclusions

• Conventional Guidelines and hardening techniques for SSRs still hold
  – Although parts have changed—they are still commercial and subject to same market forces and risks
  – Hardening techniques work because they address consequences of SEE modes
    • Permanent failure of entire part (e.g. single-event latchup, burnout)
    • Permanent partial failure of part (e.g. stuck bit)
    • Unrecoverable loss of functionality requiring power cycle (SEFI)
    • Large blocks of data lost (block error)
    • Small amount of data lost (single-event upset)
    • Effects of any single error confined to a single die
    • Usually ensure at least 1 worst-case error correctable (data loss + functionality)

• Trends such as stacking, wider part and longer data words pose challenges
  – Hardening techniques can be adapted to ensure data remain secure

• Conventional hardening approaches likely to work for new technologies as they come along
  – Nonvolatile technologies actually simplify mitigation somewhat.

SSR: solid state recorder
SEFI: single-event functional interrupt; SEE: single-event effects;