

#### **GPU Radiation Test Status**

Edward J. Wyrwas edward.j.wyrwas@nasa.gov 301-286-5213 SSAI, Inc / NASA GSFC NEPP

This work was sponsored by:

NASA Electronic Parts and Packaging (NEPP) Program



### Acronyms

- Body of Knowledge (BOK)
- Complementary metal oxide semiconductor (CMOS)
- Commercial off-the-shelf (COTS)
- Device under test (DUT)
- Electrical, Electronic and Electromechanical (EEE)
- Energy (E)
- Error rate (λ)
- Field programmable gate array (FPGA)
- Fin Field-effect transistor (FinFET)
- Graphics Processing Unit (GPU)
- Joint Electron Device Engineering Council
  Society of Automotive Engineers (SAE) (JEDEC)
- Linear energy transfer (LET)
- Key Process Indicators (KPI)
- Mean time to failure (MTTF)
- Multi-Bit Upset (MBU)

- National Aeronautic and Space Administration (NASA)
- NASA Electronic Parts and Packaging (NEPP)
- Package-in-package (PIP)
- Package-on-package (POP)
- Single-Bit Upset (SBU)
- Single Event Effect (SEE)
- Single Event Functional Interrupt (SEFI)
- Single Event Upset (SEU)
- Single Event Upset Cross-Section ( $\sigma_{SEU}$ )
- Single Instruction Multiple Data (SIMD)
- - System on Chip (SOC)
  - System on Module (SOM)
  - Technical Operation Report (TOR)



#### **NEPP – Processors**





### **Modern Components**

- We may still use some legacy parts with well known reliability and radiation tolerances but we also test leading edge computational components
  - Microprocessors (e.g., x86, x64, ARM, Power Arch.)
  - GPUs (e.g., nVidia, AMD, Qualcomm)
  - Memories (e.g., 3D Xpoint, PCM, DDR3, DDR4, Flash)





- Computational device families are converging
- Using high-level languages, applications are accelerated by running the sequential part of their workload on the CPU – which is optimized for single-threaded performance – and accelerating parallel processing on embedded engines or coprocessor devices
- Key computation pieces of mission applications can be computed using coprocessors and edge devices
  - Sensor and science instrument input
  - Object tracking and obstacle identification
  - Algorithm convergence (e.g., neural network, simulations)
  - Image processing
  - Data compression algorithms and encryption



# FPGA vs GPU vs CPU

| FPGA               | GPU                                                    | CPU                |
|--------------------|--------------------------------------------------------|--------------------|
| Hardware           | Software<br>(bare-metal)                               | Software<br>(+ OS) |
| Complete<br>system | Accelerator is useless alone,<br>but ON when necessary | Complete<br>system |
| low-power          | it depends                                             | +/- low-power      |

(low-power/operation)

+/- low-power

Floating-point operations (neural-net, image, radar) High amount of data to analyze High efficiency/high bandwidth applications



#### **DDR Interface**

- Often found in PIP, POP and Stacked Die processors
- Multi-bit error correction features can be employed
- Cell disturbance via Rowhammer has manifested in DDR3 & DDR4 due to feature scaling
- Typical software model:
  - 1. Flight computers boot from ROM, but tend to run from RAM
  - 2. RAM permits larger data sets to be processed concurrently







# **Evaluation Timeline**





# **Application Focused Payloads**

- Simulations
  - SDK Sample code
- Bit streams
  - Sensors or camera
  - Offline video feed
- Computational loading
  - LinPack
- Neural networks
  - Landsat image classification
- Display Buffer Output
  - RGBYWB Colors
  - Texture and Ray Tracing (Furmark)

- Encryption
  - SHA 256
- Benchmarks
- Easy Math
- Performance Corner tests
  - High/Low voltages
  - High/Low temperatures
  - Current limited



# **Rapid Test Preparation**



AMD e9173 GPU (Clockwise from top right)

- 1) As Received
- 2) Without fansink
- 3) Without Heatsink
- 4) Underside
- 5) Render of Adapter Plate
- 6) Toolpath settings





10

#### Test Preparation:

- Software payloads are created offline
- Conduction cooling system is modular and portable
- Adapter plates are designed and fabricated in 3-5 business days



#### **DDRx Test Readiness**





# **Thoughts**

- NEPP and its partners have conducted proton, neutron and heavy ion testing on many devices
  - Have captured SEUs (SBU & MBU),
  - Have seen repeatable current spikes and latch up behavior,
  - Predominately have encountered system-based SEFIs
- Microprocessor and memory tests require a complex platform to arbitrate the test vectors, monitor the DUT (in multiple ways) and record data
  - None of these should require the DUT itself to reliably perform any other task outside of being exercised
- Every test is another learning experience and while improvements are always possible, preparation time may not be as abundant
- Prioritization during development is important



# **Thoughts**

- The NEPP microprocessor and GPU testing has been standardized:
  - rapid development of cooling system for each DUT form factor and packaging type
  - system implementation using modular COTS' system and network components
  - public domain software that has been excessively tested by the community
  - payloads that can be easily updated to accommodate new DUTs while maintaining the ability to test older DUTs



#### **Acknowledgements**

- This work has been sponsored by
  NASA Electronic Parts and Packaging (NEPP) Program
- Thanks is given to the NASA Goddard Space Flight Center's Radiation Effects and Analysis Group (REAG) for their technical assistance and support.