#### COTS 3D NAND Flash: SEE Test Results and Challenges

Edward Wilcox, Michael Campola, Kenneth LaBel NASA Goddard Space Flight Center

# Outline



- Present State of Flash Memory
- NASA GSFC Testing Status
  - Devices Under Test
  - 3D NAND Flash Results To Date
- COTS Flash Memory Testing Challenges
  - Packaging, Availability, and Electrical Access
- Future Plans

#### Acronyms



- COTS: Commercial Off The Shelf
- ECC: Error-Correcting Code
- EDAC: Error Detection and Correction
- GEO: Geostationary Earth Orbit
- LET: Linear Energy Transfer
- MBU: Multiple Bit Upset
- MLC: Multi-level Cell
- NAND: Not AND (Flash Technology)
- NEPP: NASA Electronics and Packaging Program
- QLC: Quad-level Cell
- RBER: Raw Bit Error Rate

- SBU: Single Bit Upset
- SEE: Single Event Effects
- SEFI: Single Event Functional Interruption
- SEU: Single Event Upset
- SLC: Single-level Cell
- SSD: Solid State Drive
- TID: Total Ionizing Dose
- TLC: Triple-level Cell
- UBER: Uncorrected Bit Error Rate

# **State of Flash Memory**



- Limitations of 2D Highly-Scaled Flash
- 3D Structures Maturing / Available
  - Samsung 64-layer VNAND<sup>™</sup>
  - Toshiba / Western Digital / SanDisk 64-layer BiCS3<sup>™</sup>
  - Micron / Intel 64-layer
  - Hynix 72-layer
- 1TB SSD <\$500; 6Tb+ in a single package!
- Not just discrete components to worry about
  Integration into SoC- and SoB-type applications

# **3D NAND Structure**



- Vertical flash strings, with 64 layers now common
- Not to be confused with 3D-stacking of multiple die in package

| chipworks |
|-----------|



[http://www.micron.com]

Close-up image of V-NAND flash array

[https://www.3dincites.com/2014/08/samsungs-3d-vnand-flash-product-spires-el-dorado/]

# **NEPP / NASA GSFC Testing Status**



- Previous NEPP SEE testing on Hynix 3D
  - 36 layer vs new 72-layer
  - D. Chen, NSREC 2017; TNS Jan. 2018
- 2017/2018 SEE testing on Micron MLC 3D NAND
  - 32 Layer, floating gate technology
  - 1Tb packages with four 256Gb die
  - Limited availability / required teaming for procurement
  - Re-used simple microcontroller test setup
- On-going SEE testing on variety of SSD modules
  - Major manufacturers have their latest flash on SSDs
  - Easy procurement BUT limited documentation
  - No direct electrical access to memory devices

### **Devices Under Test**



- Micron MT29F1T08CMHBB
  - 256Gb die; MLC; 32 layers; piece-part testing
- Micron MT29F768G08EEHBB
  - 384Gb die; TLC; 32 layers; Crucial MX300 SSD module
- Intel
  - 256Gb die; TLC; 64 layers; Intel 545 SSD module
- Samsung
  - TLC; 64 layers; Samsung T5 Portable SSD
- SanDisk/Toshiba
  - TLC; 64 layers; WD Blue 3D SSD module
  - 15nm planar TLC; WD Blue SSD module
- Hynix H27QDG822C8R-BCG
  - Piece-part testing; MLC; 36 layers

# Micron MT29F1T08CMHBBJ4



- Leveraged previous NASA test setups with Cortex-M4 microcontroller
  - Simple asynchronous interface
  - Low-level electrical access; no mapping or abstraction
  - No ECC  $\rightarrow$  We can actually see bit upsets...





# Micron MT29F1T08CMHBBJ4







• Dakai Chen on Hynix 36-layer MLC (TNS, 2018):



#### 3D NAND Angular Effects, Constant LET



 How does "Cosine Law" apply with 3D NAND flash?



#### **Data Pattern Dependence**



 For Micron 3D NAND, no discernable pattern dependence (0's and 1's are being mapped evenly)



#### **Fluence Dependence**



- Programmed-cell Vth is a distribution not an ideal ON or OFF
- Consider some cells "easier" to upset than others
- Reduced effect compared to previously observed Hynix 3D MLC flash.
- Relevant to understanding accelerated SEE test results!



#### **TID Effects**



- Let's look at adding TID into the mix
- Shifts Vth distribution of flash cells... just like
  - Heavy-ion particle strikes
  - Program-Erase cycles



# **Micron Combined Effects**



#### How does TID before SEE affect error rate?



## **SSD Test Setup**



- Solid State Hard Drives are easy to buy, easy to use, and hard to test at the bit level!
  - Abstraction, logical address mapping, EDAC, etc
- Number of upsets expected from SEU *low* compared to memory size and built-in error rate
- Can we observe general trends from manufacturer-to-manufacturer in state-of-the-art 3D NAND flash?
- Can TID or program/erase cycling magnify effect for easier comparison?
- Can we learn anything about effects of SEU on SSDs?

# SSD Test Results – WD Blue 3D SSD



- Irradiated to 1x10<sup>6</sup>cm<sup>-2</sup> N (LET 1.4 MeV·cm<sup>2</sup>/mg)
  - Nothing observed on tester...
- Up to 1x10<sup>8</sup>cm<sup>-2</sup>
  - Still nothing
  - Based on Micron 3D NAND testing we'd guess on the order of .0016 upsets/bit
  - No reported uncorrectable errors









# **WD Blue 3D Continued**



- Pre-SEE testing: 10krad (Si) exposure
  - No SSD errors noted following TID
- Irradiated to 1x10<sup>7</sup> cm<sup>-2</sup> Copper (LET 21.1 MeV·cm<sup>2</sup>/mg)
  - Waited for full readback of drive... and nothing.
- Up to 1x10<sup>8</sup>cm<sup>-2</sup>
  - Based on Micron 3D NAND MLC testing we'd guess on the order of .010 upsets/bit.
  - Errors abound (next slide)!



### WD 3D Blue SSD Data



#### • Nothing abnormal noted immediately after run:

| (AC) Erase Fail Block Count      | 100 | 100 | 0 | 0         | ok |
|----------------------------------|-----|-----|---|-----------|----|
| (AD) Wear Leveling Count         | 100 | 100 | 0 | 0         | ok |
| (AE) Unexpected Power Loss Count | 100 | 100 | 0 | 4         | ok |
| (B8) End To End Error Detection  | 100 | 100 | 0 | 0         | ok |
| (BB) Uncorrectable Error Count   | 100 | 100 | 0 | 0         |    |
| (BC) Command Timeout             | 100 | 100 | 0 | 0         | ok |
| (C2) Temperature                 | 100 | 50  | 0 | 214748364 | ok |
| (C7) Interface CRC Error Count   | 100 | 100 | 0 | 0         | ok |

#### • But, after reading back drive:

| (AC) Erase Fail Block Count      | 100 | 100 | 0 | 0         | ok |
|----------------------------------|-----|-----|---|-----------|----|
| (AD) Wear Leveling Count         | 100 | 100 | 0 | 0         | ok |
| (AE) Unexpected Power Loss Count | 100 | 100 | 0 | 5         | ok |
| (B8) End To End Error Detection  | 100 | 100 | 0 | 0         | ok |
| (BB) Uncorrectable Error Count   | 100 | 100 | 0 | 78        |    |
| (BC) Command Timeout             | 100 | 100 | 0 | 0         | ok |
| (C2) Temperature                 | 57  | 50  | 0 | 214750527 | ok |
| (C7) Interface CRC Error Count   | 100 | 100 | 0 | 0         | ok |

S.M.A.R.T. attributes showed interesting data only after allowing drive controller to learn its own condition.

## WD 3D Blue SSD Data



- Same LET, but at 65° angle
- Pre-SEE testing: 10krad (Si) exposure
- Irradiated to 1x10<sup>8</sup> cm<sup>-2</sup> Ar @ 65° (LET 21 MeV·cm<sup>2</sup>/mg)
  - Several step irradiations with readbacks, no errors through 5x10<sup>7</sup>cm<sup>-2</sup>.
  - Big changes after final step:





Presented by Edward Wilcox at the Single Event Effects (SEE) Symposium and Military and Aerospace Programmable Logic Devi.

### **Other SSDs Tested**



- Intel 64-layer TLC
  - 10 krad(Si) + 1x10<sup>8</sup> cm<sup>-2</sup> @ LET 1.4:
  - All clean
  - Separate device, 0 krad, 1x10<sup>8</sup> cm<sup>-2</sup> Copper (LET 21.1):



## **Continued SSD Data**



#### Samsung 64-layer VNAND

- Clean at 1x10<sup>7</sup>cm<sup>-2</sup> Copper (LET 21.1 MeVcm<sup>2</sup>/mg)
- Few errors at 5x10<sup>7</sup>cm<sup>-2</sup>.
- Stopped mounting for ~1 hour
- Fully erasable and now normal

#### Micron 32-layer TLC

#### - 1x10<sup>8</sup> cm<sup>-2</sup> N (LET=1.4 MeVcm<sup>2</sup>/mg)

| (01) Raw Read Error Rate        | 100 | 100 | 0  | 3677      | ok        |
|---------------------------------|-----|-----|----|-----------|-----------|
| (05) Reallocated Sector Count   | 4   | 4   | 10 | 1120      | failed    |
| (09) Power On Hours Count       | 100 | 100 | 0  | 648       | ok        |
| (0C) Power Cycle Count          | 100 | 100 | 0  | 55        | ok        |
| (AB) Unknown Attribute          | 100 | 100 | 0  | 2         | ok        |
| (AC) Unknown Attribute          | 100 | 100 | 0  | 1087      | ok        |
| (AD) Unknown Attribute          | 100 | 100 | 0  | 4         | ok        |
| (AE) Unknown Attribute          | 100 | 100 | 0  | 38        | ok        |
| (B7) SATA Downshift Count       | 100 | 100 | 0  | 0         | ok        |
| (B8) End To End Error Detection | 100 | 100 | 0  | 0         | ok        |
| (BB) Uncorrectable Error Count  | 100 | 100 | 0  | 27        | ok        |
| (C2) Temperature                | 77  | 41  | 0  | 253404184 | ok        |
| (C4) Reallocated Event Count    | 100 | 100 | 0  | 1120      | warning   |
| (C5) Current Pending Sector     | 100 | 100 | 0  | 24        | warning   |
| (C6) Offline Uncorrectable      | 100 | 100 | 0  | 0         | ok        |
| (C7) Interface CRC Error Count  | 100 | 100 | 0  | 1         | attention |



# Challenges



- SSD testing adds layers of abstraction and mapping on top of error-correcting code.
- Effectively impossible to see any individual errors, even MBUs and minor SEFIs; major SEFI events likely to dominate error-response
- Must test as a black-box system
  - Ok if you're trying to *characterize* a black-box system, but limited insight into marginal degradation at part level

#### **Future Plans**



- Generational scaling in 3D layer count and feature size will continue
  - Test piece parts when able, but SSD-type testing possible as well
- Evaluate combined effects, particularly as TLC/QLC cells continue to erode margins and increase RBER
  - TID/SEE/Endurance/Retention all tightly coupled