# Commercial CMOS Failure Rates: Facts and Fictions

John Scarpulla, PhD Technical Fellow Engineering & Technology Group The Aerospace Corporation

June 2022

# **Commercial CMOS for Space ?**

many programs would love to fly state-of-the-art commercial devices

- Can we fly commercial high performance CMOS chips in space?
  - Today we can select:
    - SOCs system on a chip
    - FPGAs field programmable gate array
    - CPLDs complex programmable logic device
    - PALs programmable array logic
    - ASICs application specific integrated circuits

- -- CMOS technologies
  - with gate lengths of < 25 nm
  - clock frequencies > 1 GHz
  - power supply voltages ~ 1 V
  - FiNFETs
  - gate counts >  $10^9$
- Commercial CMOS could be wonderful for our space systems, BUT ...what about–
  - Total Dose? OK for the most part
  - SEE (singe event effects)? probably OK with mitigations and TMR (sans latch up)
  - long term reliability? Not always OK
  - risk posture of various missions? some are more risk-tolerant than others
- Comm'l CMOS is inexpensive and readily available (compared to MIL-CMOS)

#### Incomplete understanding of commercial CMOS reliability is a risk to be assessed

# commercial versus space

applications and usage times

- Typical commercial products
  - powered-on lifetimes ~2-5 years
  - desktop or laptop computers
  - cell phones
  - games
  - servers
  - automobiles

- satellites
  - 10 15 years
  - on board signal processing
  - data serialization / de-serialization
  - encryption



# inferences from a constant failure rate

in terms of a cumulative failure probability

- CMOS vendors have done a terrific job of minimizing infant mortality regime
  - it may be neglected for space applications
- CMOS vendors quote reliability as a failure rate in the "useful life" regime

– typically quoted as  $\lambda = 10-100$  FITs

• Analysis: cumulative probability of failure with constant failure rate



- failure probabilities
- at 2.5 years (commercial) 0.025% - 0.25%
- at 15 years (mil) 0.13% - 1.3% at 15 years

#### failure probability appears **<u>quite reasonable</u>** for a 15 year space mission

# inferences from a constant failure rate

*in terms of a cumulative failure probability* 

- CMOS vendors have done a <u>terrific</u> job of minimizing infant mortality regime
  - it may be neglected for space applications
- CMOS vendors quote reliability as a failure rate in the "useful life" regime

- typically quoted as  $\lambda = 10-100$  FITs

Analysis: cumulative probability of failure with constant failure rate



| Temp. | Duration | Product | Date     | Sample | Qty Fail |   |
|-------|----------|---------|----------|--------|----------|---|
| (°C)  | (hrs)    | Code    | Code     | Size   |          | 1 |
| 125   | 1000     |         |          | 48     | 0        |   |
| 125   | 1000     |         |          | 48     | 0        |   |
| 125   | 1000     |         |          | 17     | 0        |   |
| 125   | 1000     |         |          | 35     | 0        |   |
| 125   | 1000     |         |          | 35     | 0        |   |
| 125   | 1000     |         |          | 35     | 0        |   |
| 135   | 1000     |         |          | 48     | 0        |   |
| 135   | 1000     |         |          | 49     | 0        |   |
| 135   | 1000     |         |          | 47     | 0        |   |
| 125   | 1000     |         |          | 35     | 0        |   |
| 125   | 1000     |         |          | 35     |          | 1 |
| 125   | 1000     |         |          | 35     |          |   |
| 125   | 168      |         |          | 49     |          |   |
| 125   | 1000     |         |          | 82     |          |   |
| 135   | 1000     |         |          | 35     |          |   |
| 135   | 1000     |         |          | 34     |          |   |
| 135   | 1000     |         |          | 35     |          |   |
| 135   | 1000     |         |          | 35     |          |   |
|       |          |         |          |        |          |   |
| 125   | 1000     |         |          | 35     |          |   |
| 125   | 1000     |         |          | 34     |          |   |
| 135   | 750      |         |          | 35     |          |   |
| 135   | 750      |         |          | 35     |          |   |
| 135   | 750      |         |          | 34     | 0        |   |
| 135   | 750      |         |          | 35     |          |   |
| 135   | 500      |         | <b> </b> | 49     |          |   |
| 138   | 1000     |         |          | 45     |          |   |
| 138   | 1000     |         |          | 45     |          |   |
| 138   | 500      |         |          | 45     |          |   |
| 138   | 500      |         |          | 45     |          |   |
| 125   | 1000     |         |          | 49     | 0        |   |
| 125   | 1000     |         |          | 49     | 0        |   |
| 125   | 1000     |         |          | 49     | 0        |   |
| 95    | 1000     |         |          | 12     | 0        |   |
| 125   | 1000     |         |          | 82     | 0        |   |
| 125   | 1000     |         |          | 82     | 0        |   |
| 96    | 1000     |         |          | 84     | 0        |   |
| 96    | 1000     |         |          | 79     | 0        |   |
| 125   | 168      |         |          | 11     | 0        |   |
| 125   | 1000     |         |          | 81     | 0        |   |
| 125   | 1000     |         |          | 79     | 0        |   |
| 125   | 1000     |         |          | 81     | 0        |   |
| 125   | 1000     |         |          | 82     |          |   |
| 125   | 1000     |         |          | 50     |          |   |
| 125   | 1000     |         |          | 80     |          |   |
| 125   | 1000     |         |          | 80     |          |   |
| 125   | 1000     |         |          | 80     |          |   |
| 125   | 1000     |         |          | 82     | 0        |   |
|       | 1000     |         |          |        |          |   |
| 125   |          |         |          | 82     |          |   |
| 125   | 1000     |         |          | 82     | 0        |   |
| 125   | 1000     |         |          | 81     | 0        |   |
| 125   | 1000     |         |          | 82     |          |   |
| 125   | 1000     |         |          | 82     |          |   |
| 125   | 1000     |         |          | 81     | 0        |   |
| 125   | 1000     |         |          | 82     |          |   |
| 125   | 1000     |         |          | 81     | 0        |   |
| 125   | 2000     |         |          | 81     | 0        |   |
| 125   | 2000     |         | _        | 80     | 0        |   |

#### Example: HTOL data from a well-known fab

(an advanced CMOS node)

| Temp.<br>(°C) | Duration<br>(hrs) | Product<br>Code | Date<br>Code | Sample<br>Size | Qty Fail |
|---------------|-------------------|-----------------|--------------|----------------|----------|
| 125           | 1000              |                 |              | 48             | 0        |
| 125           | 1000              |                 |              | 48             | 0        |
| 125           | 1000              |                 |              | 17             | 0        |
| 125           | 1000              |                 |              | 35             | 0        |
| 125           | 1000              |                 |              | 35             | 0        |
| 125           | 1000              |                 |              | 35             | 0        |
| 135           | 1000              |                 |              | 48             | 0        |
| 135           | 1000              |                 |              | 49             | 0        |
| 135           | 1000              |                 |              | 47             | 0        |
| 125           | 1000              |                 |              | 35             | 0        |
| 125           | 1000              |                 |              | 35             | 0        |
| 125           | 1000              |                 |              | 35             | 0        |

*N* = 4794 devices

 $N \times t = 5.23 \times 10^6$  device-hours Results:  $\lambda = 6$  FITs for  $T_{use} = 55^{\circ}$ C

 $\lambda$  = 100 FITs for  $T_{use}$  = 100°C

Typical HTOL data aggregated amongst many products

# Mechanics of a commercial reliability estimate

- <u>Assume</u> an exponential failure distribution
  - Failures occur randomly in time at a constant rate  $\lambda$  to be found (no wearout)
  - Probability of failure for a mission of duration  $t_m$  is  $P_f = 1 \exp(-\lambda t_m)$
- Test many devices for 1000 hours each at 125°C
  - N × t = number of device-hours tested
    - Sometimes more or less hours
    - Sometimes higher or lower temperature than 125°C
      - adjust t's to "effective t's" @ desired  $T_{use}$  using Arrhenius factor and  $E_A = 0.7 \text{ eV}$
  - C = conf. factor (usually 0.9) (1000 hrs at  $T_{ref} = 125^{\circ}C \rightarrow t_{eff} = 3925$  hrs at  $T_{use} = 100^{\circ}C$ )
  - r = number of failures (often zero)
- Estimate<sup>†</sup> the upper confidence limit for failure rate  $\lambda$  and probability of failure  $P_f$

 $-\hat{\lambda} = \frac{\chi^2_{2r+1,1-C}}{2N \times t}$  where  $\chi^2_{2r+1,1-C}$  is the chi-squared sampling distribution with 2r+1 degrees of freedom and significance level (1-C).

– if no failures, (r = 0), then 
$$\hat{\lambda} = rac{-\ln(1-C)}{N imes t}$$

<sup>&</sup>lt;sup>T</sup>a detailed description of the use of the  $\chi^2$  distribution for estimation of confidence limits for failure rates, and more complex censoring is available from the author

# Alternate estimate

It is equally valid to assume that there exists a wearout mechanism

- Assume that there is a wearout distribution that exists with a lognormal distribution of failure times
  - Assume a shape factor  $\sigma$  = 0.8
  - Assume the same thermal activation energy  $E_A = 0.7 eV$

- Probability of failure for a mission of duration  $t_m$  is  $P_f = \Phi\left(\frac{\ln(t_m) - \ln(\mu)}{\sigma}\right)$ 

- where  $\mu$ =median time to fail (to be found),
- $\Phi$  is the standard normal cumulative distribution function
- Test many devices N for 1000 hours each at 125°C
  - − (1000 hrs at  $T_{ref}$  = 125 °C →  $t_{eff=}$  3925 hrs at  $T_{use}$  = 100 °C
  - C = conf. factor (usually 0.9)
  - r = number of failures (often zero)

• Estimate <sup>+</sup> obtain the binomial probability of failure p for zero failures  $\hat{p} = 1 - (1 - C)^{\frac{1}{N}}$ 

(upper confidence limit for zero failures)

- determine the lower confidence limit on the median time to failure

$$- \underline{\mu} = exp[ln(t_{eff}) - \sigma \Phi^{-1}(\hat{p})]$$

• where  $\Phi^{-1}$  is the inverse normal cumulative distribution function

<sup>+</sup>a description of the more complex estimation procedure for the case of nonzero failures and multiple censoring is available from the author

# Two HTOL data interpretations

constant failure rate vs. lognormal (90% conf.)



The usual interpretations and conclusions drawn from HTOL results (exponential) are flawed

HTOL alone is not sufficient for longer space missions

#### HTOL alone is not sufficient for longer term reliability assurance for space missions

# History of HTOL

- HTOL has been "standardized" for CMOS:
  - oven temperature of 125° or 150°C
  - nominal usage voltage plus 10%
  - clocking at 1MHz or 10MHz
- In the (distant) past, HTOL actually did provide an assessment of operating life
  - temperature acceleration was feasible since operating temperatures were low (55  $^{\circ}$ C)
    - $E_A = 0.3 \text{ eV}$  was recommended in 1970's for oxide defects
    - $E_A = 0.5 \text{ eV}$  was recommended in ~1975 for Al-Si interactions, purple plague
    - $E_A = 0.7$  eV was adopted in 1980's when Cu or Ni barriers were added to AI metallization
  - voltage acceleration was feasible but not usually employed
  - clock speeds emulated usage conditions, which were a few MHz
- Today HTOL test is not sufficient for modern CMOS life determination
  - essentially no temperature acceleration with usage temperatures 105 ℃ or higher
    - some mechanisms (HCI) are <u>de</u>-accelerated by temperature
  - voltage acceleration is not feasible with tight  $V_{DD}$  requirements
  - HTOL clocking rate is far below today's CMOS clock frequencies
- Today's HTOL does not actually predict operating life
  - it must be augmented by reliability test structures designed to address known wearout mechanisms

#### HTOL tests alone are insufficient to assure long term life in space



#### 

| CMOS failure                                      | how                                               | HTOL test acceleration capability                                              |                                                                    |                                                                                                                                                                                                                           |  |  |
|---------------------------------------------------|---------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| mechanism                                         | normally<br>accelerated                           | Temperature                                                                    | Voltage                                                            | Frequency                                                                                                                                                                                                                 |  |  |
| EM<br>Electromigration                            | current<br>density,<br>temperature<br>& frequency | little or no<br>acceleration is<br>possible when<br>$T_{HTOL} \approx T_{use}$ | little or no<br>acceleration<br>when<br>$V_{HTOL} \approx V_{use}$ | $\begin{array}{l} \text{negative} \\ \text{acceleration} \\ \text{since} \\ f_{HTOL} << f_{use} \end{array}$ $\begin{array}{l} \text{negative} \\ \text{acceleration} \\ \text{since} \\ f_{HTOL} << f_{use} \end{array}$ |  |  |
| HCI<br>Hot Carrier<br>Injection                   | voltage and frequency                             | negative or no<br>temperature<br>acceleration                                  | little or no<br>acceleration<br>when<br>$V_{HTOL} \approx V_{use}$ |                                                                                                                                                                                                                           |  |  |
| TDDB<br>Time Dependent<br>Dielectric<br>Breakdown | temperature<br>and voltage                        | little or no<br>acceleration is<br>possible when<br>$T_{HTOL} \approx T_{use}$ | little or no<br>acceleration<br>when<br>$V_{HTOL} \approx V_{use}$ | frequency does<br>not accelerate<br>TDDB                                                                                                                                                                                  |  |  |
| BTI<br>Bias/Temperature<br>Instability            | temperature<br>and voltage                        | little or no<br>acceleration is<br>possible when<br>$T_{HTOL} \approx T_{use}$ | little or no<br>acceleration<br>when<br>$V_{HTOL} \approx V_{use}$ | frequency does<br>not accelerate<br>BTI                                                                                                                                                                                   |  |  |

# Commercial CMOS reliability approach

#### three steps Test Structures highly accelerated failure times PoF models metal-via-metal EM test 99 structure,7nm Finfet process Translate model to lifetime goal 95 management / competitive strategy 80 % failed 400°C▲ design rules for reliability e.g. RDRs 50 325°C 🗖 • trade-off with performance 20 • $V_{DD}$ , layout, $J_{max}$ , $f_{op}$ 5 1 0.1 time (a.u.) 2 years? 5 years? exp. dist dist free

## 3. HTOL data

- advertised as showing "intrinsic reliability"
- product level test only
- quarterly/yearly reliability report
- often non-accelerated

|    |                   |          |                    |       | (binomial)       | (χ2)             |
|----|-------------------|----------|--------------------|-------|------------------|------------------|
| ," | Product<br>family | HTOL     | sample<br>size eq. | fails | Pf (90%<br>conf) | λ (90%<br>conf.) |
|    | A 1 lot           | 1000 hrs | 277                | 0     | 1%               | 8300             |
|    | A 2Q21            | 1000 hrs | 5,000              | 0     | 0.047%           | 460              |

#### The long term reliability can be gauged—BUT: Are ① and ② "known unknowns"?

# **Beyond planar CMOS**



- CMOS technology, material, and devices are pushed to higher limits
  - traditional overdesign/conservatism for reliability is no longer economically viable
  - minimal safety margins
  - competitive forces drive the tradeoff towards performance
- New approaches for "reliability enhancement"
  - canary cells, aging sensors
  - adaptive VDD voltage, adjustable clock speeds
  - reconfigurable blocks, dynamic logic rewiring, on-chip redundancy

#### Not old school CMOS

# *Is HTOL useful for assuring reliability for space missions?*

- HTOL as implemented today was intended for AI barrier diffusion/purple plague
  - has become standardized at Tj =125 °C
  - thermal acceleration factors assumed to have an activation energy of  $E_A = 0.7 \text{ eV}$ became customary
  - traditional to run at 1 MHz or 10 MHz only
  - may be useful as a design check to give some assurance that reliability design rules have not been violated
- HOWEVER
  - HTOL provides essentially no acceleration with today's commercial CMOS
    - 125°C not an accelerant in a large CMOS chip EM and BTI non-accelerated
    - 10 MHz is not accelerating HCI, EM in a chip designed to run at 1 GHz
    - VDD in modern commercial CMOS is not an available accelerant for TDDB
    - high temperature decelerates HCI
- This is a problem
- The term "High Temperature Operating Life" as applied to modern CMOS has become an oxymoron

#### HTOL no longer provides long term "intrinsic reliability" evidence

# HTOL inadequate to demonstrate longer lifetimes

"While HTOL testing addresses the intrinsic failure rate for the device, it **does not adequately address** the useful lifetime of the device. ... The acceleration factor ... is too small to demonstrate a required useful device lifetime [a general requirement for useful lifetime has tended to be 100K powered-on-hours (11.2 years) at a specified junction temperature]. ... To adequately determine useful product lifetime, HTOL qual data had to be supplemented with reliability test structures that could be stressed to demonstrate the desired product lifetime."

J.W. McPherson, "Brief History of JEDEC Qualification Standards for Silicon Technology and Their Applicability to WBG Semiconductors", 2018 IEEE International Reliability Physics Symposium (IRPS) proceedings, March 11-15, 2018, Burlingame, CA

# Facts and Fictions

## • FACTS

- Qual of "RadHard/Rel-by-Design" devices takes many years
- modern CMOS devices offer outstanding performance advantages
- fabs have low defect densities, extremely low early life failure rates
- commercial CMOS devices are readily available
- -costs of high-performance comm'l CMOS are relatively low
- -most of the failure modes are "soft"
- For typical consumer products, operating life of 1 – 3 years is "reliable"

## • FICTIONS

- Long term lifetime (10-15 years) is guaranteed in commercial CMOS
- failure mechanism models are made available by vendors
- design decisions affecting the performance vs. reliability tradeoff are perfectly transparent to the space community
- -HTOL data alone provides longterm reliability evidence
- -Heritage HTOL reliability prediction is applicable to today's high performance CMOS

# Recommendations

based upon realities at typical commercial fabs

- Use of commercial high performance devices in short-term space missions
  - < 1 3 year missions may be OK
  - this is the commercial reliability benchmark for most commercial product
- For long missions 10 15 years ---
  - Use multiple redundant devices in reserve on payloads
  - example: in a 15 year mission assume each device will survive for 3 years on average
  - design system with 5-way redundancy
- Derate performance-related parameters
  - clock devices at lower-than-rated max clock frequency
    - causes the device to be far more tolerant of timing degradations
    - reduces device current
    - reduces temperature
    - lessens the risk of soft failure by HCI, EM, BTI
  - operate at a lower range of VDD, lessening risk of TDDB
  - operate at as low a temperature as possible
- Power down CMOS devices in non-operational periods
- Push for reliability transparency at suppliers: 1 models, 2 design, 3 HTOL

## Addressing the lack of transparency in publications



To: IEEE Editors, Conference Chairpersons, Technical Society Governance

As a practitioner in the space satellite enterprise, the reliability of electronic devices is of utmost importance. Of equal importance is the performance of the VLSI devices that are used for many satellite functions. The newest state-of-the art CMOS technologies and devices are extremely interesting as they offer tremendous performance and seem to be relatively radiation hard. It is very important to the space community to keep abreast of these developments, especially the latest CMOS FinFET technologies. These may well be viable candidates for use in new high performance satellite systems. My job as a microelectronics subject matter expert at my company is to evaluate these candidates and to "trust but verify". IEEE journals and conference proceedings are a valuable source of data I rely upon to make my assessments.

Unfortunately, I have noticed a disturbing trend in the IEEE publications that report on state-of-the-art CMOS. There are many IEEE papers and conference publications in recent years that describe reliability and performance of nanometer scale CMOS processes and devices, particularly employing FinFET technologies. While these might be excellent candidates for space systems, the IEEE literature seems to lack critical information inherent to credible, disciplined reporting of scientific and engineering findings. Often the data is "redacted" where plots or tables show "arbitrary units" (A.U.) for important quantities such as times-to-fail, or stressing conditions such as voltage, temperature, etc. Most unfortunately, this renders the paper or conference record essentially useless for the space audience. By way of example, I have attached a list of a dozen IEEE papers that have appeared in the last few years on the topic of FinFET technology reliability and performance that suffer from this malady. Every one of these papers has what might be useful and interesting data but the numerical values are in A.U.

One of the concerns in the space industry is that its purchasing power does not match that of the consumer industry. Yet its performance and reliability goals are extremely stringent. This puts the space enterprise at a disadvantage. The larger CMOS semiconductor concerns do not see much economic benefit in providing information to space customers. Often, the information that is needed indeed exists in valuable IEEE publications, were it not for the A.U. scales.

I would like to propose that editors and reviewers of IEEE publications be sensitive to this problem. With use of redacted data, a scientific or engineering document becomes almost useless. One of the foundations of the scientific and engineering literature is to provide results that can be independently duplicated by independent laboratories. Unfortunately, this is not possible as can be seen in the dozen papers I have cited. The IEEE policy allowing this approach has caused its publications to read more like marketing trade magazines rather than peer reviewed engineering journals. This does the space community, academia, (and in fact the entire electronics engineering community) a disservice. It undermines the discipline, credibility and transparency in the scientific method to which the IEEE and all its members subscribe. I would like to suggest that this practice of allowing the publishing of redacted data be reconsidered.

I believe the IEEE editors and reviewers should look askance at A.U. scales and that some guidance should be put in place to discourage this practice. Alternatively, the IEEE should disclose that the purpose of its publications has transitioned away from hosting peer reviewed scientific data of the highest technical caliber.

Thanks for your consideration.

John Scarpulla, Ph.D.

Technical Fellow, The Aerospace Corporation IEEE Life Member

Attachment: One Dozen Recent IEEE Papers Containing Redacted Data

K. Choi, "Reliability Characterization on Advanced FinFET Technology", 2021 IEEE International Interconnect Technology Conference (IITC) doi: 10.1109/IITC51362.2021.9537487

H. Jiang, "Time Dependent Variability in Advanced FinFET Technology for End-of-Lifetime Reliability Prediction", 2021 IEEE International Reliability Physics Symposium (IRPS), doi: 10.1109/IRPS46558.2021.9405129

H. Sagong, "Reliability of Advanced FinFET Technology Nodes Beyond Planar", 2020 IEEE Electron Devices Technology and Manufacturing Conference Proceedings of Technical Papers, paper 4M-4, doi: 10.1109/IRPS.2018.8353649

J. Liu, A Reliability Enhanced 5nm CMOS Technology Featuring 5th Generation FinFET with Fully-Developed EUV and High Mobility Channel for Mobile SoC and High Performance Computing Application", 2020 IEEE International Electron Devices Meeting (IEDM), p. IEDM20-179, doi: 10.1109/IEDM13553.2020.9372009

M. Jin, "A Comprehensive Reliability Characterization of 5G SoC Mobile Platform Featuring 7nm EUV Process Technology", 2020 IEEE Symposium on VLSI Technology, paper JFS2.3, doi: 10.1109/VLSITechnology18217.2020.9265018

R. Grover, "A Reliability Overview of Intel's 10+ Logic Technology", 2020 IEEE International Reliability Physics Symposium, (IRPS), doi: 10.1109/IRPS45951.2020.9128345

G. Yeap, "5nm "CMOS Production Technology Platform featuring full-fledged EUV, and High Mobility Channel FinFETs with densest 0.021µm 2 SRAM cells for Mobile SoC and High Performance Computing Applications", 2019 IEEE International Electron Devices Meeting (IEDM), paper IEDM19-879, doi: 10.1109/IEDM19573.2019.8993577

D. Huang, "Comprehensive Device and Product Level Reliability Studies on Advanced CMOS Technologies Featuring 7nm High-K Metal Gate FinFET Transistors" 2018 International Reliability Physics Symposium (IRPS), paper 6F.7-1, doi: 10.1109/IRPS.2018.8353651

# Acronyms

- SOC system on a chip
- FPGA field programmable gate array
- CPLD complex programmable logic device
- PAL programmable array logic
- ASIC application specific integrated circuits
- CMOS complementary metal oxide semiconductor
- FINFET fin field effect transistor
- RBD reliable by design
- RHBD rad hard by design
- HTOL high temperature operating life
- HCI hot carrier injection
- BTI bias/temperature instability
- EM electromigration
- TDDB time dependent dielectric breakdown
- STEM scanning transmission electron microscope
- SEE single event effects
- RDR restrictive design rules
- TMR triple modular redundancy

# References

- Franklin Nash <u>Estimating Device Reliability</u>: <u>Assessment of Credibility</u> Kluwer (1993)
- Franklin Nash <u>Reliability Assessments: Concepts, Models & Case Studies</u>, CRC Press (2016)
- Paul Tobias & David Trindade <u>Applied Reliability</u> 3<sup>rd</sup> ed. CRC Press (2012)



