



#### ReSpace/MAPLD Conference Albuquerque, NM, August 2011.

## Cyclic NMR-based Fault Tolerance with Bitstream Scrubbing via Reed-Solomon Codes

Naveed Imran and Ronald F. DeMara Department of Electrical Engineering and Computer Science, University of Central Florida E-mail: naveed@knights.ucf.edu, demara@mail.ucf.edu







- Motivation
- Previous Fault-Tolerance approaches for FPGA-based Designs
- Proposed Technique
  - Cyclic-NMR Technique
  - Bitstream Scrubbing via Reed-Solomon
- Simulation Results
- Conclusions





#### Progress in Semiconductor Technology

- Reduced feature size -> increasing vulnerability to errors, aging-induced permanent faults \*
- Runtime faults in addition to manufacture-time defects of future nanoscale devices \*\*

#### **Reliability**

• Importance of fault tolerance in mission critical applications

#### **FPGAs**

- FPGA as a reconfigurable platform
- Faults in logic resources and configuration memory

<sup>\*</sup> Suresh Srinivasan, Krishnan Ramakrishnan, Prasanth Mangalagiri, Yuan Xie, Vijay krishnan Narayanan, Mary Jane Irwin, Karthik Sarpatwari, "Towards Increasing FPGA Lifetime", IEEE Trans. Dependable and Secure Computing. vol. 5, Issue 2, pp. 115-127, 2008. \*\* W. Rao, C. Yang, R. Karri, and A. Orailoglu, "Toward future systems with nanoscale devices: Overcoming the reliability challenge," *Computer,* vol. 44, no. 2, pp. 46–53, Feb. 2011.



## Fault-Tolerance of FPGA-based Designs



• Redundancy-based techniques

Example: Duplex arrangement, Triple Modular Redundancy (TMR):conventional approach with tool support(e.g., XTMR) yet vulnerable to multiple faults.

• Online Testing methods [1],[2],[3]

<u>Example:</u> Online Built-In Self Test (BIST), Roving STAR: resource-based tests may not easily correlate with functionality.

#### • Evolvable Hardware

Example: Design-time evolution [4], Competitive Runtime Reconfiguration (CRR) [5] : model free, self-adapting.





- An architectural framework for N-modular redundant , FPGAbased design
- Partial Reconfiguration used so only <u>one</u> instance of a *Functional Element (FE)* is retained on the throughput datapath and the other instances undergo test.
- A fixed hardware voter is avoided on the critical path.
- The *faulty* FEs are isolated and a *healthy* FE is introduced into the datapath by reconfiguration process



## **Cyclic NMR Platform**





Circuit Under Test (CUT)



- All instances of the NMR are subjected to the same input.
- An *Active FE* provides the system output.
- Fault Isolation: Fitness status of the

FEs based upon their discrepant behavior in an *Evaluation Window*.

• Fault Recovery: A *healthy* FE is inserted in the datapath and becomes the new *Active FE* 





# Part-II: Previous Memory Protection



Error Correcting Codes (ECC)

Store parity or code bits (e.g., SECDED hamming code [12])

### Physical bit Interleaving

 Data words are stored in interleaved way; it is beneficial for clustered errors.

#### Hardware Redundancy

• Manufacture time redundant rows/columns, BIST

#### <u>Scrubbing</u>

• Periodically sweep through the memory array to check for errors

#### Process Level Techniques

• Silicon-on insulator (SOI), Rad-Hard

K. Jangwoo, H. Nikos, M. Ken, F. Babak, and H. James, "Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding," in *Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture*: IEEE Computer Society, 2007.





- FPGA configuration memory in a space radiation environment -> malfunctioning of the implemented circuit
- Bitstream Scrubbing- configuration mitigation without taking the device offline.
- Xilinx provided scheme: a readback of the configuration memory is performed and if any error is found, the corresponding frames of the memory are overwritten
- NASA's REAG group technique: external blind scrubbing method in which configuration memory is periodically overwritten by a *golden* bitstream.
- CyclicNMR technique : Protecting the configuration storing media by Reed-Solomon Codes

M. Berg, C. Poivey, D. Petrick, D. Espinosa, A. Lesea, K. LaBel, M. Friendlich, H. Kim, and A. Phan, "Effectiveness of internal versus external seu scrubbing mitigation strategies in a Xilinx FPGA: Design, test, and analysis," *Nuclear Science, IEEE Transactions on, vol. 55,* no. 4, pp. 2259 – 2266, Aug. 2008.





- Hamming code can correct only one or two errors, RS can correct multiple errors.
- An RS code [6],[13] is specified as RS(n,k) with m bit symbols where:  $\bullet$ *k*=number of message symbols *n*=number of encoded symbols

```
fault capacity = t
```

```
number of parity symbols 2t=n-k
```







- In the RS original approach [13], a message of length k was represented by a polynomial p(x) whose coefficients are the source message symbols.
- The p(x) is oversampled generating **n** code words which are sent over the channel.
- The encoded message at the receiver end is decoded by solving a linear system of equations to recover original k symbols.
- For a noisy channel, the code words are corrupted. The error correction capability is realized by solving all possible distinct sets of **k** equations and a correct solution is found by majority voting [13].





- The bitstreams are protected by the Reed Solomon codes\* RS(15,9). Each codeword contains 15 symbols out of which 9 are data symbols and 6 are parity symbols.
- Error correcting decoder algorithm in on-chip PowerPC processor running compiled C-language code.
- In the prototype, the processor fetches a partial bitstream from the Compact Flash, decodes it, and writes the decoded bitstream to the configuration memory through the Xilinx Internal Configuration Access Port (ICAP) port
- To simulate the errors, bits are randomly inverted into the encoded bitstream stored on a compact flash.
- PowerPC introduces a critical element on *Recover path* opposed to voter on the *throughout datapath*.

<sup>\*</sup> The Error Correcting Codes (ECC) Page, http://www.eccpage.com/



## **Experiment Design**

- Xilinx development board ML410 Virtex-4
- Xilinx ISE 9.2i, Xilinx Platform Studio (XPS), Xilinx PlanAhead
- A 32-input with 32-output MCNC benchmark circuit C6288



PowerPC

An NMR floor plan of an FE



- Operating modes selectable based on mission trade offs of area, power, and availability.
- Simplex

A single FE computes for the input, the periodic blind scrubbing provides first level fault tolerance

• Duplex

Two replicas of an FE compute for a given input. A discrepancy in their output flags at least one of them as faulty -> Scrubbing

#### • NMR

N instances of an FE -> keep redundant resources in standby



## **Simulation Results**



MCNC benchmark circuit C6288

TABLE IOVERHEAD OF THE CYCLIC NMR SCHEME

|                        | Simplex  | with ECC   | Cyclic NMR |
|------------------------|----------|------------|------------|
| Partial bitstream size | 38KB     | 63KB       | 190KB      |
| Reconfiguration time   | 203 msec | 751.7 msec | 1.015 sec  |
| Resources              | 752 LUTS | N.A.       | 3760 LUTS  |

#### TABLE II FAULT HANDLING RESULTS

| Phase                           | Time              |
|---------------------------------|-------------------|
| Discrepancy Monitor             | Evaluation Window |
| Majority Voting                 | Negligible        |
| Active FE Reconfiguration       | 751.7 msec        |
| Faulty FE Blank Reconfiguration | 59 msec           |





- While fault capacity of a TMR system is limited to fault(s) in a single module, here fault capacity of NMR in our case is *N-2*
- Failure of *N-2* FEs still allows two elements *healthy which* can be used for discrepancy check.
- By keeping all the FEs in operation, we increase the fault capacity of the system by *N-2* tolerable multiple failures across multiple modules.
- A faulty module in the datapath can be replaced by one of the healthy modules from the test pool.
- Only one FE is active, other resources undergo test. However, the resources under test are evaluated using actual inputs at all times, in contrast to BIST techniques.



## References



- 1. J. Emmert, C. Stroud, and M. Abramovici, "Online fault tolerance for FPGA logic blocks," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 15, no. 2, pp. 216–226, Feb. 2007.
- 2. S. Dutt, V. Verma, and V. Suthar, "Built-in-self-test of FPGAs with provable diagnosabilities and high diagnostic coverage with application to online testing," *Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 27, no. 2, pp. 309 326, Feb. 2008.*
- 3. M. Gericota, G. Alves, M. Silva, and J. Ferreira, "Reliability and availability in reconfigurable computing: A basis for a common solution," *Very Large Scale Integration (VLSI) Systems, IEEE Transactions on,* vol. 16, no. 11, pp. 1545 1558, Nov. 2008.
- 4. D. Keymeulen, R. S. Zebulum, Y. Jin, and A. Stoica, "Fault-tolerant evolvable hardware using field-programmable transistor arrays," *Reliability, IEEE Transactions on, vol. 49, no. 3, pp. 305–316, 2000.*
- 5. R. F. DeMara, K. Zhang, and C. A. Sharma, "Autonomic fault-handling and refurbishment using throughput-driven assessment," *Appl. Soft Comput., vol. 11, pp. 1588–1599, March 2011.*
- 6. M. Riley and I.Richardson, "Reed-Solomon Codes An introduction to Reed-Solomon codes: principles, architecture and implementation"

[Online]: http://www.cs.cmu.edu/afs/cs/project/pscico-guyb/realworld/www/reedsolomon/reed\_solomon\_codes.html

- 8. C. Carmichael, E. Fuller, P. Blain, M. Caffrey"" SEU Mitigation Techniques for Virtex FPGAs in Space Applications", [Online] http://china.xilinx.com/esp/mil\_aero/collateral/presentations/SEU\_mitigation\_technique.pdf
- 9. Gallagher et al., "Multiple bit upset insensitive error detection and correction circuit for field programmable gate array based on static random access memory blocks", US Patent 2011.
- 10. Blanset et al., "Sandia-Xilinx Virtex FPGA SEU Experiment on the International Space Station" [Online] http://nepp.nasa.gov/mafa/talks/MAFA07\_33\_Blansett.pdf
- 11. Heiner, J.; Collins, N.; Wirthlin, M.; , "Fault Tolerant ICAP Controller for High-Reliable Internal Scrubbing," *Aerospace Conference, 2008 IEEE*, vol., no., pp.1-10, 1-8 March 2008 doi: 10.1109/AERO.2008.4526471
- 12. N. Rollins, M. Fuller, and M. J. Wirthlin, "A comparison of fault tolerant memories in SRAM-based FPGAs," in *Aerospace Conference, 2010 IEEE, pp. 1–12.*
- 13. I. S. Reed and G. Solomon, "Polynomial Codes over Certain Finite Fields," *SIAM Journal of Applied Mathematics, Volume 8, pp. 300-304,* 1960.