SOC Processors
Radiation and Developments

Steve Guertin – NASA/JPL
Intro

NEPP’s System on a Chip (SOC) processors, radiation, and developments topic seeks a broad scope of investigation of SOC devices for radiation effects during space use. It seeks to address complex digital SOCs that are essentially single-chip computers. It seeks to identify radiation vulnerabilities. And it seeks to stay involved in development of SOC devices for space use.

This is a relatively new topic. SOC devices have become increasingly commercially viable. Recent efforts to develop radiation-hardened-by-design (RHBD) SOC devices have been underway at Atmel, Aeroflex, Boeing, and other manufacturers.

SOCs simplify required design efforts for developing systems and reduce cost when used in systems (their reduced yield must be addressed).

From direct interaction with manufacturers it is clear that NASA will need to identify the types of data it requires for interpreting risk. This is because there is so much to test, and some types of errors have significant risk of being misinterpreted. NASA will also need to be directly involved in efforts to collect this data and interpret it.
NEPP SOC Investigation Task (part 1) – MAESTRO

Description:
– Boeing is developing the “On-board Processing Expandable Reconfigurable Architecture” (OPERA) to provide fault tolerant spaceborne computing. It is supposed to provide space applications with at-generation computing capability.
– The first processor implementation is the MAESTRO chip, which is an RHBD 49-core processor based on the Tilera architecture.
– Computers with high processing capability are desired for remote computing tasks. MAESTRO provides 70 GOPs (@ 20W), which is on-par with high-end FPGAs. And the OPERA architecture is capable of handling tasks running from bit-processing/DSP applications all the way up to full processor applications.
– This is an exciting option for spacecraft designers. Right now OPERA is looking for applications and will be easy to partner

FY10 Plans:
– Provide first-cut analysis of the MAESTRO chip for radiation testing.
– Review all available software and information regarding radiation testing of Maestro device.
– Establish operation of test-kit software in an engineering setting in order to evaluate the degree to which each application sensitizes key elements of the Maestro design.
– Feedback information to Boeing test engineers in order to provide a best-practices approach to test efforts by both groups.

Schedule:

<table>
<thead>
<tr>
<th>Date</th>
<th>Activity</th>
</tr>
</thead>
<tbody>
<tr>
<td>Oct</td>
<td>Dev. Software for HI test 1</td>
</tr>
<tr>
<td>Nov</td>
<td>Perform HI test 1</td>
</tr>
<tr>
<td>Nov</td>
<td>Dev. Software for HI test 2</td>
</tr>
<tr>
<td>Dec</td>
<td>Perform HI test 2</td>
</tr>
<tr>
<td>Jan</td>
<td>Heavy Ion Reports</td>
</tr>
<tr>
<td>Feb</td>
<td>Participate in MAESTRO review</td>
</tr>
<tr>
<td>Mar</td>
<td>Interim report on MAESTRO</td>
</tr>
<tr>
<td>Apr</td>
<td>Final report on SoC &amp; MAESTRO</td>
</tr>
</tbody>
</table>

Deliverables:
– Review of test software used by Boeing for initial radiation characterization
– Review of “board test kit” (BTK) codes for sensitization of MAESTRO subsystems

Partners:
– GSFC & JPL: Melanie Berg, Hak Kim, Lonnie “Scott” Walling, Ken LaBel, Raphael Some, Carlos Villalpondo, David Rennels, Steve Guertin
– Boeing – Bryan Buchanan, Manuel Cabanas-Holmen, Charles Neathery
– USC/ISI – Steve Crago
– Lew Cohn/Dagim Seyoum
NEPP SOC Investigation Task (part 2) – Aeroflex UT699

Description:

Several system-on-a-chip (SOC) devices are being used by NASA, or under consideration. The devices of interest are all built using RHBD methods, and have shown unexpected problems during testing.

Manufacturers have produced radiation results showing these devices are inherently very robust against SEE, but evidence suggests that anomalies during testing are usually not reported very clearly (often because only a couple such anomalies are observed during manufacturer testing and may be due to overloaded EDAC systems).

This task seeks to understand anomalies that occurred during testing and try to isolate the nature of the upsets. In SOC devices there are several peripheral circuits that may be culprits as well as the actual on-chip processor which is the structure usually tested the most.

We will provide hardware and software for operation of Aeroflex’s UT699 SOC device, this will then be investigated for rare uncorrected upsets.

FY10 Plans:

- Obtain test board (we have previous experience with UT699 test equipment).
- Modify software to provide full functionality monitoring of the onboard Leon3 FT processor core. This includes implementing software fault tolerance features to enable fault identification.
- Perform heavy ion testing (threshold is around LET=9, so protons are not a very useful test) to determine sensitivity to anomalous events (device has single-bit correction on storage elements).
- For secondary option, develop test software to sensitize the on-chip components to identify the radiation sensitivity of elements such as the UART controller, MMU, IIC bus, and other subsystems.
- Perform heavy ion testing of the components.

Deliverables:

- Heavy ion test report for anomalies.
- Heavy ion test report for on-chip peripheral components.
- Final report discussing SOC testing. This report will include the information learned from the UT699 and a comparison with Atmel/TSC695 testing.

Partners:

- Aeroflex – Craig Hafer, Steve Griffith, Fred Sievert
- NASA/JPL – Steve Guertin
Goals

- Participate in Maestro development and future developments of SOC devices intended to provide very high computation capability in space.

- Establish appropriate methods for digital SOC radiation sensitivity evaluation – specifically targeting microprocessor-based SOCs.

- Use these methods to evaluate SOC sensitivity in RHBD, commercial, and FPGA devices.

- Provide leading-order radiation sensitivity measurements for SOCs of interest to NASA.

- Engage NASA community to improve effectiveness of methods and increase testing and reporting of desired RHBD or fault tolerant structures.

- Feedback experiences from testing to improve methods (industrial methods and user application methods).
Expected Impact to Community

- Currently the testing methods for digital SOCs are problematic and need improvement.
  - Recent experiences with SOC manufacturers have verified this need.
  - Manufacturer test methods are not well developed.
- Microprocessors in SOCs alone are difficult to test and the community has only a few groups testing them.
- SOCs are expected to become more common in spacecraft and may even run flight computers in the future.
- NEPP seeks to establish viable methods for evaluating radiation sensitivity of SOCs with built-in error mitigation in order to establish leading-order unmitigated radiation event rates.
Addressing Risk

• This task is designed to reduce risk to potential users of SOC devices by doing the following:
  – Addressing the lack of understanding of event types in complex systems such as microprocessors.
    • Develop methods and capabilities to enable testing these devices.
    • Provide support when possible for test efforts.
  – Augmenting manufacturer’s radiation testing efforts to identify:
    • Glass jaws – weak points that may not be getting tested sufficiently.
    • True single-hit events that either overwhelm fault tolerance systems, or circumvent them.
    • Leading order event rates in space.

• NASA will be “running blind” into radiation problems with these devices. This task addresses that risk.
Status/Schedule

**FY2010**
- Examine UT699 (Aeroflex’s LEON3FT device) for leading order non-mitigated events.
- Establish best software methods for MAESTRO testing.
- Advise NASA community about appropriate radiation requirements for MAESTRO-Lite
- Develop capabilities to test MAESTRO ITC device.
  - Laser testing 09/2010
  - Heavy Ion testing 09/2010
  - Proton testing 10/2010-11/2010

**FY2011**
- Test 49-core MAESTRO ITC device
  - High speed IO and multiple interfaces
  - 12/2010-03/2011
- Continue to build expertise in SOC test methods
- Examine other candidate SOC devices (such as Atmel AT7913E)
Devices of Interest

• ASICs:
  – MAESTRO: Boeing’s current effort on multi-core processors is under the OPERA (On-board Processing Expendable Reconfigurable Architecture) program. The device will be a 49-core multi-core processor with several standard peripheral interfaces. It is derived from the Tilera Tile64. This 49-core device is called the Maestro. This is a project under DTRA’s RHBD microelectronics program.
  – UT699: Aeroflex built an SOC device around the LEON3FT using Gaisler engineering’s fault-tolerant library. Aeroflex has tested and reported the fault-tolerant (FT) elements of this device. However, it still has an underlying (very low) error rate from non-FT elements in the design.

• FPGAs:
  – Several options for inserting hard or soft-core microprocessors into FPGAs exist. Xilinx has PowerPC, Microblaze, and is looking at Sparc cores. Actel can also take Sparc cores and is looking into hard-core ARM.
Boeing’s Maestro

- Maestro provides all the functionality of an normal PC’s bridge or chipset device(s).
• The UT699 provides interfaces more common to embedded systems. These are designed to help it target embedded aerospace applications.
Testing Challenges

• Many of the SOCs that will be of interest to NASA will be RHBD/FT devices.
  – RHBD & FT in SOCs is often accomplished with rate-sensitive protection schemes such as EDAC.
  – If RHBD/FT systems are overloaded the test becomes difficult due to unmitigated conditions.
  – If a rare event is seen in an RHBD system, it becomes difficult to prove the event was not due to laboratory event rates.
• Microprocessor testing is required but not well-defined.
• Peripheral devices each require special hardware effort for testing.
• Devices are so complex that legitimate test campaigns requires assistance from their manufacturers.

Testing SOCs is difficult due to: Lack of Knowledge-Base, overwhelming of RHBD design rates, and need for manufacturer participation.
One more Test Challenge

- SOC device and support structures (boards) incorporate additional resources that can upset.
- These are a problem if:
  - They are active parts of the test system or
  - They have upset modes even when unused
- There are usually multiple explanations for any observed upset mode.

It is impossible to isolate upset modes to subsystems.
Results 1 of 3 – UT699

- NASA/JPL has verified published Aeroflex event rates for UT699 in protected cells.
- We have also measured event rates for unprotected events.
- Unprotected events lead the overall event rate in space.

- The UART and memory controller were also tested for gross operation.
- The Spacewire ports were tested and found to have upset cross section below $1 \times 10^{-16} \text{cm}^2/\text{transferred bit}$. 

Upsets in protected cells do not give the leading order contribution to event rates in space. However the device is quite robust, including Spacewire. Unprotected event rate below 1 in 1000 years (ISS).
Results 2 of 3 – Maestro Development

• Maestro radiation test development is underway at Boeing and NASA (GSFC & JPL) is involved:
  – Helping set the goals for the initial Maestro test effort.
  – Helping develop test methods appropriate for effort.

• NASA is also developing its own test capabilities
  – To verify the Boeing results
  – To expand on Boeing results by testing additional parts of the Maestro architecture and peripherals
  – Current NASA test effort development:
    • Building custom hardware to enable testing the Maestro and its peripheral ports.
    • Investigating manufacturer-based test software for functional evaluation – to be applied to radiation testing.

**NASA is building custom hardware and software to verify Maestro radiation sensitivity and enable study of key vulnerabilities such as error propagation from core to core.**
Results 3 of 3 – Maestro Software

• GSFC and JPL are currently engaged in analyzing software for potential Maestro radiation testing of functional components.
  – Software is based on the Tilera Board Test Kit (BTK)
  – Basic functional capabilities – microprocessor components (ALU, Branch, FPU) – must be characterized or modeled.
  – Basic hardware operation – memory management, i/o subsystems, and other possible standard interfaces such as XAUI, Ethernet, etc., must be characterized or modeled.
  – Test the on-chip networks for error isolation capabilities.
  – To a limited extent the previous items will be tested by Boeing and verified by NASA.

• Further software efforts are underway to understand the impact of the elements above on a more flight-like system.
  – Specifically, error isolation in a multi-processing environment is being analyzed to fold in some characterization capabilities.

Software efforts span from basic duplicate operation to expanded operation of desired peripherals and systems
Future Plans: Maestro

• The work on Maestro follows a 3-stage plan.
• **Stage 1: Verification of Existing Results**
  – NASA team will run existing Boeing test codes on the NASA hardware setup in order to verify results during beam exposures.
  – Little additional effort will be put into modifications of existing software.
• **Stage 2: Extension and Coverage**
  – Existing test codes will be modified, using information from the BTK and help from USC/ISI in order to sensitize key elements.
  – Upset sensitivity will be measured using beam testing on each modified test code created.
• **Stage 3: Flight-Like Operation**
  – The Maestro is a very complex device with very high (20 GOPs) operation rate. A fully stressed device, using a flight-like application is desirable as a final checkout.
  – Currently establishing a test collaboration to leverage cost on this topic.
MAESTRO Test Setup

Test system for MAESTRO currently under development at GSFC using updated LCDT.
Future Plans: UT699

- The UT699 is the first of what is expected to be a series of basic SOC devices that will be examined under this NEPP activity.

- The effort on the UT699 will be completed this fiscal year, with one final radiation test trip that accomplishes the following goals.
  - Evaluation of the possibility of flux dependence as the cause for unprotected events. (This will require careful statistical analysis.)
  - Establish the full sensitivity curve for operation of Spacewire. This has already been about ½ measured.

**UT699 testing will be completed this year and serve as a baseline for future similar SOC testing.**
Future Plans: Other SOCs

• Initial efforts should be leveraged for testing of other SOC devices.
  – From the RHBD world, candidates include devices from Atmel and others.
  – Commercial devices may also be of interest such as Freescale and others.
  – Candidate device with all required peripherals embedded (inc. memory) is currently being considered for future work.

• Some FPGAs are available with SOC configurations that are currently under study for radiation effects.
  – Xilinx SIRF with Sparc, PowerPC, or other embedded processors.
  – Actel embedded ARM core.

Future SOC testing may target devices from several manufacturers including Freescale, Atmel, Xilinx or Actel.
Continued Community Involvement

- This NEPP task will continue to track the RHBD SOC community
  - To provide community input and vet plans
  - To interact with development groups designing RHBD elements for future devices.
  - Not always possible (such collaborations not always possible – especially commercial)
- Continue to develop SOC radiation testing knowledge base, including expansion to commercial devices.
- Examine key types of FPGA designs that include SOC architectural elements.
- This is also a crossover topic that also covers many of the types of questions relevant to testing bridge chips or chipsets.