Thursday, October 23, 2014

System reliability in Electronics: new point of view

New model to address Electronic System Reliability

Everyone agree that reliability of their "things" is very important, least qualitatively. Now how to define reliability quantitatively? what does it mean to be 10% more reliable, or 20% less reliable? Like in many technical and non technical fields, until a metric is defined to measure a quantity or a process, it is not possible to act upon it.

Our lifestyles, the way we work, communicate, move around, have fun depends heavily on electronic systems. In this blog I'd like to present some thoughts on how to consider reliability of these electronic systems (that's what I've been doing for a living in the past several years) that could help system architects have a more systematic approach to the issue.

I am considering here two main risk scenarios. 
- I call the first one"Margin call", and I believe it is mostly used by reliability professionals, 
- The second one is a non linear effect, which, if we think about it, represents well what happens when failure strikes: it is the "Edge of the Cliff" scenario.

In “Margin call”, a drift in parameters value due to ageing or other semi predictive effects, a unlucky combination of corners effects result in a system's performances not meeting functional specifications anymore. This can be managed by reducing the system performance (processor speed lowered from 2GHz to 1.8GHz for example, or increase of power consumption). It might not lead to complete failure, the overall system might still be working, even at degraded performances. Risk is linked to how wide the Gaussian distributions of all operation performances is.

“Edge of the Cliff” effect can be a consequence of letting Margin call situations adrift without corrective action. It can be also a case of Soft Error where location, timing and effects of the event are unpredictable. Whatever the existing performances margins of individual subsystems and components, catastrophic failure can happen.

System reliability engineers seem to handle the "Margin Call" scenario through ageing models, statistical analyses, predictions. There are several tools to do this, including statistical analyses (Monte Carlo, covariance analysis, …) but most of the system reliability engineers are actually using excel as a platform to compute these data.

Dealing with “Edge of the Cliff” scenario consists in chasing the outliers of the distribution of every possible combination of corners, including analyzing the propagation of failures within the system. Once we can describe what these cases are, we can setup design margins to be as far away from the Cliff’s edge as we need.

So the point is trying to find where the edge of the cliff is...and by the way the depth of the precipice is also an important parameter. The perception is that the deeper the precipice (the more serious the consequence of failure), the more distance (margin) from the cliff's edge is needed.

But in both cases the key element for the system engineers is to trust the input data from suppliers. These data can be biased or incorrect or incomplete for many reasons. The main reason is the competitive environment and contractual framework governing relationship between parties and biasing communication:
When a customer asks vendors to provide reliability performance of their components, guess which data they provide?

You got it: the best ones! 

Think about it: is it in the best interest of the system reliability architect? obviously not, it is actually a  dangerous practice. This system architect might as well find himself or herself like the person in red in the picture above: sitting at the edge of a huge cliff, without even knowing it!

How to solve the issue? after all, it is in engineers' nature to show always their best achievements and results (and hide the ones they're not so proud about).

Let me know your thoughts, but my guess is that the solution involves building solid trust among the partners in the value chain. This can work only with effective communication, opening the kimono, and maybe an specific platform for data exchange which show openly  system specifications, and the exhaustive set of reliability data, good AND bad, from the component providers.

Oct 22nd Update...
I find this picture (credit: Jeff Pang. Yosemite Highlining). This might be your situation, and you don't even realize it!

Tuesday, March 25, 2014

The business of System Reliability :  Defining the characteristics of our best target market with a simple cost model.

Supply chain managers have developed tools to monitor the supply of components for large-scale systems, including parameters like lead time, single source of supply, potential replacement parts, price, INCOTERM in their analyses. They usually work with component engineers to gather the data needed for their management tools. Reliability has historically been factored into the design (life duration) and the maintenance program mostly for deterministic ageing effect creating a slow drift in performance. Another type of reliability issue is unexpected, random failure like SEEs that can occur  at anytime, anywhere in the system, causing potential catastrophic failures. Because of the difficulty of modelling and predicting SEEs, they are not necessarily well analyzed as they are sometimes like the needle in the haystack for large systems. When components show this type of reliability issues in the field, the information is supposed to be fed back to the supply chain’s monitoring system, which organizes repairs and recalls, and design change request if needed. These operations often come at a large cost to the system vendor, both internal and as contract penalties.
A simplified cost estimation model of this risk and the associated costs can be represented below:

Total cost is:
                                    C = C1 + P1*C3 + C4                              

Where: C1 represents the cost associated to development, implementation, fabrication, sales, etc. C3 represents the cost of repair or recall of the product. C4 represents the cost of maintenance.

Adding the capability of assessing and correcting SEEs before shipment drastically lowers the risk of failure in the field. It modifies the cost structure as shown below:

The total cost would then be:
                              CR = C1 + C2 + P2*C3 + C4                         (2)
Where: C2 is the overall cost of improving the reliability (analysis, test, mitigation, ...). P2 is the probability of a failure given the result (and recommendations) of the reliability audit. Statistically speaking, this probability follows Bayesian statistics. 
The difference of cost between the two approaches would then be:
                          ΔC = CR – C = C2 + C3*(P2 - P1)                     (3)
ΔC needs to be negative in order for reliability audit to make business sense and generate positive Return On Investment.  Note that if reliability analysis is performed well, the repair cost C3 should be much lower than in the previous case. For the sake of simplicity, we'll assume that C3 is the same (worst case).

                                                    ΔC<0 nbsp="" o:p="">

The conditions for this to happen are clear from (3), and define our target markets and our offering:

·         C3 is large. It corresponds to certain applications and industries. Our experience of cases where this cost can be prohibitive: aerospace, medical devices and cloud infrastructure

·         P2 << P1: this condition is achieved mainly with accurate analysis tools, deep knowledge and expertise in the field and effective mitigation strategies

·         C2 is as small as possible: C2 has two components, the cost of analysis and the cost of mitigation. C2 depends on the stage where the problem is audited, obviously the earlier the cheaper.

Therefore, for the target markets for which cost of failure is prohibitive, we have to bring enough expertise to significantly lower the probability of failure through accurate analysis and effective mitigation, and our intervention should be as early as possible in the design phase. These statement are very helpful when defining our product portfolio and the type of data and engagement that we'll be seeking.

if you find these thoughts interesting, or you'd like to react to this blog, let us know! Your comments are welcome as usual!

Friday, June 8, 2012

DAC 2012: Interview with EE Times

DAC was quite busy this year at the Moscone center in San Francisco.
It was a good way to test market demand for Soft Error solutions, or at least the interest of different industries about this reliability problem. Whereas many are aware of the problem, more than ever see it as a concern and try to be proactive about it. Summary of concerned markets is shown in this snapshot taken from the graphics on our booth:
We've also been interviewed by EETimes online TV. You can see the it here.

Tuesday, May 1, 2012

The Australian Transportation Safety Board (ATSB) released on December 19th 2011 the final report of investigations of two repeated nose dive incidents during a Qantas Airline Airbus A330 flight from Singapore to Perth in October 2008 (Qantas Flight 72) resulting in an accident. The plane landed at a nearby airport after the incident which caused at least 110 injured passengers/crewmembers and some damages to the inside of the aircraft.

See section 3.6.6 page 143 for discussion about Single Event Effect and Appendix H.
The report is inconclusive about the root cause of the incident. Its origin occurred in an avionic equipment called the TLN101 ADIRU (Air Data Inertial Reference Unit). 
Incorrect data for all the flight parameters were sent by this unit to the other avionics equipments, eventually creating a false angle of attack information misleading the central computer that reacted with a quick nose down maneuver.
The report mentions that probably the wrong signal came from the CPU (Intel 80960MC) inside the ADIRU. Other chips interacting with the CPU and therefore potentially sending wrong signals are an ASIC from AMIs, wait state RAM from Austin Semiconductor (64kx4) and RAM from Mosaic Semiconductor (128kx8).
When skimming through the report and especially the section about SEE, I had few thoughts:
The estimated SEU failure rate of the equipment is 1.96 e-5 per hour, or 19.600 FIT (note: none of the memories were protected by ECC or EDAC). At an altitude of 37.000 ft the neutron acceleration factor compared to ground reference (NYC) is 83x (report data), therefore the equivalent FIT at ground level should be 236 FIT. The order of magnitude seems about right, even though I’d like to have more data about the process node and total size of embedded memory. This FIT rate is just an estimate (from theory, not from test) and seems to take only memory SBU (Single Bit upset) into account.
The investigators couldn’t reproduce the symptoms through test. They focused mainly on neutron testing at 14MeV. I imagine this is because it was a source that they could access easily. Maybe a wider neutron range up to hundreds of MeV (like white neutron spectrum at Los Alamos, TSL or Triumf) would have been more appropriate, especially to create MCU (Multi Cell Upsets). The report states that the rate MCU/SBU is about 1%, so they didn’t investigate further. This depends on the process node! At latest technologies (40nm, 28nm) this ratio can be up to 40% on SRAM.
The components seemed to have been manufactured with older process nodes. But as such, did they check the effect of thermal neutron (Boron 10 was used in older technologies)? Of alpha particles contamination of the package?
I believe that this report needs a little more details on the issue, a little more investigation to try to be more conclusive….Any thoughts and comments?

Thursday, April 12, 2012

Reliability of Cloud Computing

Every second the equivalent of 63 billion CDs of data transit through the world’s internet (source: Cisco). That’s 1.5ZB per year (1ZB= 270 B!). As of December 31st 2011, 2.3 billion persons are using the internet, a 5.3X increase from the previous year! (source: As lifestyle in almost every country in every continent is moving towards ubiquitous mobile lifestyle, demand for remote mass storage and cloud computing capacity is rapidly increasing. These numbers are mindboggling and leave us to estimate the impact of failure of this infrastructure leading to service disruption: we are not talking about thousands nor millions of users affected. We are talking about hundred of millions!
Obviously cloud services architectures involve heavy redundancies, mirror imaging of servers in different geographical location, disaster recovery procedures…Still, isn’t there some single point of failure? As it is a well adopted fact that software can show bugs, viruses, worms…how about hardware? When firewalls, watchdogs and other software procedure are commonly put in place, aren’t we less keen to accept hardware failure? Even trickier: what about hardware generated data corruption? The hardware shows no sign of failure, the software is not infected by viruses….still something’s not right.
Soft errors, even as being a small contributor of the overall reliability of systems, can still be the source of undetected failures that propagate to whole systems.
What are we doing to mitigate this problem, especially in cloud computing and data storage infrastructure?