Home » Posts » Articles » Reliability – Part II: The Tools

Reliability – Part II: The Tools

The alphabet soup of expressions, and the role subjectivity plays in risk assessment.

History shows Earth and its inhabitants are inflicted with many difficulties: famine, war, pestilence, income taxes, stripes mixed with plaids, and easy listening music. Depravity knows no limits.

Then there are the real horrors, like unreliability. And cavernous, asymmetrical trade shows.

Relax. I will explain.

Last time I stated the problem: Our industry whistles past the graveyard when it comes to confronting problems related to reliability. Now it’s time to review some of the popular tools used to measure product attributes as they relate to long-term reliability. It’s also time to assess their worth and ask some pointed questions about whether their value still withstands scrutiny and delivers meaningful data, especially considering lots of stuff still breaks and otherwise fails to work as reliably as it should.

But first, my editor wants me to comment on the recent IPC Apex Expo, held in March at the Las Vegas Convention Center. OK, it was big. It was also full of stuff on display. Big stuff. And crowded with people trying to sell that stuff. Especially on the last day of the show, when potential buyers were an endangered species.

One of the statistical tools test engineers use is C_pk, which stands for Process Capability Index. It is an indicator of process precision, or repeatability, expressed in terms of the ability of that process to meet a customer’s tolerances. It is based on a measureable, relatively stable process. (Stat geeks call this a normally distributed process.) Stability means the process can be observed to operate, and is therefore measured, within certain narrow boundaries, usually referred to as upper and lower limits, built around a measure of central tendency, or average. Those measurements are compared with the nominal or specified value of the attribute being measured and its tolerances. The degree of deviation from the norm indicates the consistency, or reliability, of the specified process.

The process being measured typically mirrors the nominal value of a key attribute (for example, in board testing, component values such as resistance, capacitance, and inductance), within a manufacturer’s specified tolerances. The degree to which it mirrors those tolerances reflects its low variability, or precision. These measurements provide an accurate mapping of the process, and are a predictor of process yield, usually expressed in terms of defects per million opportunities (DPMO) or parts per million (PPM). Process variability also is expressed as standard deviations from the mean, and given the Greek symbol sigma. A six-sigma process, which via convoluted mathematics calculates to a C_pk of 2, corresponds to a process yield of 99.99966% or a DPMO of 3.4. Some test statements of work (SoWs) specify a C_pk of 10 for individual component tests.

That sounds very precise. So why do electronic products still stop working on occasion, seven months after passing the requisite tests, often to a high C_pk standard? Is that standard adequate to ensure proper product functioning?

The Las Vegas Convention Center is very large, much larger than the previous Apex Vegas venue, Mandalay Bay. One wag noted the hall begins in Nevada and ends in Utah. It was not much of an exaggeration. One had the feeling B-24 Liberator bombers could have been built here during WWII. Booth numbering patterns were convoluted, making navigation challenging and rendezvous with colleagues an adventure. It was a long hike from anywhere (hotel rooms, restaurants) to anywhere. Disorganization also ruled. Some food venues lacked proper seating the first day. And who had the bright idea to situate badge pickup inside the hall rather than outside of it?

Design of experiments (DoE) and failure modes and effects analysis (FMEA – sometime with a “C” added, signifying criticality) are two additional statistical tools engineers use to predict performance and (un)reliability. DoE uses mathematical techniques, often expressed in multivariate matrix or linear algebra, to model cause-and-effect relationships. Inputs (the so-called predictor variables) are changed under different performance scenarios, resulting in different outcomes, or vice versa. DoE models can be extremely simple or mind-numbingly complex, requiring computer assistance.

FMEA is a risk-assessment tool. It imposes a discipline upon engineers to visualize product failure. It also makes the distinction between failure modes (the symptoms) versus failure mechanisms (the root causes of failure). For example, intermittent opens on a BGA are the failure modes; head-in-pillow defects at specific pins may be the failure mechanism. HiP defects are the results of solder process profile anomalies. FMEA carefully considers a range of failure modes and ranks them according to risk. Those with a higher risk of inducing failure are assigned greater attention at preventing them; those with lower risk get correspondingly less scrutiny.

In our own business, we assign risks at the quoting stage, on a 1 to 5 scale, with 1 being a slam dunk, risk-wise, while 5 is a bet-the-company roll of the dice. Approximately 98% of our quotes receive risk ratings of either 1 or 2. New customers always receive a risk rating of at least 2, simply because new customers by their very nature bring uncertainty, which only can be reduced with time, experience and ongoing business activity. Idiosyncrasies must be learned. Occasionally we quote a 3 (usually a functional test project where the customer hasn’t a clue what they want or what it will cost). In three years of using this system, we have quoted one job with a risk rating of 4. It did not become an order, nor is it likely we would have accepted it, except under stringent terms and conditions. We have yet to quote a 5. The system has held up well under regular (monthly) review, and we now have three years of data supporting our rankings.

Our modest success at risk management aside, do you see the problem baked into these methods? Subjectivity. Humans assign the risk rankings, change the inputs, tweak the outputs, try to give human behavior a mathematical expression. Human behavior resists a mathematical straightjacket. There is also noise (uncontrollable factors), some of which may exercise decisive influence over performance. Much still is left to chance, bias, intuition, or plain old gut feel, all of which can be, and sometimes are, catastrophically wrong. Think of the decision to launch the Space Shuttle Challenger on a cold January day in 1986, when the O rings on the booster rockets lost their malleability. We all know how that ended.

At least IPC got one call right. Kudos for switching Apex to San Diego for the next six years. To paraphrase a recent commercial, “That’s reliability you can count on!”

Highly accelerated life testing (HALT) and highly accelerated stress screening (HASS) are two popular techniques used to predict and detect failures in products before they are shipped to the customer. Both often are used interchangeably with hot and cold temperature cycling, and sometimes burn-in, but this is misleading. HASS and HALT are comprehensive methodologies, whereas temperature cycling and burn-in are specific types of stress testing.

HALT uses many techniques, depending on the product and its application. It is used most effectively in the design phase of a product and early prototyping. Where appropriate, a HALT program may include elements of vibration testing, shock or drop testing, electrostatic discharge (ESD) testing, hot cycling, cold cycling, rapid temperature transition testing, and other laboratory methods. The point is to apply rapid and severe stress to a product that is well beyond its intended operating parameters. This helps uncover design weaknesses and ascertain performance margins. Failures in HALT can be addressed through root cause analysis and possible redesign before a product reaches the volume manufacturing stage.

One of our customers periodically sends us solar array control boards to x-ray. Following our x-ray imaging, the customer takes them back for temperature, shock and vibration cycling, then returns them to us for a second round of x-ray imaging. The intent is to simulate the rigors of space flight on solder joints. Service calls in space are rare. Our service finds stress-related defects routinely, and our customer uses the data we provide to make its controller cards’ design more robust.

HASS employs many of the same techniques (vibration, power cycling, temperature cycling, burn-in, and power-on operation or functional testing) used by HALT, but typically in a production environment, and in a highly optimized manner. Temperature excursions, for one, may not be as extreme as they are with HALT. Functional testing may be limited to go/no-go testing. Failure diagnostics may be restricted by a scarcity of detailed data. The objective nonetheless is to screen early failures (so-called “infant mortality”) during manufacturing rather than after shipment.

The fact that our own company and many others nationwide have a thriving PCBA failure analysis business is testament to the uneven effectivity of these methodologies.

Add to the alphabet soup the acronyms FEA, POF, MTBF, EOL, ALT, FTA and the analytical techniques like Weibull analysis, bathtub curves and fishbone diagrams, and there is an arsenal out there constituting a statistician’s fantasyland.

Yes, but. Stuff keeps breaking. Is it enough? Is the alphabet soup relevant?

Next time I’ll address that question.

At least we can predict with a high level of reliability and statistical significance where IPC’s supreme shindig will be for the next few years. Oh, and Apex is still chockablock with old guys with no intention of buying anything, scooping up free stuff by the bagful. Sure hope it works. Hell hath no fury like an old guy with malfunctioning free stuff.