Home » Posts » Articles » Reliability, Part III: The Future

Reliability, Part III: The Future

Given recurring defects, are failure models more academic exercise than problem-solving tool?

It is a truism that people rank simplicity above complexity. Occam’s Razor perennially favors the simpler, elegant solution to a problem over the more elaborate alternative. Harry Truman famously expressed a preference for dealing with one-armed economists, to preclude them from saying, “On the one hand….” Nice and neat beats multiple moving variables every time. (Previously I would have said one trumps the other, but I’m giving that verb a rest until, at least, mid-November.) Short words rule. Simple means “easy to explain,” and to understand.

Life is predominantly not composed of storybook endings. Occam was not in the EMS business.

In failure analysis of printed circuit board assemblies, the simple, ideal option is referred to as The Smoking Gun. It is the one indisputable root cause of the failure in a board or an electronic subassembly. Be it a dead short, a lifted trace, mammoth voids in solder, an obviously open or bent pin, or a backwards-installed component, observation of one or the other of these phenomena solves all, explains all, concludes all. Q.E.D. and retire to the drawing room for cocktails.

Then we get to explain it to the customer, which is so much easier if the problem, and its solution, have a one-to-one ratio. They don’t like being told their board is a piece of crap, but at least they know why. This describes about 10% of the cases we see.

Inconveniently, the other 90% of assembly defects do not fall into this tidy one-to-one category.

Damn.

Nevertheless, we still cherish the simple explanation, as if tying everything together in a nice, neat rhetorical bow conveys more authority by its very neatness than logic and persistent investigation rudely allow.

We keep trying. In my EMS days, in more than one fatigue-driven instant of frustration, I fantasized that the ideal corrective action to an inconclusive process problem would be a variation on the following: “The perpetrator has been identified, marched forthwith to the parking lot and, in full view of the assembled workforce, summarily executed at the flagpole. This problem will never happen again.”

That’s definitive.

This is a column about finding simplicity, after all, wherever it can be found, however elusive it is. Above all, however, this is about reliability.

Like the UPS driver. Several years ago an article said that for many women, the brown-clad driver was the most reliable man in their lives, faithfully appearing each morning with the day’s packages, always cheerful, often with a kind word. Always working, but in an agreeable, platonic way. A way to warm to.

Marriage material.

Think also of the Schwinn bicycles of our suburban youth. They weren’t going to win any awards for coolness or general design excellence, but, by God, those suckers were built like tanks. You could ride them through walls (or so it seemed) and keep on riding. Ugly as sin, sufficient to deter trend-conscious thieves, but they never stopped working. They fulfilled their primary design function and became hand-me-downs.

Or Buster Brown shoes, part of the uniform of the parochial school childhood some of us suffered through. Available in any color of the rainbow, as long as that color was either black or brown. Totally unfashionable, but those shoes would emerge triumphant, and still functional, from the grade school equivalent of the Bataan Death March.

Or Timex watches, taking a licking and keeping on ticking. And ask any Cuban about American cars built in the ’40s and ’50s.

That’s what I mean by reliability. Stuff that just works, if treated properly. Forever. No excuses.

So we have this piece of resin-impregnated fiberglass, with electronic components soldered to the top and bottom, with varying degrees of adhesion, to various well-known industry standards. When we apply current in service, it drives computers; tallies and ranks data; stores same; monitors vital signs; regulates fluid flow; directs security cameras in casinos; charges batteries; prolongs the life of expectant transplant patients; controls heads-up displays in aircraft; automates blood tests; and positions spacecraft in geosynchronous orbit. Hefty demands, all of these. But for how long, how repeatedly, and with what degree of precision will each board continue to function as intended? Further, when well-designed units, suitable for these purposes, suddenly fail, why do they fail? After all, they worked great coming off the production line.

In our failure analysis business, customers are always asking about the smoking gun. “Have you found the root cause?” “Do you know how this happened?” “Can you tell what initiated the failure?” “Have you located the solution to our problem?” Nope, nope, nope and nope.

However, we frequently observe many marginal conditions. Components clinging to a board by a sliver of solder, just enough to pass the requisite electrical tests, with no telling for how long. BGA balls that are not concentric to their true position. Through-holes that have not filled to their fullest extent during reflow. Digital device leads that are not properly aligned to their underlying lands on the surface of the board.

In a certain sense we serve up disappointment. But at the same time, our work narrows the search and helps to focus the scope of inquiry, revealing where those marginal conditions begin to impede functionality.

A little bit off here; a wee bit off there. Not enough in and among themselves to constitute a failure, but when you add the sum-total of these marginal conditions, you get a recipe for …?

Hello tech support.

Sometimes failure analysis is just the laborious process of ticking off those variables, and attempting to assign a probability to them, projecting when they will fail, and if that failure will lead to bad outcomes. It is unsexy detective work, narrowing the scope of the search from the nonessential to the essential. It often involves pattern recognition, and the realization that what we’re looking at is not normal, compared to what came before and what we’re observing in the surrounding area. This is subtle stuff. It usually doesn’t emit smoke.

Reporting these marginal conditions to an audience starving for a smoking gun can be a challenge. This is especially true when the audience has already conducted its own in house trial, and the culprit has been convicted and sentenced to hang before the proceedings even begin. Customers like this perceive our job to be one of confirming their biases.

You can imagine their reaction when we don’t get with the program.

A recent customer was completely convinced the package-on-package (PoP) device on their board was failing due to head-in-pillow (HiP) defects. They refused to entertain alternative explanations, even when x-ray images clearly demonstrated the PoP device to be properly and cleanly assembled, but the adjacent FPGA to be resting on substandard, voided BGA balls, showing abundant evidence of shoddy rework. The schematic clearly indicated the FPGA controlled the PoP device, so the chain of evidence was strong. Causation was obvious (to us). It didn’t matter. The customer didn’t want to hear it, our 24-page report with supporting 3D images notwithstanding. You reach a point in the argument where you stop fighting. You’re the customer; here’s your bill. Have a nice life making up facts to suit your conclusions.

Then there was the customer with the Big Emergency: an intermittent failure because of unpredicted heat excursions on one FPGA. Sample size of one board, one row, one pin. The only discordant note out of thousands of boards made. That didn’t stop the customer from drawing global conclusions and predicting imminent doom.

And ignoring our results.

A misshapen pin was not the glamorous answer they were looking for. At least they couldn’t persuade themselves to convince management of the validity of our evidence. Nevertheless, in spite of the misgivings about our results, the board still didn’t work consistently, and we were pretty sure we knew the reason why. Somebody screwed up. It happens. Here’s your bill. Thanks for coming. File the results under “ignore.”

In the last column, I cited many of the popular statistical techniques used to predict “infant mortality.” I also highlighted methods like HASS and HALT employed to deliberately accelerate failures and get a handle on probable product lifecycles.

This time I want to emphasize the accumulation, or insidious creep, of little things: ever-so-slight deviations from the norm, or the nominal tolerance, that subtract from performance and product life, eventually leading to total failure. These little things may not seem like much at first, but they compound. Over time they can overwhelm.

How do we know this? We see evidence of it every day, and we have thousands of images of substandard or marginal assembly conditions to back it up. They suggest an emphasis in manufacturing on “building just enough quality in to pass,” as opposed to “building it to work.” Analogous to asking the teacher at the beginning of the semester what is needed to get an “A” versus resolving to actually learn something.

How quaint.

My point, you ask? We have lots of sophisticated models to give us some indication when electronic products will break. These models were supposed to help us improve our processes so they would not break, or at least break far less often than they do.

Most disturbingly, the same problems recur, as in the examples described above. Makes you wonder if these models are more of a self perpetuating academic exercise than a problem-solving tool. Is anything really being learned?

Somebody didn’t get the memo. Stuff still breaks, a lot. Perhaps planned obsolescence really is desirable, and sustaining engineering isn’t. On the one hand, thank you, because failure is good for business. On the other hand, is anybody listening?

Sorry Harry.