>*"Fail fast" applies to more than just software.* Yes, but let's not take the a...

bick_nyers · on April 20, 2023

I previously worked in medical image processing/transcoding and you are correct, but most people probably don't know precisely how or why.

The knee jerk mindset that most people have in safety-critical design, is being ultra conservative.

In the fields of medical and space (and likely others), you have an asymmetrical risk-reward profile. Think about it this way, if an engineer takes a risk on refactoring some software logic, and it speeds the system up 3%, what is their reward? A raise? A promotion? Fat chance. If that refactor instead breaks 3% of the time; engines blow up, people die, customers yell at them, they get fired, perhaps in some situations they even get sued.

The engineers then converge to a local maxima by the means of: "If it ain't broke, don't fix it", and various other ultra conservative leanings. This mindset also will often get selected for in hiring, and rewarded.

Now take the limit of this as time goes to infinity, you have bloated, legacy software that is full of spaghetti code and can't take new features easily (if at all). In the case of NASA's space shuttle program, it was extremely expensive, and the cost wasn't falling significantly over time either.

One might view ultra conservatism as the problem, but the real issue is the asymmetrical risk-reward profile. Solving that takes a head-on approach with great leadership, deploying capital, state of the art testing/QA, great deployment pipelines, and more. Shield people from the risk, and intentionally reward people when they push the envelope.

Imagine if you had a software testing process and product specification that was 99.9999% effective (and no, yours is not even close to that), you could then move at a silicon valley "fail fast" pace and advance the technology and architecture rapidly.

mgfist · on April 20, 2023

> You can actually see this in SpaceX. In development when stakes are relatively low (e.g., no payload or passengers), the risk threshold is high. But they start taking a more measured approach when that risk starts to ratchet up. The danger being, advocates of one approach don't always know when/how to transition to the other.

This corroborates "fail fast". They achieve reliability and safety by launching far more than traditional space companies and seeing failure in those launches. They prove it out before adding the risk of human lives or expensive cargo. Meanwhile traditional companies will develop their rocket for 15 years till it's "perfect". SpaceX figured out that achieving perfection is best done through actual attempts, not laboratory experiments.

bumby · on April 20, 2023

>This corroborates "fail fast".

But it also shows "fail slow" later. My point is that "fail fast" is not some hard rule to abide by, but contextually dependent on the risk at stake.

mgfist · on May 3, 2023

Nothing contradicts that. SpaceX does do this contextually. Once they nailed down the Falcon 9 rocket, the boosters and the Merlin engine, they mostly stopped messing with it and focused on operational excellence. But to get to that point they failed fast, because it's the most effective way to get reliability.

lutorm · on April 20, 2023

I mean, the whole point if failing fast is to stop failing faster. If you just keep failing fast without end you're not going to be around long...

bumby · on April 20, 2023

See the last comment of my OP. The risk I'm poking at is a cultural one, where "failing fast" becomes an acceptable mode of operating, regardless of the context.

phkahler · on April 20, 2023

Let's look back at Star Hopper. SpaceX literally hired a company that builds water towers to build a prototype tank for Starship - and flew it! They were primarily trying to figure out how to build it, move it, etc. Obviously the risk tolerance was high. That's really the difference between them and say Boeing. SpaceX starts with higher risk tolerance just to figure out the lay of the land, but they start reducing that tolerance as development progresses. Boeing aims for perfection out of the gate (apparently).

bumby · on April 20, 2023

Alternative take: SpaceX is so new they don't know what they don't know.

Take their example of a failed F9 strut, where the material supplied by a vendor didn't come close to meeting the necessary specs. A mature aersopace company would have processes in place to check the material for these specs before use. SpaceX has since levied these new process checks, but prior to that failure lots people may have pointed to them as being more efficient because of their 'streamlined' process.

lutorm · on April 20, 2023

A mature aersopace company would have processes in place to check the material for these specs before use.

Would a "mature aerospace company" also know to not use O-rings outside the temperature range specified by its engineers? Or know to test whether foam traveling at high velocity would penetrate the TPS?

Look, this is hard stuff. It's very easy to tell when you screw up, but very difficult to tell how close you are to screwing up. You're deluding yourself if you think some entities are immune to screwing up just because they've been around.

bumby · on April 20, 2023

Whataboutism aside, I’m not delusional; I’m quite upfront that these types of biases exist at every organization that is staffed by human beings.

The difference is I don’t allow “gosh, space is hard!” as a rationale for thinking one organization is immune to those shortcomings. So instead of taking a look and asking something like, “Hmmm. Every other organization seems to have a supplier vetting process for safety critical stuff, I wonder if we should too?” We can instead just pretend we’re smart and different and be forced to learn already solved problems the hard way. The supplier thing is very standard quality control process stuff that transcends industry. Knowing if foam can penetrate tile or o-rings operate out of spec are not, precisely because they were non-standard conditions. That’s not to say that the decisions weren’t flawed, but I don’t think it’s as good of an analogy as you may think. Besides, the investigations largely pointed to broken cultures so I don’t know if that’s the type of company you want SpaceX associated with.

What’s the saying? “A fool learns from his own mistakes. A wise man learns from somebody else’s”

lutorm · on April 21, 2023

That's fair, my point was just that you made it sound like "mature aerospace companies" were some special beings that didn't make mistakes.

It's good to learn from other people's mistakes, but you also can't let everything people have done before go unchallenged or no progress would ever be made from rockets that cost $2B per launch.

bumby · on April 21, 2023

Yeah, I realize now that probably wasn't worded as well as it could have been. To say it differently, I would expect well-run companies (whether 'mature' or not) to have the processes in place to better control the well-known problems. When it comes to those 'unknown unknowns' sometimes you can't learn except by trial-and-error.

s1artibartfast · on April 20, 2023

If you don't know something, your options are to pack up your bags and go home or try to learn.

bumby · on April 20, 2023

Or, you can take a risk-informed approach, and understand what risks are prevalent (e.g., the risk of a bad vendor) and put the appropriate checks in place to mitigate that risk. "Learning" doesn't always mean taking the highest risk option and just rolling the dice.

s1artibartfast · on April 20, 2023

Oh I totally agree, but that is a form of learning.

If you don't know what you don't know, you must find out.

giantrobot · on April 20, 2023

If you write a spec you've got to have some process to verify your stuff meets that spec. Otherwise you just wrote a fucking dream journal.

s1artibartfast · on April 20, 2023

Indeed, and one way to verify you meet the spec is to test it, which they did and found it to be deficient. Having done so, they decided to improve their process. This is the definition of learning from your mistakes

phkahler · on April 20, 2023

Sadly, or interestingly, that was not an engineering lesson but a human one - don't trust the supplier.

Framed this way, I'm not surprised younger Elon's company missed it.

bumby · on April 20, 2023

This is an really optimistic outlook. One way to test the rocket is also to see if it fails when humans are aboard. But it may not be the best way to balance risk and what you learn.

It would have been much more economical to test a coupon of the material upon receipt, like what is considered standard practice throughout aerospace companies. Or, like you said, you can blow up a rocket and launch pad instead. Same result, different risk profiles.

s1artibartfast · on April 20, 2023

I don't think anyone would argue it would have been better had they known better and done things right the first.

The question is how you transition from the state of not knowing to knowing.

If you have poor processes and a lack of knoledge, how do you get better?

In this specific instance, the root cause analysis and remediation are vastly more complicated that presented in this thread. It is not like SpaceX wasnt doing testing on incoming materials at all or ignorant of the concept.

bumby · on April 21, 2023

You’re right. Like most failures of this type, it’s rarely simple and these forums are super conducive to long-form discussion. They did checks, but they were inadequate.

Regarding knowing if you have a poor process or not, it depends on the uniqueness of your problem. For proteins that are relatively common, like material checks, you can shorten your learning process by looking at other organizations that have been through it for decades. For more exotic non-standard problems, you might have to learn the hard way.

bumby · on April 20, 2023

I think the distinction I'm making is that the "form" of learning you take should be proportional to the risk and that all forms are not equal in value.

version_five · on April 20, 2023

I agree with everything in your comment. One thing I've wondered about re space exploration though is how we reconcile what I think is pretty much universal acknowledgement that we have to do everything possible to avoid loss of life with the inherent danger of space travel, and the "drag" that a zero-tolerance safety focus can have on culture.

Put another way, what would global exploration have looked like if sailors refused to accept the risks of early ocean crossings?

bumby · on April 20, 2023

That is a real issue. I think part of it involves creating a culture where it's acceptable to make failures as long as those failures were the result of a sound decision process.

What you don't want is a culture that is either a) afraid to take any risk because they are afraid of career consequences or b) willing to roll the dice with bad decision processes due to biases and bad incentives.

Example for a): bureaucrats who are unwilling to push the envelope because a bad outcome would effectively end their career

Example for b): making high-risk decisions due cost/schedule pressure, like competition for a contract

DoughnutHole · on April 20, 2023

Early modern sailors were uneducated manual labourers with few economic prospects in a world where simply living as a lower class individual was more dangerous than nearly any job that exists in the developed world today. Sailing often paid better than jobs on land which made up for the risk, and it offered the potential of massive reward to the high class leadership of the vessel.

There is no consummate economic incentive for being an astronaut. The incentive is the experience and making some impact on science, and while that motivates many people the probability of attracting the best and the brightest goes down as the probability of exploding goes up.

I'd say there's unlikely to be enough of an economic incentive to justify riskier manned space travel until the earth becomes a whole lot less habitable.

lutorm · on April 20, 2023

If what you're saying is true, how come astronaut programs have many more applicants than they can take? People have other goals in life beyond economic incentives.

People volunteer for the military, and I don't think anyone's under any illusion this is risk free. The important thing is not lying to people about the risk involved.

bumby · on April 20, 2023

>People have other goals in life beyond economic incentives.

They acknowledged this point, though:

>The incentive is the experience and making some impact on science

I think you might be interpreting their last point differently than me. I took it to mean that the "Mars or bust" is an inspiring, but impractical, narrative.

philistine · on April 21, 2023

I don't know what you're talking about. The Space Shuttle blew up twice, killed too many people out of sheer incompetence. And it flew again both times. Both times because NASA needed the vehicle to deliver on its promises. The Space Shuttle should never have flown after Columbia was lost. But it did, because NASA decided to live with the risks.

politician · on April 20, 2023

My expectation is that as soon as someone gets something going beyond Earth orbit and beyond the need to obtain launch licenses from the FAA, the zero-tolerance safety culture will be reduced. The Martians will build nuclear power plants.

shinryuu · on April 20, 2023

I think it's a matter of failing in a safe environment. Clearly that's what space x is doing. They fail spectularly, but they do so within the confines of a safe environment.

For the real launch they've already mapped out the failure modes and are able to prevent them when it really matters.

bumby · on April 20, 2023

I've worked in aerospace. The organizations (both public and private) would all claim to fail "within the confines of a safe environment".

One small BS detector is when there's some unplanned/unmitigated test outcome that gets characterized as a "test anomaly" rather than being transparent about the details.

>they've already mapped out the failure modes and are able to prevent them when it really matters

This remains to be seen. The shuttle also had all their failure modes mapped out. As did CST-100. Yet massive failures still occurred.

cdash · on April 20, 2023

It doesn't remain to be seen. SpaceX has done this already with the Falcon rocket.

bumby · on April 20, 2023

That...doesn't follow because F9 and Starship are different systems.

Mercury and Gemini both had good track records. Apollo and Shuttle, not so much.

Put differently, how much would you be willing to bet that their FMEA has caught literally every failure mode possible?

jjk166 · on April 20, 2023

Well in the case of the Shuttle both accidents that killed people were due to previously identified failure modes. At the end of the day, risk will always be a number greater than 0%. At some point someone is going to have to make a judgement call that the risk is low enough to proceed, and sometimes that call is going to be wrong.

bumby · on April 21, 2023

The failure modes may have been known but the effects were not. A FMEA needs both to work.

Regarding the foam, they had a difficult time even recreating it after the fact. It was apparently only on a lark that they decided to turn the gun up to 11 and, viola, now the foam had the physical properties capable of damaging the tile catastrophically. So, yes, they knew the mechanism of foam shedding but did not realize the effect properly.

With the o-rings, they similarly just didn’t have the test data for this conditions. They incorrectly extrapolated on the test data they did have.