Hacker Newsnew | past | comments | ask | show | jobs | submit | more CobrastanJorji's commentslogin

Multi-cloud. It's fairly unlikely that AWS and Google Cloud are going to fail at the same time.


Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.


> double++

I'd suggest to ++double the cost. Compare:

++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy

double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0


Lol :)


And likely far more than double the cost since you have to use the criminally-priced outbound bandwidth to keep everything in sync.


Shouldn't be double in the long term. Think of the second cloud as a cold standby. Depends on the system. Periodic replication of data layer (object storage/database) and CICD configured to be able to build services and VMs on multiple clouds. Have automatic tests weekly/monthly that represent end-to-end functionality, have scaled tests semi-annually.

This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.

Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.


And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.


How do you handle replication lag for databases?


If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.


I don't think there are any physics reasons why it'd be impossible, but certainly we can't do it with existing technology. You'd need an air breathing jet that could get a vehicle to go about five or six times faster than any current such engine has ever achieved (i.e. around mach 20-30), which is perhaps ridiculous, but I don't think it's necessarily impossible, just something we don't know how to do. There have been some (failed) efforts to get there, like the X-30.


Basically when you cut thrust you must pass through that altitude again or escape orbit.

So either fire a rocket in space to circularize the orbit or reach more than Earth’s escape velocity 25,020 mph (11.186 km/s, 40,270 km/h) ~ Mach 32.6, due to some drag in air to thin for any kind of air breathing engine to work.

X-30 was aiming far lower ~Mach 20. Nuclear could make it more realistic than any form of chemical combustion. It might be physically possible using Hydrogen but you’re talking generating extreme thrust at vastly more extreme conditions than the space shuttle’s retry.


Or go high enough to let the moon alter your orbit into one that doesn't hit the atmosphere.


Yea thus ‘Basically’ you can also escape earth’s orbit slightly more easily using the sun. However, none of this really helps much you’re still looking at more than escape velocity in atmosphere with a purely air breathing engine due to drag.


Well you can't reach a high orbit using air breathing engines because your impulse must be given within the atmosphere, and then your trajectory inevitably re-intercepts the atmosphere (unless you achieve an escape trajectory) and would decay quickly. You can get around this by packing a small rocket engine and circularizing on apogee!


Can an air breathing jet actually attain those velocities? I thought most supersonic aircraft use rockets after a certain point


"Most" supersonic aircraft are fighter jets and other military aircraft that use jet engines, not rockets. They may have afterburners that are much like a rocket that just injects jet fuel in the exhaust stream, but that's still using atmospheric oxygen.

The issue, I think, is more about balancing drag and air intake at appropriate atmospheric densities for different speeds. An SR-71 Blackbird could fly at 85,000 feet continuously, and a MiG-25 set what I believe is still the air-breathing record max altitude by pulling a "zoom climb" (accelerating in higher-density air that the engines could use effectively, then pulling the stick back and coasting up through rarefied air too thin for the engines) to 38km or 123,000 feet.

Most experimental hypersonic aircraft use rockets because that's what works.


> Can an air breathing jet actually attain those velocities?

There's no theoretical limitation on how fast an air breathing jet can move. You just have to redesign everything every few mach numbers, and deal with the atmospheric drag.


It's always fun to compare Trump quotes against other presidential quotes.

Jefferson: “The basis of our governments being the opinion of the people, the very first object should be to keep that right; and were it left to me to decide whether we should have a government without newspapers or newspapers without a government, I should not hesitate a moment to prefer the latter."

Reagan: "There is no more essential ingredient than a free, strong, and independent press to our continued success in what the Founding Fathers called our 'noble experiment' in self-government"

FDR: "If in other lands the press and books and literature of all kinds are censored, we must redouble our efforts here to keep them free."

Trump: "The press is the enemy of the people."


Even more fun when we add the dimension for press ownership.

Who owned the presses when Jefferson or FDR or even Reagan discussed the role of the press; who owns it now?

Diversity and the (political/social) range of press is an important aspect of this matter.


And then there’s Nixon.


The issue comes in theory vs practice. Obviously in theory a free press is absolutely key to a free society, but in practice the press often ends up with different motivations. Another, rather more famous comment from Jefferson on the press [1]:

---

"To your request of my opinion of the manner in which a newspaper should be conducted, so as to be most useful, I should answer, "by restraining it to true facts & sound principles only." Yet I fear such a paper would find few subscribers. It is a melancholy truth, that a suppression of the press could not more compleatly deprive the nation of it's benefits, than is done by it's abandoned prostitution to falsehood.

Nothing can now be believed which is seen in a newspaper. Truth itself becomes suspicious by being put into that polluted vehicle. The real extent of this state of misinformation is known only to those who are in situations to confront facts within their knolege with the lies of the day. I really look with commiseration over the great body of my fellow citizens, who, reading newspapers, live & die in the belief, that they have known something of what has been passing in the world in their time; whereas the accounts they have read in newspapers are just as true a history of any other period of the world as of the present, except that the real names of the day are affixed to their fables.

General facts may indeed be collected from them, such as that Europe is now at war, that Bonaparte has been a successful warrior, that he has subjected a great portion of Europe to his will, &c., &c.; but no details can be relied on. I will add, that the man who never looks into a newspaper is better informed than he who reads them; inasmuch as he who knows nothing is nearer to truth than he whose mind is filled with falsehoods & errors. He who reads nothing will still learn the great facts, and the details are all false.

Perhaps an editor might begin a reformation in some such way as this. Divide his paper into 4 chapters, heading the 1st, Truths. 2d, Probabilities. 3d, Possibilities. 4th, Lies. The first chapter would be very short, as it would contain little more than authentic papers, and information from such sources as the editor would be willing to risk his own reputation for their truth. The 2d would contain what, from a mature consideration of all circumstances, his judgment should conclude to be probably true. This, however, should rather contain too little than too much. The 3d & 4th should be professedly for those readers who would rather have lies for their money than the blank paper they would occupy."

Thomas Jefferson, 1807 [1]

---

[1] - https://press-pubs.uchicago.edu/founders/documents/amendI_sp...


During the height of blockchain, there were plenty of good, legitimate jobs. The things they were building were some combination of inane, criminal, or stupid, but the jobs themselves were often quite real. I knew more than one person being paid $300k+/yr building something completely stupid like a collectible pet dragon breeding simulator because a VC thought it had a decent chance of being the next monkey coin or something. Sure, you had to get a new job every six months as each VC ran out of money, and sure you were making the world a worse place, but hey, it's a living.


Voting isn't necessarily a better system. The majority of people will very frequently give up rights in any given specific case that, in general, they hold dear. We're not rational actors.

And there are a lot of really weird discussions to be had about "consent," too. If we allow unlimited speech, that means that we're all subject to marketing and propaganda, and that's another thing that people are quite vulnerable to. Being convinced to vote via propaganda isn't really a great example of consent. But banning any speech that resembles propaganda is rife with problems.

Anyway, my point is that democracy/voting and free speech isn't necessarily the most free/consented-to form of government. I'm not sure what would take its place, though. I certainly wish I knew.


Dunno where parent said anything about democracy. Democracy and voting aren’t the same thing also they rejected the idea of voting on every law (democracy).

It seems inherent in your worldview that you lack faith in people to self govern (that is, for a person to govern themselves. Which would explain why you are at odds with the parent. I suggest you read a bit of Jefferson’s ideas of self governance, education, etc. There are tradeoffs as with everything else, I do think based solely on your short commentary here that there may be an opportunity for your perspective to be enriched however


The top ranked schools figured out long ago that removed students do not count against the scores that make them top ranked schools.


They have certainly gotten better, but it seems to me like the growth will be kind of logarithmic. I'd expect them to keep getting better quickly for a few more years and then kinda slow and eventually flatline as we reach the maximum for this sort of pattern matching kind of ML. And I expect that flat line will be well below the threshold needed for, say, a small software company to not require a programmer.



Ironically, yes. :)


Ahaha, of course nothing will ever be able to do my job!


Since "AGI has been achieved internally" tweet I've only seen incremental improvements that are guaranteed to never be able to do my job. Or most people's jobs.


It will take at least a few more decades at least from the looks of it. I would be 6 feet under by then so yes "Nothing will ever be able to do my job".


Many years ago, I worked at Amazon, and it was at the time quite fond of the "five whys" approach to root cause analysis: say what happened, ask why that happened, ask why that in turn happened, and keep going until you get to some very fundamental problem.

I was asked to write up such a document for an incident where our team had written a new feature which, upon launch, did absolutely nothing. Our team had accidentally mistyped a flag name on the last day before we handed it to a test team, the test team examined the (nonfunctional) tool for a few weeks and blessed it, and then upon turning it on, it failed to do anything. My five whys document was most about "what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do."

I recall my manager handing the doc back to me and saying that I needed to completely redo it because it was unacceptable for us to blame another team for our team's bug, which is how I learned that you can make a five why process blame any team you find convenient by choosing the question. I quit not too long after that.


My litmus test for these types of processes: If root causes like "Inflexible with timelines", or "Incentives are misaligned (e.g. prioritizing career over quality)" are not permitted, the whole process is a waste of time.

Edit: You can see others commenting on precisely this. Examples:

https://news.ycombinator.com/item?id=45573027

https://news.ycombinator.com/item?id=45573101

https://news.ycombinator.com/item?id=45572561

https://news.ycombinator.com/item?id=45572561


Usually things like that have to end up in retrospectives, and the first thing I hated about Scrum (or maybe just the first Scrum team, though they tried really hard to follow the letter of the process) was that you basically had to know about a problem for 5-7 weeks before you could get anyone to act on it. Because the uncomfortable items had to repeat at least 3 times before people wanted to look at them.

This was and is torture to me. I'm not going to fuck something up on purpose just to make the paperwork look good if I can tell ten minutes in that this is a stupid way to do it and I should be doing something else first.


Many developers have strong opinions that certain parts of a process are valuable and others aren't, and will try quite hard to align your process with their opinions as quickly as possible. For an organisation that doesn't know which developers' strongly held views are right and which are not, requiring everyone to try something for 5-7 weeks is probably more productive than any other approach they could take.


Only an idiot keeps touching a hot stove until someone tells him to stop.


It's not reasonable for an employer to expect you to injure yourself. It is reasonable for an employer to expect you to follow the working patterns they've told you to (up to a point) even if you personally think they're unproductive.


Having done quite a bit of politicing at a centuries old, med sized company, I can tell you that what management wants you is the assurance that this particular problem won’t happen again. Ideally there will be an actionable outcome, so someone can check that off a todo list at a later meeting Though what I’ve found is that if you have enough clout you can add an addendum to the root cause analysis, and you can start getting into things like misaligned incentives. But always keep in mind, at best you can only point out this will mean this class of problem will keep happening.

If you do this, know that there be dragons. You have to be very careful here, because for any sufficiently large company, misaligned incentives are largely defined by the org chart and it’s boudaries. You will be adding fuel to politics that is likely above your pay grade, and the fallout can be career changing. I was lucky to have a neutral reputation, as someine who cared more about the product than personal gain. So I got a lot of leeway when I said tonedeaf things. Even still I ended up in crosshairs once or twice in the 10 years I was at the company for having opinions about systemic problems.


> Having done quite a bit of politicing at a centuries old, med sized company, I can tell you that what management wants you is the assurance that this particular problem won’t happen again.

I'm not disagreeing. I'm saying they should phrase it this way (and some do), instead of masking it with an insincere request for root causing.

> Ideally there will be an actionable outcome, so someone can check that off a todo list at a later meeting

Occasionally this is the right thing to do. And often this results in a very long checklist that slows the whole development down because they don't want to do a cost-benefit analysis of whether not having an occasional escape is worth the decrease in productivity. And this is because the incentives for the manager is such that the occasional escape is not OK.

In reality, though, he will insist on an ever growing checklist without a compromise in velocity. And that's a great recipe for more escapes.

That's the problem with root cause analyses. Sometimes the occasional escape is totally OK if you actually analyze the costs. But they want us to "analyze" while turning a blind eye to certain things.

I've worked at places that understood this and didn't have this attitude. And places that didn't. Don't work for the latter.

BTW, I should add that I'm not trying to give a cynical take. When I first learned the five whys, I applied it to my own problems (work and otherwise). And I found it to be wholly unsatisfying. For one thing, there usually isn't a root cause. There are multiple causes at play, and you need a branching algorithm to explore the space.

More importantly, 5 (or 3) is an arbitrary number. If you keep at it, you'll almost always end up with "human nature" or "capitalism". Deciding when to stop the search is relatively arbitrary, and most people will pick a convenient endpoint.

Much simpler is:

1. What can we do to prevent this from happening again?

2. Should we solve this problem?

Expanding on the latter, I once worked at a place where my manager and his manager were militant about not polluting the codebase to solve problems caused by other tools. We'd sternly tell the customers that they need to go to the problematic tool's owner and fix them, and were ready to have that battle at senior levels.

This was in a factory environment, so there were real $$ associated with bugs. And our team had a reputation for quality, and this was one of the reasons we had that reputation. All too often people use software as a workaround, and over the years there accumulate too many workarounds. We were, in a sense, a weapon upper management wielded to ensure other tools maintained their quality.


Engineering is the art of making do with what you've got and sometimes you have to treat such unreasonable i positions just like any other constraint


But are those really the root causes?

An inflexible timeline is a constraint that is often valid -- e.g. if you have to launch a product meant for classrooms before the school year begins.

So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.

Similarly, with "incentives are misaligned", that's fuzzy. A real root cause might be around managing time spent on bugfixes vs new feature, and the root cause is not dedicating enough time to bugfixes, and if that's because people aren't being promoted for those, it's about fixing the promotion process in a concrete way.

You can't usually just stop at fuzzy cultural/management things because you want to blame others.


> An inflexible timeline is a constraint that is often valid -- e.g. if you have to launch a product meant for classrooms before the school year begins.

That's not an inflexible timeline. That's just a timeline.

> So then the real question becomes, why wasn't work scoped accordingly? Why weren't features simplified or removed once it became clear there wouldn't be time to deliver them all working correctly? There's a failure of process there that is the root cause.

Because management didn't want to drop features. Hence "inflexible".

I'm not saying this is always the reason, or even most of the times. But if we can't invoke it when it is the problem, then the exercise is pointless.

> Similarly, with "incentives are misaligned", that's fuzzy.

Any generic statement is inherently fuzzy.

> You can't usually just stop at fuzzy cultural/management things because you want to blame others.

Did you really think I was advocating responding to a whole process with one liners?

The examples you gave are often ones they will not accept as root causes.


Usually another team's failure is covered by their own independent report. That simplifies creating the report since you don't need to collaborate closely, but also prevents shifting the blame on to anyone else (because really, both teams had failures they should have caught independently). E.g. as the last why:

Why did the testing team not catch that the feature was not functional?

This is covered by LINK


If a root cause analysis is not cross team, how deep can the analysis possibly be? "Whoops, that question leads to this other process that our team doesn't directly control, guess we stop thinking about that!"


If your team doesn't control it, should you be thinking about it? Or should the team that owns it also own fixing it?

I should also have stated that based on the context I assumed this was talking about in incident report meant to be consumed internally, which I believe should be one per team. Incident reports published externally should be one single document, combining all the RCA from each individual report.


Shouldn't you be thinking about it? Whether you can control it or not, if you need to rely on it, you should be thinking about it.

Otherwise when an airplane crashes because of a defect in the aluminum, the design team's RCA will have to conclude that the root cause is "a lack of a redundant set of wings in the plane design", because they don't want to pin the blame on the materials quality inspection team mixing up some batch numbers.

If you were relying on something not to fail and it failed, your RCA should state as much. At best in the GP's case they could say "it's our fault for trusting the testing team to test the feature".


This sort of thinking is why the Japanese obliterated Detroit in the late 70's and early 80's.

The idea that if it was obvious another team messed up, you'd just ignore the problem until it got audited by a cross-functional team. All of the time and effort and materials spent between the two being wasted because nobody spoke up.


I think that's a valid question, and has plenty of ways you could go.

Is the process set up so that it's literally "throw it over the wall and you're done unless the test team contacts you"? Then arguably not. You did your job e2e and there was nothing you could've done. Doing more would've disrupted the process that's in place and taken time from other things you were assigned to. The test team should've contacted you.

BUT, well now the director has egg on his face and makes it your problem, so "should" is irrelevant; you will be thinking about it. And you ask yourself, was there something I could've done? And you know the answer is "probably".

Then, the more you think about it, you wonder, why on earth is the process set up to be so "throw it over the wall"? Isn't that stupid? All my hard work, and I don't even get to follow its progress through to production? Is this maybe also why my morale is so low? And the morale of everyone else on the team? And why testing always takes so long and misses so many bugs?

And then as you start putting things together, you realize that your director didn't assign this to you out of spite. He assigned it to you to make things better. That this isn't a form of punishment, but an opportunity to make a difference. It's something that is ultimately a director-level question. Why is the process set up like it is? The director could put together and solve with adequate time, but at that level time is on short supply, and he's putting his trust in you to analyze and flesh out, what really is the root cause for this incredibly asinine (and frightening) failure, and how can we improve as a result?

That said, in an org so broken that something like this could happen, I'm guessing the director is wanting you to do the RCA and the ten other firedrills that you're currently fighting as well, in which case, eh, fuck it. Blame the other team and move on.


So what do heartbleed and log4shell RCAs look like for your internal teams? “A necessary source library screwed up, not our problem”?


If your root cause is cross team, then you wind up having to make some implicit assumptions on what the other team could have done. Is akin to ending with "because the gods got angry." Not really actionable.

This is a classic "limit the scope of the feature." You want the document to be written and constrained to someone that is in a position to impact everything they talk about. If you think there was something more holistic, push for that, as well.

Note you can discuss what other teams are doing. But do that in a way that is strictly factual. And then ask why that led your team to the failure that your team owns.


If the dynamic is cross functional you need to reschedule the post mortem and invite the other team to the meeting.

This is literally a "the right people aren't in the room" issue.


> on what the other team could have done

If you're wondering what anyone "could have done", you've already missed the point of the article completely.


I was just commenting further on why you constrain it to your team/area of control. Should have been more clear that I meant my comment as a plus one to some of the other replies.


Ah, thank you and sorry for assuming.


I oversaw an RCA once where this is exactly where they stopped. “We didn’t write this code originally so it isn’t our fault that our new feature broke it”. Repeat for 30 minutes whenever anyone says anything. We gave up.


Pretty deep. It forces you to account for failures in other domains


> root

You keep using that word. I do not think it means what you think it means.


I love 5+ why's. I find it to be a fantastic tool in many situations. Unfortunately, when leadership does not reward a culture of learning, Five Why's can become conflated with root cause analysis and just become a directed inquiry for reaching a politically expedient cause. The bigger the fuck up, the more it needs an impartial NTSB-like focus on learning and sharing to avoid them in the future.

Fwiw, if I were your manager performing a root cause analysis, I'd mostly expect my team to be identifying contributing factors within their domain, and then we'd collate and analyze the factors with other respective teams to drill down to the root causes. I'd also have kicked back a doc that was mostly about blaming the other team.


The excellent thing I learned about 5 whys is that not only is it not really just 5, as you allude to with “5+”, but it’s also a *tree* instead of a linear list. Often a why will lead to more than one answer, and you really have to follow all branches to completion. The leaf nodes are where the changes are necessary. Rather than identifying one single thing that could have prevented the incident, you often identify many things that make the system more robust, any one of which would have prevented the incident in question.


> I'd also have kicked back a doc that was mostly about blaming the other team.

Agreed. If the test team messed up, then you need to answer the "why" your team didn't verify that the testing had actually been done. (And also why the team hadn't verified that the tool they'd sent to testing was even minimally functional to begin with.)

Five whys are necessarily scoped to a set of people responsible. For things that happen outside that scope, the whys become about selection and verification.


Quis turmas probationum examinat?

Validating that the build you produced works at all should be done by you, but there's also a whole team whose job it was to validate it; would you advocate for another team to test the testing teams tests?

And more to the point, how do you write a 5 why's that explains how you'd typo'd a flag to turn a feature on, and another team validated that the feature worked?


> how do you write a 5 why's that explains how you'd typo'd a flag

Seriously? Even without knowing any context, there’s a handful of universal best practices that had to Swiss cheese fail for this to even get handed off to devtest…

- Why are you adding/changing feature flag changes the day before handoff? Is there process for development freeze before handoff, e.g. only showstopper changes are made after freeze? Yes but aales asked for it so they could demo at a conference. Why don’t we have special build/deployment pipeline for experimental features that our sales / marketing engineers are asking for?

- Was it tested be developer before pushing? Yes - why did succeed at that point and fail in prod? Environment was different. Why do we not have dev environment that matches prod? Money? Time? Politics?

- Was it code reviewed? Did it get an actual review, or rubber stamped? Reviewed, but skimmed important parts only — Why was it not reviewed more carefully? Not enough time — why is there not enough time to do code reviews? Oh, the feature flag name used underscore instead of hyphen — why did this not get flagged by style checker? Oh, so there’s no clear style conventions for feature flags and each team does their own thing…? Interesting…

Etc etc.


Curious, do your 5 why's actually look like this, kind of stream-of-consciousness? Because I love this! Our org's 5 why's are always a linear 5 steps back that end at THE ROOT CAUSE. And those are the good ones. Others are just a list of five things that happened before or during the incident.

I've always pushed to get rid of this section of the postmortem template, or rename it, or something, because framing everything into five linear steps is never accurate, and having it be part of the template robs us of any deeper analysis or discussion because that would be redundant. But, it's hard to win against tradition and "best practices".


Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested, that someone on your team reviews before signing off.

It doesn't take a whole team. There are lots of mechanisms to produce that evidence. This is just how it works. If two checks aren't sufficient, it becomes three. Or four. Until problems stop making it through.


> Just saying, once you find out the testing team is unreliable, you make sure there's a form of evidence it actually got tested

Once you find out the heart surgeon shows up drunk to the operating room, you make sure there is an additional nurse there to hold his arm steady.


:P I mean, obviously assuming you don't have the choice of changing your testing team. But even if you do, what if they're worse?


I... with the evocative scenario... would choose another remedy, rather than have a nurse steady the drunken surgeon's arm.


I think that's the point. If you have an incompetent team or team member the number of checks around them can grow astronomically and still you will have problems. At a certain point the systemic problem can become "the system is unwilling to replace this person/team with a competent one".

(That said, this is only in the case of persistent problems. Everyone can be inattentive some of the time, and a sensible quality system can be very helpful here. It's when the system tries to be a replacement for actually knowing what you're doing that things go off the rails)


In my experience, you can weaponize processes like the Five Whys or the Amazon Leadership Principles. I don’t think that means they don’t have any value.

That being said, in this case, I agree with your manager. Both the QA team and your team had fundamental problems.

Your team (I assume) verified the functionality which included X set of changes, and then your team made Y more changes (the flag) which were not verified. Ideally, once the functionality is verified, no more changes would be permitted in order to prevent this type of problem.

The fundamental problems on the QA team were…extensive.


It sounds like there’s another failure here, which you could have documented. If the test team didn’t understand what they were meant to test, that’s a failure of communication. Simply saying “they were wrong” is not sufficient exploration of the failure so, if that’s the point your manager was making, I agree with them. Blaming a third party for misunderstanding is less useful than seeking to improve the clarity of your own communication.


A very relatable experience, lot of pressure to stop the Whys at the dev team and not question larger leadership or organizational moves


5 why's can be very political. You can make it take whatever direction you want to tell what ever story you want. I don't get why it's cargo culted the way it is


No, people can be very political. It doesn't matter what the process is.

Hell, people even legislated the value of PI that one time.


While that might be true, the five whys is notorious for slipping into a destructive "you/I suck and firing you/I solves the problem for good and I believe it makes everyone absolutely happy" style of false conclusions.

Reportedly Toyota has organizational mitigations for that problem or reportedly the working culture there isn't so great after all. The bottom line is, it's a double edged sword to say the very least.


while its clear that your testing team should have their own 5 Ys as well, I think its reasonable for the manager of your team asking you the question: how do we prevent this in the future? The unfortunate reality of large companies is that sometimes the quality and behaviour of other teams is (to some extent/on some time horizon) out of your control, and so the question for any given team lead is often "what can I do differently". It does seem that in this case there probably was some mechanism that your team could have had its own internal testing process that needn't replicate the full responsibilities of the testing team but could at least have caught this issue.


Interesting one.

My first thought is why is rolling out a new system to prod that is not used yet an incident? I dont think "being in prod" is sufficient. There should be tiers of service and a brand new service should not be on a tier where it having teething issues is an incident.

> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do

would be interested to see the doc, but imagine you'd branch off the causes, one branch of the tree is: UAT didnt pick up the bug. why didn't UAT pick up the bug? .... (you'd need that teams help).

I think that team would have something that is a contributing cause. You shouldn't rely on UAT to pick up a bug in a released product. However just because it is not a root cause doesnt mean it shouldn't be addressed. Today's contributing cause can be tomorrow's root cause!

So yeah yiu dont blame another team but you also dont shield another team from one of their systems needing attention! The wording matters alot though.

The way you worded the question seems a little loaded. But you may be paraphrasing? 5 whys are usually more like "Why did they papaya team not detect the bug before deployment?"

Whereas

> what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do

Is more emotive. Sounds like a cross examiners question which isn't the vibe you'd want to go for. 5 whys should be 5 drys. Nothing spicy!


It was an incident because it was important to leadership. It was a marketing targeting feature that was advertised to the local executive with some excitement by the management, so they were excited to share the results of it, and when there weren't results on the anticipated launch date, they wanted answers, which meant the manager treated it as an incident.


Oh geez! That is very bad (almost pointy haired) use of an incident process.


That's how we do it - there are "branches" to most of our RCAs, and in fact, we have separate sections for root cause analysis (things which directly or indirectly contribute to incident, which are a branched / fractal 5 whys) and lessons learned (things which did not necessarily contribute to incident but which upon reflection we can do better - frequently incident management or communication or reporting or escalation etc).

It took a while for all the teams to embrace the rca process without fear and finger pointing, but now that it's trusted and accepted, problem management stream / rca process probably the healthiest / best viewed of our streams and processes :-)


they way i handle this with my teams: any bugs caught by the QA team go against the developers. any bugs caught after QA green lights the go live go against the QA team. (Of course, discounting any bugs that are deemed acceptable for go live by the PM).


General trick in any project management is try to arrange for the work to be done by the group that has the most influence over it.

Just as you should never take responsibility over something you are given no power over, you should move responsibility to where the power is (and if they won't take responsibility, you start carving out the edges of their power and hand it over to adjacent groups who will).

I learned pretty early how to hack the 5 Why's in order to make sure something actionable but neither trivial nor overwhelming gets chosen as the response. And I often do it early enough in the meeting that I'm difficult to catch doing it.

If I don't get invited I will sometimes crash the party, especially if the last analysis resulted only in performative work and no real progress. You get one, maybe two, and then I'm up in your business because mistakes don't mean you're stupid, but learning nothing from them does.


> very fundamental problem.

...5 billion years ago, the Earth coalesced from the dust cloud around the Sun...


The next org you went to, did they also use the Five Whys or did they get by with Four True Colors instead?


"Apple TV is available on Apple TV" suggests that Google's naming guy must've left for Apple.


You are technically correct! Alphabets (ABC) are for phonemes, syllabaries are for syllables (hiragana, Cherokee), logograms are for words (chinese characters). Of course, some writing systems are very much hybrids (like Chinese, or hieroglyphics, or even the humble ampersand).


To be sure, 'logogram' is an even worse word than 'ideograph' for what the Chinese writing system is.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: