i chwilio am perffeithrwydd

galli di ddim yn adeiladu technoleg perffaith. beth a wnest ti?

7 Ebr. 2025 mood: gobeithiolmwydriadmeddalwedddibynadwyedd

here's the short answer: prepare to make mistakes. don't be embarrassed or scared of them, don't shame people who make them, just figure out how to prevent them – or at least detect them sooner – next time. use every tool you have: policy, processes, and technology.

how do you actually do that, though? what steps can you take?

luckily for me, it's much easier said than done. so here's it being said:

define success, failure, and incidents. success is when your system is working as it should, failure is when it doesn't provide what it should, and an incident is any failure of a control, even when the system itself doesn't fail.
prevent failures, by establishing preventative controls – technology and processes to proactively handle situations before a failure occurs.
recover from failures, by establishing reactive controls – technology and processes to assess the breadth of a failure and streamline both the specific effects of that failure, e.g. database corruption or data loss, and a return to full functionality.

i could go over this in exhaustive detail, but i would just be parroting other people's work worse. instead, here are a few recommended resources to learn more:

anatomy of a disaster, by csb. an excellent video introduction to concepts like leading and trailing indicators. (the rest of the uscsb channel is also an excellent resource!)
the faa's compliance program. aviation is perhaps the best known example of a safety culture, and the compliance program is the faa's most recent program supporting it.
final reports from the ntsb, which can be quite dry, but only because they explain in enough detail for laypeople to follow. highly recommended reading not just for people in the relevant industries but for anyone interested in building reliable systems.

that said, i do want to talk about three crucial aspects of reliability engineering.

how to investigate errors

the first and most critical step of an investigation is not blaming anyone. you're not looking for a scapegoat, you're analyzing a failure for root causes to address. if you're looking for someone to blame, you'll find them, and then you'll

with that said, "negligence" can be a root cause, as can "malice". but so can inadequate training, vague instructions, insufficient documentation, etc. – never stop at "human error" when you can investigate why that human erred.

and that's ultimately the trick to root cause analysis: keep asking why. when you find an answer, ask why it happened. go back until you find a root cause – a conflict between implementation and intent.

the rule of thumb you'll often hear is "five whys", i.e. ask why the accident occurred, then why the cause happened, and so on until you've gone five "whys" deep in the chain. don't just ask "why" five times! sometimes, more than one thing causes an issue, and you should go down a level for each.

that said, "five whys" is only a rule of thumb. it's great for people who aren't familiar with root cause analysis, because it gives a relatively objective guideline. but as you (or your organization) gains more experience, and in particular as you implement fixes and re-investigate recurring errors, you should learn what a root cause really looks like for you.

treat your processes like systems too

inevitably, you will have regressions: failures that happen even after you thought you fixed them. obviously, you need to investigate the second failure. but it's equally important that you look at why you didn't get it right the first time!

what you'll find taught about systems thinking or root-cause analysis or reliability engineering is general guidance, broadly applicable but unspecific as a result. what you need to do is apply that to build concrete, organization- and system-specific processes. a key part of doing that is recognizing where the general guidance doesn't work, whether because it's actually wrong or just not specific enough.

both the ntsb and csb are intimately familiar with this: notice how they don't put out recommendations to "consider possible unexpected sources of oxygen", they recommend that the american petroleum institute "develop a publicly available technical publication for the safe operation of fluid catalytic cracking units".

ultimately, your processes to prevent failures in systems are, themselves, systems, and need to be treated accordingly.

don't be scared to change

first, even if all the circumstances are totally static, your system is almost certainly imperfect. you'll need to make changes to address its issues, especially if you're just starting to apply reliability principles.

but second, circumstances are almost never totally static! it's not enough to achieve a perfect system once; as other systems change around yours, your system will need to be updated to match, even if its purpose is the same.

this is the point of "change management", and yes, i know that's a dirty word in tech. who wants to spend time in meetings justifying every little tweak? that just slows things down, and change management boards don't even understand the systems they control, so they don't actually prevent errors.

but… have you noticed something? change control is a system too. if it's not working, then investigate why and fix it! i've talked about this before, but i'm going to keep talking about it until everyone in tech knows, which is to say i'm going to talk about it until my inevitable death.

conclusion

ultimately, the point here is to inspire you to take control of your systems. don't manually work around broken technology, don't yell at people for inevitabilities – take action to fix things and make the tech work for you.

a lot of tech has a justified reputation for being bad. a lot of computerization just ends up making jobs worse. the point of advanced technology is to enable people to do more complex work – and you can't do that if the tech is unreliable. so fix it, and keep it fixed.

good luck!