Apr 3, 2024
A series of observations on choices that make small systems more successful. Largely a restatement of how complex systems fail, through the lens of success rather than the lens of failure.
It has to work.
I don’t mean this lightly. Someone - you, or another person - is trying to solve a problem, and the service they’re using and relying on needs to actually solve that problem in some meaningful way.
It doesn’t have to be perfect.
An adequate solution to the end user’s needs, that is also tractable to operate and use, beats a perfect solution that’s hard to work on, every time. “Good enough” makes the world go ‘round.
It does have to be predictable.
Services will fail to deliver on their promises. It’s a fact of operational life, whether you’re a hobbyist or a huge organization with billions on the line - no amount of money or humanpower will stop this from happening at least some of the time.
A service whose behaviour on failure is predictable will be dramatically easier to operate, and, if necessary, dramatically easier to replace.
It can’t sabotage itself.
This is a surprisingly common technical and design failure mode! Tools can sabotage their own success in lots of ways. The one I encounter most often is to write data - logs, operational data, and so on - to limited data stores such as hard drives, without any consideration for how to either limit or predict the growth of that data. This extends to any other resource that may be exhausted, including operators’ time and money, network bandwidth, and even attention.
Think through how your system will behave if it’s installed on a computer with a fixed amount of storage, and then left unattended for a few years by its operator while others use it. Will it fail on its own? If so, fix that.
Secondary service dependencies have to be looked at as risks.
This means things like requiring an SQL database, or requiring a third-party email delivery system. It also means things like container orchestration services (a very high-risk dependency) and the computer/s the service runs on. Every dependency increases the operational risk in some way. Those dependencies may be unavailable, they may be misconfigured, they may be too complex for the operators you have in mind, and so on. Third-party dependencies are particularly risky, because they also entail exposure to that organization or that person’s business and organizational needs, which may be at odds with your operator’s.
It’s almost always worth taking the time to consider what it would take to eliminate that dependency, especially if you expect others to operate the service in your absence. You might still opt to keep the dependency, but it shouldn’t be your default choice.