After more than two years at Imgur, I’ve had to learn a lot about the principles behind writing highly-available (but not AP) fault-resilient systems. While occasionally some systems go down, it’s the times that I wake up in the morning and come in to work only to realize that overnight, a safeguard we put in place automatically triggered, or the system caught an error and successfully recovered, that I am thankful for some good design principles. Here are a few of those things I’ve noticed in particular:
- Put limits on everything
- Retry, but with exponential back-off
- Use supervisors and watchdog processes
- Add health checks, and use them to re-route requests or automate rollbacks
- Redundancy is more than just nice-to-have, it’s a requirement
- Prefer battle-tested tools over the “new hotness”