Lessons learned writing highly available code

After more than two years at Imgur, I’ve had to learn a lot about the principles behind writing highly-available (but not AP) fault-resilient systems. While occasionally some systems go down, it’s the times that I wake up in the morning and come in to work only to realize that overnight, a safeguard we put in place automatically triggered, or the system caught an error and successfully recovered, that I am thankful for some good design principles. Here are a few of those things I’ve noticed in particular:

Put limits on everything
Retry, but with exponential back-off
Use supervisors and watchdog processes
Add health checks, and use them to re-route requests or automate rollbacks
Redundancy is more than just nice-to-have, it’s a requirement
Prefer battle-tested tools over the “new hotness”

Pages

Distributed Applications Technology

Lessons learned writing highly available code

Leave a Reply

Notice

Sponsors