Published: June 18, 2021
54
1.0k
4.5k

In honor of someone’s bad bug today, I will retell a story of my worst bug: Once upon a time I was the CEO and entire engineering team of a company which sent appointment reminders. Each reminder was sent by a cron job draining a queue. That queue was filled by another cron job

Reminders could fail but the queue draining job had always been bulletproof and had never failed to execute or take more than a few seconds to complete. It ran every 5 minute. So I had never noticed the queue *filling* job wasn’t idempotent.

Idempotent is a $10 word for a simple concept: an operation where you get the same result no matter how many times you run it. Adding 2 + 2 is idempotent. Creating a new record in your database may not be; the number of rows in the DB goes up each time.

One day, for the first time ever, the queue draining job broke and could not be restarted. This was a result of a trivial code push I had made to an unrelated part of the code base late in the day, prior to an apartment move. Of course, being responsible, I was paved immediately

However, that was back during those pre-iPhone years where my cell phone was “a useful tool” rather than “extension of brain”, and like many useful tools it ended up in a box on the moving truck, trilling merrily for 13 hours.

Later that night, while unboxing things, I got the page, realized there were thousands of undrained events in the queue, and panicked. So I reverted the bugged deploy and restarted the queue workers. Queue quickly drained. Crisis averted, right?

At 2 AM in the morning I woke from a nightmare caused by system engineer spideysense. “Wait wait wait there were THOUSANDS of events on the queue? Shouldn’t it have been a couple dozen at that hour?” And then I realized with dawning horror what I had done.

In the 13 hours the queue had been broken, cronjob #2 had been dutifully asking “Have we called Client of Customer #437 about their Friday appointment?”, gotten a no from the DB, and then dutifully queued up a call. Every five minutes. Resulting in 13 * (60 / 5) calls.

We did not spam the heck out of one client’s inbox. Oh no. We spammed the heck out of every client, of every customer, who had an appointment that day. But worse, because a key feature of our service was that it didn’t just email; it would escalate to SMS and then phone.

And since the blissfully ignorant queued calls thought “OMG I am so late better urgently tell the person about their appointment” most of them chose to escalate to a call. Now there is a word for what happens when many independent systems simultaneously try to restart.

We call it a “thundering herd.” It routinely brings down systems built for massive scalability like web tiers, APIs, databases, etc. You know what is not designed for massive scalability? A residential plain old telephone.

Plausibly you might not even know what happens if, say, 50 people all call you at once. I’ll tell you. They keep ringing indefinitely while you have a conversation and hang up, at which point your phone will immediately start ringing again. You can repro w/ 50 patient friends.

Computers are very, very patient, and so they dutifully lined up and let the phone ring until it was answered or went to answering machine, 50 times consecutively. Or, as more commonly happened, the immensely frustrated person physically disconnected their phone line.

Now this would have been bad enough. But what did those 50 calls start with? “This is Dr. Smith’s office. He’d like to remind you that...” So you can imagine who got the call an hour later. From every patient they were supposed to see that day. Then they sent me an email.

And so it was that at 2 AM local time I had a long stack of very irate emails from literally every client of my company and no working internet to answer them from (new apartment).

I still had the keys to old apartment, which had working Internet. Problem: across town, no way to call taxi because small town Japan and no functioning phone. Solution: pack up laptop, landline, and heater into backpack, walk across town to old apartment. In freezing rain.

And so at 3 AM, wet and shivering in a room with no light, I started making apology calls. After the first two I broke down and called my dad, convinced I had just bankrupted my company. He talked me down. I worked through the night on apologies.

We ended up losing exactly two accounts. One reactivated the next day, impressed that they rated “a personal apology from the CEO.” “Everyone makes mistakes. Go easy on your engineer.” That’s good advice.

I fixed the idempotency issue and added MANY safeguards. We never unintentionally duplicated a call again for the life of the company. Nobody remembers this now except me, and engineers who I tell it to, when they think that they’ve just made the worst mistake of their career.

@patio11 Don't deploy on Friday or moving day!

@patio11 How long were you sitting on this thread waiting for a publicly observable bug to happen worthy of posting this? :)

@patio11 I've heard this story before, but I don't recall the part where the value of a reminder was set to $0 and millions of dollars were accidentally locked.

@patio11 Something on a much smaller scale happened to me too. I think every engineer goes through this once in their lifetime.

@patio11 An early Inktomi bug: our new fast web crawler DoS'd http://Crutchfield.com accidentally for several hours. We bought a bunch of stereo stuff from them out of guilt (and I still buy from them)

@patio11 @ben11kehoe Thank you so much for sharing this. It always makes me relax when I read how the world of software is just a lot of duct tape and glue. And that it works *most* of the time.

@patio11 Was it the time a change accidentally moved a few million dollars? Was it the time I found out gmail's maximum rate limit via a faulty alert? It wasn't the time a misconfigured app left half a day's worth of messages unread. Idk what my worst bug was. Need more discovery.

@patio11 This reminds me of the time we were building the video compression tech for Command & Conquer @ Westwood Studios, we created this distributed compressor that would farm out each frame across all the team desktops that were idle.

@patio11 I'll share. As a fresh out of university programmer working for a company that was huge on open source, I was tasked with replacing our fax platform with something open source.

@patio11 I dodged a massive bullet just yesterday. Latent bug in the largest single change we've been working on. Wiped 270k emails, but thankfully only on a server which just had staff users. If we hadn't accidentally triggered it then it could have happened a month to customers.

@patio11 @readwiseio save thread

@patio11 Omg I was laughing so hard by the end of this, thank you for sharing

@patio11 Was sweating just reading this. Thanks for sharing.

@patio11 This whole thread is incredible

@patio11 Something similar happens to me. Once I sent ~5000 emails to the same inbox, a client’s client, major bank in Italy. The issue was in the retry logic in a queue

@patio11 @gautamrege We have faced a similar multiple emails to clients scenario due to improper use of SQS => Lambda => Timeout configuration. Severity was not as high though, and we were able to fix it before the fourth email reaching everyone’s inbox!

@patio11 It reminded me of one too. Also related to emptying a queue - although in this case some team members had moved a staging queue into use for - you guessed it!

Share this thread

Read on Twitter

View original thread

Navigate thread

1/37