Take down production? Get promoted.
Writing software = writing bugs. How you manage the crisis can earn your next promo.
Welcome, Avatar!
First off, a special thank you to the readers who pay to make this newsletter possible.
Your kind words, success stories, and support go a long way.
– Fullstack
Me: “You get any weird messages this morning?”
Team: “Wait, you got a weird message too?”
Me: “Yeah, I think Bumbo service is just glitching.”
Team: “I just got another.”
Me: “Oh shit.”
I’d shipped the odd small bug in production before, but this one was shaping up to be bad.
Millions of customers impacted. Reputational damage to the company.
And it had been live for 12 hours. I had shipped to prod end of day and went to sleep.
My code had been spewing bugs all night and we didn’t notice until we encountered the bug ourselves the next day.
I had a SEV1 on my hands, and not just one I had discovered, one I had caused.
I managed to reframe this into promo evidence that helped me get a 6-figure promotion a few months later.
Maybe I’m getting ahead of myself. Let’s go back to the beginning.
Today we cover the basics of SEV (or high severity incident) management.
And how to turn the inevitable bugs you will ship, into evidence for your next promo packet.
Writing Software = Writing Bugs
If you’ve written software for any length of time, you know this to be true.
Except for the occasional 10x or 100x engineer savant who can maybe spit out perfect code, writing software is done in an iterative process.
Write. Test or Deploy. Fix. Repeat.
Yes, TDD cultists, this is a simplification and doesn’t split out the importance of testing separate from deploys but the point remains.
Whether through tests or deploying your code to customers, the rubber will hit the road and you will see if your software is up for the task.
Maybe the software is incorrect. An off by one error. Infinite loop. A control flow bug. You misunderstood the product requirements and coded incorrect business logic.
Maybe the software can’t scale. Missing database indexes. Inefficient queries. Poor data modeling and table design.
Maybe the infrastructure can’t scale. Your service is under-provisioned. Your database is under-provisioned. Your proxy is under-provisioned. You ran out of AWS credits.
Once you add growth (users, data…) to your software, what worked perfectly in your test suite may soon seem like the wheels are falling off.
While good testing, CI/CD, staging environment QA, chaos monkey… practices can help reduce how many bugs land in production, the limit only approaches zero.
Zero bugs can not be achieved with any normal level of developer velocity.
Writing Software = Writing Bugs.
The only question then is what to do when the bugs come for you.
Mission Control Mindset
Imagine being in Mission Control during the storied Apollo missions as portrayed by Hollywood. Or the SpaceX livestreams of recent years.
The buzz. The anticipation. The energy. The countdown. Then takeoff.
Engineers glued to their monitors, tracking progress. Managers cheering and watching the big screen.
If success, then euphoria. If signs of failure, then intense pursuit of a fix.
If you bring this mindset to resolving SEVs, it will not go unnoticed.
Customers are impacted, systems are unstable, it’s all hands on deck to get healthy.
Some engineers lack the sense of urgency, or don’t seem to act or express it. It comes off as apathy, unseriousness, even lack of care for customers or the company. Not a look that will help you get promoted.
Bring the zeal to get things fixed ASAP. If it is a system you own, there’s no excuse not to. Even if it’s not your system, if you choose to jump in and try and help, over enough SEVs you will become more capable and helpful and learn a ton.
In practice, this means:
quick to investigate pages or reported issues
not get defensive if the bug is your code
pages and adds any relevant people into the SEV Slack channel
focuses conversation and participants in SEV call towards organized investigation and then mitigation
unafraid to read source code, check dashboards, Sentry, search Slack, Google, StackOverflow to find the bug and the fix
even after a SEV is stable, never stop searching until a root cause is found and fixed
follow up on all action items from the SEV Postmortem which could help prevent future SEVs
Build Your Reputation in the Trenches
Time and again, I’ve seen engineer’s performance during SEVs factor into their career growth and promotions.
In the best case, I’ve seen new engineers who eagerly jump into ongoing SEVs – initially observing but soon helping – get promoted years before those who prefer to watch from the sidelines.
A wise man once said: the future is for those who show up for it.
Want to have a big customer impact and build your internal reputation? Show up during SEVs.
Coordinate. Investigate. Fix. Be helpful. Be the hero. Don’t be in the way.
In the trenches, other senior staff and managers will note your attitude and actions.
For better or for worse.
Weak Engineers Cause Downtime
I’ve also seen engineers who bring the opposite mindset.
ignore pages and alerts
get defensive
try and ignore bug reports
belittle customer impact
remain convinced it can’t be their code
deprioritize, slow to join SEV channel or call
resist any rollback or resolution which would add future work for them
never take responsibility or ownership of the bug they shipped
avoid any accountability during SEV Postmortem
drag their feet on action items to prevent the SEV recurring
A history of this soon becomes cemented as their reputation, and is hard to ever shake.
They will be known for their weakness.
Taking extreme ownership during SEVs is a sign of strength.
So, what is extreme ownership?