Derpibooru outage postmortem: "Sticks and stones may break my pones"

Arcaire
Ten Seconds Flat - Zero to booru hero in under 24 hours.
Friendship, Art, and Magic (2018) - Celebrated Derpibooru's six year anniversary with friends.
Not a Llama - Happy April Fools Day!
A Perfectly Normal Pony - :^)
Always Codes Drunk - In the Zone

Systems / Ops Alumni
This is part of the brand new dunkeroni three cheese seismic slam developer openness series.
 
_For more info about this series, click this link
 
An issue that had plagued us in a couple of weird ways since the server migration had cropped up again, and an attempt to fix it that would regularly work just fine had wrecked more things right as Friday peak time for booru activity came upon us.
 
 
The Long Section
 
This is probably pretty >>44238 for most people but it includes the technical explanation. You might want to skip to the tl;dr section below.
 
Between 10pm and 6am my time, all notifications on my phone are muted bar a very select few. Of those few, there’s only actually one I enjoy seeing light up my phone at that time.
 
At 1am, as I was turning on my alarm for the next morning, the notification that came through was not, in any way, one I enjoy seeing:
 
<byte[]> Arcaire: help, again
 
For context, the responsibilities of the booru development staff / tech team / tex mix / whatever you want to call us are somewhat split up depending on where each of us has the most expertise. While we can all do most things as well (e.g., I can also code and add features to the booru), it’s best to push for subject matter experts so that the best talent is hitting the right places. I’m the system administrator and security guy, making sure everything runs as smoothly as possible so that the pony may flow. Most of the time, my job is comparatively pretty easy.
 
It wasn’t running smoothly last night. Specifically, IPv6 routing was broken. Due to the configuration of our server, it meant the server couldn’t resolve domain names (which broke our remote image uploader). A generally innocuous command that would - 99 configurations out of 100 - work fine and fixed the original problem, had instead added to it and broken a secondary network connection we have to our storage server, because we were that 1 configuration whereby hard-resetting the network is a bad idea.
 
Most probably remember our new server which featured a managed SAN accessible over a private network connection. That SAN is our storage server. iSCSI, the SAN protocol we’re using, isn’t a particularly fault-tolerant protocol on its own outside of restoring a network connection. Our setup is built for the resiliency of the data itself, and flexibility for extension, over connection resiliency, so we have another layer on top of this network storage that requires a bit of cuddling (or, often simpler, a server reboot) if things fail like that.
 
What followed was a rather interesting race of shutting down the booru app before it totally desynchronised the database from the file storage server leaving us with hundreds of missing images. You lot upload real quick, so any time something to do with site interactivity goes weird it’s always a sprint.
 
 
The tl;dr Section
 
Total booru degradation (e.g., some services not working) was about 1:45.
 
Total complete downtime was a further 0:45.
 
What we lost:  
Actually, nothing. We got lucky. We lost a couple of contiguous indexes in the database, so a few image IDs will show a Schrodinger’s Pony, but no actual data was lost. It’s not enough to rely on luck, however. So we’re changing a few things behind the scenes.
 
While I was waiting for things to come up and stabilise, I started writing some extensive documentation on the correct way to snuggle the systems we use so that they don’t break. We also modified a few configurations to add a bit more leniency for faults that may occur. The rest of the team will be adding their own expertise to fill in the documentation further so that the entire team has a reference for any significant portion of the booru systems.
 
So, hopefully that doesn’t happen again. :^)
Interested in advertising on Derpibooru? Click here for information!
Pony Arts & Prints!

Help fund the $15 daily operational cost of Derpibooru - support us financially!

Syntax quick reference: **bold** *italic* ||hide text|| `code` __underline__ ~~strike~~ ^sup^ %sub%

Detailed syntax guide