Post-mortem: Database Cluster Crashes
by the OBS Team
posted on
3rd Aug 2023
Downtime on the afternoon of 3rd of August
On 3. August, a few hours after a large migration performed within the maintenance window earlier that day, we experienced multiple downtimes while recovering from database inconsistencies.
Date: 03.08.2023
Impact: Multiple downtimes throughout the day.
Root Causes: Our database cluster ran out of available space during a large schema/data migration (#14597 - Migrate the remaining database tables and columns to utf8mb4)
Trigger: Morning deployment and migration from utf8mb3 to utf8mb4.
Resolution: The tables were dumped and restored from scratch.
Detection: Our database admins got notified via their monitoring.
Lessons Learned
What went well?
- We learned about the database crash soon after it happened.
What went wrong?
- The current way we deploy with migration does not log the progress of migrations or inform us about things happening in real time (improvement card).
- We did not communicate this migration with our database admins in advance to make them aware of potential fallout.
Where we got lucky?
- Only four tables ended up being affected.
- Our database admins where around to help us in getting the database back to usable state.
Timeline (CEST)
- 09:03 Started the deployment with the migration
- 09:26 Ended the deployment
- 09:31 First recorded error in the index of
project_log_entries
table - 13:37 Database cluster crashes
- 14:04 Build Service goes into downtime to export
project_log_entries
table - 14:14 Started
project_log_entries
table import - 14:59 Build Service comes back from downtime
- 17:40 Database cluster crashes again
- 17:56 We learn about
binary_releases
table index being broken - 18:14 Started
binary_releases
table export without downtime - 18:18 Started
binary_releases
table import - 18:46 Finished import
- 18:50 We start performing
CHECK TABLE
on the rest of the tables in the database - 19:08 We find out about
bs_request_actions
table being broken and take Build Service down for maintenance - 19:16 Build Service comes back up after all the tables went through
CHECK TABLE