Post-Mortem: Conflict with ruby standard gems on May 10, 2022
There was a severe service degradation of our reference server. On 2022-05-10 a deployment of OBS failed and led to a downtime. We want to give you some insight into what happened.
Impact
build.opensuse.org was offline for 20 minutes. No one was able to work with the API or user interface. Services depending on OBS (like software.opensuse.org) where taken down by this too.
Root Causes
After the update of the strscan
ruby gem in our bundle, we had two strscan
versions on our production installation. One provided by Ruby 3.1 stdlib (3.0.1) and one by our gem bundle (3.0.2). bundler
itself requires strscan
to function and loads from stdlib by default. During booting
bundler
noticed the mismatch and wasn’t able to spawn new instances of the application anymore.
Could not spawn process for application /srv/www/obs/api: The application encountered the
following error: You have already activated strscan 3.0.1, but your Gemfile requires strscan 3.0.2.
Since strscan is a default gem, you can either remove your dependency on it or try updating to a
newer version of bundler that supports strscan as a default gem. (Gem::LoadError)
Trigger
Merging and deploying the strscan
gem update.
Detection
The deployment showed failures and we received alerts from our monitoring.
Resolution
Setting back the version of strscan
by monkey patching the version number defined
in the Gemfile.lock
and restarting the server resolved the incident.
Action Items
- Reverting the version update and deploying again. ✔️
- Pin the version of
strscan
until it’s solved/released upstream in the bundler dependencypsych
. ✔️ - Try if we can make our production and CI bundle setup more similar.
Lessons Learned
What Went Well
Collaboration among the team to resolve this.
What Went Wrong
Our continuous integration cycle did not catch this as we are bundling differently on production. On CI we bundle to vendor/bundle
on production we bundle to /usr/lib64/obs-api/
. Apparently those paths lead to a different load order.
Where We Got Lucky
No permanent damage or data loss.
Timeline (times in UTC)
- 14:24 Started the deployment which wasn’t successful.
- 14:26 We started to receive alerts.
- 14:28 We declared an incident and started to follow our Incident Management Protocol.
- 14:31 Tried to turn it (apache) on and off again.
- 14:32 Detected the conflict of the gem versions in the log files.
- 14:41 Monkey patched the version of
strscan
from 3.0.2 back to 3.0.1 in theGemfile.lock
. - 14:41 Restarted the server.
- 14:42 Alerts resolved itself.