Post-Mortem: Search Unavailable on September 3, 2020
There was a severe service degradation of our reference server. We want to give you some insight into what happened.
End of August we fixed an issue with our search indexes that required us to rebuild them. We did so and everything was running fine. Then on September 3rd we noticed after a deployment that the search service did not start successfully. During our attempt to mitigate this, we destroyed the search indexes, which lead to this service degradation.
Impact
The search was unavailable for 13 minutes. User visible errors also happened whenever projects and packages where created/updated due to a tight integration of those object with the search engine. Once the search service was back, we only provided partial search results for some hours.
Root Causes
To fix issue #9436 we rebuilt our search indices at the end of August. Unfortunately we ran this command not as the usual user we run our searchd
service with so it created
a process and files belonging to another user.
Trigger
After deploying on September 3, 2020 at 13:57 (CEST), we noticed that the searchd
systemd
unit couldn’t restart.
Restarting the unit always failed to bind to localhost. While trying to mitigate this, we removed the search indices because they had wrong ownership and we wanted to start fresh. This caused the searchd
service to fail, which took Project/Package writes with it.
Resolution
After a while we noticed that there was a searchd
process running that was not controlled by systemd
. It was running as the user we re-indexed with in August. We stopped the searchd
gracefully to have all remnants from the previous process removed. And then used the systemd
unit to start the searchd
properly.
Afterward we had to rebuilt the search indices, but this time, as the right user in the context of our application.
Detection
We got errors about the search being down in our errors tracker.
Action Items
- Running
bundle
,rails
andrake
should do nothing besides emitting a warning on how to run those commands as the right user in the context of our application. We already implemented that now. - Prevent errors from ActiveRecord callbacks when Sphinx’s daemon is down
Lessons Learned
What Went Well
We were quickly aware of the issue with the help of our errors tracker.
What Went Wrong
We don’t guard ourselves against running “standard” Ruby on Rails commands in production as the wrong user. Our monitoring for the searchd
service is not sophisticated enough. And the integration of the ActiveRecord callbacks and searchd
is too fragile.
Where We Got Lucky
The issue was quite severe, but not the whole application went down with it.
Timeline (times in CEST)
- 2020-08-27 08:09: We rebuild the search indices as the wrong user
- 2020-08-27 to 2020-09-03 Everything was seemingly working fine. However, the
searchd
systemd
unit was failing. Which we don’t monitor, we only monitor for the existence of a process namedsearchd
- 2020-09-03 13:57: We ran a deployment
- 13:59: We notice that the
systemd
service of the search failed to start - 14:00: We notice a dangling
searchd
process - 14:25: We create a PID file in an attempt to have the
systemd
service manage the dangling process - 14:34: We notice wrong file permissions on the search indices and a message about this in the
searchd
logs - 14:40: In an desperate attempt, we remove the indices. The search is down
- 14:47: We start the
searchd
process as user root, the search is no longer down - 14:57: We stop the
searchd
process gracefully, which briefly degrades the service again - 14:57: We start the
searchd
usingsystemd
again. It can now start without any issue - 15:06: We start re-indexing the search
- 20:30: The search indices are up-to-date