Post-mortem: Failing email deliveries on 6th February, 2023
Our reference server stopped sending out email notifications on February 3th, 2023. In the lines below you will find a detailed explanation of what happened.
Impact
Our reference server wasn’t sending out email notifications (or only partially) to the users starting Feburary 3th, 2023 around 11:37PM until February 6th, 2023 13:06PM.
Root Causes
The problem was caused by an exception thrown in the SendEventEmailsJob
due to not properly querying “hidden projects”,
which made the job fail.
Trigger
A recent change of user roles on a “hidden project” on our reference server server, started
to trigger queuing SendEventEmailsJob
’s to send emails to the subscribed users.
Resolution
Temporarily ignoring the events coming from the hidden project in the SendEventEmailsJob
in order to allow mailer jobs
to be processed again.
Detection
- An alert about receiving no new data from
ActionMailer
in Grafana in our chat. - Report from a user in the
#opensuse-buildservice
IRC channel. - Receiving exceptions in Errbit regarding failures in the
SendEventEmailsJob
.
Action Items
- Correct handling of hidden projects in the
SendEventEmailsJob
. Right now we use a default scope when querying the projects. This leads to the failure in the job, since it cannot find the hidden project. See issue 13636.
Lessons Learned
What went well?
Collaboration among the team to resolve the issue.
What went wrong?
Not considering the default scope when querying the projects in the SendEventEmailsJob
.
Where we got lucky?
Once the problem was clear, it was easy to temporarily exclude the project that caused the issue from the SendEventEmailsJob
.
Timeline (CET)
- 2023-02-03 23:37 Errbit starts to track errors in
delayed_job#SendEventEmailsJob
. - 2023-02-04 03:08 Start to receive an alert in our chat that no data is received in Grafana for the
ActionMailer
. - 2023-02-05 15:26 Users informed in IRC that they didn’t receive emails since
2023-02-03T22:11
. - 2023-02-06 11:50 Start analyzing the issue. Our exception tracking tool Errbit showed almost 100.000 exceptions
where the
SendEventEmailsJob
couldn’t find a hidden project. - 2023-02-06 12:13 Detect that Postfix wasn’t sending out any mails since
2023-02-05T07:38
. - 2023-02-06 12:16 The delay job queue for ‘mailers’ is unusually high (7243).
- 2023-02-06 12:26 We declare the incident
- 2023-02-06 12:30 Remove the queued jobs related to the hidden project. After that we realized that this is not enough, since new jobs related to the same hidden project were queued again, which led to make the mailer job fail again.
- 2023-02-06 12:58 We monkey patch
app/jobs/send_event_emails_job.rb
to avoid sending emails or web notifications related to the projects causing the issues. - 2023-02-06 13:06 Restart the server and delayed jobs for the
mailers
queue. - 2023-02-06 13:08 The alert for the
ActionMailer
is resolved and emails are send again.