On Tuesday, February 7 at 9:47 AM Eastern time (14:47 UTC), our database reported a performance issue under normal operational load. We had upgraded the database server over the weekend, but it had been operating normally since Monday. The alerts indicated a high level of lock contention on the newly upgraded database, which was causing problems for our call servers (SFUs).
The SFUs are designed to shut themselves down if they are not able to connect to our database. When an SFU shuts down, our autoscaling will start a new SFU to replace it. With several SFUs shutting down at the same time (and several new ones starting), we experienced a larger than normal volume of “meeting moves”, which added additional load to a database that was already struggling.
A “meeting move” occurs when an old SFU is shutting down. Our webapp automatically moves any ongoing call sessions on that SFU to a different SFU. During a meeting move, users will usually notice everyone else’s video drop out for a second or two before reappearing.
The next few paragraphs shows the sequence of events between 09:51 to 10:47 which helped us identify the cause.
By 9:51 (T+4 minutes), engineers had found a potential culprit: a large volume of queries stuck in a deadlock. These were “meeting events” from the SFU, noting when participants joined or left meetings. This was causing the webapp API requests to time out and return 5xx errors, and ultimately causing the SFUs to drop their connections and restart.
By 10:13 (T+26 minutes), we had found one potential cause of the deadlocks. After our database migration from the previous weekend, we were still using MySQL binary log replication to keep our old database up to date. We disabled binlog replication and restarted the database to try and reduce the overall load on the database. This helped, but many of the SFUs retried the queries that were causing the deadlocks, so the problem persisted. We continued investigating, and also contacted AWS support to see if they had any insight on the issue.
At 10:47 AM (T+1 hour), engineers were working on a script that would terminate stuck queries when the database suddenly restarted itself. This restart took slightly longer than the one at 10:13, and it allowed the SFUs to discard the now-stale meeting updates without being disconnected long enough to cause them to restart. At this time, the SFUs and the platform went back to normal operation.
We were ultimately able to prove that the deadlocking behavior was caused by a low-level behavior change introduced in a point release of MySQL. Our database maintenance from the previous weekend had upgraded us to that version and introduced the change. Working around that behavior change involved updating an index on one affected table. We spent the rest of the week developing and testing a plan to update the production database, and we completed that work with no user impact on Saturday evening.
At 11:01, we decided we could move into a monitoring state while continuing to investigate the root cause. We left the status incident in a “monitoring” state until Friday, because we wanted to make sure we fully understood the initial cause of the deadlocks took any necessary action to avoid it in the future. One such action was the addition of rate limiting to the room creation API endpoint.
The overall impact of this incident was limited to almost exactly one hour, between 14:47 and 15:47 UTC. During that time, some users in Daily calls experienced the “meeting moves” described earlier. There may have been a small number of users that weren’t able to join a room if they happened to try in the middle of a “meeting move”, which lasts a few 10s of seconds. They would have joined on reattempting a few seconds later. Similarly, some REST API requests may have returned 5xx error codes as well.
We are continuing to work with AWS to make sure that the deadlocks issue we saw in production with Aurora MySQL 2.11.0 is fully documented, understood, and fixed in a future release. A more conservative approach to deadlocks was a known change in MySQL 5.7 (which Aurora MySQL 2 is based on). However, the severity of the deadlocks that we experienced during this incident was a surprise to us and to the AWS Aurora team.
We try hard to test all infrastructure changes under production-like workloads. In this case, we failed to test with a synthetic workload that had the right “shape” to trigger these deadlocks. As a result of this incident, we have added additional API request patterns to our testing workload. We’ve also added some new production monitoring alarms that are targeted at more fine-grained database metrics.