Issues connecting to calls

Incident Report for Daily

Postmortem

On Tuesday, February 7 at 9:47 AM Eastern time (14:47 UTC), our database reported a performance issue under normal operational load. We had upgraded the database server over the weekend, but it had been operating normally since Monday. The alerts indicated a high level of lock contention on the newly upgraded database, which was causing problems for our call servers (SFUs).

The SFUs are designed to shut themselves down if they are not able to connect to our database. When an SFU shuts down, our autoscaling will start a new SFU to replace it. With several SFUs shutting down at the same time (and several new ones starting), we experienced a larger than normal volume of “meeting moves”, which added additional load to a database that was already struggling.

A “meeting move” occurs when an old SFU is shutting down. Our webapp automatically moves any ongoing call sessions on that SFU to a different SFU. During a meeting move, users will usually notice everyone else’s video drop out for a second or two before reappearing.

The next few paragraphs shows the sequence of events between 09:51 to 10:47 which helped us identify the cause.

By 9:51 (T+4 minutes), engineers had found a potential culprit: a large volume of queries stuck in a deadlock. These were “meeting events” from the SFU, noting when participants joined or left meetings. This was causing the webapp API requests to time out and return 5xx errors, and ultimately causing the SFUs to drop their connections and restart.

By 10:13 (T+26 minutes), we had found one potential cause of the deadlocks. After our database migration from the previous weekend, we were still using MySQL binary log replication to keep our old database up to date. We disabled binlog replication and restarted the database to try and reduce the overall load on the database. This helped, but many of the SFUs retried the queries that were causing the deadlocks, so the problem persisted. We continued investigating, and also contacted AWS support to see if they had any insight on the issue.

At 10:47 AM (T+1 hour), engineers were working on a script that would terminate stuck queries when the database suddenly restarted itself. This restart took slightly longer than the one at 10:13, and it allowed the SFUs to discard the now-stale meeting updates without being disconnected long enough to cause them to restart. At this time, the SFUs and the platform went back to normal operation.

We were ultimately able to prove that the deadlocking behavior was caused by a low-level behavior change introduced in a point release of MySQL. Our database maintenance from the previous weekend had upgraded us to that version and introduced the change. Working around that behavior change involved updating an index on one affected table. We spent the rest of the week developing and testing a plan to update the production database, and we completed that work with no user impact on Saturday evening.

At 11:01, we decided we could move into a monitoring state while continuing to investigate the root cause. We left the status incident in a “monitoring” state until Friday, because we wanted to make sure we fully understood the initial cause of the deadlocks took any necessary action to avoid it in the future. One such action was the addition of rate limiting to the room creation API endpoint.

The overall impact of this incident was limited to almost exactly one hour, between 14:47 and 15:47 UTC. During that time, some users in Daily calls experienced the “meeting moves” described earlier. There may have been a small number of users that weren’t able to join a room if they happened to try in the middle of a “meeting move”, which lasts a few 10s of seconds. They would have joined on reattempting a few seconds later. Similarly, some REST API requests may have returned 5xx error codes as well.

We are continuing to work with AWS to make sure that the deadlocks issue we saw in production with Aurora MySQL 2.11.0 is fully documented, understood, and fixed in a future release. A more conservative approach to deadlocks was a known change in MySQL 5.7 (which Aurora MySQL 2 is based on). However, the severity of the deadlocks that we experienced during this incident was a surprise to us and to the AWS Aurora team.

We try hard to test all infrastructure changes under production-like workloads. In this case, we failed to test with a synthetic workload that had the right “shape” to trigger these deadlocks. As a result of this incident, we have added additional API request patterns to our testing workload. We’ve also added some new production monitoring alarms that are targeted at more fine-grained database metrics.

Posted Feb 28, 2023 - 16:16 UTC

Resolved

We've identified the issue that caused the incident on Tuesday morning. While we've already deployed fixes that helped prevent the problem from reoccurring, we still need to perform one more database update that will require a short scheduled maintenance. That will likely happen this weekend.

We will post a full retro after completing the final database maintenance operation.

Posted Feb 09, 2023 - 23:37 UTC

Update

We've deployed a platform update with a few improvements designed to mitigate the impact of the current database performance issue. The only thing you may notice is that you'll no longer see 429 rate limit responses in your Dashboard API logs.

Our database metrics have remained normal today, but we'll continue to monitor the platform to verify these fixes and watch for further issues.

Posted Feb 09, 2023 - 04:13 UTC

Update

While we were able to restore platform functionality earlier today, we've continued to troubleshoot the underlying issue that caused the problem.

As a precautionary measure, we've temporarily enabled rate limiting on the REST API endpoint used to create rooms. The limit for POST /rooms is now the same as the DELETE /rooms/:name endpoint. You can expect about 2 requests per second, or 50 over a 30-second window.

Posted Feb 08, 2023 - 06:30 UTC

Monitoring

We’ve addressed the issue with the database, and platform operations have returned to normal. We are monitoring alerts and metrics for any further issues.

Posted Feb 07, 2023 - 16:07 UTC

Identified

We've identified an issue with one of our databases that coordinates activity between call servers. This is causing elevated rates of "meeting moves", which is when an ongoing call session has to move from one call server to a different one. If you're in a call when this happens, you'll notice everyone's video and audio drop out and come back within a few seconds. You may also need to restart recording or live streaming when this happens.

You may also experience timeouts when making REST API requests.

We'll post more information as soon as it's available.

Posted Feb 07, 2023 - 15:39 UTC

Investigating

We are investigating elevated platform error rates. Users may get websocket connection errors when trying to join calls.

Posted Feb 07, 2023 - 14:54 UTC

This incident affected: API, Dashboard, and Core Call Experience.