Delayed API calls

Incident Report for Daily

Postmortem

On Tuesday, October 22, around 17:15 UTC (9:15 AM PDT), a Daily customer started running a series of load tests. Their test involved rapidly creating and deleting a large number of rooms that used PSTN dial-out, cloud recording, and webhooks. This eventually caused several capacity threshold alerts to fire around 18:15 UTC (10:15 AM PDT) as our system scaled out to handle the load.

We noticed that their test was running a script that created a room and started dial-out, but almost every instance of the script was exiting the room uncleanly before the outgoing call even connected to anything. This exposed an edge case that caused a ‘zombie’ PSTN participant to stay in that session and continue to try and send presence updates indefinitely. We’re already working on fixing that bug.

This has probably happened before, but in much smaller quantities, since it involves a very unusual combination of events—but since this was an automated load test, it was causing too many of these ‘zombies’ to build up, all trying to write frequent presence updates to the database. Soon, the database response time began to slow under the increased load.

Around that same time (18:15 UTC, 10:15 AM PDT), we noticed an increase in API error rates—specifically, actions that required writing to the database. Our team started to work both problems at once: safely get rid of the ‘zombie’ sessions without affecting other customers, and alleviate the load on the database to improve API response times.

API error rates for POST requests spiked as high as 8%, and error rates for all requests peaked at 2-3%. We were able to return API error levels and latency back to normal by around 19:50 UTC (12:50 PDT) by refreshing several database instances. We contacted the customer and stopped the load tests, and then we were able to remove the ‘zombie’ sessions through our normal deploy process.

We’re sorry for the disruption this caused. We’re already working on several remediations, including fixing the bug that caused the ‘zombie’ sessions, as well as adjusting platform rate limits to prevent this from happening again.

Posted Oct 23, 2024 - 21:53 UTC

Resolved

This issue has been resolved. We will post more information about this incident in the near future.

Posted Oct 22, 2024 - 22:02 UTC

Monitoring

API latency and errors have stayed at normal levels for a while now, but we're continuing to monitor for any further impact.

Posted Oct 22, 2024 - 20:59 UTC

Update

API error levels have decreased considerably, but we're still working on full remediation. More updates to come.

Posted Oct 22, 2024 - 19:59 UTC

Identified

We've identified an issue causing some slowdowns in one of our databases, leading to some delayed or failed API responses. We've solved the root cause of the issue, but we're being cautious about restoring the database to full functionality, so we expect the delays to continue for a short time.

Posted Oct 22, 2024 - 19:14 UTC

Investigating

We're investigating an issue that's causing delays with some API operations, such as creating rooms and starting recordings. We'll post more info as soon as we have it.

Posted Oct 22, 2024 - 18:46 UTC

This incident affected: API.