On Tuesday, October 22, around 17:15 UTC (9:15 AM PDT), a Daily customer started running a series of load tests. Their test involved rapidly creating and deleting a large number of rooms that used PSTN dial-out, cloud recording, and webhooks. This eventually caused several capacity threshold alerts to fire around 18:15 UTC (10:15 AM PDT) as our system scaled out to handle the load.
We noticed that their test was running a script that created a room and started dial-out, but almost every instance of the script was exiting the room uncleanly before the outgoing call even connected to anything. This exposed an edge case that caused a ‘zombie’ PSTN participant to stay in that session and continue to try and send presence updates indefinitely. We’re already working on fixing that bug.
This has probably happened before, but in much smaller quantities, since it involves a very unusual combination of events—but since this was an automated load test, it was causing too many of these ‘zombies’ to build up, all trying to write frequent presence updates to the database. Soon, the database response time began to slow under the increased load.
Around that same time (18:15 UTC, 10:15 AM PDT), we noticed an increase in API error rates—specifically, actions that required writing to the database. Our team started to work both problems at once: safely get rid of the ‘zombie’ sessions without affecting other customers, and alleviate the load on the database to improve API response times.
API error rates for POST requests spiked as high as 8%, and error rates for all requests peaked at 2-3%. We were able to return API error levels and latency back to normal by around 19:50 UTC (12:50 PDT) by refreshing several database instances. We contacted the customer and stopped the load tests, and then we were able to remove the ‘zombie’ sessions through our normal deploy process.
We’re sorry for the disruption this caused. We’re already working on several remediations, including fixing the bug that caused the ‘zombie’ sessions, as well as adjusting platform rate limits to prevent this from happening again.