Inbound Call Processing Failure

Updates

Postmortem
May 04, 2026 at 9:37 AM
Postmortem
May 04, 2026 at 9:37 AM
The incident was caused by the number of concurrent active calls across tenants exceeding the capacity of a micro-service responsible for temporary call state storage.
This situation was triggered by an unexpectedly high-volume campaign workload from a tenant, contributing to a significant increase in overall system load.
While the underlying voice infrastructure had sufficient capacity to handle the traffic, the supporting micro-service reached its storage limit. This resulted in a processing blockage that prevented calls from being fully established.

Corrective action plan:
Migrate voice instance state storage from in-memory grain storage to blob-based storage
Resolved
April 06, 2026 at 2:45 PM
Resolved
April 06, 2026 at 2:45 PM
This incident has been resolved.
Update
April 06, 2026 at 2:20 PM
Update
April 06, 2026 at 2:20 PM
The underlying process responsible for the blockage was identified and stopped.
Update
April 06, 2026 at 1:57 PM
Update
April 06, 2026 at 1:57 PM
The blockage reoccurred, again impacting call handling.
Monitoring
April 06, 2026 at 1:43 PM
Monitoring
April 06, 2026 at 1:43 PM
The blockage was cleared, and calls began to be successfully processed. Monitoring continued.
Identified
April 06, 2026 at 1:30 PM
Identified
April 06, 2026 at 1:30 PM
A blockage in the call processing flow was identified, and remediation steps were initiated
Update
April 06, 2026 at 12:40 PM
Update
April 06, 2026 at 12:40 PM
Senior members of the development team were engaged for further investigation.
Update
April 06, 2026 at 12:07 PM
Update
April 06, 2026 at 12:07 PM
A broader system investigation observed an unusually high volume of outbound calls initiated by a campaign-based tenant, which was noted as a potential contributing factor and investigated further.
Update
April 06, 2026 at 11:52 AM
Update
April 06, 2026 at 11:52 AM
After initial checks, SRE rebooted the first affected voice instance and performed test calls. The system did not recover. During the reboot, all Graia tenants were automatically failed over to the secondary instance.
Investigating
April 06, 2026 at 11:15 AM
Investigating
April 06, 2026 at 11:15 AM
The System Reliability Engineering (SRE) team was alerted by the Geomant Helpdesk following a report from a partner that inbound calls were not reaching agents.

ccaas-status - Inbound Call Processing Failure – Incident details

All systems operational