ccaas-status - Inbound Call Processing Failure – Incident details

All systems operational

Inbound Call Processing Failure

Resolved
Partial outage 50 %
Started 28 days agoLasted about 4 hours

Affected

Graia Contact Center as a Service

Partial outage from 11:15 AM to 1:43 PM, Operational from 1:43 PM to 1:57 PM, Partial outage from 1:57 PM to 2:20 PM, Operational from 2:20 PM to 2:45 PM

CCaaS Europe - Voice Services

Partial outage from 11:15 AM to 1:43 PM, Operational from 1:43 PM to 1:57 PM, Partial outage from 1:57 PM to 2:20 PM, Operational from 2:20 PM to 2:45 PM

Updates
  • Postmortem
    Postmortem

    The incident was caused by the number of concurrent active calls across tenants exceeding the capacity of a micro-service responsible for temporary call state storage.

    This situation was triggered by an unexpectedly high-volume campaign workload from a tenant, contributing to a significant increase in overall system load.

    While the underlying voice infrastructure had sufficient capacity to handle the traffic, the supporting micro-service reached its storage limit. This resulted in a processing blockage that prevented calls from being fully established.

    Corrective action plan:
    Migrate voice instance state storage from in-memory grain storage to blob-based storage

  • Resolved
    Resolved
    This incident has been resolved.
  • Update
    Update

    The underlying process responsible for the blockage was identified and stopped.

  • Update
    Update

    The blockage reoccurred, again impacting call handling.

  • Monitoring
    Monitoring

    The blockage was cleared, and calls began to be successfully processed. Monitoring continued.

  • Identified
    Identified

    A blockage in the call processing flow was identified, and remediation steps were initiated

  • Update
    Update

    Senior members of the development team were engaged for further investigation.

  • Update
    Update

    A broader system investigation observed an unusually high volume of outbound calls initiated by a campaign-based tenant, which was noted as a potential contributing factor and investigated further.

  • Update
    Update

    After initial checks, SRE rebooted the first affected voice instance and performed test calls. The system did not recover. During the reboot, all Graia tenants were automatically failed over to the secondary instance.

  • Investigating
    Investigating

    The System Reliability Engineering (SRE) team was alerted by the Geomant Helpdesk following a report from a partner that inbound calls were not reaching agents.