Cloudli Connect Inbound calls | Appels entrants Cloudli Connect

Incident Report for Cloudli Communications

Postmortem

Cloudli Incident Report

Services Affected: Clarity Inbound Calling, Clarity Outbound Calling

Start Date/Time: 9:00 AM (EST), October 21, 2025

End Date/Time: 11:22 AM (EST), October 21, 2025

Summary of Events

At approximately 9:00 AM (EST), Cloudli Support received multiple reports from customers indicating failed inbound and outbound calls. The Network Operations and Engineering teams were immediately engaged to investigate.

Initial analysis confirmed that the issue was not related to the prior night’s maintenance, which had already been rolled back. Further investigation revealed that the `shortlocation` entries in Redis for some Registrar Servers were incomplete, resulting in failed SIP registrations and call setup errors for a subset of customers.

At 9:13 AM, the GlobalSBC Registrar microservice was restarted to re-establish proper Redis synchronization. Service behavior normalized immediately following the restart, and call completion success rates returned to expected levels. The incident moved to a resolved but monitoring state at 9:45 AM.

By 10:15 AM, engineering confirmed that all SIP registrations were stable and that customers previously impacted were again able to complete calls successfully. Further validation through test calls, log reviews, and metric analysis confirmed full-service restoration by 11:22 AM, closing the incident.

Incident Analysis and Mitigation Measures

The incident on October 21, 2025, was caused by the AWS US-East regional outage (occurring on October 20, 2025), which disrupted connectivity to Cloudli’s Kafka infrastructure. This interruption caused transient desynchronization between Kafka, Redis, and the Java microservices managing some Registrar microservices, resulting in incomplete `shortlocation` entries and subsequent SIP registration failures.

Once AWS service availability returned for Cloudli’s Kafka infrastructure, restarting the affected Registrar component restored proper Redis state and normal call routing.

To prevent recurrence, Cloudli Engineering will:

  • Implement enhanced caching and message queue redundancy, beyond 12 hours to reduce reliance on real-time cloud synchronization.
  • Expand health monitoring around Registrar nodes to immediately flag Redis desynchronization events.

These measures will ensure platform resilience and reduce sensitivity to external cloud infrastructure interruptions.

Final Remarks

At Cloudli, we take any interruption of service very seriously and are continuously evaluating new processes and mitigation measures that can be proactively implemented to ensure service continuity.

When service interruptions do occur, our incident management procedure prioritizes prompt and clear notification and timely status and resolution updates to our customers and partners.

We thank you for your continued support. Please feel free to reach out if you would like to discuss the particulars of this incident report further.

Posted Oct 24, 2025 - 01:29 EDT

Resolved

This incident has been resolved.
Posted Oct 21, 2025 - 11:22 EDT

Monitoring

Inbound and outbound calls have been confirmed working since 11:17 EST. We will continue to monitor the situation.
Posted Oct 21, 2025 - 11:17 EDT

Investigating

We are currently investigating some additional delayed calls being reported.
Posted Oct 21, 2025 - 11:15 EDT

Monitoring

A resolution was implemented at 9:45 AM EDT and we are monitoring the results. .
Posted Oct 21, 2025 - 09:45 EDT

Investigating

We are currently experiencing an issue affecting incoming calls to some customers in USA and we are investigating.

Impacts: This could potentially cause abnormal delays or an inability to initiate or receive calls.

***

Nous rencontrons actuellement un problème affectant les appels entrants vers certains clients aux États-Unis et nous enquêtons.

Impacts: Ceci pourrait potentiellement entraîner des délais anormaux ou à une incapacité à recevoir des appels.
Posted Oct 21, 2025 - 09:28 EDT
This incident affected: Cloudli Connect (Inbound Calling) and Clarity (Inbound Calling).