According to our findings, the intermittent outage of the Buzzeasy webchat service on 2024-10-09 was due to increased traffic flowing through our webchat-connector microservice.
The service had to process enough data to the point that the built-in monitoring system's liveness probe's health checking requests did not get a reply in time. As a result, the service was considered non-responsive which triggered automatic restarts, making the webchat widget temporarily unavailable for web pages that were loaded while a restart was in progress.
We addressed the problem by on the one hand, increasing the maximum used memory limit of the service and on the other hand, allowing more time for the liveness probe's requests to get a response before restarts are triggered. This will prevent the service from being restarted prematurely while there is no real fault with the resource but rather there would just be some additional time needed for it to process all requests in its buffer.
We will be continuing to monitor closely this system in following period. Furthermore, we will be planning to implement additional redundancies and resources between which the service can load balance its traffic in future.