On the 11th of December between 00:19 IST and 01:14 IST an incident occurred which caused our platform services (application and APIs) to run in a degraded state and eventually resulted in the service being completely unavailable. Functionalities were fully restored by 01:14 IST.
All of our services use container orchestration when they are deployed to production. Containers are instantiated using docker images that reside in a repository.
The underlying cause for this outage was an incorrect logic in a script that runs periodically to clean up the older container images in the repository. This resulted in the latest image being unavailable for auto-recovery when some of the containers failed.
This incident highlighted the lack of sufficient checks in our infra provisioning process. As an improvement, we have now changed the lifecycle of our image repositories so that the latest image would be retained in any case.
We apologise for the disruption in service as a result of this incident and thank you for trusting us with our incident communication.