Platform outage issue
Incident Report for Instamojo
Postmortem

Summary:

On the 11th of December between 00:19 IST and 01:14 IST an incident occurred which caused our platform services (application and APIs) to run in a degraded state and eventually resulted in the service being completely unavailable. Functionalities were fully restored by 01:14 IST.

Background:

All of our services use container orchestration when they are deployed to production. Containers are instantiated using docker images that reside in a repository.

Cause:

The underlying cause for this outage was an incorrect logic in a script that runs periodically to clean up the older container images in the repository. This resulted in the latest image being unavailable for auto-recovery when some of the containers failed.

What we are changing going forward:

This incident highlighted the lack of sufficient checks in our infra provisioning process. As an improvement, we have now changed the lifecycle of our image repositories so that the latest image would be retained in any case.

We apologise for the disruption in service as a result of this incident and thank you for trusting us with our incident communication.

Posted Dec 11, 2018 - 18:33 IST

Resolved
We have identified elevated 5xx errors on our platform services (www.instamojo.com and api.instamojo.com/v2/) between Dec 11th, 2018 00:19 IST and Dec 11th, 2018 01:14 IST.

During this period, all requests to the website and APIs were resulting in a failure.

All services are now operational and we will soon be publishing a postmortem report of the incident.
Posted Dec 11, 2018 - 00:19 IST
This incident affected: www.instamojo.com and api.instamojo.com.