Rapid increases in the adoption of our software products and services is a key goal. As usage increases and software evolves, disruptions inevitably become more frequent. Effectively responding to these incidents is critical in ensuring ongoing acceptance of our products and upholding the team reputation and credibility.
An incident is an event that requires an immediate emergency response, such as
- Disruption or a reduction in the quality of a service
- Data loss of any kind
- Cybersecurity incidents
A good rule of thumb is to consider any incident that is security-related, or requires a hotfix or rollback.
Examples of security incidents include
- triggering of security-related alerts (HTTP 401 Unauthorised / 403 Forbidden)
- suspected cyberattacks such as denial-of-service, and
- security bugs reported by cybersecurity professionals.
For the avoidance of doubt, scanning or fuzzing activity that merely triggers HTTP 404 Not Found alerts that do not impact either system uptime nor result in successful malicious traffic, should not count as cybersecurity incidents.
The number of available team members dictates the approach to on-call scheduling. The number of available individuals impacts how you address issues of daytime and nighttime coverage and ways to strike a balance between on-call obligations and the individual team members’ personal time.
In smaller teams, product managers and core engineers often take all of the on-call shifts themselves. They know the entire system well so they can triage and fix problems on their own.
When there are two people available you can schedule weekly rotations with each alternating full weeks. Being on call and dealing with alerts for an entire week, however, can be exhausting.
Scheduling becomes easier with three or more team members participating in rotations, allowing for periods of rest and recovery. More team members also allows for other shifting alternatives and even more flexibility.
It is important to enable teams to have feedback into the scheduling model so you choose the one that works best for the individual members of the team. Technical leads should have enough freedom to set up schedules that allow team members to attend to family needs, take care of their health, or pursue training in their off-hours and to make needed adjustments to maintain a positive attitude and high level of morale.
Do also note that the on-call engineer and incident manager are roles - it can be fulfilled by the same person, especially for smaller teams. You may wish to discuss with your team to set up an on-call schedule that works well for everyone, as this guide is not prescriptive about that. For example, the team may decide that it is not necessary to have a dedicated incident manager on all days as release days are typically riskier.
- Have we described the actual impact on customers?
- Did we say how many internal and external customers are affected?
- If the root cause is known, what is it?
- If there is an ETA for restoration, what is it?
- When & where will the next update be?
- Ensure preferred notification channels are open and operative
- Ensure access to devices needed to respond to alerts such as laptop and mobile phone with Slack (carry these with you outside if you have to)
- Ensure reliable internet connectivity
- No alcohol
- Acknowledge the alert
- Determine the urgency of the alert and verify that there is an incident
- Immediately escalate the incident to team members, D/OGP, GITSIR ([email protected]) and GIROC ([email protected]) once an incident is confirmed
- Determine and execute a recovery plan
- Rollback (preferred) or hotfix
- Scale resource bottleneck
- Log important actions undertaken
- Verify issue is resolved in production through end-to-end checks
- Conduct and document postmortems
- Create & assign tickets where necessary
- Update documents and runbooks to keep them up-to-date
- Ensure preferred notification channels are open and operative
- Internal Slack channels for team members
- Workplace, email and in-app banners for users
- Ensure access to devices needed to respond to alerts such as laptop and mobile phone with Slack (carry these with you outside if you have to)
- Ensure reliable internet connectivity
- No alcohol
- Acknowledge the alert
- Collaborate with team to determine the urgency of the alert
- Immediately escalate the incident to team members, D/OGP, GITSIR ([email protected]) and GIROC ([email protected]) once an incident is confirmed
- Inform affected users and stakeholders of the issue that the team is investigating and will update them soon via appropriate communication channels
- Log important actions undertaken
- Communicate periodically (every 15-30mins) to keep internal and external stakeholders informed until the incident has been resolved
- Conduct and document postmortems
- Share post mortems with GITSIR, GIROC and affected users
R - Responsible A - Accountable C - Consulted I - Informed
Incident Stage | Incident Manager | On-call Engineer | D/OGP | GITSIR/GIROC |
Acknowledge alert & confirm incident | R | R | A | - |
Escalate incident | R | R | A | I |
Assess impact | A | R | I | I |
Inform users | R | C | A | I |
Execute recovery | A | R | I | I |
Communicate periodically | R | C | A | I |
Verify recovery | A | R | I | I |
Conduct & share postmortem | A | R | I | I |
A runbook is the documented form of a team’s procedures for conducting a task or series of tasks. When an incident is detected, containing the event and returning to a known good state are important elements of a response plan. The remediation might be as simple as removing the variance through a redeployment of the resources with the proper configuration. To do this, each team should plan ahead and define their own response procedures, which are often called runbooks. These runbooks should be maintained and improved upon by the project team over time as the application functionality grows and new incidents uncovered.
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and planned follow-up actions to prevent the incident from recurring.
Writing a postmortem is not punishment—it is a learning opportunity for the entire team.
For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.
Incident tracking aims to improve stability and reduce the number of incidents in the long-term. Key Performance Indicators to be measured month-on-month are \
- Number of incidents with root cause attributable to the project team
- Average time to incident detection
- Average time to mitigation from detection
- Average time to recovery from detection
The highest priority in an incident is to restore service. Maintaining availability and quality of service is more important than understanding the root cause of the issue.
When possible, this should always be the preferred option as it offers the fastest route to service recovery.
A rollback can be performed when these conditions have been met:
- The last state of the system is verified and known to be in working order. This is not always true.
- The failure is caused by a change that can be rolled back with no problems of backwards compatibility
This is the fastest and safest since no new code needs to be written and tested. This can be performed by deploying the last known build artifact (e.g. Docker image) to the deployment infrastructure. However, this requires backwards compatibility between the current and previous release versions.
This is similar but is still safer compared to a hotfix. A partial rollback is performed by reverting the specific Git commit that is problematic, and then re-deploying the application through the usual CI/CD pipeline.
- There is a security bug with login, and the code was written over a year ago
- A change to the API broke backwards compatibility, and there are both old and new clients released in the wild
- A new feature which users are reliant on required the addition of a new field to the database schema, and is not backwards compatible
In case the application cannot be rolled back, a hotfix will have to be written.
Fixing forward has significant downsides:
- Hotfixes are usually rushed out during high stress situations, and code written in haste usually escapes the scrutiny and testing expected of regular changes
- The point fix usually incurs significant system entropy and technical debt
- There is a possibility of breaking something else