Backups and disaster recovery
A reliable system survives hardware failures, accidental deletions and attacks. For that you need tested backups (a backup you don't know how to restore is not a backup) and a disaster recovery plan (disaster recovery). It is measured with two objectives:
- RPO (Recovery Point Objective): how much data you can afford to lose, expressed in time. An RPO of 1 hour means "at most we lose the last hour", which dictates how often to make copies.
- RTO (Recovery Time Objective): how much time you can be down while you restore the service.
SLA, SLO and error budget
- SLA (Service Level Agreement): the contract with the customer, with consequences (e.g. refunds) if it isn't met. It is an external commitment.
- SLO (Service Level Objective): the internal objective you set yourself (e.g. 99.9% availability). It is usually stricter than the SLA to leave margin.
- Error budget: what you have "left to spend" of the SLO. If your objective is 99.9%, that 0.1% of allowed failures is your budget. While there is budget left, you can deploy and take risks; if it runs out, you freeze changes and focus on stability.
Incident response and blameless postmortems
When something breaks, you follow an incident response process: detect, declare the incident, mitigate, communicate and resolve. Afterwards a postmortem is written: what happened, impact, root cause and actions to prevent it from recurring.
The postmortem is blameless: systems and processes are analyzed, you don't look for someone to point at. Only that way do people tell what really happened and the organization learns.
Safe deployments: blue-green, canary and rollback
Deploying is one of the riskiest moments. Strategies to reduce that risk:
- Blue-green: you keep two identical environments. One (blue) serves the traffic while you deploy on the other (green); when it's ready, you switch the traffic all at once. If it fails, you go back to the previous one instantly.
- Canary: you release the new version to a small percentage of users (the "canaries"). If the metrics stay healthy, you expand little by little up to 100%; if they worsen, you stop it before affecting everyone.
- Rollback: the ability to quickly go back to the previous stable version when a deployment goes wrong. Having a fast, rehearsed rollback is what makes it safe to deploy often.