Navigating the Chaos: A Holistic Approach to Incident Management

Rate this content
Bookmark

Incident management can be challenging and throw you curveballs with unexpected issues, resulting in data loss, downtimes, and overall money & hours of sleep going to waste, BUT! There are practical things you could do to make it a smoother process and handle it better.
Remember when we were at school, and people said - "Actively listening in class guarantees 50% prep for the upcoming test"?
The same goes for being proactive at work in ways that will instantly prepare you to manage incidents better (at night or in general).
In this talk, I'll cover the proactive ways you could take and incorporate into your day-to-day routine, in order to prepare you for a smoother and more efficient incident management process.
I will also show the best practices I've finalized over the years that helped me get a clear vision of how to manage production incidents in the quickest & efficient way possible.
Embracing the tips I'll give you will guarantee you'll not only talk the talk but also walk the walk when it comes to incident management.

FAQ

Incident management is a set of procedures and actions taken to resolve a critical incident. It involves detecting and communicating incidents, assigning responsibility for handling them, utilizing tools for investigation and response, and executing steps towards resolution.

A proactive approach helps in managing production incidents by preparing for incidents before they occur, improving the mean time to resolution, reducing costs by minimizing downtime, and preserving the business's customers and reputation.

Having a business mindset in incident management means understanding the broader impact of incidents on the business, which includes the potential loss in revenues, customers, data, and reputation. This mindset helps in prioritizing actions and decision-making processes that align with business goals.

The structured process in incident management includes five pillars: identify and categorize the incident, notify and escalate to the relevant parties, investigate and diagnose the issue, determine and apply remediation steps, and review and update the incident runbooks and alerts post-incident.

A postmortem after a critical incident should include a detailed discussion about what went wrong, what was done to resolve the issue, lessons learned, and actions that can be taken to prevent future incidents. It's important to conduct this in a blame-free manner to focus on improvement.

The philosophy 'Everything fails all the time' reminds us that failures are inevitable in any system. Adopting this mindset in incident management prepares teams to handle failures proactively rather than reactively, minimizing downtime and reducing the impact on business operations.

Best practices for conducting a war room during a critical incident include maintaining focus on relevant information, assigning clear roles and responsibilities, reducing noise by limiting unnecessary participation, and ensuring documentation like runbooks are available and up to date for efficient handling of the incident.

On-call shift handoffs improve incident management by providing subsequent teams with a summary of incidents handled, actions taken, and any unresolved issues. This ensures continuity and preparedness, helping teams respond more effectively to incidents.

Hila Fish
Hila Fish
26 min
15 Feb, 2024

Comments

Sign in or register to post your comment.

Video Summary and Transcription

This talk covers the importance of a structured process for incident management and the need for a business mindset. It outlines a five-pillar structured process and emphasizes the importance of staying calm and asking the right questions during incidents. The talk also highlights the importance of effectively identifying, categorizing, and investigating incidents, as well as prioritizing root causes and communicating incident resolutions. Additionally, it discusses the role of incident managers, proactive measures for continuous improvement, and the importance of preparation and a proactive mindset.

1. Introduction to Incident Management

Short description:

In this part of the talk, I will cover the importance of a structured process for incident management and the need for a business mindset. I will also emphasize the rule of knowing that everything fails all the time.

Hi everyone. Thank you for joining my talk about navigating the chaos, aka production incidents. When I was in high school, the common belief was that if you were actively listening in class, you will have 50% of the exam prep already in your pocket. I want to show you how I adopted this belief to an actual proactive approach that you could take that will help you manage incidents more efficiently in a more structured way and eventually preserve much needed hours of sleep.

But first of all, hi, my name is Hila Fish. I'm a senior DevOps engineer. I have a lot of things to say about myself, but basically the most important thing that you need to know about myself in terms of this presentation is that I handled a lot of production incidents when I was on call. When I wasn't on call, big corporates, startups, I've seen a lot of things. So this is why I'm able to bring the things that I learned along the way to this presentation.

So let's cover the agenda for today. We will first of all cover the mindset that you should have in order to really practice and manage incidents efficiently. Then we will cover the incident flow, aka a structured process that you can do and take in order to manage incidents efficiently and being proactive, things that you can do in your day to day and after an incident took place that will help you come prepared for the next incident that will happen.

So first of all, let's set a baseline here. Incident management is a set of procedures and actions taken to resolve a critical incident. And it basically means that it is an end-to-end process that defines how incidents are detected and communicated, who is responsible to handle them, what tools are used to investigate and respond to them, and what steps are taken towards a solution. And a thing that we really need to think about when we come to deal with incidents in production is first of all, that not all pages that you get through Ops Uni or PagerDuty or any other tool that you're using, not all pages become an incident. When it is an incident, when you have a loss or potential loss in revenues, customers, data, and reputation.

And if we don't have incident management process, if we are just in an ad hoc putting out files kind of approach and mindset, it means that we will potentially lose valuable data, downtime could potentially lead to reduced productivity and revenues, and the business could hold a bridge of service level agreements. Every company has its own SLAs and we want to avoid breaching those SLAs. So it means that we need to avoid being in an ad hoc manner of, okay, something happened, now we need to put out this file. We need to have a structured process towards resolving incidents.

And how do we do that? First of all, with reframing our perspective, we need to have business mindset. Meaning that whenever you deal with something, whenever you implement something at work, whenever you do anything at work, you need to think not only about the systems that you are incorporating and implementing, but also understanding the why. Why we're doing things in a certain way? Why do we have this system? How does it help us do things and how does it help the business succeed? So business mindset is needed in order to grasp the overall impact of incidents and mitigate damages accordingly. And this is why it has to be a structured process. This is where you will incorporate the business mindset and make sure that things are handled as quickly as possible for the sake of the business.

And in order to really, let's say, manage incidents efficiently, the number one rule of managing incidents and to be a better engineer in general is to know that everything fails all the time. Who said this, by the way? You would ask. First of all, me. I said that a lot throughout my career.

2. Structured Process and Types of People

Short description:

Incidents are mayhem by nature. Having a structured process will help prevent incidents, improve mean time to resolution, reduce costs, and preserve the business and reputation. We will follow a structured process with five pillars. I will share questions to ask and answer in each pillar to progress towards incident resolution. Two types of people in incident management: those who stay calm and those who can't. Asking the right questions will help you stay calm and progress through each phase.

But first of all, I think it is very odd to quote myself. And second of all, I found someone with a little bit more credit in the industry than me. AWS CTO, Werner Vogels. So take his word for it. Everything fails all the time. Production systems, development environments, pipelines, things that we build, things that we buy, systems that we rely on to know that production is down, aka our monitoring systems. Even us as human beings, we crash and we need to sleep and then restart ourselves. Everything fails all the time. And that's exactly it. Incidents are mayhem by nature. But if we have this fact, if we know that failures are a given because everything fails all the time, then we can't be in an ad hoc manner of putting out files. We can say, OK, this happened, but I'm prepared to deal with it. So this is the whole idea of having a structured process and a structured process will help leading to incidents prevention, improved mean time to resolution, cost reduction because downtime was reduced or eliminated entirely, and to preserve our business customers and reputation. And how do we do that? We have a structure process that we can follow here, five pillars, and we'll go over each pillar and I will show you questions that you can ask and answer in each pillar in order to progress to the next one. But beforehand, I want to show you two types of people that I met in my entire journey in production, managing production incidents. I met this type. First of all, the one that says keep calm, I'm an engineer. And the other type that says I can't keep calm. I'm an engineer. These are the two types that I met. And I say that you can keep calm if you ask yourselves the questions I'm going to share with you in each pillar and then it will help you progress towards the next phase and up until the incident resolution.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Don't Solve Problems, Eliminate Them
React Advanced Conference 2021React Advanced Conference 2021
39 min
Don't Solve Problems, Eliminate Them
Top Content
Humans are natural problem solvers and we're good enough at it that we've survived over the centuries and become the dominant species of the planet. Because we're so good at it, we sometimes become problem seekers too–looking for problems we can solve. Those who most successfully accomplish their goals are the problem eliminators. Let's talk about the distinction between solving and eliminating problems with examples from inside and outside the coding world.
Using useEffect Effectively
React Advanced Conference 2022React Advanced Conference 2022
30 min
Using useEffect Effectively
Top Content
Can useEffect affect your codebase negatively? From fetching data to fighting with imperative APIs, side effects are one of the biggest sources of frustration in web app development. And let’s be honest, putting everything in useEffect hooks doesn’t help much. In this talk, we'll demystify the useEffect hook and get a better understanding of when (and when not) to use it, as well as discover how declarative effects can make effect management more maintainable in even the most complex React apps.
Design Systems: Walking the Line Between Flexibility and Consistency
React Advanced Conference 2021React Advanced Conference 2021
47 min
Design Systems: Walking the Line Between Flexibility and Consistency
Top Content
Design systems aim to bring consistency to a brand's design and make the UI development productive. Component libraries with well-thought API can make this a breeze. But, sometimes an API choice can accidentally overstep and slow the team down! There's a balance there... somewhere. Let's explore some of the problems and possible creative solutions.
Network Requests with Cypress
TestJS Summit 2021TestJS Summit 2021
33 min
Network Requests with Cypress
Top Content
Whether you're testing your UI or API, Cypress gives you all the tools needed to work with and manage network requests. This intermediate-level task demonstrates how to use the cy.request and cy.intercept commands to execute, spy on, and stub network requests while testing your application in the browser. Learn how the commands work as well as use cases for each, including best practices for testing and mocking your network requests.
React Concurrency, Explained
React Summit 2023React Summit 2023
23 min
React Concurrency, Explained
Top Content
React 18! Concurrent features! You might’ve already tried the new APIs like useTransition, or you might’ve just heard of them. But do you know how React 18 achieves the performance wins it brings with itself? In this talk, let’s peek under the hood of React 18’s performance features: - How React 18 lowers the time your page stays frozen (aka TBT) - What exactly happens in the main thread when you run useTransition() - What’s the catch with the improvements (there’s no free cake!), and why Vue.js and Preact straight refused to ship anything similar
Testing Pyramid Makes Little Sense, What We Can Use Instead
TestJS Summit 2021TestJS Summit 2021
38 min
Testing Pyramid Makes Little Sense, What We Can Use Instead
Top Content
Featured Video
Gleb Bahmutov
Roman Sandler
2 authors
The testing pyramid - the canonical shape of tests that defined what types of tests we need to write to make sure the app works - is ... obsolete. In this presentation, Roman Sandler and Gleb Bahmutov argue what the testing shape works better for today's web applications.

Workshops on related topic

React Performance Debugging Masterclass
React Summit 2023React Summit 2023
170 min
React Performance Debugging Masterclass
Top Content
Featured WorkshopFree
Ivan Akulov
Ivan Akulov
Ivan’s first attempts at performance debugging were chaotic. He would see a slow interaction, try a random optimization, see that it didn't help, and keep trying other optimizations until he found the right one (or gave up).
Back then, Ivan didn’t know how to use performance devtools well. He would do a recording in Chrome DevTools or React Profiler, poke around it, try clicking random things, and then close it in frustration a few minutes later. Now, Ivan knows exactly where and what to look for. And in this workshop, Ivan will teach you that too.
Here’s how this is going to work. We’ll take a slow app → debug it (using tools like Chrome DevTools, React Profiler, and why-did-you-render) → pinpoint the bottleneck → and then repeat, several times more. We won’t talk about the solutions (in 90% of the cases, it’s just the ol’ regular useMemo() or memo()). But we’ll talk about everything that comes before – and learn how to analyze any React performance problem, step by step.
(Note: This workshop is best suited for engineers who are already familiar with how useMemo() and memo() work – but want to get better at using the performance tools around React. Also, we’ll be covering interaction performance, not load speed, so you won’t hear a word about Lighthouse 🤐)
React Hooks Tips Only the Pros Know
React Summit Remote Edition 2021React Summit Remote Edition 2021
177 min
React Hooks Tips Only the Pros Know
Top Content
Featured Workshop
Maurice de Beijer
Maurice de Beijer
The addition of the hooks API to React was quite a major change. Before hooks most components had to be class based. Now, with hooks, these are often much simpler functional components. Hooks can be really simple to use. Almost deceptively simple. Because there are still plenty of ways you can mess up with hooks. And it often turns out there are many ways where you can improve your components a better understanding of how each React hook can be used.You will learn all about the pros and cons of the various hooks. You will learn when to use useState() versus useReducer(). We will look at using useContext() efficiently. You will see when to use useLayoutEffect() and when useEffect() is better.
React, TypeScript, and TDD
React Advanced Conference 2021React Advanced Conference 2021
174 min
React, TypeScript, and TDD
Top Content
Featured WorkshopFree
Paul Everitt
Paul Everitt
ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it's hard to find accurate learning materials.

React+TypeScript, with JetBrains IDEs? That three-part combination is the topic of this series. We'll show a little about a lot. Meaning, the key steps to getting productive, in the IDE, for React projects using TypeScript. Along the way we'll show test-driven development and emphasize tips-and-tricks in the IDE.
Designing Effective Tests With React Testing Library
React Summit 2023React Summit 2023
151 min
Designing Effective Tests With React Testing Library
Top Content
Featured Workshop
Josh Justice
Josh Justice
React Testing Library is a great framework for React component tests because there are a lot of questions it answers for you, so you don’t need to worry about those questions. But that doesn’t mean testing is easy. There are still a lot of questions you have to figure out for yourself: How many component tests should you write vs end-to-end tests or lower-level unit tests? How can you test a certain line of code that is tricky to test? And what in the world are you supposed to do about that persistent act() warning?
In this three-hour workshop we’ll introduce React Testing Library along with a mental model for how to think about designing your component tests. This mental model will help you see how to test each bit of logic, whether or not to mock dependencies, and will help improve the design of your components. You’ll walk away with the tools, techniques, and principles you need to implement low-cost, high-value component tests.
Table of contents- The different kinds of React application tests, and where component tests fit in- A mental model for thinking about the inputs and outputs of the components you test- Options for selecting DOM elements to verify and interact with them- The value of mocks and why they shouldn’t be avoided- The challenges with asynchrony in RTL tests and how to handle them
Prerequisites- Familiarity with building applications with React- Basic experience writing automated tests with Jest or another unit testing framework- You do not need any experience with React Testing Library- Machine setup: Node LTS, Yarn
How to Start With Cypress
TestJS Summit 2022TestJS Summit 2022
146 min
How to Start With Cypress
Featured WorkshopFree
Filip Hric
Filip Hric
The web has evolved. Finally, testing has also. Cypress is a modern testing tool that answers the testing needs of modern web applications. It has been gaining a lot of traction in the last couple of years, gaining worldwide popularity. If you have been waiting to learn Cypress, wait no more! Filip Hric will guide you through the first steps on how to start using Cypress and set up a project on your own. The good news is, learning Cypress is incredibly easy. You'll write your first test in no time, and then you'll discover how to write a full end-to-end test for a modern web application. You'll learn the core concepts like retry-ability. Discover how to work and interact with your application and learn how to combine API and UI tests. Throughout this whole workshop, we will write code and do practical exercises. You will leave with a hands-on experience that you can translate to your own project.
Next.js 13: Data Fetching Strategies
React Day Berlin 2022React Day Berlin 2022
53 min
Next.js 13: Data Fetching Strategies
Top Content
WorkshopFree
Alice De Mauro
Alice De Mauro
- Introduction- Prerequisites for the workshop- Fetching strategies: fundamentals- Fetching strategies – hands-on: fetch API, cache (static VS dynamic), revalidate, suspense (parallel data fetching)- Test your build and serve it on Vercel- Future: Server components VS Client components- Workshop easter egg (unrelated to the topic, calling out accessibility)- Wrapping up