Accelerating Code Quality with DORA Metrics


What are the benefits of becoming an Elite Performer? The answer can be complicated. Join me today for a short journey to accelerate code quality with Google's DORA metrics & continuous code improvement to drive higher software delivery and team performance.


Hi everybody, my name is Niko Kruger and I'm going to talk to you today about accelerating code quality with the Dora Metrics. I've spent the last 13 years working with software companies all over the world, helping them improve their software quality, specifically focused on quality critical applications. And today I'm serving as the Senior Director of Solutions Engineering here at Rollbar. So for today's agenda, we're going to look at three things. The first is take a look at what Dora is and how that relates to continuous code improvement. We're also going to look at the benefits that you'll get and your organization will get once you become an elite performer. So as you move from low, medium, high into the elite performing category. We're also going to look at the journey to accelerate your code quality with continuous code improvement and what some of the benefits you can expect to receive once your organization reaches that high and elite performing status. So let's first take a look at what is the Dora Metrics. So the DevOps Research and Assessment Project is a Google team that spent a multi-year program researching a couple of the key things that are driving some of the most high performing organizations around the world. And they looked at the technical processes for measurement, for culture, and all these capabilities that make up these high performing teams. But also looked at as not only high performing software delivery teams, but also organizations that are delivering from a revenue perspective back to their customers. So making sure that the products they put out and these high performing teams actually result in real tangible business results. What they looked at, they really looked at a number of key metrics that they were able to identify that you can use in your organizations to help understand where you are today and also what you need to do to move up in this chain to become an elite performer. So the first is the lead time to produce features into the hands of your customers. And really what that means is the time it takes you from taking an idea all the way through your pipeline into your customers' hands. So giving them real value where they can use these features very quickly, day in and day out, and start to give you feedback on how these are working and obviously how that can affect, for example, your revenue for the, if it's a revenue facing application. The next part is of course deploying more frequently. So we looked at how these organizations are deploying, how frequently they're deploying. And then also once they do deploy, what is their change failure rate? So how often are these releases failing? So for example, if you start to release more, are you in fact failing more often or not? And for those failures, they also looked at the time to restore these services. So how long does it take you to actually take a failed deploy and roll back once you can start to get that up and running and get your customers back to using your solution? There's also of course availability, but that's not part of the metrics for today's discussion. So what do we actually measure when it comes to the DORA metrics? So there's really a couple of categories over here for those four key metrics. So if we take a look at deployment frequency, so our low performers are really looking to do this once a month to once every six months. And if you move your way to high and elite performers, our elite performers are really doing this on demand and multiple times a day. In fact, at Rollbar, we've seen some of our customers release more than 4,000 times in a given week. And that just gives you a sense of how quickly and how often they can get changes and features, capabilities into the hands of their customers. And really that helps you understand if you take smaller changes, getting it to the customers, what value are they gaining from it? Is it something of value to them? And will it help your application or your business move forward? For those lowest performers, that of course is a very long cycle. And indeed, you have to bunch up lots of large deploys and often not even big changes into these a couple of deploys a year. Now of course, if you start to look at how long it takes us to get these changes into the hands of our customers, we look at the lead time for change. And again, our lead performers can actually do this in some instances in less than a day. Now, of course, that depends on how big a change or feature request is. But of course, these elite teams are breaking these down into smaller chunks of work that they are delivering into the hands of their customers very, very quickly. For those on the lower performing scale, again, it's over a month to six months to get these changes all the way through their pipeline, tested, validated, and into the hands of their customers. And of course, that can actually negatively affect revenue as your customers have to wait longer and indeed your competitors could be out innovating you on a 10 to 1 basis. And of course, as we're pushing more of these changes live, one can often expect that our change failure rate will be very high. But that was actually not true. So our high and elite performers were anywhere from zero to 15% change failure rate versus the lower performers, which are roughly 50 to 60% failure rate. So of course, the more time we are deploying, actually, the better we get at it. And the time to restore was actually a reflection of this. So the high elite performers had an almost identical process for restoring their services back if something does go wrong. And for the high and elite performers, this is somewhat automated, if not completely automated, where our low performers really took more than a week to get these services back. And it was a very manual process to get these items back up again. So if you start to look at these metrics, start to measure your team, what are some of the benefits that you can expect once you become a high and elite performer? So the Dora group actually found a couple of quite phenomenal benefits that these elite and high performing teams were getting. So the first one is they were actually deploying 208 times more frequently than those on the lower end of the scale. And that just tells us that they actually have the ability to push features, capabilities into the hands of their customers faster. And that means they're roughly 106 times faster than the low performers. That means they can get features and valuable revenue generating parts of the application into the hands of their customers. So again, that's something to really look at if you want to move and outperform your competitors in today's market. What was also very important that came out of this research was the fact that the ability for these teams to recover very quickly. So they were 2.6 times more faster to recover if something does go wrong. And they had a 7x lower change failure rate. So they were actually very good at releasing working software and these smaller changes on a regular basis. So even though they're releasing multiple times a day, if not thousands of times a week, they were very, very good at making sure that the quality was a lot better for these releases versus those on the low performing side. And one of the other quite interesting parts of this research was the fact that they were spending 50% less time fixing security issues. So really part of these elite performers, they were building in better security into the applications from the get go and were able to adjust and update these more frequently. So really spending less time versus some of the lower performer on this category group. So let's take a look at what continuous code improvement is and how that relates to part of these Dora metrics. So having bugs in your application is a part of life. We all know that they happen day in and day out. And so really what we want to look at is how do we improve throughout our pipeline? So if we're going into our testing or continuous integration pipelines, how are we identifying these issues earlier? What do we know about these errors? How do we handle them? And how do we start to get better about understanding how to fix these very quickly? And of course, if we move into staging, we also want to reduce our risk as we get closer to the actual release process and into the hands of our customers. So how do we reduce that risk? Again, identifying errors at a code level and not only just, for example, on the integration level or other levels that we might have today. And then finally, as we get into production, do we have the ability to understand as soon as the error happens, what that error is, are we able to monitor those things and how do we act upon those? So what we need to do is we need to tie all of these things together so that we can start to understand all the way through our pipeline, what the true error rates are, what the real errors are and how do we fix them day in and day out. So really we need a process and a way to make sure that our teams can move from these low performing types of teams into high and elite performing teams by helping them with the confidence to release to production more often. So how do we solve this problem? So the first thing we've got to do is we really have to look at understanding when something goes wrong, what actually went wrong. And that means we need to go and create what we call a fingerprint on an error. So almost like a human has a unique fingerprint, every single error should have a unique fingerprint. And that's where we understand what you are, have we seen you before? Have I seen you in testing? Have I seen you in staging or have I seen you in production multiple times? How many customers were impacted from this error? So we need to know more about you as soon as you occur. And of course, if you're a developer and you're busy coding something, you also have your ID, you have your stack traces, and it's very easy to debug and solve a problem during the dev cycle. And once you get into staging and production, this becomes exponentially harder. So what we need is as soon as something happens, we need context. So what was the line of code, the surrounding code that actually caused this problem? Who committed that code last? Do we have some tracing information if we're having microservices? What are the things that led up to this error? So all the variables and all those things that's fed in there. So ideally, we really want that full stack trace that we're used to as developers writing code day in and day out. And then finally, as I've mentioned, we want to know who was impacted by an error. Was it one customer? Was it thousands of customers? So that we can keep our services up and running and make sure that they are in the hands of our customers, actually generating revenue for our businesses. So the way to solve this problem is, first of all, by getting down into the code level. So a lot of applications today have monitoring outside the actual code levels, maybe on containers. What we need to do is we need to look inside the actual code itself. So we need to be in the application during runtime as it's being executed and used by our customers so we can identify issues in real time. That means one to three seconds, as soon as something happens, we can collect that error. We can give it a fingerprint. We can understand what it is, how important it is, meaning what's the severity and the impact of it, and then action on that item. So what do we do about it? And this is a great way to start to not only be able to respond very quickly, so we can detect errors quickly, but also respond, get a system back up and running again, because we actually know what went wrong and what those problems were. And of course, we want to make sure that once we know about these errors, can I see the line of code, the last committed change, who did that? Can I create, for example, a ticket in any ticketing system? Can I see the surrounding logs, performance logs, if it was more of an environmental issue, as an example? Do I want to have all the tracing information for the microservices, as an example? So we need more and more information to help us automate this process. So as teams start to do this, some of the benefits we've seen, as we did a survey of about a thousand developers around the world, to ask them, what are some of the things that keep them up at night, worry them, and also some of the benefits they've seen from continuous code improvement? And one of the first things that really stood out was, about 34% of developers said the biggest risk for them of an error or bug is losing customers. We know that that's the lifeblood of our businesses. So we want to make sure that we keep our customers engaged, they can use our software. And oftentimes, if a customer finds a problem, they won't even notify us, they might try to work around it, they may not log a ticket to get it resolved, and you can often lose those customers. So we need to give them that better user experience out of the gate. So we need to be able to manage these issues as they come up in real time, on a proactive basis, meaning we'll find them before a customer finds them, and we can sometimes even fix them before then. The next one is about 37% said they spend more than 25% of their time just fixing bugs. So that is a tremendous amount of wasted effort that could have been spent on feature development, improving other things in the application, and that's now spent on really re-work or redoing some of the work that's really been done. And that's a tremendously big number when you start to take that into developer cost, day in and day out over a 12-month period or even longer. So we really want to focus on making our engineers as productive as possible. And then finally, about 84% of developers said, they're held back from moving into an elite category or high-performing category by deploying more often, as an example, because they just don't have the confidence, what we call release confidence, that when they push these items live, that they indeed have a confidence that if something does go wrong, they'll know about it quickly, not from a customer, but actually in real time, get a notification, and then more importantly, being able to do something about it. So how do we automate? Do we roll back? Maybe turn off a feature flag? How do we go about instilling that confidence for our engineering groups? So when you undergo this journey, what are some of the things you should be expecting to get from this journey? Now for most of our customers that have started this journey, they have seen, for example, deploy rates go up from anywhere from 50% up to about 95% successful deploy rates. So that's giving you a much better confidence that you can release more often, but also that your deploys are going to be of good quality, and you're able to handle any issues that do come up. They've also had, one of the biggest things we hear is the ability that I just didn't know, I didn't have visibility when something went wrong. So oftentimes it'll be an hour or more, sometimes days before we know that there's a problem because we're waiting for customers sifting through logs. And what we've seen from our customers, the time to detect an error, respond to it is often down to about five minutes in a production scenario. That's a massive time saving. If we know about something, we can respond to it and make sure that we address the most impactful bugs very, very quickly. The other part we've seen is that we can take that 20, 25 to 40% time that developers are spending on fixing bugs and move that down to about 10% of their time, because it's faster to know about these bugs, faster to find the exact line of code, where the problem is and fix it versus trying to go through logs or trying to figure out what actually went by piecing it all together. So a tremendous time saving that's also gained by looking at the code itself, understanding what's going wrong. And so those benefits really add up over a 12 month period. We always say, how do we go about starting this process? So we can start this journey pretty quickly. It's a single step to get going. And I'm going to run you through how you can get that going with Rollbar. So what is Rollbar? Rollbar is a continuous code improvement platform. Fundamentally, we're going to help you get visibility into your application as soon as an error happens, be it handled or unhandled exceptions, and we'll be able to help you do something about it. So we'll make sure that they're actionable. We will automatically group them together. We are going to remove the noise that you're used to with traditional logging tools, where you're waiting for something to happen. With Rollbar, you'll get real time visibility into these errors that are actually actionable with details to help you solve them a lot faster than you used to today. So how do we go about getting started with Rollbar? So you can sign up with Rollbar for a free account using either your GitHub account or Google account. And once you signed up into the Rollbar system, you can start to pick one of our many SDKs available. We, of course, have SDKs from a front end, back end, mobile applications. If you think of Python, JavaScript,.NET, Java, React, React Native, Android applications, iOS applications, and the list goes on. We have an SDK for all of these types of applications. Once you're ready, you can add them in there. And so, for example, here I can pick my SDK. And with a couple of lines of code, I can actually start to enable Rollbar in my application today. And once you do this, you'll start to get visibility into those errors in real time. And again, once you've included that code, you can start to get telemetry. You can get all the information that you need to get visibility into what's going wrong when it goes wrong. Now, for those today, if you go to forward slash git nation, we've got a great offer for you to start to trial Rollbar and get some free time using the platform as well. So feel free to go ahead and use the code forward slash git nation. Thank you very much for your time today. And thanks for watching this section on accelerating code quality using the Dora matrix. So one, two, one, two, and three, nobody thinks is very easy. So most of the people think is around eight, which is hard, but not extremely hard. It's like around eight. 33% responded that and 7% responded that is very hard. But yeah, most of the people think is between eight and seven, which is medium hard and a little bit harder. So what do you think about that, Nico? Well, I mean, it's certainly reflective of what we're seeing. I think overall, we've gotten better as engineers, as software teams, we're definitely a lot better in figuring out, you know, specifically, you know, running through things through test environments, finding things earlier, so we're getting better. But when you get into production, you know, for a lot of people, it's still hard. And we actually did a developer survey, about 1000 developers around North America, and some international and a lot of this data does speak to, you know, compliment what we're seeing right with this poll where it is, it's we haven't solved this problem, it is still pretty hard to solve, you know, issues in production, specifically around, you know, the time it takes to understand what actually went wrong, you know, specifically, you know, as we're hearing about microservices, and lots of systems coming together, you know, we haven't really cracked this. And I believe there's definitely a better way for teams to start to do this. So certainly not surprising, and certainly something we've seen in our developer surveys. Yeah, looks like, like a general answer, like it's a little bit hard. So I have a question for you. How do you fingerprint errors? You mean using the Git commit path? That's a great question. So in Rollbar, when we say fingerprint, think of it like a human being, every person is unique. And so how Rollbar does it, we actually go and inspect every single error. And we've got some AI and machine learning running on these, inspecting these errors. And then we understand through our machine learning, if this is in fact a unique error or not. And that's how we give it a fingerprint. So we don't have to look at the Git shard, we don't have to look at anything like that. It's actually Rollbar inspecting these errors. And we've actually processed in excess of 100 billion errors. So we know a lot about what is a unique error? Is it the same? Have we seen it again? But that's actually the process of how we create a fingerprint, that unique identification for any type of error or warning or exception within Rollbar. Okay, thank you. And another question is, how can I get a DORA score for my team? So a DORA score is certainly something you can get today. You've got to have the data. So that's the first thing, right? So the DORA score, all those metrics really depend on having the data available. So the first thing you've got to do is, if you look at each of those four categories, is start very simply to put the things in place, you can collect them. And I always say, you've got to do this over time. You can't just take, you know, once a year snapshot and say, you're good. This is a weekly, a monthly activity where you want to look at these scores. So first thing is implement things that will help you get the data. Once you start to get this data, you can generate these either, you know, through Excel, or you can use some dashboarding tools. In fact, the DORA team has various dashboards available and there's other tools that you can use as well. You can get data from something like Rollbar as an example. We can give you some of those metrics as well, if you report them into Rollbar. But that's the place to start. And make sure that you do this over time, because to get better, we need some baseline, right? We need to understand where were we, where are we going? Is something working? Isn't it working? So we can, you know, visually start to see and fix things and get better over time. Okay, thank you. So John Gautier asks, does it mean that if I'm using Rollbar in my team, I could have the same fingerprint that an error from another team in another company? No, so we know that error is unique for your project, right? We know that same error might have existed with somebody else. So we actually know about the error, but it'll be unique for your project, your account. So the good thing about Rollbar is, you know, we're working on some very smart things for the future where we'll know about an error because a lot of these exceptions are quite common. A lot of people run into the same things. So we're going to get better at showing you how to fix them, in fact, and what's the right way to fix them and some of the solutions that most people get to solve that problem. But the fingerprints are unique to your accounts, right? So your projects will have their own unique fingerprints. But in the backend, we learn a lot about errors. We understand them. We remove all the variables, right? All the things that you might say, okay, well, that's different for us. We remove all of those. We can get to the core exception to really understand what the exception is, what happened, and how do you fix it? Okay, perfect. Another question from Eva is, would these errors not be captured by automated tests in a unit test? What am I missing regarding the use case for Rollbar? Absolutely. So Rollbar is actually used throughout the development lifecycle. So you're absolutely right. In the ideal world, we'll catch every single error in pre-production and testing, in fact, in a unit test, but that's not reality, right? So for most of our clients, in fact, the large organizations that are releasing, you know, 4,000 times in a given week, for example, they run Rollbar in their QA environments. So they start to understand about these errors very early on so that they can fingerprint them. And as you roll through into production, they can then know, have I seen this before? What do you do about it? But you're right. Some errors should be picked up in testing, but often at times, you know, there's some variable that, you know, makes production slightly different. Users use the system in a different way. You might be interacting with external systems or micro services that causes these things to slip through the cracks. And think of Rollbar as sort of your safety net, right? It's the thing that people look at first, as soon as they deploy, to know if something has gone specifically horribly wrong. And hopefully it won't, but Rollbar is your safety net. And a nice thing about Rollbar is, you know, if we knew about it before, we can often tell you how you fixed it before. So actually your time to fix it is reduced greatly, which is quite nice. But you definitely want something in production that'll help you catch the things you potentially missed in QA. Wow, that sounds great. And the last question will be, can you help my team with release quality visibility? Absolutely. So you know, the goal is let's release quality code quickly, right? And it's going to be something that's of value. So absolutely, Rollbar is the thing that'll help you understand what the true quality is. Why is that? Why am I saying that? Because Rollbar is inside your application at runtime, right? So as you're releasing it and people start to use it, your customers will know about an error before anybody else does. And we'll tell you before your customers will even tell you about that error. So we'll give you that instant visibility into an error. And then you can do something about it, be proactive, meaning you can fix it before somebody logs a help desk ticket or so on. But that's how we do it. And you know, some of the biggest and the best companies have that very proactive mindset to finding things before their customers do and then fixing them, which is really nice. Thank you. And the last question from Matthew is, are you using Rollbar on robots? Oh, absolutely. So Rollbar is used heavily on Rollbar. In fact, we'd love for you to come and have a chat to us. We'll show you how we use Rollbar. In fact, we catch everything from our website to our internal Rollbar applications, any piece of code that we release via the open source project that we work on, we always put Rollbar on. And that's helping us get better every single day and finding these errors. So absolutely. And love to show you how we're using it. Nice. Thank you. If anybody has any more questions, feel free to contact Nico on Discord or pose your questions in the Q&A channel. Thank you, Nico. Bye. Thank you so much, Liz. Bye-bye.
27 min
09 Jun, 2021

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic