What are the benefits of becoming an Elite Performer? The answer can be complicated. Join me today for a short journey to accelerate code quality with Google's DORA metrics & continuous code improvement to drive higher software delivery and team performance.
Accelerating Code Quality with DORA Metrics
AI Generated Video Summary
This Talk discusses the Dora Metrics and their relation to continuous code improvement. High and elite performers deploy more frequently and have a lower change failure rate. Continuous code improvement involves identifying and fixing bugs in real time. Rollbar is a continuous code improvement platform that provides visibility into actionable errors. It helps organizations reduce the risk of losing customers and optimize developer productivity. Rollbar's unique error fingerprints and advanced features enable a deeper understanding of issues and faster resolution.
1. Introduction to Dora Metrics
Hi, everybody. My name is Niko Kruger. Today, I'm going to talk about accelerating code quality with the Dora Metrics. We'll explore what Dora is and its relation to continuous code improvement. We'll also discuss the benefits of becoming an elite performer and the journey to accelerate your code quality.
Hi, everybody. My name is Niko Kruger. And I'm going to talk to you today about accelerating code quality with the Dora Metrics. I've spent the last 13 years working with software companies all over the world helping them improve their software quality, specifically focused on quality-critical applications. And today, I'm serving as the Senior Director of Solutions Engineering here at Rollbar.
So for today's agenda, we're going to look at three things. The first is, take a look at what Dora is and how that relates to continuous code improvement. We're also going to look at the benefits that you'll get and your organization will get once you become an elite performance, as you move from low, medium, high into the elite performing category. We're also going to look at the journey to accelerate your code quality with continuous code improvement, and what some of the benefits you can expect to receive once your organization reaches that high and elite performing status.
So let's first take a look at what is the Dora metrics. So the DevOps Research and Assessment is a Google team that spent a multi-year program researching a couple of the key things that are driving some of the most high performing organizations around the world. And they looked at the technical processes, for measurement for culture and all these capabilities that make up these high performing teams. What also looked at is not only high performing software delivery teams, but also organizations that are delivering from a revenue perspective back to their customers. So making sure that the products they put out and these high performing teams actually result in real tangible business results. What they looked at, they really looked at a number of key metrics that they were able to identify that you can use in your organizations to help understand where you are today and also what you need to do to move up in this chain to become an elite performer.
2. Metrics and Benefits of High and Elite Performers
The lead time to produce features into the hands of customers is a crucial metric. Deploying more frequently and measuring change failure rate and time to restore services are also important. High and elite performers deploy on demand and multiple times a day, while low performers do so once a month to once every six months. The lead performers have a lead time for change of less than a day, while low performers take over a month to six months. High and elite performers have a change failure rate of zero to 15%, while low performers have a rate of 50 to 60%. High and elite performers also have an automated or semi-automated process for restoring services, while low performers take over a week. Becoming a high and elite performer allows for more frequent deployments and faster delivery of features to customers, giving a significant competitive advantage.
The first is the lead time to produce features into the hands of your customers. And what that means is the time it takes you from taking an idea all the way through your pipeline into your customers' hands. So giving them real value where they can use these features very quickly day in and day out and start to give you feedback on how these are working and obviously how that can affect, for example, your revenue if it's a revenue facing application.
The next part is, of course, deploying more frequently. So we looked at how these organizations are deploying, how frequently they're deploying, and then also once they do deploy, what is their change failure rate? So how often are these releases failing? So, for example, if you start to release more, are you in fact failing more often or not? And for those failures, they also look at the time to restore these services. So how long does it take you to actually take a failed deploy and roll back once you can start to get that up and running and get your customers back to using your solution? There's also, of course, availability, but that's not part of the metrics for today's discussion.
So what do we actually measure when it comes to the Dora metrics? So there's really a couple of categories over here for those four key metrics. So if we take a look at deployment frequency, so our low performers are really looking to do this once a month to once every six months. And if you move your way to high and elite performers, our elite performers are really doing this on demand and multiple times a day. In fact, at Rollbar, we've seen some of our customers release more than 4,000 times in a given week. And that just gives you a sense of how quickly and how often they can get changes and features capabilities into the hands of their customers. And really, that helps you understand if you take smaller changes, getting it to the customers, what value are they gaining from it? Is it something of value to them, and will it help your application or your business move forward? For those lowest performers, that, of course, is a very long cycle and indeed you have to bunch up lots of large deploys and often not even big changes into these couple of deploys a year.
Now of course, if you start to look at how long it takes us to get these changes into the hands of our customers, we look at the lead time for change and again, our lead performers can actually do this in some instances in less than a day. Now of course, that depends on how big a change or feature request is, but of course, these elite teams are breaking these down into smaller chunks of work that they are delivering into the hands of customers very, very quickly. For those on the low performing scale, again, it's over a month to six months to get these changes all the way through the pipeline tested, validated and into the hands of their customers. And of course, that can actually negatively affect revenue as your customers have to wait longer. And indeed, your competitors could be out innovating you, you know, on a 10 to 1 basis. And of course, as we're pushing more these changes live, one can often expect that, you know, our change failure rate will be very high, but that was actually not true. So our high and elite performers were anywhere from zero to 15% change failure rate versus the lower performers, which are roughly 50 to 60% failure rate. So of course, the more time we are deploying, actually, the better we get at it. And the time to restore was actually a reflection of this. So the high elite performers had an almost identical process for restoring their services back if something does go wrong. And for the high elite performers, this is somewhat automated, if not completely automated, where our low performers really took more than a week to get these services back. And it was a very manual process to get these items back up again.
So if you start to look at these metrics, start to measure your team, what are some of the benefits that you can expect once you become a high and elite performer? So the DORA group actually found a couple of quite phenomenal benefits that these elite and high-performing teams were getting. So the first one is they were actually deploying 208 times more frequently than those on the lower end of the scale. And that just tells us that they actually have the ability to push features, capabilities into the hands of the customers faster. And that means they're roughly 106 times faster than the low performers. That means they can get features and valuable revenue generating parts of the application into the hands of their customers. So again, that's something to really look at if you want to move and outperform your competitors in today's market.
3. Accelerating Code Quality and Error Resolution
The research revealed that high-performing teams recover 2.6 times faster and have a 7x lower change failure rate. They also spend 50% less time fixing security issues. Continuous code improvement involves identifying and fixing bugs throughout the pipeline, reducing risk in staging and production, and understanding error rates and real errors. To solve this problem, a unique fingerprint is created for each error, providing context such as the line of code, the developer responsible, and the impact on customers. By looking inside the code itself, issues can be identified in real time, allowing for immediate action.
What was also very important that came out of this research was the fact that the ability for these teams to recover very quickly. So they were 2.6 times more faster to recover if something does go wrong. And they had a 7x lower change failure rate. So they were actually very good at releasing working software and these smaller changes on a regular basis. So even though they're releasing multiple times a day, if not thousands of times a week, they were very, very good at making sure that the quality was a lot better for these releases versus those on the low performing side.
And one of the other quite interesting parts of this research was the fact that they were spending 50% less time fixing security issues. So really part of these elite performers, they were building in better security into the applications from the get-go and were able to adjust and update these more frequently. So really spending less time versus some of the lower performer on this category group.
So let's take a look at what continuous code improvement is and how that relates to part of these DoRA metrics. So having bugs in your application is a part of life. We all know that they happen day in and day out. And so really what we want to look at is, how do we improve throughout our pipeline? So if we're going into our testing or, you know, continuous integration pipelines, how are we identifying these issues earlier? What do we know about these areas? How do we handle them? And how do we start to get better about understanding how to fix these very quickly? And that's, of course, if we move into the staging, we also want to reduce our risk as we get closer to the actual release process and into the hands of our customers. So how do we reduce that risk? Again, identifying errors at a code level, and not only just, for example, on the integration level or other levels that we might have today. And then finally, as we get into production, you know, do we have the ability to understand as soon as an error happens, what that error is? Are we able to monitor those things? And how do we act upon those? So what we need to do is we need to tie all of these things together so that we can start to understand all the way through our pipeline, what the true error rates are, what the real errors are, and how do we fix them day in and day out. So really, we need a process and a way to make sure that our teams can move from these low performing types of teams into high and elite performing teams by helping them with the confidence to release to production more often.
So how do we solve this problem? So the first thing we've got to do is we really have to look at understanding when something goes wrong, what actually went wrong, and that means we need to go and create what we call a fingerprint on an error. So almost like a human has a unique fingerprint, every single error should have a unique fingerprint, and that's where we understand what you are. Have we seen you before? Have I seen you in testing? Have I seen you in staging? Or have I seen you in production multiple times? How many customers were impacted from this error? So we need to know more about you as soon as you occur. And of course if you're a developer and you're busy coding something, you also have your IDE, you have your stack traces, and it's very easy to debug and solve a problem during the dev cycle. And once you get into staging and production this becomes exponentially harder. So what we need is as soon as something happens we need context. So what was the line of code, the surrounding code that actually caused this problem? Who committed that code last? Do we have some tracing information if we're having microservices? What are the things that led up to this error? So all the variables and all those things that's fed in there. So ideally we really want that full stack trace that we're used to as developers writing code day in and day out. And then finally as I've mentioned, we want to know who was impacted by an error. Was it one customer? Was it thousands of customers? So that we can keep our services up and running and make sure that they are in the hands of our customers, actually generating revenue for our businesses.
The way to solve this problem is first of all by getting down into the code level. So a lot of applications today have monitoring outside the actual code levels, maybe on containers. What we need to do is we need to look inside the actual code itself. So we need to be in the application during runtime, as it's being executed and used by our customers, so we can identify issues in real time. That means one to three seconds as soon as something happens, we can collect that error.
4. Benefits of Continuous Code Improvement
We can give errors a fingerprint and understand their severity and impact. This enables us to respond quickly and get systems back up and running. By having visibility into errors, we can take immediate action, such as creating tickets, viewing logs, and gathering tracing information. Continuous code improvement has many benefits, including reducing the risk of losing customers, optimizing developer productivity, and instilling release confidence. Through this journey, organizations can expect increased deploy rates, improved visibility into errors, faster detection and response times, and a significant reduction in time spent fixing bugs.
We can give it a fingerprint. We can understand what it is, how important it is. Meaning what's the severity and the impact of it. And then action on that item. So what do we do about it? And this is a great way to start to not only be able to respond very quickly so we can detect errors quickly, but also respond, get a system back up and running again, because we actually know what went wrong and what those problems were.
And of course, we want to make sure that once we know about these errors, can I see the line of code, the last committed change, who did that. Can I create, for example, a ticket in any ticketing system? Can I see the surrounding logs, performance logs? If it was more of an environmental issue as an example, do I want to have all the tracing information for the microservices as an example? So we need more and more information to help us automate this process.
So as teams start to do this, some of the benefits we've seen as we did a survey of about a thousand developers around the world to ask them, you know, what are some of the things that keep them up at night, worry them? And also some of the benefits they've seen from continuous code improvement. And one of the first things that really stood out was, you know, about 34% of developers said the biggest risk for them of an error or bug is losing customers. We know that that's the lifeblood of our businesses. You know, so we want to make sure that we keep our customers engaged. They can use our software. And oftentimes, if a customer finds a problem, they won't even notify us. They might try to work around it. They may not log a ticket to get it resolved. And you can often lose those customers. So we need to give them that better user experience out of the gate. So we need to be able to manage these issues as they come up in real time on a proactive basis, meaning we'll find them before a customer finds them. And we can sometimes even fix them before then.
The next one is about 37% said they spend more than 25% of their time just fixing bugs. So that is a tremendous amount of wasted effort that could have been spent on feature development, improving other things in the application, and that's now spent on really rework or redoing some of the work that's already been done. And that's a tremendously big number when you start to take that into developer costs day in and day out over a 12-month period or even longer. So we really want to focus on making our engineers as productive as possible.
And finally, about 84% of developers said they're held back from moving into an elite category or high-performing category by deploying more often as an example because they just don't have the confidence, what we call release confidence, that when they push these items live that they indeed have a confidence that if something does go wrong, they'll know about it quickly, not from a customer, but actually in real time, get a notification, and then more importantly, being able to do something about it. So how do we automate? Do we roll back, maybe turn off a feature flag? How do we go about instilling that confidence for our engineering groups?
So when you undergo this journey, what are some of the things you should be expecting to get from this journey? Now, for most of our customers that have started this journey, they have seen, for example, deploy rates go up from anywhere from 50 percent up to about 95 percent successful deploy rates. So that's giving you a much better confidence that you can release more often, but also that your deploys are going to be of good quality and you're able to handle any issues that do come up. They've also had one of the biggest things we hear is the ability that I just didn't know, I didn't have visibility when something went wrong. So oftentimes it'll be an hour or more, sometimes days before we know that there's a problem, because we're waiting for customers sifting through logs, and what we've seen from our customers, the time to detect an error, respond to it is often down to about five minutes in a production scenario, and that's a massive time saving. If we know about something, we can respond to it and make sure that we address the most impactful bugs very, very quickly. The other part we've seen is that we can take that, you know, 20, 25 to 40 percent time that developers are spending on fixing bugs and move that down to about 10 percent of their time, because it's faster to know about these bugs, faster to find the exact line of code where the problem is and fix it versus trying to go through logs or trying to figure out what actually went by piecing it all together.
5. Rollbar: Continuous Code Improvement
Rollbar is a continuous code improvement platform that provides real-time visibility into actionable errors. You can sign up for a free account and choose from various SDKs to enable Rollbar in your application. By including the code, you'll gain visibility into errors and access telemetry to understand what's going wrong. Visit Rollbar.com/gitnation for a special offer to trial Rollbar. Thank you for watching this section on accelerating code quality.
So, a tremendous time saving that's also gained by looking at the code itself, understanding what's going wrong, and so those benefits really add up over a 12 month period. We always say, you know, how do we go about starting this process? So, we can start this journey pretty quickly. It's a single step to get going, and I'm going to run you through how you can get that going with Rollbar.
So, what is Rollbar? Rollbar is a continuous code improvement platform. Fundamentally, we're going to help you get visibility into your application as soon as the error happens, be it handled or unhandled exceptions, and we'll be able to help you do something about it. So, we'll make sure that they're actionable. We will automatically group them together. We are going to remove the noise that you're used to with, you know, traditional logging tools where you're waiting for something to happen. With Rollbar, you'll get real-time visibility into these errors that are actually actionable with details to help you solve them a lot faster than you're used to today.
Challenges and Solutions in Production
Most people think it's medium hard to solve issues in production. Rollbar inspects every error and uses AI and machine learning to give it a unique fingerprint. To get a DORA score, you need to have the data and implement tools to collect it.
So, one to one, three, nobody thinks it's very easy, so most of the people think is around eight, which is hard, but not extremely hard. It's like around eight, 33 percent responded that, and seven percent responded, that is very hard. But yeah, most of the people think is between eight and seven, which is medium hard and a little bit harder. So what do you think about that, Nico?
Well I mean, it's certainly reflective of what we're seeing. I think overall we've gotten better as engineers, as software teams. We're definitely a lot better in figuring out, you know, specifically running things through test environments, finding things earlier, so we're getting better. But when you get into production, you know, for a lot of people it's still hard. And we actually did a developer survey, about a thousand developers around North America and some international, and a lot of this data does speak to, you know, complement what we're seeing right with this pro, where it is, we haven't solved this problem, it is still pretty hard to solve, you know, issues in production, specifically around, you know, the time it takes to understand what actually went wrong, you know, specifically, you know, as we're hearing about microservices and lots of systems coming together, you know, we haven't really cracked this. And I believe there's definitely a better way for teams to start to do this. So certainly not surprising, and certainly something we've seen in our developer surveys.
Yeah, looks like, like a general answer, like, is a little bit hard. So I have a question for you. How do you fingerprint errors? You mean using the Git commit tasks?
That's a great question. So in Rollbar, when we say fingerprint, think of it like a human being, every person is unique. And so how Rollbar does it, we actually go and inspect every single error, and we've got some AI and machine learning running on these, inspecting these errors. And then we understand through our machine learning, if this is in fact a unique error or not. That's how we give it a fingerprint. So we don't have to look at the Git Sharp, we don't have to look at anything like that, it's actually Rollbar inspecting these errors, and we've actually processed in excess of 100 billion errors. So we know a lot about what is a unique error, is it the same? Have we seen it again? But that's actually the process of how we create a fingerprint, that unique identification for any type of error or warning or exception within Rollbar.
Okay, thank you. And another question is, how can I get a DORA score for my team? So a DORA score is certainly something you can get today. You've got to have the data. So that's the first thing, right. So the DORA score, all those metrics really depend on having the data available. So the first thing you've got to do is, if you look at each of those four categories, start very simply to put the things in place, you can collect them. And I always say, you've got to do this over time. You can't just take, you know, once a year snapshot and say you're good. This is a weekly, a monthly activity, where you want to look at these scores. So first thing is, implement things that will help you get the data. Once you start to get this data, you can generate these either through Excel, or you can use some dashboarding tools.
Using DORA Metrics and Rollbar
The DORA team provides various dashboards and tools, including Rollbar, to gather metrics and improve code quality over time.
In fact, the DORA team has various dashboards available. And there's other tools that you can use as well. You can get data from something like Rollbar as an example. We can give you some of those metrics as well, if you report them into Rollbar. But that's the place to start, and make sure that you do this over time. Because to get better, we need some baseline, right? We need to understand where were we? Where are we going? Is something working? Isn't it working? So we can then, you know, visually start to see and fix things and get better over time.
Unique Error Fingerprints and Advanced Features
Rollbar provides unique fingerprints for errors, which are specific to your project or account. While the same error might have occurred elsewhere, Rollbar's advanced features help identify common exceptions and provide solutions. The backend learns from errors, removes variables, and focuses on the core exception, enabling a deeper understanding of the issue and how to resolve it.
Okay, thank you. So John Gautier asked, does it mean that if I'm using Rollbar in my team, I could have the same fingerprint that I never from another team in another company? No, so we know that error is unique for your project, right? We know that same error might have existed with somebody else. So we actually know about the error, but it'll be unique for your project or your account. So the good thing about Rollbar is, you know, working some very smart things for the future, where we'll know about an error, because a lot of these exceptions are quite common. A lot of people run into the same things. So we're going to get better at showing you how to fix them, in fact, and what's the right way to fix them and some of the solutions that most people get to solve that problem. But the fingerprints are unique to your accounts, right? So your project will have their own unique fingerprints. But in the back end, we learn a lot about errors, we understand them, we remove all the variables, right? All the things that you might say, okay, well, that's different for us. We remove all of those, we can get to the core exception to really understand what the exception is, what happened and how do you fix it.
Rollbacks and the Development Life Cycle
Rollback is used throughout the development life cycle. In an ideal world, every error would be caught in pre-production and testing, but that's not the reality. Large organizations running rollback in the QA environment can understand and fingerprint errors early on. Some errors may slip through the cracks due to variables in production, user behavior, or external systems. Rollbacks act as a safety net, allowing teams to quickly identify if something has gone wrong. If the error is known, rollbacks can provide insights on how to fix it, reducing the time to resolution. Having a system in production to catch missed errors from QA is essential.
Okay, perfect. And another question from Eva is, would these errors not be captured by automated tests in a unit test? What I am missing regarding the use case for rollback? Absolutely. So rollback is actually used throughout the development life cycle. So you're absolutely right. In the ideal world, we'll catch every single error in pre-production and testing, in fact in our unit tests, but that's not reality, right?
So for most of our clients, in fact the large organizations that are releasing, you know, 4,000 times in a given week, for example, they run rollback in the QA environment. So they start to understand about these errors very early on so that they can fingerprint them. And as you roll through into production, they can then know, have I seen this before? What do you do about it? But you're right. Some errors should be picked up in testing, but often at times, there's some variable that, you know, makes production slightly different, users use the system in a different way. You might be interacting with external systems or micro services that causes these things to slip through the cracks. And think of rollbacks as your safety net, right? It's the thing that people look at first, as soon as they deploy, to know if something has gone specifically, horribly wrong. And hopefully it won't, but rollback is your safety net. And a nice thing about rollback is, if we knew about it before, we can often tell you how you fixed it before. So actually your time to fix it has reduced greatly, which is quite nice. But you definitely want something in production that'll help you catch the things you potentially missed in QA.
Rollbar: Release Quality Visibility
Rollbar helps you understand the true quality of your code by providing instant visibility into errors. By being proactive, you can fix issues before customers report them. Rollbar is used extensively by the Rollbar team, catching errors in various applications and projects. Contact Nico on Discord or the Q&A channel for any further questions.
Well that sounds great. And the last question will be, can you help my team with release quality visibility? Absolutely. So, you know, the goal is let's release quality code quickly, right? And it's going to be something that's of value. So absolutely, Rollbar is the thing that'll help you understand what the true quality is. Why is that? Why am I saying that? It's because Rollbar is inside your application at run time, right? So as you're releasing it and people start to use it, your customers will know about an error before anybody else does. And we'll tell you your customers will even tell you about that error. So we'll give you that instant visibility into an error and then you can do something about it. Be proactive, meaning you can fix it before somebody logs a helpdesk ticket or so on. But that's how we do it. And you know, some of the biggest and the best companies have that very proactive mindset to finding things before their customers do and then fixing them, which is really nice. Thank you. And the last question for Metin is are you using Rollbar on robots? Oh, absolutely. So Rollbar is used heavily on Rollbar. In fact, we'd love for you to come in and have a chat to us. We'll show you how we use Rollbar. In fact, we catch everything from our website to our internal Rollbar applications. Any piece of code that we release, be it an open source project that we work on, we always put Rollbar on. And that's helping us get better every single day and finding these errors. So absolutely. I'd love to show you how we're using it. Nice. Thank you. If anybody has any more questions, feel free to contact Nico on Discord or pose your questions in the Q&A channel. Thank you, Nico. Bye. Thank you so much, Liz.