Keep Calm and Deploy On: Creating Safer Releases with Feature Flags


Creating and deploying new software is risky. We've all seen how easily bugs arise, causing software to be poorly delivered or to the wrong people. What's more, depending on how tightly we couple our systems and services, they can interact unexpectedly and unfortunately with existing software or hardware. Beyond unintended consequences, we can also find that people can use our services for nefarious purposes. It's essential to have safety nets in place when things don't go as planned or people attempt to break the rules. In this session, we'll discuss how feature flags can work in both temporary and permanent scenarios to enable you to break the quality triangle and deliver quality promptly.



Hey everyone at React Advanced, hope you're having a good time. I'm Jessica and I'm going to talk to you about how you can use flags to mitigate risk in your software development lifecycle. So let's get into it. Now you've likely heard about feature flags solving sort of release-shaped problems, right? And they're often used in these sort of entitlement scenarios, changing what's available to certain users. And it's typically used in that sort of Boolean state. We take a feature, we wrap it in a flag, and that effectively becomes our control point within our code, allowing us to alter its visibility to our end users. The feature's either visible or it isn't. It's on or it's off. And once we've validated the changes in production and are confident that our feature can be on for 100% of our audience, we get rid of the flag. That's the kind of typical lifecycle that we see with flags. As you know, this is super useful when it comes to, say, previewing features for testing in production without going out to our end users or for rolling out customer-facing features to just a subset of our user base. But what if the problem we're trying to solve requires more than just a binary state change or A-B testing? At LaunchDarkly, when we're talking about flags, we're not simply talking about two states. We can actually deal with a whole spectrum of stages in your release process. We can flag based on number, string. We even have JSON flags. And that allows you to mitigate risk in these more sort of complex scenarios. That safety valve is a long-term flag that can be used to degrade non-critical functionality of your applications and services. It ultimately helps you maintain availability of all of your applications. And it's super common, as we all know, to rely on downstream services and providers. But things start to get scary when you have a single point of failure in your delivery. But why not protect yourself? De-risk that element. By flagging around that point, you can effectively create a system that allows you to pivot to a fallback service if in case the worst-case scenario does in fact occur, which we know it often does, unfortunately. Sorry. This gives you the ability, rather, to roll back gracefully and without having to go offline altogether, all within about 200 milliseconds. You're protecting your uptime, you're supporting your team as a whole, and everyone's much happier. This can also be done in the case of bad actors. Say someone's using your service for something they really shouldn't be. You can isolate that one endpoint. You can give 404 for that one bad-acting device and everyone still gets their 200s. In essence, you get to really define how you degrade. You get to ring-fence your blast radius and make a decision around how you do roll back. So it's perfect for scenarios for, say, load shedding or for manual control of certain problems. This process is all about putting you back in control of a situation that you likely didn't anticipate or ask for. And of course, when we're talking about resolution of these sort of scenarios, let's take the situation where a safety valve is able to maintain uptime by rolling back to a previous state where there's a breaking change. Switching a flag can not only support you in staying online, but it also gives you the agility needed to be able to fix the issue at hand. Using the audit log and your observability platform, you're able to pinpoint the issue, see when, where it occurred, what was the change that contributed towards that outage. And when your fix is, in fact, ready to deploy, you of course need to be sure that it can actually go out to all of your users, that it is going to be a solution that can be applied across your user base. It isn't going to cause further problems when implemented. This is where you can stage your rollout. You can stage your fix by going out to a subset of your users at first and gradually rolling out to more and more people as your confidence grows. Flags give you the gift of certainty here that gives the ability for everyone to operate from one singular version of the truth. And now that you're back online, your fix is live to your entire user base, you of course want to stay online, right? Sometimes it's hard to know if your configuration is truly good to go. You may make some guesses based on your platform how it behaves in certain scenarios. But the thing is that assumptions, they can be easily proven incorrect and preconceptions proven wrong. When you're handling a myriad of microservices or dealing with processes requiring numerous network calls, there's some complex tuning often required. A lot of the time you're having to take a great deal of care when deploying. Monitoring can quickly become your best friend and you're having to get a real sense of the impact of your overall CPU and performance that you're incurring here. How about accepting the guesswork as part of your workflow? Store your configuration in a feature flag. And this approach allows you to, it gives you the ability to dynamically change which value you're going out with. You can start with your assumptions and roll towards a more suitable value as your hopefully assumptions get proven correct or incorrect as you go. You can even push different configurations based on different IP addresses, node sets or methods to kind of get to where you want to get to. When it comes to software, operating within the unknown sometimes just can't really be avoided. We've got to manage complexity and we've always got to be learning. We're going to need to venture into uncharted territory to discover better ways of doing things. That's just how this works. But let's make sure that we're exploring these new possibilities with a sense of added protection and confidence. So while features will inevitably fail sometimes and things will go wrong, let's work together to make them less painful, shall we?
7 min
24 Oct, 2022

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic