How Baselime created a culture where it's possible to move fast, break as little as possible, and recover from failures gracefully. The culture is technically underpinned by Node.js, Event-Driven Architectures (EDAs), and Observability (o11y).
Creating an innovation engine with observability
Node Congress 2023
♪♪ ♪♪ ♪♪ My name is Boris. I'm the founder and CEO of Baseline. What we do is observability for serverless architectures. So I'm sure most of you guys have heard the word serverless multiple times today, from the talk earlier in the morning to all the demos that have happened since then. And a lot of emphasis is put on how to deploy code onto the cloud and etc. But very little effort is actually put into how do we actually run and maintain this code over time. And that's the sort of solution that we provide for people that are adopting serverless architectures. But what I want to talk today actually is something completely different, is how internally within the Baseline team, we have been able to create what I like to call an innovation engine, thanks to the observability that we have. So compared to other startups at very similar stages of life, we are at this point where we ship really, really, really fast. And I want to share with you the, I wouldn't say tricks, but the methods that we apply so that we can ship so fast. So the first thing is, is anybody here in the room dealing with tech debt at their job right now? I see one hand to oh, wow, almost all the room. Is anybody dealing with flaky tests? Is anybody dealing with ci cd pipelines that never work when you need them to work? And again, almost everybody. And that's what I don't like. When we signed up to be software engineers and for a lot of us cloud engineers, what we wanted is to create things and put them in the hands of people and, you know, make that innovation happen and see how people are interacting with those things that we create and we put on the web. But we are left day to day dealing with tech debt, flaky tests and all of that, which is basically just slowing us and preventing us from innovating every single day. And there's this question that comes up a lot in, you know, conferences and tech conversations is how well is your team performing? And this question, what it actually means is how much innovation is your team shipping every day? And up until very recently, there was no real way of actually answering this question. Honestly, people will say, oh, we're doing well, but there was no way of quantifying it up until the Dora metrics in the Accelerate book. I hope everybody here has read it. If you haven't, please get a copy. And it gave us a scientific framework that we can use to actually be able to say, OK, we're in the top 10 percent performing teams. We're in the top 20 percent or we're in the bottom 10 percent. And we need to do a lot of work to get out of there. And to be able to answer that question, there are a few metrics that we need to measure. The first one is how often do you deploy? That's your deployment frequency. Second one is how long does it take for code to go live? So from a developer writing code in their code editor locally to that code being live in production used by real users, how long does that take? How many of your deployments fail? So when you deploy, you most definitely sometimes introduce defects into production. How often does that happen and how long does it take to recover from an outage? So when someone introduces a defect to production, how long does it take for your team to detect that defect happened and ultimately fix it? Either roll forward or roll back. And at baseline, we have another one. It's a bonus one. We call it how long we call it mess around lead time. And it's how long does it take from customer insight to production? So I'm here at a conference. I've spoken with a lot of people, a lot of experts in serverless, really. And I've learned a lot. I've gotten a lot of insights. How long does it take from those insights that I have now to being in our product in production at some point in the future? What's that time gap? And this is what innovative teams look like. Every single day, they have the foundations laid. They don't have flaky tests. They don't have bad CI, CD pipelines, and they keep innovating, adding blocks on top of what they already have, such that innovation happens every single day. And for low performing teams, this is what it looks like. Complete chaos. Nobody knows what's happening. And every single day, instead of shipping stuff to production that is actually helpful to your users and customers, you are fixing stuff. You are fighting CI, CD pipelines. That's not what we signed up for. And we want to move away from this. And when you start moving away from it, it's a self-fulfilling prophecy. That's the expression. Smaller deployments lead to faster deployments. Faster deployments lead to faster detection time. Faster detection time leads to faster recovery time. And if you know that you can recover from outages quickly, you will deploy more often. You will innovate more often. So how does all of this look like at baseline? So our deployment frequency is whenever you want. You make a typo change in the front end, Git push gets deployed. You spend the whole day working on this huge migration and blah, blah, blah. Git push gets deployed. There is no red tape around deployment. And that's something that we need to introduce more into our development cycles because those red tapes that we introduce, they seem like they're helping us being more productive, but actually they're just slowing down everybody. What's our mean lead time for changes? So how long does it take for somebody writing code on their code editor to that code being in production? That is actually however long infrastructure as code takes. So we use infrastructure as code to manage all our infrastructure. And every time you Git push, our ci cd pipeline picks up and it builds the artifacts and it deploys them to the cloud. And the bottleneck in our process is actually that deploy to the cloud piece. It can take a few, maybe two minutes or so. And the reason we are able to achieve this is controversial. We have very, you know, streamlined testing. So we don't test for the sake of testing. We don't have testing suites that take 15 minutes to test buttons and etc. We test the critical path in our software and the rest we are going to discover if there is a problem thanks to the observability that we have. And the second thing, probably even more controversial, we don't do code reviews. I know a lot of people are not, I hear a smile there. It's not something that you hear a lot. But I think the idea of code reviews comes from two things. Legacy practices. We have always done code reviews. We are going to continue doing code reviews. And the second thing, thank you, the second thing is really open source, the open source community. When you are running an open source library or an open source framework, you need to have code reviews because anybody can commit to your repos. So you need to have processes in place to make sure that the code that gets in your repo is what you expect, the quality that you expect, the security standards that you expect. But within a team, well, you were hired in this team. It means that we trust you to be able to ship code to the standards of this team. Why do we have to block one or two senior or staff engineers to read your code, review it every single time? Let's be honest with each other, the vast majority of the things that get caught at code review stage, linters, testing, and all of that automated stuff can prevent the whole code review process. And more importantly, sometimes there are things that are actually hard. Like I write this code, but I'm not super confident that this is what should go into production. The way we handle that within the baseline team is pairing. Right in Slack, hey, guys, I'm working on this thing. It's kind of hard that someone has a second, two people, three people join you, you pair for three, four, six hours the whole day, and you ship a product that doesn't need to go through code review. So we do pair programming, but on demand rather than all the time. So our change failure rate, about 10%. What that means is one in 10 deployment introduces a defect in production. This is still a pretty good number, but we have to remember that I personally deploy 20 to 30 times a day. So we're a team of three, that's, I'm bad at math, it's a lot of deployments every single day. So 10% of those introduce a defect in production. Now those defects, sometimes they're typos. Sometimes they are worse defects than typos. But that leads us to the next thing, which is our recovery time. When a defect gets into production, how long does it take for us to detect that problem and to fix it? On average, less than one hour. So sometimes it's five minutes because it was a typo, sometimes a bit longer because it was a bigger problem. But we bet on our observability to be able to know about these defects so soon that we are able to fix them before they actually impact anybody who is actually using the application out there in the wild. And our mess around lead time. This is my favorite one. From customer insight to production. So customer insight can be anything, a conversation, looking at our analytics dashboard, customer interview, all of that. From insight to production, about half a day typically. And the great thing about this is it has nothing to do with coding ability. It has nothing to do with credentials. It's all about the culture of your team. It's all about the trust that you have in your teammates. It's all about the tooling that you put in place so that you're able to know about these defects before they actually impact real users. And a question I get a lot is, is this for everyone? Oh, Boris, you know, you are in a small startup. It's fine for you to quotation marks, YOLO to production. It will not work anywhere else. And this question usually comes with all of that. That approach doesn't scale. How about migrations? You need controls and manual checks around deployment. You need QA. What if you break production? Well, what I'm hearing actually is you cannot have high velocity and quality software at the same time. That's what all those questions are saying, that it's impossible to move fast and not break things. And my answer is, yes, we can. And we can with the right culture, the right processes and the right tooling. So how do we achieve this at baseline? It starts with our company culture. So this is one of the most fundamental of our company values. And this is the toned down version because we couldn't really put fuck around and find out on our website. We had to settle with mess around. But the idea is very simple. You are experimenting. You are innovating. You don't know the answers to these questions. Like there was a talk today about async hooks and how it changed from worker to node.js, etc. But they were experimenting. They were innovating. They found the first solution that wasn't the right one. And now they are fixing it. That's how software is built. You need to mess around and find out if it actually works for people out there. The second one is ship skateboards. Nobody on our team skates, maybe one person. But the skateboard emoji is the one that is used in our Slack the most. And the reason is we ship skateboards. And this is what it means. We build the smallest version possible of anything. We put it in production. We see how people react to it. And we keep iterating on it until we get this really fancy car. Or midway through we realize that actually nobody cares about this thing that you just built and you shipped. Scrap it. Remove it. There is no point building this really fancy car, spending six months, deploying once a month, and discovering that it was completely useless one year later when you could have shipped the smallest version and iterated on it. So one of the other key pillars is observability-driven development. And is anybody here familiar with the concepts of observability-driven development? Just a few hands. Cool. So I'm going to start by telling you what observability actually is. And I have a short video of me here. But before I play the video, I am not from a software engineering background. So I studied aerospace engineering. And I was quite shocked when I moved into software a few years ago. And I realized that people actually had no idea what their systems were doing in production. People didn't know how many nodes they have. People didn't know how their users were experimenting their apps. And the reason I was shocked was this. So this is a very younger and skinnier me. A few years ago, I was working on drones. So we will fly those drones, pack them with as much instrumentation as possible, collect as much data as possible about the flight path and the air velocity and all of the things that you can imagine about that drone flying. And when it will land, I will take the stuff out, take the SD card, plug it on my computer. And from that data, I could tell you absolutely everything that happened to this aircraft whilst it was flying. I didn't need to actually go back and fly the aircraft for whatever reason, because the data was telling me everything. And more importantly, with that data, I could actually build models that will predict how the aircraft will behave in new scenarios that were not part of this flight. Imagine then when I joined, I started being a software engineer and people are like, oh, we don't know. We need to add logs. Like, what do you mean you need to add logs? Why didn't you have logs in the first place? Why didn't you have tracing in the first place? It's not something that you do at the end. Your application is not your feature is not complete until you have proper monitoring, proper observability, dashboards, queries and alerts on that specific feature that you just built. That's not the right button. And that is what observability is about. observability enables you to ask arbitrary questions to your systems and get answers without code changes. That's the critical part, because that issue might happen in so rare conditions that you're actually not able to reproduce it. You should be able to know exactly what happened without changing the system that was in production in the first place. So what is observability driven development? I think I'm going to repeat myself a little bit. observability as part of your development lifecycle. So when you write code, you need to instrument it. You don't ship code that is not instrumented. Secondly, actively instrument your application. Pretty much the same thing. And the last one, a bit more controversial, testing and production. A lot of people want to be able to replicate everything locally. That's great. But when you're building distributed systems, especially with serverless architectures, a lot of the issues that will happen are emergent behaviors that will be extremely hard to reproduce locally. Imagine having that plane that I was flying and trying to reproduce the gust conditions that happened in the air right on the ground station. That would be pretty much impossible. That's the same thing with distributed systems in the cloud. So this is one of my favorite quotes. observability without action is just storage. A lot of people have Elasticsearch that is getting all the logs and they're happy with that. You're paying for storage. If you're not using that data, if you're not querying that data every single day to understand your systems, you're just paying for storage. You're not paying for observability. And the most critical thing to actually be able to innovate quickly is to construct tight feedback loops. And yeah, I'm going to take an example to explain how that works. So this was about a month ago, a month and a half. I go on a customer call and I speak with this guy and he tells me, yo, this is great. But on my lambda functions, typically I can run them. But sometimes DynamoDB, hopefully people are familiar with DynamoDB, it's a database on Amazon Web Services. Sometimes it throttles and it's quite difficult for me to know that it throttled. That was one Wednesday afternoon at around, I don't know, 4, 5 p.m. that I had that call. I dropped a message on Slack and I said he was even more excited, blah, blah, blah, blah, blah. And Thomas, who is on our team, said, I know what I'm shipping tomorrow. That was at 6 p.m. one day. The next day at 542 p.m., he had the very first version of that feature shipped. It worked only for DynamoDB, none of the other aws services. But we pushed this to production and were able to get feedback from it. And now we're iterating on it, adding other services and improving on that. Oh, I see that I'm out of time. Well, the thing is, everything fails all the time. It's not, you know, all beautiful, you ship to production, it works. It fails all the time. But what we cannot afford is this. People being afraid to deploy to production. That's what we cannot have. As soon as an engineer is afraid to deploy to prod, we failed as a team. Oh, one thing that I hear a lot is, oh, Boris, that's great, but we don't have time for observability because we have all these other things to do. And this is what I hear. We are too busy because we are using that to work when we could be using that instead. And it doesn't matter how hard you work with this. You're always going to be slower than if you use the actual wheels. So, to recap, build a culture that fosters more ownership by your engineers, democratize production to your entire team, instrument absolutely everything, and get all your answers by asking questions directly to production, rather than trying to guess things. Build your innovation engine and spend less time managing tech debt. That's it for me. Thanks for having me. Do you measure lead time at the task level or at the project, EPIC level? Oh, we measure it on every single thing that gets to production, like at the smallest level possible. So, we don't work in large EPICs that take one month to build. We ship the very smallest possible version of it. We have this thing that we call scope hammering. So, when someone has an idea, we actually take that idea and we're like, okay, what is the smallest version of this that can go to production within the next few days? And we keep reducing the scope of that idea until we have something that we are confident will be in production within a few days. And that's what we work on, rather than trying to build something a bit bigger. Awesome. Next one should be very quick to answer. What's the name of the metrics book that you mentioned? Oh, Accelerate. I don't remember the name of the authors. But if you Google Accelerate, it's a very long name. How to improve your development lifecycle or something like that, you're going to get the book. All right. Thank you. Do you run end-to-end tests in some areas? If yes, when do you run them? In a pull request or using a schedule? Oh, so that's very interesting. So, given that we deploy so often, we don't need schedulers for doing things like that. So, everything gets, like the test suite that we have gets run every single time, before something gets to production. And in terms of end-to-end tests, what we do actually is very interesting because we deploy very small serverless functions. So, serverless functions do exactly one thing, not two things, like one very simple thing. It means that testing that function in isolation is actually relatively easy because you just say, these are the inputs, what do I expect as outputs? There is very, okay, the database calls and the cache calls, everything like that, we mock it. But the actual tests themselves are very simple. All right. Very cool. Accidentally archived one. I'm re-archiving that now. And yeah, with trunk-based development, I noticed there is a higher risk of creating a lot of technical depth. It usually shows after seven years of fast development. What's your experience with it? Well, the company is not seven years yet, so I can't quite answer that. But the way we work is push it to main. We don't, this is very personal and biased, I will say, but my experience with branches and all of that, it's always a nightmare if you can afford. So, we have multiple repos. We have a very small team, but we have easily 100 repos. So, it's very rarely that your push to main will actually clash with someone else's push to main. So, just push to main. Okay. Cool. Yeah, the one that I accidentally archived, now it's your turn. An interesting idea about not having code reviews. Did you measure if this works well for the business eventually? Or do you know it's not making things worse? Oh, interesting. So, I think it was a conscious decision for us to not have code reviews. And when the guys join, every single time someone joins the team, they're like, oh my god, I cannot work in this environment. And three weeks later, they're like, oh my god, code reviews, who invented that? Simply because it just allows you, enables you to move so much faster. We don't have quantitative numbers that we can share because we started the company with that philosophy. We didn't switch to it and we are able to compare the cadence. But from personal experience at previous positions, it's definitely much, much faster. Okay. There is one related to that. One second. Regarding tests in production, what if you can't afford losing customers who will turn their back when our system is down for 10 minutes, for example? Oh, that's an interesting one. So, I mentioned that one out of 10 deployments introduces a defect in production. But very rarely those defects are actually experience breaking. The way our infrastructure is done is so decoupled that, and that's what you get almost out of the box with serverless architecture. If you work with serverless, typically you'll have a much more decoupled architecture. And as a result of that, defects are very localized. Very rarely a defect will impact a large swan of users. So, more often than not, we are actually the ones telling customers, hey, there was this issue for 10 minutes. You were not able to, I don't know, see your dashboard, but nobody opened their dashboard during those 10 minutes. So that's fine. I'm sure that there might be occasions where we will get some churn, but that's a decision that we made. We are willing to get churn from a few users as long as we are able to innovate every single day and gain more users than we are actually churning. Right. Yeah, there are more questions related to what has been discussed already. Maybe two more. Yeah. Would you change something on your practices if you work in a healthcare or banking industry, you think? Oh, yeah. So, huge caveat in all of this, we are building observability, right? We are not handling people's money. We are not handling people's health records. I'm sure that in regulated industries, this will be very, very much different. I spent a very short stint in a regulated startup, and I absolutely hated it because of how slow it was. Well, if you're in that space, I don't know. But if you're in a space where you don't have those regulations over your head, don't apply the same principles that they apply simply because those principles are best practices. Do what works for your industry rather than copying what the rest of the industry does. Yeah. Great one. Yeah, last one. Have you tried to implement space framework instead of Dora metrics? Any feedback? Nope. Only Dora. Okay. Thank you. Thanks a lot, Boris. Cheers.