
34 min
Fine-tuning DevOps for People over Perfection
Demand for DevOps has increased in recent years as more organizations adopt cloud native technologies. Complexity has also increased and a "zero to hero" mentality leaves many people chasing perfection and FOMO. This session focusses instead on why maybe we shouldn't adopt a technology practice and how sometimes teams can achieve the same results prioritizing people over ops automation
&
controls. Let's look at amounts of and fine-tuning everything as code, pull requests, DevSecOps, Monitoring and more to prioritize developer well-being over optimization perfection. It can be a valid decision to deploy less and sleep better. And finally we'll examine how manual practice and discipline can be the key to superb products and experiences.
Transcript
Intro
Hi, my name is Julie. I'm here to talk to you today at devops.js about DevOps. And I want to do something a little bit different, I want to focus more on people as opposed to, let's just say DevOps in theory and on paper. So before we get started, I have to just briefly show a disclaimer that I am appearing here as myself and much of what I'm going to share with you today is my opinion based on my experience. So it's also not a complete guide to DevOps. For this talk and the time duration, I've decided to pick a couple of examples to illustrate the points I want to make.
[00:57] So I've been building for the web for a very long time. I'm a bit older than I look, and I've worked at every place from startups to actually full corporate. I was self-employed for a long time, so I actually was freelancing at various companies and that was my initial exposure to DevOps.
Then I joined Allianz Germany, which is a multi-billion dollar insurance company, and we moved some projects to the cloud in less than a year. It was a crazy ride, but I learned so much, not just about DevOps, the skills, but really at scale. I already knew a lot of those practices and especially around Git and automated deployments, but transferring those across the team is a lot harder than it sounds. And so today I am an engineer at Microsoft, I'm part of the FastTrack for Azure program where I help customers onboard to Azure. I specialize in cloud architecture and DevOps automation. So we don't just help them with best practice guidance. If they run into a big problem or a challenge, we also help unblock them. So some of the content in here is going to be from those customer scenarios as well as internal Microsoft, kind of like my story, my experience, which is why this is my opinion, not a “here's how you should do it”.
[02:23] And the reason is because DevOps is a journey. Every company is going to be a little bit different. They're starting in different places and you always have different people with different preferences and just how they work. So it doesn't matter if you bring in somebody like me who's been doing it for a decade. It depends on the team, the company processes, and you have to make all of this work together. This is a little bit different from some of the talks I give. Part of it is that this is year two in COVID. Doing a lot of these things that have to deal with transformation and cultural transformation is really hard when everything is remote and sometimes you've never met your team. I've never met my team. So some of the things I'm going to talk about in terms of best practices actually become much more challenging when you don't have that FaceTime to have that nuance. And it's like, "Okay, is she serious or is she just being her snarky self?"
And then some of those rules, especially with security, can I bend some of them and why? So we're going to look at that. What I want you to get out of today is, a lot of these things that are best practices, even if I'm telling you they are, but you don't have to do them. You don't have to do them today, you don't have to do them next week. You have to eventually get there. And also, you can get there without following some of those best practices. So it's not about tools. It's going to be about people. And people are going to be the difference between success, both in delivering a product and as well, actually growing a team, investing in a team that will still be with you in a year or 10. So without a further ado, let's get started.
Pull requests
[03:58] So the first thing, pull requests. Everybody loves pull requests. It's best practice, but you can also do it wrong. So let's say we have a pull request workflow. Everybody's working in a branch, you might be used to, "I'm just going to open the main branch again, merge our workflow and push," and it's going to be like, "Nope." The server might reject it. So you have to then open a pull request. Then you make your commits, do all these things. And then you might have a convention or workflow set up that says, "Oh, I'm going to put a hashtag sign off. A robot should now merge this. It didn't work. Let me do that again. Let me do that again. And again." This is what I have to deal with at work. Then sometimes you just give up and you might not even close the pull request.
There are so many repos with dozens or hundreds of pull requests. It's a bit really frustrating because you want to help, you want to contribute, but it's not going anywhere. It's stuck. Now, even if pull requests are going in and the builds are running, you can eventually get code merged in, they can be really slow. And that's also just as frustrating and as disappointing. So this is what a pull request might look like. In documentation, it sounds really easy, open, approve, close and merge. So you might have it actually tied to a pipeline and you got to wait for a build agent. If you have lots of people and you don't have as many build agents, you didn't buy those as you actually need, you'll wait forever. So then if you're like me, you grab your phone and you're like, "Let me look at Twitter, see what's going on." You go back and then when it finally ran like, "Oh, it failed. But it's half an hour, an hour later, let me go have lunch, let me come back." That's not necessarily super productive either.
[05:40] So what you end up having is this traffic jam of pull requests. And that's not necessarily helpful either. So I love this tweet by Keith and he's talking about pull requests, why you have them. And it's the best way to describe when you should not have them. And that is, a pull request is like airport security. When it was first introduced, it was really designed more for the open source community and you want to welcome outside contributions, large and small. And you want to introduce a way to collaborate on that. It's not just about security, it's about discussing a code change. Oh, what do you mean by this? Can you make that change? But if you're on the same team, do you need all those gates? Maybe you don't. So let's look at what actually this workflow looks like and where people get stuck.
So this is actually a diagram I use at work and it's in the Azure documentation. And if you look at that first part, what you'll see is that, that pull request is a gate to that protected branch, the main branch or the production branch. And often I'll also see people who lock that down entirely, even admins can't push directly to it. So you really are stuck in this pull request workflow. The problem here is that that is all done in code and nobody's perfect, I'm not either. And sometimes those things break for whatever reason. And then what you end up having is actually something like this. So I blanked out everything that you can't see. And I took this screenshot this morning and it made me really sad. 26 days ago, that was the last time somebody contributed to this repository. It's something internal that theoretically as engineers we should be using every day. There's stuff in there because Azure changes every day. There's bugs in there, people should go fix typos and broken links, et cetera. But what's happening is you'll see we have about 50 pull requests that are stuck. This repository is, in my opinion, it's run amuck. It's just too many. And you also just see the number of forks we have for a team that's actually only a few hundred engineers worldwide. It sounds like a lot, but in the scheme of Microsoft or in the scale of Microsoft, is not a lot of people. What you really want to avoid, because this is the worst feeling ever, is that frustration.
[07:58] Because what happens is, you stop. And when you stop, you're not delivering business value. And that's really harmful. That's when the tooling isn't helping you. In fact, it's hurting you. So this is a great tweet from Mateo Kullina who's on the No JS core team, and I really loved it. And he was talking about, okay, he helped this company and they were deploying every two weeks. It was an overnight deployment, very long, and they're doing it every day now. So how are they doing that every day? And the key point is actually, you remove the gates and you trust your team. They get to deploy when they're ready. So that sounds really simple, but is it super simple? How does that look like?
So I pulled out this really old photo from the Allianz. So what does it mean when you're ready to deploy? So you think, "Oh, it's when we have tests, everything is great." Well guess what? In real life, not everything is great. Not everything is always great. And one of the things that was very interesting was that sometimes, and I never saw this actually in startups so much, I saw it more actually at this giant corporation. People said it was too risky to deploy during business hours for an internal line of business application. So let's still work eight hours but shift everything, and let's do it when nobody else is using it. And that's okay for the beginning. Trust me, it's not a goal. I think we only did that once with pizza and I have to blank out the faces because yeah, GDPR. So it's really up to the teams. In the beginning, you don't need to deploy 10 times a day. It's okay to start with every two weeks. It's okay to start with every month if you're dealing with a legacy application that is making you a lot of money. It just takes that time.
DevSecOps
[09:43] So another example that's more complicated, especially when you're like, "Ooh, gate keeping. Why do we have gate keeping?" We have it for security. So who doesn't want security? But it sounds too easy. Let's go through this example about how maybe it's too much or maybe we don't react to it.
So I was giving actually a webinar last week and I'm demoing how to use security and whatnot. And nobody actually said, "Julie, but look at that. There's a vulnerability, why aren't you looking at it?" I wish they asked me but they didn't. So let's go through it and I'll tell you why it's still there and it's going to stay there for the time being.
[10:24] So I go into the Dependabot bot alerts. So this is all automated. "Hey Julie, we found these issues. You should go fix them." One of them is really high. So the bottom one for a Glob-Parent. Now if we open it up, Dependabot, which was bought by GitHub, is trying to be helpful. "Here you go. It's just the version's less than 5.1.2. So upgrade it and you should be good," and it actually gives you a suggestion. So let's try that. I'm going to stick that in my package.json. I'm going to do an NPM update. This is what I see. NPM WARN. All these messages and whatnot. And at the very bottom it says, "You have 15 vulnerabilities." I thought I had two before, now I have 15. And then it says, "Run NPM audit for details." And you're like, "Hmm, okay, well guess what? I've done this before. I'm going to run NPM audit fix immediately." So again, lots and lots and lots of code. And if I scroll down to the bottom, oh, I have to do that on that screen, I still have 15 vulnerabilities.
So why? I was hoping it would be gone. So let's do an NPM fix-force. I don't even know if that's still a tag, but when you're a developer, that's what you do, you just try again. And we still have 15 issues, you can't get past it. So the hard part is actually now figuring out what should I do? There's no easy answer. You have to just stop, zoom out and think. So we have all these tools and they know about a certain fact at a certain time range, like oh 5.1.2, but it doesn't mean anything. It doesn't help you fix it.
[12:04] So after some Googling and stack overflowing, it was like, oh, there's this relatively new NPM feature. You can do an override and you can say, "I don't care what version is required in my dependency tree, I want this one." So let me put in the latest ones for these. So I'm going to update my node version manager on my local machine, my Docker file, all these things. I actually have some tests. So I'm going to do a pre-flight check which runs all of them and do a git push. Everything's green, all good. But then the CI build says, "Nope, it's not going to work." And if you dig into the error, it's giving me actually an open SSL error. So I'm like, "Eh, no, I don't want that. That's even worse encryption. Let me just ignore it. And I'm actually going and revert that change," and notice that in the commit I also put a link to the stack overflow thread a little bit like, "Why is it there?"
So you'll see actually in all my pipelines, I tend to have continued on error. It's like, "Thank you for the alert, but I'm going to keep going." There's also an Azure defender for cloud container scanner, and it'll go through the code and it found the exact same thing. And so it's like, "Okay, let me thank you for the alert. I need to disable you as well." Now all my builds are running, but that shouldn't be the end of the story. I need to deploy, so continue on error. I know about the error. Let's figure out what we're going to do. The key to that is understanding what is glob-parent, what am I using it for, and what is the Open SSL issue and how is that going to be used? So I was like, "Hmm, I'm not sure what the Open SSL, but I actually do know what the glob-parent is going to do."
[13:49] And so let's first understand what is this kind of vulnerability? And so it is a denial of service kind of vulnerability, which basically means your application isn't serving requests anymore because it's going to be bombarded. And so there's a networking version, but there's also a, let's say bad code version, where if you give it the right, let's just say inputs, then it could slow to a crawl or even crash. Is that going to happen like in my example? So I didn't show you what this project is, but the answer is no. So I'm building this application, it's a proof of concept. And what I want to do is give an assessment that's not a checklist. So if you select certain options, then maybe your security would go up, but at the cost of increased complexity. And so all of the things here, the questions and the factors, all of that is done in Markdown.
So it's pulling this all together. It's a headless CMS, I don't need a database. And it's figuring out what goes where and is attached to what, based on the file tree. So that's what it's globbing. It's matching all these things. How do I throw all this together? That is in my file system. There is no user input. This application also has a build process. So it's doing all that, figuring out live, but you just run a build at some point when you're done and when you deploy it and you're running in production mode or you're just serving it in static mode or whatever it's called, and you don't have that anymore, that gap isn't there. So in this situation, what I tell people is that tools are stupid. I'm just trying to get you to remember that the tool is not the holy grail.
[15:32] Just because it says there's a vulnerability, that doesn't mean, "Okay, stop everything, drop everything what you're doing and figure that out." Because even if you figure that out, it might be like in this case, it's choosing the lesser of two evils. So the Open SSL issue, whatever it has to do with the encryption that is used for the cookie, where the results, or rather the inputs that you gave, the answers that you picked are saved and it's all encrypted. But for me, I've decided that risk, even though there's actually no real use of data in there, is worse I think, than, okay, something that can only happen to me in development. So you won't know any of that until you go through that process. And this was just one particular example that I encountered this month. But for every kind of notification, you have to go through this process and it's really, really hard.
And then when you suddenly get something like this, we get to that other point of frustration, things that are clogged down. Some organizations will measure like, "Ooh, how many alerts are still open? How many vulnerabilities are still there?" But it doesn't actually tell you if it is a security, how bad, or it says, "High or moderate," but is it really that bad? "Lesser of two evils," that's what I always say.
Craftsmanship
&
the Art of Devops
[16:52] Okay, so we talked about pull requests, we talked about how you need to trust your team. We talked about as well that even with security, you also have to trust your team to learn and grow. And the last thing I want to touch upon today is actually craftsmanship, because we make DevOps really hard, but I want people to actually want to do it. And key to that, in my opinion, is loving what you're doing and being proud of what you do.
So this is what I see a lot of, everywhere on GitHub open source as well, not just at work. You're probably using the GitHub UI if you're clicking something and it says, "Update. Read me. Update. Read me. Update. Read me." That's okay. It's a place to start. Now what's interesting is when customers come to me, and one of the biggest challenges people face is actually versioning, especially if they're using microservices. And for whatever reason you have to correlate things, in which case you don't have a microservice, you have a distributed monolith. Anyway, when you see something like this, it looks super nice and like, "Oh my god, simple. It's super easy." And how do you do that? Well, the answer is practice and discipline. So this is a change log that I made using a tool called Standard Version, and it relies on something called conventional commits. And that's what it is. It's just a naming convention. So when you make a commit message, this is what it looks like. How it figures out if there's a bug fix or a feature, it's just a prefix. Throw some stuff between the parentheses and it'll try to categorize things for you. And then as well, if you add the hashtag and a number, of course GitHub links it to the issue. Now, if you're really, really paying close attention, you can tell that some of those, I've also adjusted by hand because even I am not perfect, even I will sometimes be in a rush and I'm super pedantic, which is why I say, "Even I." I'll be in a rush and I'll just also submit something. But when I actually make a release, I will look at the change log and edit it as necessary before committing it. Because at that point, I am publishing it to you, the consumer. And I want you to understand it. Any tiny mistakes I make in between, that's okay, it's just for me. But at the point where I say, "Okay, it's ready for the public," I want to make it look a little bit nicer.
[19:14] The other thing too is that sometimes you're not doing everything perfectly, and in this case, we had to restructure a bunch of things. Take that extra step to actually make sure everybody gets credit. This I've seen happen a lot in corporate where unfortunately people actually count those commits. And you know that GitHub profile with all those little green squares? So you need to make sure that not just for you, but also for your colleagues, that their GitHub users and usernames are also getting credit for that work, otherwise people don't see it.
Now, it's also not just about that corporate loophole, it's also about, let's say appreciation, of teams and being grateful for the help that they're giving you. So this takes 30 seconds, but it goes a really long way for your team culture.
[20:05] Another thing that I want to tell people to spend time on is actually doing documentation. So I write this both for myself as well as people who are going to use the software that I publish or samples. It's not meant for your builds, it's meant entirely for humans. This has no purpose in code whatsoever except to help you understand something. And that's really, really important because we almost always have to go back and look at things in the future. And I don't remember what I did last week, let alone six months ago. And if we're working on multiple projects and microservices, et cetera, then yeah, sometimes you take a long break from a project. So it's really important to actually add that documentation.
Now when I show demos at work, you look at something like this and this is a GitHub action workflow, and I recently took some time to actually completely migrate an Azure Pipeline workflow. It looks really nice. But how long does it take? In reality, years. There's so much experience that comes from just actually creating pipelines. What works well for me, not necessarily for everybody, and then figuring out, "Okay, there's a gap here. There's a gap here. This feature works that way." It takes a long time. The simpler and more elegant it looks, the more effort it takes to get there. And it's okay. You don't start there. What you want to do is actually invest in yourself, which means good habits. Good habits, little tiny steps that over time, actually result in that big goal you're trying to reach.
[21:38] If you try to get to it from day one, what you might end up doing is actually hurting your own motivation and your team's motivation. And that's almost the worst thing you can do. I think it's better to have more contributions and stuff you need to clean up, than to have no contributions. And you might think, "Oh no, no, they just have to update the pull requests or whatever. Just have to make the build go green." But there's a human element to that that might be missing in those types of rules that are just all or nothing binary. Don't do that. Really it's about people. It's about investing in people long term. So things like documentation, et cetera, it's not a cost, it's not a chore, it's an investment.
And with that, I am done. So thank you for coming to this talk. I hope that it'll get you thinking critically the next time you read some article about Best Practices, 10 Things you Must Do. Do you really need to do it? And if you're interested in more types of topics like this, counterintuitive, what is it actually in real life? You can follow me on Twitter or on YouTube. Then thank you very much.
Questions
[22:50] Mettin Parzinski: Good to have you here. So let's look at the results. So the question was, what does it feel like to deploy to production? And with 57% we have a winner at passport control and airport security followed with 27% waving at your coach as you step into the field or 17% showing an ID before entering a bar or club. Was this something you were expecting, Julie?
[23:17] Julie Ng: Actually it surprised me, the airport control, that it's that many people. I thought this audience, this conference, folks who use JavaScript, I don't know, they would be more laid back. I think what surprises some people, because my job today, I helped Azure customers, even the Allianz didn't have passport control and airport security. So we had all those really super strict requirements. So my job there, I was a full stack engineer and then I later had a mentoring role across many teams and we established all the rules. But it got to a point, for example, some of the compliance stuff, the product owner, so a non-technical person would create the pull request that could deploy to production, for example. Or the product owner would be the one to click a button in Jenkins, and that deploys. There weren't any hard controls on everything at the time. It's just when you have a business that's worth that much money, it's scary anyway. So you're only going to click that button if you're a hundred percent sure. And because people actually own their repositories, their own repositories for the most part, it's like, "Okay, if I screw up, I'm just going to shoot myself in the face." It might have been that it was really early in the cloud journey and they didn't put all those rules on yet. But I don't think so. I think it was just that there was a good amount of trust, which is hilarious because I have airport security today in internal repositories and I don't even work on Azure. But it's not necessary. I hope it's clear from the talk that it's not necessary to have that much control. Trust is much more important, I think, for long-term growth of the team, team spirit, and then also just adding more features and business value.
[25:00] Mettin Parzinski: But I guess also with our audience, most of the people are not actually deploying user facing code, but they are building the infrastructure, so like the chef doesn't need to have his own kitchen, thinking of that. They're scared of their own code, maybe. But hey, we'll never know unless everyone tells us now on chat. So lets, if you're ready for it, hop into the Q and A. And as a reminder, if you have any questions you can still do so in the channel, DevOps Talk Q and A. So be sure to join that. First question is from CC Miller. How would you handle the scenario where the vulnerability is a valid issue for the app but can't be updated because other things are not compatible with the fix?
[25:52] Julie Ng: So if I understood correctly, there is a valid security vulnerability, so it's a confirmed vulnerability, but there's no fix. And how should we handle that? Is that what's the question?
[26:06] Mettin Parzinski: Okay, yeah. And there's also compatibility issues.
[26:13] Julie Ng: Okay, there's compatibility issues. That's something you have to figure out with your security team. So I can say that the customers do this. I can say that a couple years ago, I'm pretty sure that Allianz probably still does this. You have to find that balance. And often even in large organizations, you'll have basically a contract. It's going to be locked, you're deploying something that has a known vulnerability and you have however many days or hours to actually fix it. And you have to make that decision. There's a vulnerability. If the team or the business owner, the product owner says, "We go anyway," okay, then the clock starts ticking after you go into production and you fix it by then. And that might mean rolling back. So there's always going to be a contract I think between all the different stakeholders in your organization and that includes the security team. And I think the hardest thing is to find the right amount of confidence. Like the example I said, it's a valid security issue, but what is the chance of it happening? There's no such thing as 100% security. I think you just have to evaluate it, take a little bit of a risk and figure out what is right for you. Yeah.
[27:28] Mettin Parzinski: Yeah. Agree. Next question is from Jessica. How often should we revisit what good looks like? We have to start somewhere and how do we know how often we should optimize for success?
[27:46] Julie Ng: I think you'll know in your gut. So you never reach perfect. And it's funny because I show demos at work all the time and in webinars and I'm like, "Oh, I'm doing it like this now," and then six months later, I would totally do it differently now. And so as you work with what you've built or as other people join your team, as you get new experience, you just might change it. And whether or not you have the time to change it, that's going to be dependent on what's your workload, how much time do you have to prioritize features, et cetera. So you're probably going to be doing it all the time and you'll never reach perfect. I think if you reach perfect, I'm like, "Hmm, that's an interesting bar." Or you have something that, there are some services you deploy and they're done. You don't really ... If it's not customer facing, let's just say it's an API to consume, some things are done and you don't update them regularly. Maybe for some security maintenance, but same thing then when it's automation, it's done.
[28:48] Mettin Parzinski: So yeah, I'm thinking of, I don't know where I got this wisdom from, but I once heard if you're looking at something you built a year ago or two years ago and you're not ashamed of it, that's a bad thing, because that means you have not progressed in your knowledge for a year or two years. And you mentioned half a year. Half a year is acceptable. And of course you don't need to progress. If you're fine where you are then you're fine. But most people here I'm assuming are here to learn new things. I thought it was a nice rule of thumb. If you're looking at something from two years ago, you're unashamed, it's a bad thing.
[29:35] Julie Ng: Yeah. It's not just though ashamed or whatever learning. So I'm reading this book right now, it's called Think Again by Adam Grant. So sometimes it's just that it's not you learning something new, but you realize something and you just change your mind. And changing your mind is actually not a bad thing. It means you're thinking like a scientist, maybe you have new learning information now and you change your mind. It didn't get better, it's just different. And what I changed, it works for me, it won't work for you, but that's okay. It's for me, not for you.
[30:06] Mettin Parzinski: Nice, nice. We have another question from Sisi Miller. Is it possible, can you over document, and what is too much, or is that a team thing?
[30:20] Julie Ng: You can definitely over document. I wish I could share so many internal things with you. Documentation, it's really hard to find that right balance. And sometimes there's not much text and actually that takes a lot more time to figure out what is actually necessary. Let me write a paragraph and delete 10% of it or 20% of it or rewrite it because it doesn't make sense. As much as you need and not much more. I think one thing I try to tell my colleagues, and I'm not really successful at it, is you don't need to write a book. You don't need to write whole paragraphs. If you can get away with bullet points, do it because nobody reads it anyway. Who has time to read pages? And so you will decide, or the people you're talking to will decide if it's too much. There are a couple times they're like, "Oh, didn't you read the Wiki?" And I'll say, "No, it was too long." And I'm direct, and so people are like, "You didn't read it?" I'm like, "No, I didn't read it. I didn't have time to read this much text. No."
[31:22] Mettin Parzinski: Yeah, it reminds me of a meme that was going on a few weeks back. Something like, why waste two minutes reading some good documentation when you can just spend a whole day trying to figure things out for yourself.
Julie Ng: I like that too. That also works.
Mettin Parzinski: And I felt personally attacked when I read that.
Julie Ng: Nah. It's not attack. It's all good. It's all good. Always good humor.
Mettin Parzinski: But I guess a lot of people are guilty of that. We have one minute. So we're going to do one more quick question from Sergio. Let's say you have some process in the workflow that was added because of reasons. How do you know when it's time to remove that process because it doesn't add value anymore?
[32:14] Julie Ng: I don't think I can do that question justice in under a minute. I would say let's go into the spatial chat, that person, if you can, and then I'll answer it there. I'm going to give you, you'll know when you know. But for details, let's go into the spatial chat room thing.
[32:31] Mettin Parzinski: Yeah, that's a nice bridge too, the next step, because this is all the time we have for our Q and A. So Julie is going to go into her spatial chat where you can continue the conversation with Julie. So if you want to have some one-on-one time with Julie, you can do some now in the spatial chat. Julie, thanks a lot for joining us. It's been a pleasure having you. And enjoy our spatial chat speaker room.
Julie Ng: See you later.
Mettin Parzinski: Bye-bye.
Julie Ng: Bye. See you later.
&
controls. Let's look at amounts of and fine-tuning everything as code, pull requests, DevSecOps, Monitoring and more to prioritize developer well-being over optimization perfection. It can be a valid decision to deploy less and sleep better. And finally we'll examine how manual practice and discipline can be the key to superb products and experiences.
Transcript
Intro
Hi, my name is Julie. I'm here to talk to you today at devops.js about DevOps. And I want to do something a little bit different, I want to focus more on people as opposed to, let's just say DevOps in theory and on paper. So before we get started, I have to just briefly show a disclaimer that I am appearing here as myself and much of what I'm going to share with you today is my opinion based on my experience. So it's also not a complete guide to DevOps. For this talk and the time duration, I've decided to pick a couple of examples to illustrate the points I want to make.
[00:57] So I've been building for the web for a very long time. I'm a bit older than I look, and I've worked at every place from startups to actually full corporate. I was self-employed for a long time, so I actually was freelancing at various companies and that was my initial exposure to DevOps.
Then I joined Allianz Germany, which is a multi-billion dollar insurance company, and we moved some projects to the cloud in less than a year. It was a crazy ride, but I learned so much, not just about DevOps, the skills, but really at scale. I already knew a lot of those practices and especially around Git and automated deployments, but transferring those across the team is a lot harder than it sounds. And so today I am an engineer at Microsoft, I'm part of the FastTrack for Azure program where I help customers onboard to Azure. I specialize in cloud architecture and DevOps automation. So we don't just help them with best practice guidance. If they run into a big problem or a challenge, we also help unblock them. So some of the content in here is going to be from those customer scenarios as well as internal Microsoft, kind of like my story, my experience, which is why this is my opinion, not a “here's how you should do it”.
[02:23] And the reason is because DevOps is a journey. Every company is going to be a little bit different. They're starting in different places and you always have different people with different preferences and just how they work. So it doesn't matter if you bring in somebody like me who's been doing it for a decade. It depends on the team, the company processes, and you have to make all of this work together. This is a little bit different from some of the talks I give. Part of it is that this is year two in COVID. Doing a lot of these things that have to deal with transformation and cultural transformation is really hard when everything is remote and sometimes you've never met your team. I've never met my team. So some of the things I'm going to talk about in terms of best practices actually become much more challenging when you don't have that FaceTime to have that nuance. And it's like, "Okay, is she serious or is she just being her snarky self?"
And then some of those rules, especially with security, can I bend some of them and why? So we're going to look at that. What I want you to get out of today is, a lot of these things that are best practices, even if I'm telling you they are, but you don't have to do them. You don't have to do them today, you don't have to do them next week. You have to eventually get there. And also, you can get there without following some of those best practices. So it's not about tools. It's going to be about people. And people are going to be the difference between success, both in delivering a product and as well, actually growing a team, investing in a team that will still be with you in a year or 10. So without a further ado, let's get started.
Pull requests
[03:58] So the first thing, pull requests. Everybody loves pull requests. It's best practice, but you can also do it wrong. So let's say we have a pull request workflow. Everybody's working in a branch, you might be used to, "I'm just going to open the main branch again, merge our workflow and push," and it's going to be like, "Nope." The server might reject it. So you have to then open a pull request. Then you make your commits, do all these things. And then you might have a convention or workflow set up that says, "Oh, I'm going to put a hashtag sign off. A robot should now merge this. It didn't work. Let me do that again. Let me do that again. And again." This is what I have to deal with at work. Then sometimes you just give up and you might not even close the pull request.
There are so many repos with dozens or hundreds of pull requests. It's a bit really frustrating because you want to help, you want to contribute, but it's not going anywhere. It's stuck. Now, even if pull requests are going in and the builds are running, you can eventually get code merged in, they can be really slow. And that's also just as frustrating and as disappointing. So this is what a pull request might look like. In documentation, it sounds really easy, open, approve, close and merge. So you might have it actually tied to a pipeline and you got to wait for a build agent. If you have lots of people and you don't have as many build agents, you didn't buy those as you actually need, you'll wait forever. So then if you're like me, you grab your phone and you're like, "Let me look at Twitter, see what's going on." You go back and then when it finally ran like, "Oh, it failed. But it's half an hour, an hour later, let me go have lunch, let me come back." That's not necessarily super productive either.
[05:40] So what you end up having is this traffic jam of pull requests. And that's not necessarily helpful either. So I love this tweet by Keith and he's talking about pull requests, why you have them. And it's the best way to describe when you should not have them. And that is, a pull request is like airport security. When it was first introduced, it was really designed more for the open source community and you want to welcome outside contributions, large and small. And you want to introduce a way to collaborate on that. It's not just about security, it's about discussing a code change. Oh, what do you mean by this? Can you make that change? But if you're on the same team, do you need all those gates? Maybe you don't. So let's look at what actually this workflow looks like and where people get stuck.
So this is actually a diagram I use at work and it's in the Azure documentation. And if you look at that first part, what you'll see is that, that pull request is a gate to that protected branch, the main branch or the production branch. And often I'll also see people who lock that down entirely, even admins can't push directly to it. So you really are stuck in this pull request workflow. The problem here is that that is all done in code and nobody's perfect, I'm not either. And sometimes those things break for whatever reason. And then what you end up having is actually something like this. So I blanked out everything that you can't see. And I took this screenshot this morning and it made me really sad. 26 days ago, that was the last time somebody contributed to this repository. It's something internal that theoretically as engineers we should be using every day. There's stuff in there because Azure changes every day. There's bugs in there, people should go fix typos and broken links, et cetera. But what's happening is you'll see we have about 50 pull requests that are stuck. This repository is, in my opinion, it's run amuck. It's just too many. And you also just see the number of forks we have for a team that's actually only a few hundred engineers worldwide. It sounds like a lot, but in the scheme of Microsoft or in the scale of Microsoft, is not a lot of people. What you really want to avoid, because this is the worst feeling ever, is that frustration.
[07:58] Because what happens is, you stop. And when you stop, you're not delivering business value. And that's really harmful. That's when the tooling isn't helping you. In fact, it's hurting you. So this is a great tweet from Mateo Kullina who's on the No JS core team, and I really loved it. And he was talking about, okay, he helped this company and they were deploying every two weeks. It was an overnight deployment, very long, and they're doing it every day now. So how are they doing that every day? And the key point is actually, you remove the gates and you trust your team. They get to deploy when they're ready. So that sounds really simple, but is it super simple? How does that look like?
So I pulled out this really old photo from the Allianz. So what does it mean when you're ready to deploy? So you think, "Oh, it's when we have tests, everything is great." Well guess what? In real life, not everything is great. Not everything is always great. And one of the things that was very interesting was that sometimes, and I never saw this actually in startups so much, I saw it more actually at this giant corporation. People said it was too risky to deploy during business hours for an internal line of business application. So let's still work eight hours but shift everything, and let's do it when nobody else is using it. And that's okay for the beginning. Trust me, it's not a goal. I think we only did that once with pizza and I have to blank out the faces because yeah, GDPR. So it's really up to the teams. In the beginning, you don't need to deploy 10 times a day. It's okay to start with every two weeks. It's okay to start with every month if you're dealing with a legacy application that is making you a lot of money. It just takes that time.
DevSecOps
[09:43] So another example that's more complicated, especially when you're like, "Ooh, gate keeping. Why do we have gate keeping?" We have it for security. So who doesn't want security? But it sounds too easy. Let's go through this example about how maybe it's too much or maybe we don't react to it.
So I was giving actually a webinar last week and I'm demoing how to use security and whatnot. And nobody actually said, "Julie, but look at that. There's a vulnerability, why aren't you looking at it?" I wish they asked me but they didn't. So let's go through it and I'll tell you why it's still there and it's going to stay there for the time being.
[10:24] So I go into the Dependabot bot alerts. So this is all automated. "Hey Julie, we found these issues. You should go fix them." One of them is really high. So the bottom one for a Glob-Parent. Now if we open it up, Dependabot, which was bought by GitHub, is trying to be helpful. "Here you go. It's just the version's less than 5.1.2. So upgrade it and you should be good," and it actually gives you a suggestion. So let's try that. I'm going to stick that in my package.json. I'm going to do an NPM update. This is what I see. NPM WARN. All these messages and whatnot. And at the very bottom it says, "You have 15 vulnerabilities." I thought I had two before, now I have 15. And then it says, "Run NPM audit for details." And you're like, "Hmm, okay, well guess what? I've done this before. I'm going to run NPM audit fix immediately." So again, lots and lots and lots of code. And if I scroll down to the bottom, oh, I have to do that on that screen, I still have 15 vulnerabilities.
So why? I was hoping it would be gone. So let's do an NPM fix-force. I don't even know if that's still a tag, but when you're a developer, that's what you do, you just try again. And we still have 15 issues, you can't get past it. So the hard part is actually now figuring out what should I do? There's no easy answer. You have to just stop, zoom out and think. So we have all these tools and they know about a certain fact at a certain time range, like oh 5.1.2, but it doesn't mean anything. It doesn't help you fix it.
[12:04] So after some Googling and stack overflowing, it was like, oh, there's this relatively new NPM feature. You can do an override and you can say, "I don't care what version is required in my dependency tree, I want this one." So let me put in the latest ones for these. So I'm going to update my node version manager on my local machine, my Docker file, all these things. I actually have some tests. So I'm going to do a pre-flight check which runs all of them and do a git push. Everything's green, all good. But then the CI build says, "Nope, it's not going to work." And if you dig into the error, it's giving me actually an open SSL error. So I'm like, "Eh, no, I don't want that. That's even worse encryption. Let me just ignore it. And I'm actually going and revert that change," and notice that in the commit I also put a link to the stack overflow thread a little bit like, "Why is it there?"
So you'll see actually in all my pipelines, I tend to have continued on error. It's like, "Thank you for the alert, but I'm going to keep going." There's also an Azure defender for cloud container scanner, and it'll go through the code and it found the exact same thing. And so it's like, "Okay, let me thank you for the alert. I need to disable you as well." Now all my builds are running, but that shouldn't be the end of the story. I need to deploy, so continue on error. I know about the error. Let's figure out what we're going to do. The key to that is understanding what is glob-parent, what am I using it for, and what is the Open SSL issue and how is that going to be used? So I was like, "Hmm, I'm not sure what the Open SSL, but I actually do know what the glob-parent is going to do."
[13:49] And so let's first understand what is this kind of vulnerability? And so it is a denial of service kind of vulnerability, which basically means your application isn't serving requests anymore because it's going to be bombarded. And so there's a networking version, but there's also a, let's say bad code version, where if you give it the right, let's just say inputs, then it could slow to a crawl or even crash. Is that going to happen like in my example? So I didn't show you what this project is, but the answer is no. So I'm building this application, it's a proof of concept. And what I want to do is give an assessment that's not a checklist. So if you select certain options, then maybe your security would go up, but at the cost of increased complexity. And so all of the things here, the questions and the factors, all of that is done in Markdown.
So it's pulling this all together. It's a headless CMS, I don't need a database. And it's figuring out what goes where and is attached to what, based on the file tree. So that's what it's globbing. It's matching all these things. How do I throw all this together? That is in my file system. There is no user input. This application also has a build process. So it's doing all that, figuring out live, but you just run a build at some point when you're done and when you deploy it and you're running in production mode or you're just serving it in static mode or whatever it's called, and you don't have that anymore, that gap isn't there. So in this situation, what I tell people is that tools are stupid. I'm just trying to get you to remember that the tool is not the holy grail.
[15:32] Just because it says there's a vulnerability, that doesn't mean, "Okay, stop everything, drop everything what you're doing and figure that out." Because even if you figure that out, it might be like in this case, it's choosing the lesser of two evils. So the Open SSL issue, whatever it has to do with the encryption that is used for the cookie, where the results, or rather the inputs that you gave, the answers that you picked are saved and it's all encrypted. But for me, I've decided that risk, even though there's actually no real use of data in there, is worse I think, than, okay, something that can only happen to me in development. So you won't know any of that until you go through that process. And this was just one particular example that I encountered this month. But for every kind of notification, you have to go through this process and it's really, really hard.
And then when you suddenly get something like this, we get to that other point of frustration, things that are clogged down. Some organizations will measure like, "Ooh, how many alerts are still open? How many vulnerabilities are still there?" But it doesn't actually tell you if it is a security, how bad, or it says, "High or moderate," but is it really that bad? "Lesser of two evils," that's what I always say.
Craftsmanship
&
the Art of Devops
[16:52] Okay, so we talked about pull requests, we talked about how you need to trust your team. We talked about as well that even with security, you also have to trust your team to learn and grow. And the last thing I want to touch upon today is actually craftsmanship, because we make DevOps really hard, but I want people to actually want to do it. And key to that, in my opinion, is loving what you're doing and being proud of what you do.
So this is what I see a lot of, everywhere on GitHub open source as well, not just at work. You're probably using the GitHub UI if you're clicking something and it says, "Update. Read me. Update. Read me. Update. Read me." That's okay. It's a place to start. Now what's interesting is when customers come to me, and one of the biggest challenges people face is actually versioning, especially if they're using microservices. And for whatever reason you have to correlate things, in which case you don't have a microservice, you have a distributed monolith. Anyway, when you see something like this, it looks super nice and like, "Oh my god, simple. It's super easy." And how do you do that? Well, the answer is practice and discipline. So this is a change log that I made using a tool called Standard Version, and it relies on something called conventional commits. And that's what it is. It's just a naming convention. So when you make a commit message, this is what it looks like. How it figures out if there's a bug fix or a feature, it's just a prefix. Throw some stuff between the parentheses and it'll try to categorize things for you. And then as well, if you add the hashtag and a number, of course GitHub links it to the issue. Now, if you're really, really paying close attention, you can tell that some of those, I've also adjusted by hand because even I am not perfect, even I will sometimes be in a rush and I'm super pedantic, which is why I say, "Even I." I'll be in a rush and I'll just also submit something. But when I actually make a release, I will look at the change log and edit it as necessary before committing it. Because at that point, I am publishing it to you, the consumer. And I want you to understand it. Any tiny mistakes I make in between, that's okay, it's just for me. But at the point where I say, "Okay, it's ready for the public," I want to make it look a little bit nicer.
[19:14] The other thing too is that sometimes you're not doing everything perfectly, and in this case, we had to restructure a bunch of things. Take that extra step to actually make sure everybody gets credit. This I've seen happen a lot in corporate where unfortunately people actually count those commits. And you know that GitHub profile with all those little green squares? So you need to make sure that not just for you, but also for your colleagues, that their GitHub users and usernames are also getting credit for that work, otherwise people don't see it.
Now, it's also not just about that corporate loophole, it's also about, let's say appreciation, of teams and being grateful for the help that they're giving you. So this takes 30 seconds, but it goes a really long way for your team culture.
[20:05] Another thing that I want to tell people to spend time on is actually doing documentation. So I write this both for myself as well as people who are going to use the software that I publish or samples. It's not meant for your builds, it's meant entirely for humans. This has no purpose in code whatsoever except to help you understand something. And that's really, really important because we almost always have to go back and look at things in the future. And I don't remember what I did last week, let alone six months ago. And if we're working on multiple projects and microservices, et cetera, then yeah, sometimes you take a long break from a project. So it's really important to actually add that documentation.
Now when I show demos at work, you look at something like this and this is a GitHub action workflow, and I recently took some time to actually completely migrate an Azure Pipeline workflow. It looks really nice. But how long does it take? In reality, years. There's so much experience that comes from just actually creating pipelines. What works well for me, not necessarily for everybody, and then figuring out, "Okay, there's a gap here. There's a gap here. This feature works that way." It takes a long time. The simpler and more elegant it looks, the more effort it takes to get there. And it's okay. You don't start there. What you want to do is actually invest in yourself, which means good habits. Good habits, little tiny steps that over time, actually result in that big goal you're trying to reach.
[21:38] If you try to get to it from day one, what you might end up doing is actually hurting your own motivation and your team's motivation. And that's almost the worst thing you can do. I think it's better to have more contributions and stuff you need to clean up, than to have no contributions. And you might think, "Oh no, no, they just have to update the pull requests or whatever. Just have to make the build go green." But there's a human element to that that might be missing in those types of rules that are just all or nothing binary. Don't do that. Really it's about people. It's about investing in people long term. So things like documentation, et cetera, it's not a cost, it's not a chore, it's an investment.
And with that, I am done. So thank you for coming to this talk. I hope that it'll get you thinking critically the next time you read some article about Best Practices, 10 Things you Must Do. Do you really need to do it? And if you're interested in more types of topics like this, counterintuitive, what is it actually in real life? You can follow me on Twitter or on YouTube. Then thank you very much.
Questions
[22:50] Mettin Parzinski: Good to have you here. So let's look at the results. So the question was, what does it feel like to deploy to production? And with 57% we have a winner at passport control and airport security followed with 27% waving at your coach as you step into the field or 17% showing an ID before entering a bar or club. Was this something you were expecting, Julie?
[23:17] Julie Ng: Actually it surprised me, the airport control, that it's that many people. I thought this audience, this conference, folks who use JavaScript, I don't know, they would be more laid back. I think what surprises some people, because my job today, I helped Azure customers, even the Allianz didn't have passport control and airport security. So we had all those really super strict requirements. So my job there, I was a full stack engineer and then I later had a mentoring role across many teams and we established all the rules. But it got to a point, for example, some of the compliance stuff, the product owner, so a non-technical person would create the pull request that could deploy to production, for example. Or the product owner would be the one to click a button in Jenkins, and that deploys. There weren't any hard controls on everything at the time. It's just when you have a business that's worth that much money, it's scary anyway. So you're only going to click that button if you're a hundred percent sure. And because people actually own their repositories, their own repositories for the most part, it's like, "Okay, if I screw up, I'm just going to shoot myself in the face." It might have been that it was really early in the cloud journey and they didn't put all those rules on yet. But I don't think so. I think it was just that there was a good amount of trust, which is hilarious because I have airport security today in internal repositories and I don't even work on Azure. But it's not necessary. I hope it's clear from the talk that it's not necessary to have that much control. Trust is much more important, I think, for long-term growth of the team, team spirit, and then also just adding more features and business value.
[25:00] Mettin Parzinski: But I guess also with our audience, most of the people are not actually deploying user facing code, but they are building the infrastructure, so like the chef doesn't need to have his own kitchen, thinking of that. They're scared of their own code, maybe. But hey, we'll never know unless everyone tells us now on chat. So lets, if you're ready for it, hop into the Q and A. And as a reminder, if you have any questions you can still do so in the channel, DevOps Talk Q and A. So be sure to join that. First question is from CC Miller. How would you handle the scenario where the vulnerability is a valid issue for the app but can't be updated because other things are not compatible with the fix?
[25:52] Julie Ng: So if I understood correctly, there is a valid security vulnerability, so it's a confirmed vulnerability, but there's no fix. And how should we handle that? Is that what's the question?
[26:06] Mettin Parzinski: Okay, yeah. And there's also compatibility issues.
[26:13] Julie Ng: Okay, there's compatibility issues. That's something you have to figure out with your security team. So I can say that the customers do this. I can say that a couple years ago, I'm pretty sure that Allianz probably still does this. You have to find that balance. And often even in large organizations, you'll have basically a contract. It's going to be locked, you're deploying something that has a known vulnerability and you have however many days or hours to actually fix it. And you have to make that decision. There's a vulnerability. If the team or the business owner, the product owner says, "We go anyway," okay, then the clock starts ticking after you go into production and you fix it by then. And that might mean rolling back. So there's always going to be a contract I think between all the different stakeholders in your organization and that includes the security team. And I think the hardest thing is to find the right amount of confidence. Like the example I said, it's a valid security issue, but what is the chance of it happening? There's no such thing as 100% security. I think you just have to evaluate it, take a little bit of a risk and figure out what is right for you. Yeah.
[27:28] Mettin Parzinski: Yeah. Agree. Next question is from Jessica. How often should we revisit what good looks like? We have to start somewhere and how do we know how often we should optimize for success?
[27:46] Julie Ng: I think you'll know in your gut. So you never reach perfect. And it's funny because I show demos at work all the time and in webinars and I'm like, "Oh, I'm doing it like this now," and then six months later, I would totally do it differently now. And so as you work with what you've built or as other people join your team, as you get new experience, you just might change it. And whether or not you have the time to change it, that's going to be dependent on what's your workload, how much time do you have to prioritize features, et cetera. So you're probably going to be doing it all the time and you'll never reach perfect. I think if you reach perfect, I'm like, "Hmm, that's an interesting bar." Or you have something that, there are some services you deploy and they're done. You don't really ... If it's not customer facing, let's just say it's an API to consume, some things are done and you don't update them regularly. Maybe for some security maintenance, but same thing then when it's automation, it's done.
[28:48] Mettin Parzinski: So yeah, I'm thinking of, I don't know where I got this wisdom from, but I once heard if you're looking at something you built a year ago or two years ago and you're not ashamed of it, that's a bad thing, because that means you have not progressed in your knowledge for a year or two years. And you mentioned half a year. Half a year is acceptable. And of course you don't need to progress. If you're fine where you are then you're fine. But most people here I'm assuming are here to learn new things. I thought it was a nice rule of thumb. If you're looking at something from two years ago, you're unashamed, it's a bad thing.
[29:35] Julie Ng: Yeah. It's not just though ashamed or whatever learning. So I'm reading this book right now, it's called Think Again by Adam Grant. So sometimes it's just that it's not you learning something new, but you realize something and you just change your mind. And changing your mind is actually not a bad thing. It means you're thinking like a scientist, maybe you have new learning information now and you change your mind. It didn't get better, it's just different. And what I changed, it works for me, it won't work for you, but that's okay. It's for me, not for you.
[30:06] Mettin Parzinski: Nice, nice. We have another question from Sisi Miller. Is it possible, can you over document, and what is too much, or is that a team thing?
[30:20] Julie Ng: You can definitely over document. I wish I could share so many internal things with you. Documentation, it's really hard to find that right balance. And sometimes there's not much text and actually that takes a lot more time to figure out what is actually necessary. Let me write a paragraph and delete 10% of it or 20% of it or rewrite it because it doesn't make sense. As much as you need and not much more. I think one thing I try to tell my colleagues, and I'm not really successful at it, is you don't need to write a book. You don't need to write whole paragraphs. If you can get away with bullet points, do it because nobody reads it anyway. Who has time to read pages? And so you will decide, or the people you're talking to will decide if it's too much. There are a couple times they're like, "Oh, didn't you read the Wiki?" And I'll say, "No, it was too long." And I'm direct, and so people are like, "You didn't read it?" I'm like, "No, I didn't read it. I didn't have time to read this much text. No."
[31:22] Mettin Parzinski: Yeah, it reminds me of a meme that was going on a few weeks back. Something like, why waste two minutes reading some good documentation when you can just spend a whole day trying to figure things out for yourself.
Julie Ng: I like that too. That also works.
Mettin Parzinski: And I felt personally attacked when I read that.
Julie Ng: Nah. It's not attack. It's all good. It's all good. Always good humor.
Mettin Parzinski: But I guess a lot of people are guilty of that. We have one minute. So we're going to do one more quick question from Sergio. Let's say you have some process in the workflow that was added because of reasons. How do you know when it's time to remove that process because it doesn't add value anymore?
[32:14] Julie Ng: I don't think I can do that question justice in under a minute. I would say let's go into the spatial chat, that person, if you can, and then I'll answer it there. I'm going to give you, you'll know when you know. But for details, let's go into the spatial chat room thing.
[32:31] Mettin Parzinski: Yeah, that's a nice bridge too, the next step, because this is all the time we have for our Q and A. So Julie is going to go into her spatial chat where you can continue the conversation with Julie. So if you want to have some one-on-one time with Julie, you can do some now in the spatial chat. Julie, thanks a lot for joining us. It's been a pleasure having you. And enjoy our spatial chat speaker room.
Julie Ng: See you later.
Mettin Parzinski: Bye-bye.
Julie Ng: Bye. See you later.