From 1 to 101 Lambda Functions in Production: Evolving a Serverless Architecture



Hello, I'll tell you a story about serverless startup. The serverless part is definitely not the most important part of our startup, but on the other side, it's really a cool story for someone that is a programmer and working with node.js and other technologies. So I'll tell you a story about Vacation Tracker. As I said, at the moment, we are 100% serverless startup using node.js, but everything started with a simple lambda function. Then we added another and another and another, and yeah, that escalated quickly. So now we have a lot of lambda functions, and I'll try to walk you through our story from the first lambda function to the current state in production. And we started more than three years ago. So first in 2016, we decided to build a lead tracking system. Actually, I'm lying. We decided to solve our own problem because our other company, cloud Horizon, had more than 10 people at that moment. And it was really hard to track who's off, how many PTO days they have remaining for this year and things like this. We built, we tried to do internal hackathon, and as every hackathon, we didn't build anything. So in 2017, we tried to build, to solve our own problem. We tried to find some other lead tracking tools, but most of them were complex HR systems and things like this. So we decided to build something in-house, and we'll build some kind of proof of concept with Slack. And as always, we decided not to continue at that moment, but we published the landing page. In 2018, we got a lot of requests through our landing page. More than 100 people waited in the waiting list for a private beta. So we finally decided to build something. The idea was really simple. We wanted a system that will track leave requests and the number of remaining days. We wanted to use some kind of single sign-on, so we don't need to remember more passwords. I hate passwords. We wanted something to be connected to our Slack so we can see the info when we need the info. For example, when someone is not working, we want to see that that person is on vacation and things like this. And finally, we wanted to connect our calendar so we can subscribe to events and see who will not work next month and things like this. As I said, we were solving our own problems, and we didn't know if anyone else would use our system. But a few months after we released the beta version, we saw that there are many startups that want to use our system. And then we saw some small companies signing up. And then some schools and universities. And then nonprofits. And then teams from many enterprises. And then we saw some government organizations. And we saw some other organizations, such as churches and many other organizations that I never thought will use the system like this. So there was a real problem, and we decided to continue with that idea. And today, we have many customers from many large and famous companies, and also many cool startups and smaller companies and organizations. The number of unique users in our system on December 1st was this. Not all of these users are using Vacation Tracker, but all of these users went through Vacation Tracker at some point. It's a nice number, but it was a real number that I took from our database. I decided to leave this number from December 1st because I'm pretty sure I will not be able to get this nice number again. So here's the product. We have a web dashboard where you can do many things, like see the quotas and set applications and many other things. For Slack users, we also have a nice integration where you can just click on one button or do a slash vacation command and request or approve the leave. For Microsoft Teams, we have the same thing, plus some nice cool things, such as embedded dashboard inside Microsoft Teams, so you don't need to log into a separate system and things like this. But let's talk about interesting things, and that's architecture, not the system itself. So our first architecture was a simple serverless bot. This was version 0.1 because this was a beta and we didn't know if anyone would use our system. So the bot looked something like this. It was really funny because there was no calendar or anything like that inside Slack, so we built everything with buttons. But it seems that that was better than Excel spreadsheets and many other things that people are using to track their leaves. And whenever you click on this button or anything like that, it triggers some serverless bot in the background. So why serverless? That's the obvious first question. Well this talk is about scaling a serverless startup from one to 1001, not 1000 yet, 101 lambda functions in production, so that's why. I'm just kidding, of course, but now that they stopped the flow, let me introduce myself. I'm Slobodan Stojanovic, I'm CTO of cloud Horizon and also CTO of this product, Vacation Tracker. I'm also co-author of serverless Applications with node.js book, which I wrote with my friend Alexander Simovic, and I'm also aws serverless hero. I'm writing a lot about serverless and you can see more articles on my website. There are links to many other websites where I write about serverless. But let's go back to the most important question, why serverless? As you probably know, serverless is an acronym for something like slow, expensive, vendor locking, and I'm obviously kidding, that's not true. I actually really love serverless and we decided to build everything with serverless because at the moment we started this, we were a small team, we are still a small team, and our team was not really experienced with devops. So we decided to use something that has auto scaling and auto failover. We tried to go as cheap as we can because we bootstrapped our startup, so serverless fits that nicely because it's cheap. And it was really fast to build a prototype using serverless. It took us a few days to build that chatbot with that fake calendar and everything, and it was production ready basically. And of course, serverless also gives us a good starting point for security. Of course, you need to still think about your data and everything, but lambda functions are secure by themselves and it's really easy to secure something that is active for, let's say, 100 milliseconds or something like that. Of course, our first version wasn't 100% serverless. We started as a serverless chatbot, so Slack sends some requests to some kind of api gateway. Actually, it's called Amazon api gateway, which triggers some lambda function with some business logic. And that was our prototype. But then we decided to build some backend for that because we want people to be able to store the data and this application to do something meaningful. And then we built some kind of angular application and the developer that we assigned to this project didn't know serverless, so he decided to use node.js and Express.js server and mongodb. So half of our product was serverless and the rest of it was basically a standard traditional application. There were some clear benefits of this. Components were quick and independent and it was easy to understand and maintain this kind of application because, as you saw, there were just a few components in the system. It was easy to onboard new people because it's easy to explain how the application works and everything was really cheap. The cost of actually first year for us was $0 per month for aws. We had some credits for mongodb and our server, but the serverless part was actually $0 per month for a long time because they charge you by the number of requests and the first year we didn't have enough requests to start paying anything to aws. Then we decided to add some new features because people asked for them and our developer learned serverless and he tried to use serverless for new features. So all new features went through that api gateway and then, for example, for billing, we added a new lambda function connected to Stripe and we added some new endpoints and things like this. But for new endpoints, we faced the first problem. Our angular application needs to know where is that endpoint. Is it in Express.js application or inside some lambda function? So we tried to do something like this instead of going directly to Express.js server, we used api gateway as a gateway for everything. And yeah, that was a bit messy and downsides for every part of the system required independent deployment. It was hard to manage everything because we had more and more lambda functions. It was hard to scale this because of this kind of system and we had a clear bottleneck. Our mongodb database wasn't serverless. So everything else was scaling except that part. At that moment, we had around 100 paying teams. So we decided to improve our architecture because it seems that people want this kind of product. So we tried to build a more complicated serverless application. The first step was introducing infrastructure as a code. We use CloudFormation and we tried to have different services inside that CloudFormation. So for example, when you do something in one service, that service can send something to the, let's say api of the other service, but that api doesn't need to be api gateway or RESTful api. Sometimes that's some kind of interface for notifications that can be sent in the background. For example, this part of the system is still the same. Whenever we receive a Slack message or something like that, that goes to our api gateway. It triggers some lambda functions depending on the action that user is using, slash command or click on a button and things like these. Then we use something called Amazon EventBridge to send notifications in the background. And we respond back to Slack and tell Slack that the message is received. And then we have business logic somewhere in the background doing something with that data. At that moment, we had 150 lambda functions, which was a lot. And we started doing our first migrations. We had old services on Express, and then we built a new serverless services and we somehow migrated the users from using one to other services. And one of the key parts of this was finding a good architecture. We picked hexagonal architecture, but we'll talk about that a bit later. Things we changed, everything was inside CloudFormation. We replaced node.js server with serverless services still with node.js. We started adding typescript and we replaced mongodb with DynamoDB, not for everything, but for most of the things. Benefits, it was easier to deploy our application. It was still out of scalable. So far, we have almost 100% uptime out of the box. We didn't do anything to get that. We were down for, I think, 30 minutes in total from 2018. And it was still really cheap. Some downsides, we were storing state, not events. We were wasting a lot of time on not focusing on our business logic, but on some other parts of the system. It was hard to onboard new developers because of many new services. And I realized that the developers don't like YAML and configuration. That moment we had like 600 paying teams, so we decided to evolve our architecture one more time. And we decided to use event-driven architecture. So we were back to a drawing table and we tried to find another good architecture that will work with hexagonal architecture and help us to solve our problem. And we decided to use Command Query Responsibility Segregation, or CQRS. So why CQRS? As I said, we were storing state, but we wanted to store events. Why events? Because vacation tracker is a lot of things happening every time. For example, someone can add you to allocation, assign some leave policy, add some leave days for you. You can request a leave. Someone can change your working week. Many things are happening every moment. And the quiz question is always how to calculate remaining PTO days for the current year or some other days for the current year with the data that we have. And of course, storing events helps us a lot. But with events, we decided to remove part of our code. So we decided to use aws AppSync with Managed graphql. AppSync is basically a Managed graphql service. So now whenever a dashboard or some other application is writing something, changing something in our application, it sends a mutation. We store that mutation to some event storage table that triggers some background logic. We do some business logic. We use real-time subscriptions to let the front end know that the business logic is done. And we also store some kind of state, current state in some read-only tables because we want users to be able to run queries really fast using graphql from the front end. At the moment, we have 112 lambda functions. As you can see, we removed some of the lambda functions. And there are some clear benefits. We have a fully managed graphql server that works really good. System is actually faster. We have less code. We have better control. And we, of course, have all benefits from the old architecture. And now we have a monorepo, which seems like a big benefit because we share some code between front end and the back end. There are still many services to learn. And there are Velocity templates now instead of some functions. And Velocity templates, I will not even try to explain. These are like alien language where you transform the request to something that AppSync understands. And so we don't need to have lambda function between the request itself and the event table. And that helps us to improve the speed of our system at the moment. But of course, there are many challenges in this process. First obvious challenge is how to onboard new developers. Our current team has just four developers, all of full stack. And one is actually new. One developer just started. We have one marketing person, one customer support person, one product manager. And we have some freelance support for marketing and design mostly. The good thing with serverless is that we can assign the new environment, the new aws account to each developer. So the first day you join Vacation Tracker, you'll get the copy of basically everything inside the aws account that belongs only to you. And as the system is really complex, this is the same diagram that we saw previously, but with slightly more details. We don't start with everything immediately. We start with just one small thing. For example, new developers start working with our online dashboard, which is basically react app. And then slowly with new features that they're adding or changing inside dashboard, they start learning the backend and how everything works in the backend. And then they continue using and learning different parts of our application. And after like three months, they know basically most parts of the system. Not details, of course, but they know how everything works. testing was another big challenge and it's really important. And now we can talk about hexagonal architecture, ports and adapters. Basically our business logic is isolated from all different adapters. So we can test everything locally with a local trigger and then run everything with lambda event adapters in the production. Let's go back to this example and let's focus on this one function. The code looks something like this now with typescript, but this was a javascript example. So for example, there's lambda.js file for each service. Doesn't do anything except just firing the dependencies. There's main.js, which is basically a business logic that has its own unit and integration test. But many parts of our system are using some common repositories. For example, sending events to that command bridge. So we don't want to test each function against event bridge. Instead, we have some local notification repository with the same interface. It's easy with typescript to do that. Then we have a real event bridge repository and some other repositories that have their own unit and integration test, and we are testing them against the real aws resources. And of course, we have some helper functions and services that has, for example, event parser has its own unit test, not integration, because it's not integrated with anything. And for migrations, we do something like this. We use that mongodb repository or part of it with its unit and integration test. We try to build another repository with the same interface, for example, for DynamoDB. And then when we have the same interface, we can just switch the dependency in production. Of course, besides that, you need to transfer the data, but that's another topic. So testing things in integration test looks something like this. For example, if I want to test DynamoDB repository and I want to do integration tests, I can create a database in before all tests and destroy it after all tests by just doing some simple commands and waiting for like 10 to 20 seconds, more 20 seconds to create a new table. And then at the end, we just want to destroy that table so we don't leave any, basically any trash in our aws account. This account is just for testing, but anyways, we just want to remove everything in the end and that's it. Another big challenge we had was debugging and monitoring. And now we can run different types of queries and see all the errors in our system and things like this. We have different dashboards that help us to monitor our application and get alerts and things like this. Of course, the cost is another big challenge with serverless, but the total cost from 2018 was $7,000. And of course, we had some aws credits, so we paid maybe $2,000 so far. I actually calculated the most expensive bug. As you can see, it's half of the bill that we had from the beginning. And the bug was with DynamoDB. As you can see, when we fixed the bug, the number of requests from thousands of requests every minute dropped to basically zero. And that decreased the cost a lot by hundreds of dollars per month. And of course, we had many other smaller challenges, but we are very happy with serverless so far they're not related to serverless that much. They're overall challenges that you have with building a startup anyway. And now with serverless, we have a team of really superhero developers that can develop many things really fast. So that's it. Let's go through a quick summary because this was a bit longer than I expected. So you should evolve your architecture with your product. Everything that worked in the beginning can't work now when our product is much bigger and different. You need to pick a good architecture because it helps you to keep your migration and onboarding costs low or reasonable. You need to think about onboarding new team members because that's really important part of every system. Hexagonal architecture is really nice fit for serverless apps because it helps you to test everything. CQRS is also a nice fit for serverless, but also it's an excellent fit for our product. Find something that works for you. You should test your integrations and application in general, of course, and testing is not always enough. You need to monitor your application and track the errors so you can react fast if something breaks. And yeah, that's it. Thank you very much. I'm writing a lot about serverless. I'm doing some free workshops and things like these. If you want to follow more about these architectures and what we are doing with serverless, you can subscribe on my website. Thank you very much. Hey, good to see you. Good to see you. So what do you think? 61% is not using serverless yet. What's going wrong? That's fine. That's fine. That's something that I expected. And it's okay. I think that 40% of people using serverless at some point, at least for a small portion of their app, is really good. So it's still quite new and there are a lot of new things that you need to learn to be able to use serverless in your application. So yeah, I think that's a good percentage. And over the past few years, I saw that percentage rising from 0.something to 40%. So yeah, that's great. So if we do this question again next year, what's the percentage you'll be expecting then? I don't know. A bit more than 40%. I'll be happy if I see, let's say, 50 to 60% of people trying, at least the people that at least tried serverless. So yeah, at least dipping their toes in the water, right? Yeah. All right. So it's not the solution for everything, but so far I think you can use serverless for really many different use cases and every day they're covering more and more things. So yeah. Yeah. Super nice. So let's go into our questions from the audience. The first question I have is from House of Alejandro. He's starting his question with awesome talk, slow but on, and then like a confetti emoji. So that's good. Some nice feedback from House of Alejandro. And his question is, what are your general impressions about vendor lock-in when we use specific serverless providers? So this is a topic that's come up a few times yesterday already, vendor lock-in. How do you feel about that? So yeah, this is one of the most important questions related to serverless. And I'm really not afraid of vendor lock-in because for me, that's not a vendor lock-in. That's basically a switching cost. Whenever you decide to use something, let's say node.js or PHP or Ruby on Rails. I saw that someone mentioned that we are a small startup not using Ruby on Rails. So basically whenever you decide to use something, you have some switching costs. If you need to migrate to something else in the future, you'll need to dedicate some time and you'll need to pay some amount of money to migrate to the other thing. And it's the same with serverless. So far serverless is giving us real big benefits. And especially for startups, the amount of time that you need to build something is more important than some other things. So I'm really okay if I need to migrate some services off serverless to something else sometimes in the future, but I don't see why would I need to do that. I'll need to pay some money and I'll need to spend some time doing that, but it's okay. It's a business risk that I'm okay to take. We already migrated a lot of different things. For example, we used mongodb and migrated to DynamoDB. So that was a big change. I don't think migrating from other aws resources to something that is not on aws is a lot more complex than that. So yeah. Funny insight that you mentioned, even programming language is kind of a lock-in. Never really thought of that. Also frameworks and many other things. It really depends. For example, whenever you commit to anything, basically whenever you decide to do something with your product, you have some switching costs related to that. Even onboarding people is some kind of lock-in. When you use different services, you need to change your onboarding procedures and many other things. It's difficult. Yeah. Yeah. Yeah. Never really thought of it like that, but every decision you make technology-wise is kind of a lock-in. Cool. Next question is from William RJ Ribeiro. I can imagine it is hard to find people that already have experience with serverless. Well, lock-in. How do you onboard new people that have zero experience? Yeah. So we're doing that right now. We started a few weeks ago and our new team member, Ivan, shipped his first feature to production in his second week in Vacation Tracker. So I guess that's not that bad. So for example, he has experience with react. So we started with front-end tasks and react, and then he slowly started taking some tasks that have some small backend features and things like these. And we're trying to onboard people to a few more simple services that they can understand easily. For example, you have api, you have some lambda function, which is basically some kind of handler. And then on the other side, you have some database call or something like that. And then when they learn to do that, then slowly we add more services and try to show them the portion of the big picture of our architecture that they're working on. And over time, we are able to onboard people with zero experience with serverless to Vacation Tracker without big... I don't know. It was harder for me to onboard people to some non-serverless projects than now to serverless projects. So it takes some time and learning curve, but it's doable. It's not that much different from other things. So basically you're saying that you could even teach me? Sure. Yeah. We can do that after this talk. We'll meet at an after party. We'll learn some services. Oh, actually, actually, I have a better idea. If anyone wants to learn serverless, we are doing a workshop next week as part of this conference, so you can learn how to write serverless applications with typescript. So yeah, that's one of the options. Super nice. Also. So that's, William, that's included in your ticket price, like we mentioned in the opening. So be sure to check out Slobo.dons' workshop. We have time for one question, and that question is from Perry. Great panel discussion yesterday that you were in and today's talk also. Is it possible to test lambda functions in a production environment? Oh, yeah, it is. Also, there's one good thing with lambda functions is that you can use exactly the same lambda function in, for example, staging environment, test that function and just label that function with production label or something like that. That same code without redeployment will be promoted to production. I'm not using that at the moment, but you can do even that. But yeah, sure. The good thing with serverless is that new environments are not that expensive and most of the time they're actually $0. So we have one environment per developer that they're exactly the same as production. They don't have the same data, but we can fill that database with the same data if we want. But yeah, it's like, of course you can use it in production. It's like anything else. You just don't worry about the server itself. Everyone else will do that part of the work, but everything else is completely doable the same way that you did for non-serverless applications. So you can basically duplicate your production environment to make it a staging environment per developer with low cost. With almost zero cost, yeah. That's nice. That's nice. Because usually setting up a staging environment, well, we had a great talk yesterday also about setting up a staging environment and duplicating data and usually anonymizing data also, of course. That's still difficult, but that's not related to serverless. That's related to your data. But having everything that works and scales the same way is quite, I will not say easy, but it's easier than it was and definitely cheaper. Cool. So that's all the time we have for this Q&A session. So if you want to discuss serverless more with Slobodan, Slobodan is going to be on spatial chat in his speaker room. So be sure to go there. We had some more questions in the Q&A channel, but we don't have the time to go into that now. I'll do my best to answer that in the special chat and of course in the Discord after that session. Thanks. All right. It's been a blast having you again, Slobodan. Hope to see you again soon. Bye-bye. Bye.
32 min
24 Jun, 2021

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic