Serverless in Production, Lessons from the Trenches

Rate this content
Bookmark

Serverless technologies help us build better and more scalable applications in the cloud. It's a powerful paradigm that allows us to focus on creating business values and letting the cloud provider handle the undifferentiated heavy-liftings such as managing the underlying infrastructure. In this session, Yan Cui would take us through many of the lessons that he has learnt from running serverless workloads in production over the last five years, including development tips, testing and observability strategies, and much more.


You can check the slides for Yan's talk here.

34 min
18 Feb, 2022

AI Generated Video Summary

This Talk provides valuable insights for those considering serverless in 2022, with a focus on troubleshooting and observability using Lumigo. It emphasizes the use of multiple AWS accounts and Org Formation for better control and scalability. Security considerations include securely loading secrets at runtime and implementing zero-trust networking. Optimizing Lambda performance is discussed, along with updates on serverless frameworks and the role of Terraform. The Talk also compares Honeycomb and Lumigo for observability in serverless applications.

1. Introduction to Serverless Lessons

Short description:

In this talk, I will share the lessons I've learned from running serverless workloads in production over the past five years. I'll provide valuable insights for those considering serverless in 2022.

Hi everyone, thank you for joining this talk where I'm going to tell you about some of the lessons I've learned running production workloads with serverless the last five years. My name is Yen Chui. I'm one of the first AWS serverless heroes and nowadays I work as a developer advocate with Lumigo, which I think is the hands down best ability tool for serverless applications. The other half of my time I work as an independent consultant where I work with companies around the world to help them succeed with serverless, so I've been running workloads in production using several technologies since 2016 and quite a few number of things that's come up along the way and I've categorized them into a number of lessons, which I think will be really useful for everyone who's thinking about using serverless in 2020, in 2022 rather.

2. Importance of Troubleshooting and Observability

Short description:

You need to think about how you're going to troubleshoot your system from the start. Observability is crucial because bad things will happen. Logs are overrated for troubleshooting under time pressure. I use Lumego, structured logs, CloudWatch, and Lumego's alerts for troubleshooting.

The first and arguably the most important lesson is that you really need to think about how you're going to troubleshoot your system right from the start because it's going to be a much harder problem for you to fix after the fact. And observability is a measure of how well the internal state of a system can be inferred from its external output, and it's absolutely crucial because bad things are going to happen to your system. It might not have happened yet, but you will eventually, because everything fails all the time, as Werner Vogel famously said.

And when things go wrong and users are impacted, you need to be able to fix the problems as quickly as possible. And that requires us to be able to both identify the issue, but also to resolve them in a timely fashion. I've spent many years just swimming around in the log messages, and I've come to the conclusion that logs are overrated. They are useful, but they're not the most effective means of troubleshooting problems when you're under time pressure.

And I think at the best of times, it sometimes feels like an exercise of finding a needle in a haystack when you're browsing through huge amounts of log messages. So nowadays I have adopted a different approach whereby I'm using a combination of Lumego, and not writing many logs, but when I do, I make sure that my logs are structured, and that they cover the blind spots that I don't see in Lumego. And then I also use the CloudWatch for system metrics and the alerts to complement the alerts that I get from Lumego. And most of my troubleshooting is done inside Lumego, where I either be notified via a Slack alert, or I go to the issues page, where I can see all the recent errors, and they've been captured and categorized by function and error type.

3. Troubleshooting and Observability with Lumigo

Short description:

I can easily identify the root cause of errors and quickly work on a fix with Lumigo. It captures a lot of information about Lambda invocations, reducing the need for extensive logging. Lumigo shows logs alongside traces of function calls to other AWS services, making debugging complex transactions easier. I also set up alerts for system metrics and use CloudWatch for additional performance monitoring. Rumigo alerts me via Slack, and I can quickly access failed transactions for troubleshooting.

I can see how often these errors happened and how the trends have changed over time, if they're happening consistently, or is it just maybe a blip? And from there, I can go into a particular error and find all the failed Lambda invocations that resulted in that specific error, and then join to each Lambda invocation, or rather transaction, because it may not be just one Lambda function. And I can see the error message, that in this case, some attribute value was missing when we tried to communicate with DynamoDB, and I can see the Lambda invocation event, which is something that I see a lot of people would log at the start of their function.

With Lumigo, I don't need to do that myself. And here I can see the problem has been highlighted, and the problem was caused by a call from our Lambda functions to the DynamoDB table. I can click on the Lambda, the DynamoDB icon to find out that, okay, what is the request payload that we sent? And from that, I can see that it was an update operation. And sure enough, we can see that we had the status message in the update expression, but it wasn't included in the expression attribute values. Hence the error message we got in the first place. So because of that missing attribute value in the expression attribute values object, that's why we're getting an error. And within about 30 seconds of getting an alert in Slack, I was able to identify the root cause and I can start working on a fix, which hopefully would just be a simple no check in my code. And since libmigo captures a lot of information about the Lambda invocations, including every request I make to other services or other third party APIs, I've got a lot of checkpoints along the way that lets me infer the internal state of my application, which is why I don't have to write a lot of logs myself nowadays. But if I have some complex business logic that doesn't involve calling another API or another AWS service, then I will still need to write custom log messages so that I leave myself some clues as to what's actually going on in my function, what I'm doing, maybe some complex data transformation or enriching some of the data we're getting so that I can leave some breadcrumbs so that I can later on look at the logs and figure out what's going on when something happened.

So luckily, Lumigo shows you your logs side by side with the traces of what your function has called in terms of other APIs or other AWS services so I can easily find the information I need in a single place. And this is really handy when it comes to debugging some of the more complex and nuanced transactions that spans across multiple Lambda functions and event buses and DynamoDB or Kinesis streams and queues and buses and whatnot. So you see multiple Lambda functions in a transaction, you get all the relevant logs from those Lambda invocations all in one place side by side with your transactions. And then to add a bit of a sprinkle to the whole thing, I have alerts on many of the system metrics. Like for example, if I'm using DynamoDB, I'll have alerts around DynamoDB response time and user errors and so on. And for things like I'm processing real-time events from, say, a Kinesis stream, then I will have metrics and alerts around the age of the messages so that I know where my functions falling behind and it's not processing events fast enough. And this complements the built-in alerts I get from Rumigo, which lets me know when there's something wrong from my Lambda functions. But of course, it doesn't cover the metrics, the performance of AdAble Services, which is why I also have CloudWatch alerts for them as well. And when something happens, I get a notification like this in my Slack because I've configured Rumigo to alert me via Slack. And so clicking on this link here will also take me into the failed transaction inside Rumigo, so again, I can very quickly get to the root cause for the problem that we're seeing.

4. Using Multiple AWS Accounts and Org Formation

Short description:

Using multiple AWS accounts is crucial to avoid service limits and improve scalability. It helps compartmentalize security breaches and insulate environments and teams. AWS Control Tower and Org Formation simplify account management and provisioning. Org Formation allows for infrastructure as code provisioning and template creation. Service control policies can be used to restrict actions to specific regions, enhancing security. Password policies can be configured for all AWS accounts. Overall, multiple accounts and proper configuration with Org Formation ensure better control and scalability.

And moving on to the second most important lesson is that you really should be using multiple AWS accounts. Ideally, you will have at least one AWS account per team, per environment, because there are a lot of service limits that you can bump into if you just lump everything, all of the environments into one account. And you get limits on the number of resources you can have, like the number of dynamic db tables, or number of Kinesis shards in the region.

And one time I even ran into the limit on the number of IAM roles you can have in a region, which was 3,000. But when you have a company with 150 engineers all busy building things and creating new services and the lambda function IAM roles, you are going to hit those limits at some point probably a lot quicker than you think. And then there's the more serious throughput limits that affects your application scalability, like the number of API gateway requests per second, or the regional limit of lambda concurrent executions, which are applied to all of your functions.

So the more things you have running in the same account in the same region, the more likely you are going to be affected by one of these throughput limits. So having multiple accounts is a great way for you to put some bulkheads around your processes, and then also help you compartmentalize any security breach. If it were to happen, and say someone was able to gain access to your AWS account, hopefully that would just be your dev account so that they won't have at least access to your production user data, which would be a more serious breach.

And having one account per team per environment also lets you insulate each environment from each other, and also insulate each team from other teams so that you don't have situations, say one team is using up too much resources for their service, and then cause other teams services to be throttled. For example, if you were to run a load test in your staging environment, and suddenly that use up too much resources and starts to impact your performance and scalability in production. And from here, if say one team owns multiple services, you may also want to insulate the most of business critical or high throughput services and put them into their own accounts. So that again, you insulate other services that own by that team from this particular service, which may need much higher throughput than other services that they don't impact each other.

And to help you manage and provision all of these AWS accounts, there's AWS control tower. But I find that we've control tower that just a bit too much clicking stuff in the console for my liking very much in favor of the infrastructure as code everywhere. Luckily there's this tool called org formation, which lets me provision and manage my entire AWS organization with infrastructure as code using a syntax that's very much like CloudFormation.

And it makes it easy for me to also template my landing zone, which is the basically a configuration of what are the basic infrastructure pieces that should be provisioned to a new account and a new region. Things like KMS encryption keys, VPCs, and the subnets and so on and so forth. So using org formation, I can declare a master AWS account and as the root of my organization, again, or using code and using a syntax that's very similar to CloudFormation. And I'm able to provision a new account and set up alarm budget, alarms with just a couple of lines of code. And I can attach these accounts to my organization units, which I can then also creating in code. And I can configure a service control policies. For example, here, I know for this project, the only region we're going to use is EUS1. So I can set up a service control policy that denies all actions against the regions other than EUS1 and apply this policy to the root of my organization. So that it's applied to all the accounts. So I don't have to worry about, say, someone who's compromised my account and then try to bring up EC2 instances to do Bitcoin mining in the region that I don't normally use. So I don't pay attention to them and then until too late. I can also configure my password policy for any AWS users I have in my accounts and use this password policy in all my AWS accounts. And whenever I make any changes, I can also just run one command to update my entire AWS organization and then to configure my lending zones, I can configure what is called a task where I specify a CloudFormation template. I can provide bindings for each resource to my organization, which basically says, now create this resource from the template in specific AWS accounts and maybe even specific regions by binding this resource to say an organization unit or a specific AWS account.

5. Using OrgFormation for Resource Binding

Short description:

OrgFormation allows referencing resources across different accounts in CloudFormation templates. It provides organization binding to associate resources with specific AWS accounts. The special syntax in OrgFormation allows iterating through accounts and creating an array of AINs for assumeRole policies. Updating the landing zone is as simple as running a command. OrgFormation is a powerful tool maintained by MoneyU.

And the great thing and powerful thing with OrgFormation is that I can then reference this resource from other resources in my template as if they are part of the same CloudFormation stack, even though resources can be creating different accounts, but I can still reference them using similar syntax that I'm familiar with CloudFormation, like using ref and getAttribute.

So here I'm binding this resource to a subset of my AWS accounts in my organization using this concept of organization binding. And then I can then, down below here, iterate through all AWS accounts that are provided by the binding using this special syntax that again, looks like CloudFormation, which allows me to iterate through the accounts that are identified by the binding, and then accumulate and create an array of AINs to associate with this assumeRole policy.

So don't worry about the mechanics and how it looks, it can look a bit alien to you. But once you spend some time with the automation, you start to make a lot more sense. And then to update my landing zone, I just have to run another command and that's going to update all of my accounts. So, it's a really, really powerful tool, and I recommend you to go and check it out. It's available from GitHub and it's maintained by MoneyU, which is the bank over here in the Netherlands.

6. Security Considerations and Quick Wins

Short description:

When it comes to security, it's important to securely load secrets at runtime. Avoid storing secrets in plain text environment variables. Fetch parameters from SSM or secret manager, decrypt them at runtime, and cache them in the lambda function. Follow the principle of least privilege and give each function its own IAM role. Implement zero-trust networking to authenticate and authorize requests to APIs. Use AWS IAM authorization to protect internal APIs. Consider VPCs as an additional layer of security. These are some quick wins when working with Lambda.

And when it comes to security, there are also two really important topics to talk about. The first of which is how do you securely load the secrets at runtime. At this point, I think most people know to put the secret in SSM primary store or secrets manager and to encrypt them at rest with KMS. But then during deployment, they will oftentimes download those secrets, decrypt them and then attach the decrypted secrets as environment variables for the Lambda functions, or for the containers for that matter, which attackers will then be able to scan once they gain access to your system. Because again, this is low hanging fruit from an attacker's point of view to just look at what you have in your environment variables and then extract them.

And we've seen many attacks against the NPM ecosystem over the years, and in some cases, you have a compromise or malicious NPM packages that can be used by attackers to steal information from your environment variables. Which is why as a rule of thumb, you should never put secrets in plain text in environment variables. And instead, you should fetch those parameters from SSM or, and decrypt them at runtime doing cold start. And then you want to cache them in the lambda function and you validate the cache every so often, so that you can then in the background rotate those secrets without having to redeploy your application.

And the way I tend to do this is using MIDI, which is a middleware engine for Node.js functions. And it's got some built-in middlewares for both SSM, as well as a secret manager that lets you fetch secrets from those stores and then cache them. And also invalidated cache every say, five minutes, so that every five minutes, your lambda functions is going to go to SSM or secret manager and get the fresh copy of the current value of those secrets. But an important consideration regarding the use of SSM is that there is a 40 ops per second throughput limit by default. So if you want to go to production with this, you do want to consider switching your region to higher throughput so that you raise that limit to a thousand ops per second. That also makes SSM parameter store a paid service where you pay, I think, 10 cents per 10,000 requests, whereas the default setting is free, which is why it's got such a low throughput limit. And then there's also secrets manager, which has got built-in support for rotating secrets for RDS as well as the Amazon DocumentDB service, but it can be quite expensive by comparison because you are paying 40 cents per secret per month. So it adds up when you have lots of secrets multiplied by the number of different environments and accounts you have. So typically I only use SSM parameter store for my secrets, except for cases where it is specifically for RDS or for a DocumentDB database, in which case, the built-in rotation from secrets manager becomes quite handy.

The other important security consideration is that you should follow the principle of least privilege and give your lambda functions the minimum amount of permissions necessary. For example, if your function needs to save data into a DynamDB table, then only give it the permission to do that action against only the table that you need to save the data into. And then you want to give each function its own IAM role so that you can tailor the permissions to just what that function needs. And you follow this least privilege principle, it's going to help you minimize the blast radius of what attackers will be able to gain access to in the unfortunate event that somehow they managed to compromise your function and gain access to your database environment, perhaps through some compromised NPM module that allows them to steal credentials from your AWS Lambda functions or your containers. And on that note, we should also talk about the zero-trust networking, which is also an important and related topic. Whereas the traditional approach to network security is to harden around the network boundary, but then everything inside that boundary has full trust and can access any other resources in that boundary. Zero-trust networking says no to that, because a compromised node would just give attackers access to our entire system. And so with zero-trust networking, we're not going to trust anyone just because they happen to be in the right network boundary. And instead, every request to our APIs have to be authenticated and authorized. So for those internal APIs that are used by our other services, you want to use AWS IAM authorization to protect them and make sure that the caller has the right IAM permissions to access those APIs. You can still put these APIs behind VPCs, but that extra layer of network security should be considered a bonus and not just the only layer of defense, the only line of defense you have. And then to wrap up things here, also some quick wins for when you're working with Lambda, which should be broadly applicable for most people.

7. Optimizing Lambda Performance

Short description:

When using Node.js and the AWS SDK version 2, setting the environment variable on Lambda functions enables HTTP keep alive, saving time on requests. Database proxies should be used with Lambda and RDS to manage connection pooling. Smaller deployment artifacts lead to faster co-start times. Trimming dependencies and bundling them into a lambda layer can optimize deployment time. However, lambda layers are not a substitute for general-purpose package managers. To reduce co-start time, avoid referencing the full AWS SDK. Use Lambda destinations instead of DeltaQs for capturing failure invocation events. Thank you for your attention and feel free to ask any questions.

When you're using Node.js and still using the AWS SDK version 2, then you need to set this environment variable on all of your Lambda functions to enable HTTP keep alive, which is going to save you about 10 to 20 milliseconds on every single request you make from your Lambda function to other AWS services.

And if you're using Lambda with RDS, then you want to use the database proxies to manage the connection pooling. Otherwise, you'll likely to run into problems, you're having too many connections to the RDS cluster.

And as far as the code starts are concerned, having smaller deployment artifacts also means a faster co-start time. But unfortunately, adding more memory doesn't actually help reduce the co-start time because during co-start your Lambda function already runs the initialization code at full CPU. So this is true for, as far as I know, all language run times except for Java, because for Java part of the initialization happens during the first invocation after the initialization.

So I think that's why when it comes to Java, the extra CPU resources you get from having highest memory setting can also help in terms of reducing co-start time. But that's not a case for any of the other language run times I've tested.

The thing that's going to help you more in terms of reducing co-start time is actually just trimming your dependencies, having fewer dependencies, that means a smaller deployment artifact, but also less time for the runtime to initialize your functions as well.

And one thing that I found really useful is to bundle my dependencies into a lambda layer so that I don't have to upload them every single time when they haven't changed between deployments. And it's a great way for you to optimize the deployment time, to help you reduce both the co-start time, but also the deployment time as well.

However, lambda layers is not a good substitute for general-purpose package managers like npm or Maven, and you shouldn't use them as the primary way that you share code between projects, because for starters, there's a lot more steps to make it work. With something like npm, publishing a new package and storing them is just really straightforward, and there's support for scanning your dependencies against known vulnerabilities whereas publishing a new version of a lambda layer and then bringing that new version into a project takes a lot more work, and you don't have semantic versioning or the other tooling you get with npm as well.

My preferred way to use the lambda layers is use this plugin with the server framework, which packages your dependencies into a lambda layer during deployment, uploads it to S3, and then updates all of your functions to reference this layer. And the great thing about this plugin is that it detects when your dependencies has changed. So if they haven't changed on a subsequent deployment, it doesn't have to publish a new layer version, and your deployment is much faster as a result.

And then the trick for reducing COSTAR time is to not reference the full AWS SDK. So if you just need to use one client, then this also helps reduce the time it's going to take for the node runtime to initialize your function module, and therefore makes your COSTAR to go faster.

And if you're using Lambda to process the event asynchronously, like with SNS, S3, or EventBridge, then you also need to configure some sort of DLQ or DeltaQ to capture any failure invocation events so that they're not lost. Nowadays, you should use the Lambda destinations instead of Lambda of the DeltaQs, because the DeltaQs only captures an invocation payload and not the error. So you have to go back to your logs or whatever to figure out why you failed in the first place so that you can decide whether or not the event can be reprocessed now. But with Lambda destinations, it captures both the invocation payload, as well as the context around the invocation and the response, which in the case of a failure, it captures the error from the last attempt. So you don't have to go fish for them yourself.

And that brings me to the end of my presentation. As I mentioned before, I spend most of my time as an independent consultant, and if we want to have a chat and see how Serverless can help you and how we might be to work together, then go to theburninmonk.com to see how we might work together to help you succeed with Serverless. And if you want to learn some of the tips and tricks I've learned over the time, and I'm also running a workshop in March, and you can get the 15% with the code yenprs15. And with that, thank you guys very much for your attention. And if you've got any questions, feel free to let me know, and we can discuss it now. Oh, wow, 27% of our audience has been using Terraform as their deployment framework, and on second, we have 23% for Serverless, 18% for CloudFormation, 14% for CDK, another 40 for something else, and 5% for SA. How do you feel about this, Jan-Rust? Is this what you were expecting? Not at all, actually.

8. Serverless Framework Updates and Terraform's Role

Short description:

I find it's quite strange to see this kind of result. It doesn't quite match up with what I see in the wild. I guess personally, I think seeing serverless framework updates matches what I see in the wild, and CDK is becoming a lot more popular. So that also makes sense. I guess Terraform is an interesting one. There's a lot of teams that focus on infrastructure still using Terraform. Certainly, if most of your job is involved with writing infrastructure, provisioning VPCs and things like that, then Terraform is still a very good tool. But I think when it comes to actually building service applications, Terraform is just too low level. It doesn't provide you with a lot of help in terms of actually building an API and things like that. You end up writing a couple hundred lines of code just to provision API gateway endpoints when it should be two lines of code in the server framework, or SEM or something else. So I think that's an interesting result there. Terraform still has so much that came out on top. And I guess, interestingly as well, SEM is almost very few people using SEM nowadays. I guess AWS has really shifted gears in terms of what they promote. A couple of years ago, it was all SEM, and then the CDK came along. And now that's going to be the main thing that AWS is pushing. So I guess that makes sense for CDK to be up there, whereas SEM has been left behind. Yeah. I also see people still answering. But luckily, positions didn't change. Else, we would have to answer the question all over again.

I find it's quite strange to see this kind of result. It doesn't quite match up with what I see in the wild. I guess personally, I think seeing serverless framework updates matches what I see in the wild, and CDK is becoming a lot more popular. So that also makes sense.

I guess Terraform is an interesting one. There's a lot of teams that focus on infrastructure still using Terraform. Certainly, if most of your job is involved with writing infrastructure, provisioning VPCs and things like that, then Terraform is still a very good tool. But I think when it comes to actually building service applications, Terraform is just too low level. It doesn't provide you with a lot of help in terms of actually building an API and things like that. You end up writing a couple hundred lines of code just to provision API gateway endpoints when it should be two lines of code in the server framework, or SEM or something else. So I think that's an interesting result there. Terraform still has so much that came out on top. And I guess, interestingly as well, SEM is almost very few people using SEM nowadays. I guess AWS has really shifted gears in terms of what they promote. A couple of years ago, it was all SEM, and then the CDK came along. And now that's going to be the main thing that AWS is pushing. So I guess that makes sense for CDK to be up there, whereas SEM has been left behind. Yeah. I also see people still answering. But luckily, positions didn't change. Else, we would have to answer the question all over again.

All right. Thanks, Jan. We're going to jump into the audience questions. Don't forget, if you have any more questions, you can still pop them in No Talk Q&A. And if you have time, I will ask Jan. First question is from Argentile1990. How do you feel about integration of SSM, KMS, et cetera, with the infrastructure as core tools like CloudFormation and Terraform? How could the barriers to entry be lowered for smaller teams with legacy software? I guess if you're talking about specifically regarding SSM and KMS, I think they're really easy to provision, you're saying infrastructure as code. I guess the one thing that I guess kind of holds, that's kind of annoying with SSM when it comes to using CloudFormation or any tool that's built on top of CloudFormation, is that it doesn't support the secure strings out of the box. So what I've always learned in the past is to use a custom resource so that I can then provision secure strings in SSM using my KMS key and all that, and still use the infrastructure as code.

9. Terraform Modules and Staffing Cost Optimization

Short description:

With Terraform, you can use modules for custom resources and support secure strings in SSM. Lowering the barrier to entry for smaller teams with legacy software can be achieved by adopting serverless and re-architecting applications. This approach can lead to significant cost savings and reduced staffing needs. AWS has found that staffing costs can outweigh infrastructure costs for large enterprises. By using managed services, developers can distribute on-call responsibilities and rely on AWS for infrastructure management and security.

I think with Terraform, you can use the Terraform modules. I don't know if there's any public available modules for that, but certainly with the same things like SAM and Desire framework and CloudFormation and CDK, you can also have some custom resources or modules they can use to support secure strings in SSM. It's one of the things that's weird that CloudFormation still doesn't support today. Seems to be a bit of a no-brainer for them to do that, at least by now.

In terms of lowering the barrier to entry for smaller teams with legacy software, I guess I'm not quite sure exactly what you're talking about. Are you talking about just SSM or KMS? Or are you talking about adopting serverless in general?

I guess the first company that really did a lot of work with serverless was Social Network. We actually migrated from running a bunch of EC2 instances to serverless and rebuilt our entire architecture. And we followed this stranger pattern where we took one part of the application and then we wrote it, and then we get onto the next part of the application. We wrote that because the app itself had lots of problems, both in terms of resiliency, but also in terms of just basic scalability and cost efficiency. So as part of just re-architecting and to improve the user experience, we also just rewrote them. So we have a very small team of six people, including myself. Over the course of six months, we basically rewrote the whole thing. And I think we cut about 95% of our AWS billing, but even more than that, we basically just let go of our team of DevOps engineers who basically wasn't doing a lot of things. They were there because they had to look after some of the VPCs and things like that. But there wasn't just enough work for them. And once we moved everything to Lambda, API Gateway, and things like that, there just wasn't enough DevOps work. So in terms of staffing costs, that's probably one of the biggest savings that we've made because the company was small, but they couldn't hire full-time people, so they hire a bunch of contractors on really high day rates. And in the end, we were able to just let the application team be self-sufficient because they didn't really need that much DevOps support. Well, you're following the DevOps principles in terms of infrastructure code and things like that, but you don't need specialists to look after your cluster of EC2 instances or containers. You don't need people to look after all your network security and all of that because again, using managed services, all those things comes out of the box from AWS. Yeah, funny. I never thought of that. I hear a lot about optimizing AWS costs, but the staffing costs that you mentioned also is a big optimization. Yeah. So AWS did a white paper a while back, I think 2018. They found that at least for large enterprises, there's almost more savings on the staffing cost than there are for the infrastructure costs. Now you're talking about hundreds of millions for some of these large enterprises in terms of staffing costs. And because you've got to think about, you need a whole team to look after your cluster of machines, or you can just have your developers distribute some of the on-call responsibilities, and most of the infrastructure will be managed and secured by AWS. We have time for one more question. The question is from Hail to the wood.

10. Comparison of Honeycomb and Lumigo

Short description:

Honeycomb is a great product, but Lumigo is more suitable for fully serverless applications. Lumigo provides out-of-the-box visibility and reduces the need for manual instrumentation. With Lumigo, you can easily trace requests and get the necessary context without writing extensive custom logs. Honeycomb requires more manual instrumentation to capture the required information. Lumigo is a specialized solution for complex serverless applications, while Honeycomb is a versatile observability platform. Thank you, Jan, for joining us!

We currently use Honeycomb for serviceability. What would you say might suit us to use Lumingo? Yeah, I think Honeycomb is a great product. I do really like their product. I use them in similar projects in the past as well.

I think the thing about Lumingo is that if you've got a fully serverless or maybe say 90% serverless application and you're really doing something that's more interesting, more advanced, things like, I've got some API here, but then as I'm writing stuff to databases and the event buses and lots of other event processing that happens afterwards. So getting that traceability all the way through, I have to be for Honeycomb, at least they still do quite a bit of manual instrumentation. So you want to capture every request you're sending to some AWS services or third party APIs, so that you can add the context you need in order to be able to find the information you need in the logs. And sometimes you forget, so you had to go back and update your code and all of that.

With Lumigo, if all your application is running on a Lambda and whatnot, then Lumigo gives you all of that visibility out of the box so that you don't really have to worry about doing manual instrumentation by and large. Nowadays, I basically don't have to write much logs at all, because most of what I need, I can get from Lumigo already. The only time when I have to write some custom logs is when I've got business logic that doesn't involve calling some APIs, in which case I don't know what's going on because I'm not seeing stuff inside Lumigo that shows me, oh, you're making a call to this API, here's the request body and response body. So I have to put in some custom logs.

So my experience with Honeycomb and our solution is that you still need to do quite a bit of manual instrumentation in order to give yourself the breadcrumbs you need to find your way once your application is running in the wild, whereas Lumigo comes out of the box. I would say Honeycomb is a great observability platform for a lot of things, whereas Lumigo is a much more specialized, focused solution for people who are doing more complex and interesting applications using server technologies.

All right. Thanks a lot, Jan. If you have any more questions for Jan, then our time is over. But Jan will be in his speaker room on SpatialJet. So you can click the link in the timeline below, and Jan will pop over there in a minute. So Jan, thanks a lot for joining us, and it's been a pleasure having you. Hope to see you again soon. Thanks for having me. Cheers guys.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Node Congress 2022Node Congress 2022
26 min
It's a Jungle Out There: What's Really Going on Inside Your Node_Modules Folder
Do you know what’s really going on in your node_modules folder? Software supply chain attacks have exploded over the past 12 months and they’re only accelerating in 2022 and beyond. We’ll dive into examples of recent supply chain attacks and what concrete steps you can take to protect your team from this emerging threat.
You can check the slides for Feross' talk here.
Node Congress 2022Node Congress 2022
34 min
Out of the Box Node.js Diagnostics
In the early years of Node.js, diagnostics and debugging were considerable pain points. Modern versions of Node have improved considerably in these areas. Features like async stack traces, heap snapshots, and CPU profiling no longer require third party modules or modifications to application source code. This talk explores the various diagnostic features that have recently been built into Node.
You can check the slides for Colin's talk here. 
JSNation 2023JSNation 2023
22 min
ESM Loaders: Enhancing Module Loading in Node.js
Native ESM support for Node.js was a chance for the Node.js project to release official support for enhancing the module loading experience, to enable use cases such as on the fly transpilation, module stubbing, support for loading modules from HTTP, and monitoring.
While CommonJS has support for all this, it was never officially supported and was done by hacking into the Node.js runtime code. ESM has fixed all this. We will look at the architecture of ESM loading in Node.js, and discuss the loader API that supports enhancing it. We will also look into advanced features such as loader chaining and off thread execution.

Workshops on related topic

DevOps.js Conf 2024DevOps.js Conf 2024
163 min
AI on Demand: Serverless AI
Featured WorkshopFree
In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.
Node Congress 2023Node Congress 2023
109 min
Node.js Masterclass
Workshop
Have you ever struggled with designing and structuring your Node.js applications? Building applications that are well organised, testable and extendable is not always easy. It can often turn out to be a lot more complicated than you expect it to be. In this live event Matteo will show you how he builds Node.js applications from scratch. You’ll learn how he approaches application design, and the philosophies that he applies to create modular, maintainable and effective applications.

Level: intermediate
Node Congress 2023Node Congress 2023
63 min
0 to Auth in an Hour Using NodeJS SDK
WorkshopFree
Passwordless authentication may seem complex, but it is simple to add it to any app using the right tool.
We will enhance a full-stack JS application (Node.JS backend + React frontend) to authenticate users with OAuth (social login) and One Time Passwords (email), including:- User authentication - Managing user interactions, returning session / refresh JWTs- Session management and validation - Storing the session for subsequent client requests, validating / refreshing sessions
At the end of the workshop, we will also touch on another approach to code authentication using frontend Descope Flows (drag-and-drop workflows), while keeping only session validation in the backend. With this, we will also show how easy it is to enable biometrics and other passwordless authentication methods.
Table of contents- A quick intro to core authentication concepts- Coding- Why passwordless matters
Prerequisites- IDE for your choice- Node 18 or higher
JSNation Live 2021JSNation Live 2021
156 min
Building a Hyper Fast Web Server with Deno
WorkshopFree
Deno 1.9 introduced a new web server API that takes advantage of Hyper, a fast and correct HTTP implementation for Rust. Using this API instead of the std/http implementation increases performance and provides support for HTTP2. In this workshop, learn how to create a web server utilizing Hyper under the hood and boost the performance for your web apps.
JSNation 2023JSNation 2023
104 min
Build and Deploy a Backend With Fastify & Platformatic
WorkshopFree
Platformatic allows you to rapidly develop GraphQL and REST APIs with minimal effort. The best part is that it also allows you to unleash the full potential of Node.js and Fastify whenever you need to. You can fully customise a Platformatic application by writing your own additional features and plugins. In the workshop, we’ll cover both our Open Source modules and our Cloud offering:- Platformatic OSS (open-source software) — Tools and libraries for rapidly building robust applications with Node.js (https://oss.platformatic.dev/).- Platformatic Cloud (currently in beta) — Our hosting platform that includes features such as preview apps, built-in metrics and integration with your Git flow (https://platformatic.dev/). 
In this workshop you'll learn how to develop APIs with Fastify and deploy them to the Platformatic Cloud.
GraphQL Galaxy 2021GraphQL Galaxy 2021
161 min
Full Stack GraphQL In The Cloud With Neo4j Aura, Next.js, & Vercel
WorkshopFree
In this workshop we will build and deploy a full stack GraphQL application using Next.js, Neo4j, and Vercel. Using a knowledge graph of news articles we will first build a GraphQL API using Next.js API routes and the Neo4j GraphQL Library. Next, we focus on the front-end, exploring how to use GraphQL for data fetching with a Next.js application. Lastly, we explore how to add personalization and content recommendation in our GraphQL API to serve relevant articles to our users, then deploy our application to the cloud using Vercel and Neo4j Aura.

Table of contents:
- Next.js overview and getting started with Next.js
- API Routes with Next.js & building a GraphQL API
- Using the Neo4j GraphQL Library
- Working with Apollo Client and GraphQL data fetching in Next.js
- Deploying with Vercel and Neo4j Aura