Guardians of the Applications: Conquering Node.JS App Monitoring

Rate this content
Bookmark

Ever struggled with monitoring in your Node.JS apps? Not anymore! By sharing the good, the bad, and the hair-pulling from our own experiences, I want to help you steer clear of monitoring chaos. We’ll see how truly knowing how your apps work help you have more focused monitoring. This allows you to dodge black holes checkbox monitoring can have as you can make sure that important metrics and alerts are not swallowed. Additionally, we’ll see how strategic and focused logging, monitoring, and alerting with tools like Graylog, Grafana and Prometheus can supercharge your app’s resilience. Join to uncover how reliability and monitoring patterns and anti-patterns can help improve app quality. You will return armed with invaluable insights that can skyrocket your monitoring game!

21 min
15 Feb, 2024

Video Summary and Transcription

Monitoring and observability are important for catching bugs before they become noticeable. Examples of monitoring issues include confusion and frustration when monitoring leads to misunderstandings. Teamwork is essential for effective monitoring, automation can streamline processes and improve efficiency. Custom monitoring is necessary to prevent hazards and unnecessary alerts can hurt productivity. Challenges include relying too much on monitoring without addressing root issues and struggling with manual configuration.

Available in Español

1. Introduction to Monitoring and Observability

Short description:

What happens when your apps crash consciously? Cut them as free revenue tanks? We feel powerless. Today I'm going to share with you the model tests and monitoring methods that can help you prevent disasters. It's important for teams to monitor systems and catch bugs before you notice them. Let's explore the important concepts of monitoring and observability. Monitoring and observability work together to detect issues early and prevent future problems.

What happens when your apps crash consciously? Cut them as free revenue tanks? We feel powerless.

Hi, I'm Martin Tomician, a senior thought manager here at NippoBip, and today I'm going to share with you the model tests and monitoring methods that can help you prevent disasters. So let me first introduce myself. I'm 30 years old and 55% deaf in both ears. And in my seven years at NippoBip, I've mastered skills like Web Infrastructure, React, Webpack, Microcontent, Automation, Monitoring, Improving Experience, and much more. But hey, I'm not just a coder, and I'm not conferencing or mentoring. If you find me exploring the world, testing outstanding photographs, and also collecting rubber ducks and magnets each week, hang on because it's going to be alright.

You know when Netflix stops working and the customers get mad fast, and you have a bad day, drivers rape customers and bus. That's why it's important for teams to monitor systems. They play detective so they can catch bugs before you notice them.

And you know capturing and catching issues early keeps up the humming of customers ready. So let's explore the important concepts of monitoring and observability. And let's take an example. Little Pedro fell hard and scraped his knee. His little assistant monitored how we limp and clean the patterns on his wound. The robot investigated why he fell and realized he had untied shoes. And together they cared for the injury and prevented the next one. Monitoring here and observability work together. And monitoring is basically like a little assistant. Keeps on going and works to see if things are working right. And she checks metrics like sees if the pedal limps and knows immediately that something is wrong. Logging here helps monitoring by checking and recording information that can be useful later. But monitoring only alerts you to issues. Monitoring can feel like wandering in the dark. You are not sure what is going on. You are unsure about the root cause. Your ability is like you flip on a head light switch. You know you light everything and you can see the logs, metrics and traces clearly. And you immediately know why Pedro fell. Because of his untied laces.

2. Examples of Monitoring Issues

Short description:

When you have mechanics and they rely on diagnostics so they can precisely fix your car, durability is basically a very similar thing. Monitoring tells us that something is wrong but not how to retain our users. In InfoPip, we use Greylock for logging, Grafana for dashboards, Prometheus for metrics, Obgeni for alerts, and Sensory for user-facing issues. Poor choices can affect company, apps, and reliability. The first example is ShopPass. The monitoring led them completely confused and frustrated. So they had to change and improve. The second example is Tom who checks Tickethype's website, ensuring everything is right.

And one more example is blinking check engine inside the car line. Because it does warn you about the issues but not about the cause. When you have mechanics and they rely on diagnostics so they can precisely fix your car, durability is basically a very similar thing. Because it looks under the hood of the software, pinpoints the problem so the thing can be fixed.

And monitoring here, for example, tells us that something is wrong but not how to retain our users. And you know with monitoring tools, they give our apps superpowers. Like heroes, apps can seem invincible but they rely on the talent behind the scenes to help spot bugs early. In InfoPip, for example, we use Greylock for logging, Grafana for dashboards, Prometheus for metrics, Obgeni for alerts, and Sensory for user-facing issues. And I'm going to show you some real monitoring workstories and use practical examples. Because of limited time, I'm not going to focus so much on boring tool demos. So let's start.

The Greek philosopher Plato once said that a good decision is based on knowledge and not on numbers. And our decisions do affect company, apps, and reliability. So it's important to see how poor choices can affect this and cause instability. And it is important to understand how our application is working, what is the correct behavior and what is its current performance so that we can do proper logging, monitoring, and troubleshooting, and so that we can make sound decisions.

The first example is ShopPass. Their system started to crash as soon as their shopping trap in jumped and managers were in panic. Now they just started buying this, this, this, this monitoring tool without any strategy, everything was disjointed, everyone was confused, and they didn't have any systematic approach. What I mean, you know, we've all tried putting out fires without actually seeing the full picture. And this is the fat anti-pattern, which is called tool obsession. When we become so obsessed with certain tools that we lose perspective, because it's so easy to think that the latest tool will be the super bullet and only to end up distracting from delivering actual value. I don't know, we shouldn't put all our faith in the tools, because you can make teams think that they are a magic wand that leads to success. Because remember, Cinderella's wedding and mother even warned her to call spells for Rerop at midnight. And the same thing is here, because nothing can replace the hard work.

So what is the problem for ShopPass? The monitoring led them completely confused and frustrated because the network was so green, but the users still complained, they wasted time trying to decode those contradictions and errors, and they didn't really have any insights about those critical backend processes. So they had to change and improve. And that's exactly why they did, because they looked closely at the vital signs and metrics, saw what is important, what keeps them healthy and on track, and they made sure to cover that and they simplified tools so that they can look only by the matter of the most. And they made a focused game plan, which allowed them to spot the early and celebrate progress and make sure the products are more stable.

The second example is, let's imagine Tom. He checks Tickethype's website, kind of like his doctor, ensuring everything was right, checks servers, speeds, databases, and especially errors.

3. Importance of Teamwork in Monitoring

Short description:

And if something is wrong, he gives a call to Tim, hey, hey, can you fix it? Or by updating monitoring tools, and Tom helps keep the web in good shape. Monitoring works better as a team effort. Mom monitored things alone. The company realized that monitoring is too critical for only one person, so they needed teamwork. They automated as much as possible so that they can streamline things, and so that everything can work pretty much smoothly. And we have Shopvac, for example, who said monitoring, but that is too quickly, like glulululu, just using reports.

And if something is wrong, he gives a call to Tim, hey, hey, can you fix it? Or by updating monitoring tools, and Tom helps keep the web in good shape. So what is the problem here, actually? So reading and monitoring, like it's annoying smoke detector, keeps snoozing. Because in one or two firefighters, it's already enough to handle everybody alone. You need to hold Tim, listening for alarm, ready to grab the hose together. Because ignoring only one is can only burn our customers.

And we need to work together, because monitoring works better as a team effort. Because all of us, devs, ops, network, SRE, all of us think differently, and our perspectives do help catch problem faster. And in DevOps mindset, it's also about combining forces to monitor systems, you know, because we will be able to keep things running more smoothly. So what is the problem? Mom monitored things alone. But you know, he could only cover so much, and communication was a problem, and he had to wait for alert to get resolved. And all that increased his stress, and he was prone to mistakes.

But the company realized that monitoring is too critical for only one person, so they needed teamwork. And they did that, and took everyone on board to fix their problems. They trained the IT team on the best practices, and made it so that there are no blind spots, and so that the whole team worked together. And they automated as much as possible so that they can streamline things, and so that everything can work pretty much smoothly. And we have Shopvac, for example, who said monitoring, but that is too quickly, like glulululu, just using reports. And then they got t-shirts, and all of those rush alerts backfired.

4. Improving Monitoring and Alerting

Short description:

We learned the hard way that custom monitoring is essential to prevent hazards. Checkbox monitoring can give you untrustworthy data. Unnecessary alerts can hurt productivity and disturb sleep. IT teams became overwhelmed with constant, unimportant alerts, missing critical issues. They struggled with monitoring but took three steps to improve: tailoring metrics, improving communication, and using specific tools. Using an open-source Grafana dashboard helped us spot and create alerts based on key metrics. Connecting alerts to Slack simplifies checking. More examples in the blog post.

Because we are all hurried in some way to meet a deadline, and only to have things blow up later from those cat corners. And Shopvac learned it the hard way that they should have custom monitoring, which can help them prevent hazards when they have problems. And this is my favorite anti-panel, and it is called checkbox monitoring, where you set monitoring just to see that you have it. And that gives you no problems and solutions, because you can have mostly untrustworthy data.

For example, we had issues, my team, with unnecessary alerts, because they do hurt productivity. Imagine getting an alert at 1 a.m., like, hey, the t-shirt is expired just one month ago, please replace it when you already have done it. And frustratingly, for example, we couldn't disable those useless interrupts, and you know, they were just disturbing our sleep, basically.

IT teams loved their monitoring systems at first, but then became overwhelmed mad, because they were getting constant, unimportant alerts, they were basically drowning them out. They were basically drowning out the real trust. They missed some really critical application issues, and they did believe they had a good, reliable protection, but a big problem came unnoticed. And they also had a big problem when they had incidents, because they had no clue what happened based on the metrics.

So what did they do? So, they struggled with monitoring. So they took three steps. First, they tailored their metrics to fit their needs. They set up the contact review process to improve their metrics, and they improved communication so that everyone can be trained on how to do things properly, so that they can do it better.

And I will share with you one example with you. We have open source Grafana dashboard with HAProxy that we use, and you can send it on this QR code or link if you want to. But you know, for example, we didn't just copy-paste this. We carefully chose what we needed exactly, and we made sure it would work for what we exactly need. And it is working. It looks like this, for example. We monitor it regularly and adapt it as we need something. And you know, it really helped us. We managed to spot issues, and we even managed to create some alerts based on the key metrics. And I also want to show you one more tip. When alerts happen, it is useful to have a list of the steps to investigate or troubleshoot, like which Grafana dashboard and which Greylock clock to check, because this really helps a lot. You can get alerts on the phone. Great. But if you can connect your alerts to Slack also, it can make it easier for you to check the full alerts. You can see the snippet here, but I'm going to share more examples in my blog post.

5. Monitoring Challenges and Solutions

Short description:

Shopbit relied too much on monitoring without addressing root issues, leading to recurring problems. They learned to investigate and reinforce weak spots proactively. Subdev struggled with manual configuration and set up monitoring for every server by hand. They implemented standardized configuration, source control, and templates to improve efficiency and quality.

The next example I want to show you is Shopbit now. They sell products online, they closely monitor their website, but they rely on monitoring too much, and they do not really use the change to improve. And this is similar to like if you have leaky pipes and creaky floors, and instead of picking the root issues, all you do is slap on blankets and tape and call it monitor. Because that happens with systems too. This means not really fix the core problems and monitoring can provide temporary relief, but technical doubts will just keep piling up. We need sustainable solutions instead of bend-a-fixes.

And the problem was that they trusted monitoring too much, got too comfortable with those alerts to support an issue, so they kept getting the same problems again and again. For example, their websites were overloaded with traffic. But instead of investigating those root issues, they just kept applying the quick fixes. So the same problems kept returning. So what did they do? They actually asked why this happens, let's get to the bottom of this, and then managed a push team to actually reinforce those weak spots proactively. And monitoring tools also now focus more on improving reliability instead of just reacting to things.

And then we have Subdev for example. They are growing fast as a dev company and they have now hundreds of servers. And the IT team is struggling with setting up monitoring for every server by hand. And this gets harder and harder. And friends, monitoring tools and configuration manually wastes a lot of time, and we are prone to mistakes. So let's talk about a smarter way how we can do this. Because for Subdev, the problem was the things got more complex, it was hard to keep up, and they just were missing things, communication was a problem, couldn't maintain it. And they realize, you know, trying to fix issues as it happens just wasn't working at all, you know. So they needed to do something. And they did several steps. So they took the monitoring, which is standard math, and they put standardized configuration so that it is aligned across everywhere. Second, they're using source control to have a single source of truth for everything and to improve teamwork. And three, they created a template so they can reuse this and they can save time and ensure quality.

6. Automated Dashboard and Monitoring Challenges

Short description:

We automated the Kapana dashboard and caught several issues. Monitoring for Nest Hub was like constant monitoring of fatigue instead of focusing on critical issues. We need helpful information that doesn't overwhelm our engineers. We should make our tech work for us, not against us. Nest Shop had too many alerts and missed serious issues. They prioritized critical alerts for immediate attention.

And I will just show you one interesting example of what we did. We have automated the Kapana dashboard. We only want hardcoded input, which is the team owner name. Everything else is fully calculated by primitive queries. And we have it then in the dashboard like this. We track the status of our virtual machines and of our servers. And it helps us show the statuses very quickly. And also we managed to catch several issues that have appeared.

The dashboard had been a really good idea, and it was extremely useful for us to monitor performance. But we have two more examples. Monitoring for Nest Hub was like helicopter parents, you know, like constant monitoring of fatigue that happened because they were getting notified about paper cuts instead of only about real emergencies like broken bones. And, you know, good monitoring tells us when our site is truly sick, not when it just has a snipples. We need to give our engineers helpful information that can not overwhelm them, but so that they can focus on those critical issues.

Because I think this antipersonnel is pretty much self-explanatory. We shouldn't be like calling 3AM like, hey, the TV got expired three months ago. Please replace it. Because human notification will just teach us to ignore those alerts, you know. And we've all been there, you know. But the point is, we need our tech to drive us, help us, not drive us crazy. And you know, it simply may take into our devices also for help and that's not good. And you know, being human, we are human. And we always make mistakes. But we just need to make sure to get better. We should now make our tech work for us, not against us.

And the problem for Nest Shop was they simply had too many alerts and they weren't able to tell real emergency from less important ones. And they missed a lot of serious issues that were burning among less urgent alerts. And with so many contacts coming in, they couldn't really use resources properly. And that was a problem for them. They also took several steps. They actually prioritized critical alerts so they can be sorted ASAP, especially when they matter the most.

7. Improving Alerting and Dashboard

Short description:

They've changed the threshold and connected alerts to revenue. Optimized the process and boosted efficiency. Set up selective notification for non-urgent alerts. News Hub realized a single performance measure was too basic. They improved metrics and diagnostics to fix important issues and improve application speed. Simplify and focus dashboards to answer key questions quickly.

They've changed the threshold so they have less false positives. They connected alerts to the revenue and what really can impact them. And they optimized the process so that the right people can actually tackle the right problems and this also helped them boost efficiency.

Another one creative idea for what you can do for non-urgent alerts is, for example, you have the alert that a certificate will expire in a month and a half, and you want to be notified but not like at 2am. You can set up your alert to notify you only during working hours or during some specific days. And this can help you use Nature's Queries in a really creative way. And it's good to do things like that, you know.

But also, for example, we can have News Hub. Millions of news and they care only about loading the content quicker, let's say, when they are growing and have more popularity. But, you know, only covering the files loaded is not enough because complex software, we cannot produce it only one simple dump metric. The declining webline may ignore real pain points. And for example, grabbing medicine is like if you track a patient and only track weight and height, but not track blood work to diagnose issues. And this interesting end pattern is something called the big dump metric. And you can imagine that, for example, you have 95% of the respondents in 180 milliseconds. I don't know how does it help us, what does it tell us? Metrics should be useful. They should tell us information that can actually help us, that we can actually benefit from. This doesn't help us at all, you know, it's just useless.

And for example, if we have one response time for absolutely everything, like one single metric, or showing only top of the view without any other details. It doesn't give us any information. We cannot do anything with it, you know. And the News Hub, they realized that a single performance measure was actually too basic, couldn't really differentiate between important things. So they mischanged it to fix really important issues and misunderstood the real problems, like application speed, and you know, they overlook all of that. And it didn't go really well. So they needed to improve it. So they took three steps. They track performance metrics, they needed to put the size diagnostics, then they set up monitoring for tools that can help them, and they are only using those specifically. And third, they are reviewing metrics to constantly improve them.

And just to briefly mention also, they improve speed and reliability, but, no problem, can also be too many dashboards. Because you can put it on roaming, you don't know where to look, it takes you too much to troubleshoot. So we should simplify and focus dashboards so that they can answer key questions without clutter and very precisely and quickly.

8. Key Takeaways and Next Steps

Short description:

Monitoring is crucial, but complex. Understanding correct behavior and acting quickly is important. Monitoring is a constant process, always improving. Learn more on Hashnode and stay connected for blog posts and insights. Connect on social media for links and articles. Thank you for listening, looking forward to your questions!

Anyway, just to summarize, the key takeaway is monitoring is crucial, but it's complex, and you have to be able to do it right. You have to understand what is the correct behavior for each service, because issues will always happen. That's why we monitor. Knowing something is wrong is not enough. You have to be able to also act quickly, and that's why it's important to get to know your services.

So, and you have to also be aware that monitoring is a constant process. You should work on improving your services constantly, because you should be able to stay vigilant, alert, so that you can react quickly if necessary. And you know, it's hard to cover everything in 20 minutes, and if you want to learn more about this topic, I have prepared a small blog post on Hashnode that provides a little bit more hints about the information. But I will also write a series of blog posts in collaboration with InfoBeep DevRel and InfoBeep developers, which will be published over the next few months.

So, if you want to be in touch and see those blog posts, feel free to follow me on Hashnode, and I can also immediately recommend you to a meeting book that provides really great insights into reliability and monitoring. Those are Radical Monitoring by Mike Julian and released by Michael D. Nygart. But you can also connect with me over the socials, because I will be sharing all of the links, all of the articles there also for anyone who is curious. And you know, you can connect with me over this link that has all of my socials links. You can find me also on Twitter, which is now called Axe, or on LinkedIn, where you will for sure have those links.

And you know, you can also just type the Bitly link if you want to get to it quicker. I would like to thank you for listening to me. I hope that my session provided you with some useful insights, and I'm looking forward to hearing all of your questions that you may have. So thank you and see you soon. Thank you! Bye!

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Node Congress 2022Node Congress 2022
26 min
It's a Jungle Out There: What's Really Going on Inside Your Node_Modules Folder
Top Content
Do you know what’s really going on in your node_modules folder? Software supply chain attacks have exploded over the past 12 months and they’re only accelerating in 2022 and beyond. We’ll dive into examples of recent supply chain attacks and what concrete steps you can take to protect your team from this emerging threat.
You can check the slides for Feross' talk here.
Node Congress 2022Node Congress 2022
34 min
Out of the Box Node.js Diagnostics
In the early years of Node.js, diagnostics and debugging were considerable pain points. Modern versions of Node have improved considerably in these areas. Features like async stack traces, heap snapshots, and CPU profiling no longer require third party modules or modifications to application source code. This talk explores the various diagnostic features that have recently been built into Node.
You can check the slides for Colin's talk here. 
JSNation 2023JSNation 2023
22 min
ESM Loaders: Enhancing Module Loading in Node.js
Native ESM support for Node.js was a chance for the Node.js project to release official support for enhancing the module loading experience, to enable use cases such as on the fly transpilation, module stubbing, support for loading modules from HTTP, and monitoring.
While CommonJS has support for all this, it was never officially supported and was done by hacking into the Node.js runtime code. ESM has fixed all this. We will look at the architecture of ESM loading in Node.js, and discuss the loader API that supports enhancing it. We will also look into advanced features such as loader chaining and off thread execution.
JSNation Live 2021JSNation Live 2021
19 min
Multithreaded Logging with Pino
Top Content
Almost every developer thinks that adding one more log line would not decrease the performance of their server... until logging becomes the biggest bottleneck for their systems! We created one of the fastest JSON loggers for Node.js: pino. One of our key decisions was to remove all "transport" to another process (or infrastructure): it reduced both CPU and memory consumption, removing any bottleneck from logging. However, this created friction and lowered the developer experience of using Pino and in-process transports is the most asked feature our user.In the upcoming version 7, we will solve this problem and increase throughput at the same time: we are introducing pino.transport() to start a worker thread that you can use to transfer your logs safely to other destinations, without sacrificing neither performance nor the developer experience.

Workshops on related topic

Remix Conf Europe 2022Remix Conf Europe 2022
195 min
How to Solve Real-World Problems with Remix
Featured Workshop
- Errors? How to render and log your server and client errorsa - When to return errors vs throwb - Setup logging service like Sentry, LogRocket, and Bugsnag- Forms? How to validate and handle multi-page formsa - Use zod to validate form data in your actionb - Step through multi-page forms without losing data- Stuck? How to patch bugs or missing features in Remix so you can move ona - Use patch-package to quickly fix your Remix installb - Show tool for managing multiple patches and cherry-pick open PRs- Users? How to handle multi-tenant apps with Prismaa - Determine tenant by host or by userb - Multiple database or single database/multiple schemasc - Ensures tenant data always separate from others
React Advanced Conference 2023React Advanced Conference 2023
112 min
Monitoring 101 for React Developers
WorkshopFree
If finding errors in your frontend project is like searching for a needle in a code haystack, then Sentry error monitoring can be your metal detector. Learn the basics of error monitoring with Sentry. Whether you are running a React, Angular, Vue, or just “vanilla” JavaScript, see how Sentry can help you find the who, what, when and where behind errors in your frontend project.
Node Congress 2023Node Congress 2023
109 min
Node.js Masterclass
Workshop
Have you ever struggled with designing and structuring your Node.js applications? Building applications that are well organised, testable and extendable is not always easy. It can often turn out to be a lot more complicated than you expect it to be. In this live event Matteo will show you how he builds Node.js applications from scratch. You’ll learn how he approaches application design, and the philosophies that he applies to create modular, maintainable and effective applications.

Level: intermediate
Node Congress 2023Node Congress 2023
63 min
0 to Auth in an Hour Using NodeJS SDK
WorkshopFree
Passwordless authentication may seem complex, but it is simple to add it to any app using the right tool.
We will enhance a full-stack JS application (Node.JS backend + React frontend) to authenticate users with OAuth (social login) and One Time Passwords (email), including:- User authentication - Managing user interactions, returning session / refresh JWTs- Session management and validation - Storing the session for subsequent client requests, validating / refreshing sessions
At the end of the workshop, we will also touch on another approach to code authentication using frontend Descope Flows (drag-and-drop workflows), while keeping only session validation in the backend. With this, we will also show how easy it is to enable biometrics and other passwordless authentication methods.
Table of contents- A quick intro to core authentication concepts- Coding- Why passwordless matters
Prerequisites- IDE for your choice- Node 18 or higher
JSNation 2023JSNation 2023
104 min
Build and Deploy a Backend With Fastify & Platformatic
WorkshopFree
Platformatic allows you to rapidly develop GraphQL and REST APIs with minimal effort. The best part is that it also allows you to unleash the full potential of Node.js and Fastify whenever you need to. You can fully customise a Platformatic application by writing your own additional features and plugins. In the workshop, we’ll cover both our Open Source modules and our Cloud offering:- Platformatic OSS (open-source software) — Tools and libraries for rapidly building robust applications with Node.js (https://oss.platformatic.dev/).- Platformatic Cloud (currently in beta) — Our hosting platform that includes features such as preview apps, built-in metrics and integration with your Git flow (https://platformatic.dev/). 
In this workshop you'll learn how to develop APIs with Fastify and deploy them to the Platformatic Cloud.
JSNation Live 2021JSNation Live 2021
156 min
Building a Hyper Fast Web Server with Deno
WorkshopFree
Deno 1.9 introduced a new web server API that takes advantage of Hyper, a fast and correct HTTP implementation for Rust. Using this API instead of the std/http implementation increases performance and provides support for HTTP2. In this workshop, learn how to create a web server utilizing Hyper under the hood and boost the performance for your web apps.