Logging, Metrics, and Distributed Tracing are three vital tools for observing Node.js services. In this talk we’ll consider the different scenarios in which each tool thrives, look at dashboards and visualizations, and even examine the code required to instrument these tools in a Node.js service.
Logging, Metrics, and Tracing with Nodejs
AI Generated Video Summary
This talk covers logging, metrics, and tracing with Node.js. It explores logging configuration with Winston and logging conventions and solutions. The talk also discusses logging dashboards and metrics, as well as metrics and distributed tracing. It touches on tracing tools and visualizations, async-await and logging in Node.js, and request-specific logging and distributed tracing. Additionally, it covers logging middleware and serverless functions, and the difference between automatic and manual instrumentation.
1. Introduction to Logging
Hi. I'm Thomas Hunter, and welcome to my talk Logging, Metrics, and Tracing with Node. First, we're going to look at logging. Logging is a way to extract granular state from a program. The logs often have an associated severity level, which is used for filtering. You can configure an application to write logs to standard out, the file system, or send them directly over the network. The central logging service captures these logs globally throughout all your applications.
Hi. I'm Thomas Hunter, and welcome to my talk Logging, Metrics, and Tracing with Node. All the content from this talk is adapted from a chapter of my recently published book, Distributed Systems with Node.
First, we're going to look at logging. You can think of logging as the console.log of the cloud. What exactly is logging? It's a way to extract granular state from a program. Usually, the state ends up resembling well-structured JSON data instead of the random words or objects that you or I might log locally. These logs often have an associated severity level, which is used for filtering. For example, the severity levels that were made popular by NPM are error, warn, info, HTTP, verbose, debug, and silly.
You can configure an application so that, perhaps, in production, you're only getting messages that are greater severity than debug, whereas locally, you're getting all messages. So, these logs can be written to standard out, they can be written to the file system, or they can even be sent directly over the network by the application. If you do happen to run to standard out, you probably need some sort of sidecar process that's running, listening, reading those files, and piping them over the network. And if you were to send them from the application directly over the network, that would increase some complexity of the application, however, it would also, you know, perhaps streamline the delivery of those logs. The reason it needs to work this way is that we have a central logging service that captures these logs globally throughout all your applications.
2. Logging Configuration with Winston
One popular package for logging in Node.js is Winston. It allows you to create a Winston instance and export it as a Singleton representation. You can configure the instance to capture specific log levels and output them in JSON format. Default meta properties can be set for all logs, such as the node environment and application name. Transports can be configured to write logs to a file and print them to the console.
So, one of the popular packages for doing this is Winston, and that's what we're going to look at in these examples. So, here we have a file that creates a Winston instance, and then exports a Singleton representation of it. So, here we're importing the Winston package, creating an instance, and so the level field here, this is saying that we only want to capture info in greater messages. The format field says that we want the output to be in JSON format. Default meta, this represents default properties that are then available on each of the logs. Since we're outputting JSON messages, these are the default fields that'll appear for all logs within this application. So, specifically, we're looking at the node environment variable, which we're applying to a property called env, and we're also naming the application as profile service and setting it to an attribute called app. So, this is useful so that we can differentiate logs from different environments and different applications within our infrastructure. The transports configuration is defining two different outputs for these logs, and so the first one says that the logs are written to disk at var log node app.log, and the second one says that they're printed to the console.
3. Logging Conventions and Solutions
We can create a global logger or a request-specific logger. A common pattern is to create a request ID associated with each log. In the example, we require the global logger, generate a request ID for each request, and create a child logger. The child logger includes the request ID as a default property. We can use the child logger to generate logs with contextual information, such as the URL and method. Another example shows how to handle errors and log stack traces. Popular logging solutions include the Elk Stack, Splunk, Datadog, and Sumo Logic.
So, we have some pretty common logging conventions using node applications and really non-node applications as well. And so, perhaps we create a global logger, but we can also have a request specific logger. So, the logger we saw on the previous screen, that would represent the global logger. So, a common pattern also is to create a request ID, which is usually like a UUID value, which can then be associated with each one of the logs for each given request.
So, this example down here, we're requiring the global logger. We're then adding a on request handler to Fastify. And so, within each request, we generate a request ID, which is just a random UUID, and then we create a child logger. And so, that's a Winston convention for creating a new logger with default properties. In this case, the only default property is the request ID. We then attach that to the request object so we can use it in other requests.
So, the final bit of code at the end, that that extracts the URL and method from the request object and then actually generates a log. So, request.logger.info is a way to create an info message. And so, the name of the log is on request. And the two additional properties that we're adding contextually are the URL and method. So, here's an example of another log. This time, we're attempting to save data to a database. However, a problem has happened. So, we wrap that in a try catch. And then we catch the error and then log the stack trace. So, here we're calling request.logger.error. So, of course, creating an error message. The name of the message is DB persists error. And then we're assigning the error message, the error stack, and even the record ID to that log. That way, a developer can go in later and attempt to find out why that message didn't persist.
So, here's some logging solutions that we can use. A pretty popular one is called the Elk Stack and this can be self-hosted or you can even pay companies to host it for you. So, that's a combination of the Elasticsearch database, the Logstash, Daemon, and the Kibana graph board. Graph. And so, other alternatives that are paid for include Splunk, Datadog as a logs project and Sumo Logic as well. So, what do these logs actually look like? Well, here's a screenshot of a dashboard from my employer, lob.com, that we're hiring, of course.
4. Logging Dashboard and Metrics
This dashboard displays logs that match the query, along with raw logs. The query is written in KQL, which stands for Kibana Query Language. Metrics, on the other hand, provide aggregate numeric data and help understand the application's health. Metrics include names and associated tags for querying. They provide real-world measurements that benchmarks can't. Metrics can track request throughput, timing, memory usage, status codes, and even business-related data. Metrics are generally cheaper than logs. Example code demonstrates the use of the StatsD client package to create and track metrics for request timing, status codes, and methods.
And so, this dashboard here displays a graph of logs that match the query and then actually the raw logs below it. And so, the query at the top of the screen, that's written in KQL, which stands for Kibana Query Language, and that's how I'm querying these logs here. So, in this case, I'm looking for logs with a message of handled request and where the status code is at least 500. So, that's a way to display all server-side errors. And so, here, we see that within the last 24 hours we had six of these 500 errors. And all this data that we're looking at today is time series data and is filtered based on time.
So, now we're going to look at metrics. And so, whereas logs looked at more individual data, metrics will look at aggregate numeric data. So, what are metrics? Well, again, it's time series data, almost entirely numeric based. It's a way to understand the application's health. And so, usually these metrics still have like a name and perhaps some associated tags with them. Tags being key value pairs that can be used for querying. And that depends on the implementation that you choose. So, this is going to tell you information that real world benchmarks can't. This is going to tell you real world information that benchmarks can't tell you. So, a benchmark is like an estimate of how the application will perform while metrics are actual measurements of how the application performs. So, it's common to store metrics for things like request throughput, request timing, application memory usage. You can even track the status codes, you know, 500s versus 200s or perhaps endpoint popularity. You can even track things that aren't necessarily as engineering-based. You know, more business things such as how much money is spent, what's the user turn like, are the ads getting clicked, et cetera. Generally, metrics end up being cheaper than logs, both in the computational cost as well as the transmission cost.
So, here's some example code. Here we're using the StatsD client package, and we're creating a new StatsD instance. So, as we create that, we pass in some configuration. I'm sort of glossing over the connection details, but in this case, you can see that the prefix for these metrics is myapp. And so, it's common using this pattern to sort of prefix these metrics with like a name of an application. You then separate the name of these metrics using periods, and then you can sort of define like a hierarchy of requests. And so, again, we're adding a Fastify middleware. This time, it happens at the unresponse level, and so, we're going to track the request timing as well as the status code that was received and even the method. And so, using this information, we can see the timing that a request takes over time.
5. Metrics and Distributed Tracing
So, there are different metric solutions like statsd, graphite, and grafana or prometheus and grafana. The Datadog company also has a metrics product. Metrics can be represented in graphs, showing outbound requests, request timing, and incoming status codes. Distributed tracing allows tracking of communications across services, associating related requests, and visualizing the hierarchy of requests.
So, there's a few different metric solutions, not as popular as the logging scene, but there's definitely more than one choice. So, one common stack is to use statsd, which is daemon for collecting these metrics, graphite, which is a database for storage, and then grafana, which is a dashboard. Another popular combination is prometheus and grafana. And then finally, the Datadog company also has a metrics product as well.
So, what do these metrics actually look like? Well, here's a graph. This is just an example using some fake entries. But here we can see in the upper left, this is counting the outbound requests. And so, we can see like per second, we have 50 requests from about 1527 to 1533. On the upper right, we have the outbound requests timing. So, that's showing us how long the request is taking to actually succeed. And at the bottom, we have a list of the incoming status codes. Or as you can see, we have a long, sad line of 500 error codes and then a burst of 200 error codes. So, the very bottom, that's a query that I've written in Grafana to query this data. So, I'm looking at the myapp.requests.status.anything metrics, and then I'm just wrapping that in a function that renames the output. So, that actually gives us the bottom graph.
And finally, we're going to look at distributed tracing. And so, this allows us to keep track of communications as they cross services. So, what is distributed tracing? Well, it's a way to associate these related requests as they flow through your system. And so, perhaps you have a shallow service and then more deeper services. A request goes through the shallow one and then it gets sent to deeper ones, which then might get sent to even deeper ones. And so, you end up with a tree structure of requests. Classically, visualizing that is a bit difficult, and so that's why we have distributed tracing. And so, at a high level, these different implementations ultimately come up with some sort of request ID, which is like a random value to identify all these requests as they're related. And then depending on the implementation, you even get something called a span ID, which is a way to identify and associate individual request response pairs. And so, like as a shallow server communicates with a deeper server, it'll get its own span ID. These IDs are often passed around as HTTP headers. If you're using a tool such as gRPC, you would need to come up with a different way to pass these headers around. All this information gets sent to a central management service, just like with logging and with metrics. And as I mentioned, it'll allow you to visually see the hierarchy of the request. So it's useful to see which service was slow or if a request fails, you know, which service was at fault for that.
6. Distributed Tracing and Instrumentation
You can provide the generated request UUID to your logger to associate logs across all your services. An example is shown using the zipkin lite package with Fastify and node fetch. The application is instrumented by adding a route for getting a widget. The request ID is logged and can be accessed throughout the request. The code also sets the name of the request and prepares the Zipkin request for an outbound request. Tracing solutions like Zipkin and Jaeger are popular and can be self-hosted.
And as a bonus, you can actually take that request UUID that's generated by the distributed tracing, you can actually provide it to your logger. And so that's a way to then associate logs across all your services quickly and easily.
So here's an example in tracing implementation. So this is using the zipkin lite package. And again, we're using Fastify. It's also going to use the node fetch package as well. So we instantiate a zipkin connection. And so we specify, you know, the host that we're going to send this zipkin information to zipkin being a, a tracing implementation. We also have to name the service, keep track of the services port, as well as the services IP. Finally, we instantiate the Fastify server and then add two hooks to it. So the first one happens when a request is received and the second happens when the response is being sent. And so those two middleware are doing things like keeping track of the start time of the request, generating a request ID, and then finally transmitting all that information to the central zipkin server.
Now, here's an example of us instrumenting the application itself. And so what we're doing is we're adding a route in the application. So when the user gets a widget, this route ends up being called. So the first thing that's happening here is that the request ID is logged. And this is just a way to show how you can access that request ID. Ideally, you would have middleware that then takes that request ID and then attaches it to the request logger so that you can use that throughout the request. And this code also generates or sets the name of the request, in this case, to get widget and that's used later for reporting. And finally, we have perhaps a bunch of other code happens within the application throughout the request, but we get to a point where we're ready to make an outbound request. The first thing we do is we prepare the Zipkin request. We also have access to this URL. And then we make a fetch call. So the fetch call is mostly the same, except that we specifically have to provide the headers that are generated from this Zipkin request preparation. Once the request is complete, we call this complete method and we pass in the HTTP method and a URL. And then finally, we finish the request.
So there's a few different tracing solutions out there. The one we just looked at is Zipkin. Another popular open-source alternative is Jaeger. So these are two that you can run self-hosted.
7. Tracing Tools and Visualizations
Both Zipkin and data.dog.apm provide nice ways to trace and monitor your applications. Zipkin offers a hierarchical view of services and their timing information, while data.dog.apm provides a performance timeline similar to browser tools. These tools automatically capture metadata and can modify your application to pass along headers for deeper service tracing. If you want more information, you can follow me on Twitter or check out the presentation from distributed systems with Node.
And both of those tools are able to follow something called the OpenTelemetry specification, which is a way to sort of define, you know, what are the headers that are passed around? What kind of timing information are you tracking? How is it sent to the central server, et cetera? The Datadog company also has an APM product as well. And New Relic is another popular tracing solution.
So what does tracing actually look like? Well, with Zipkin, you end up getting a pretty nice hierarchy. So this is a screenshot taken from the Zipkin website. So we can see on the left that we have a hierarchy of the services. So we have, like, the routing service. Beneath that is Memcached, Yelp, Main, et cetera. Next to that, we have a nice timeline that shows the actual time that it takes. And so using that timeline, we're able to see that the overall request happens at the top, took about 131 milliseconds. Then different operations below that took more time. And so the depth of this graph represents the depth of the request chain within your application. On the right, we see some metadata annotations that are tracked as well.
Alright, and so that's the talk. Feel free to follow me on Twitter. The presentation is available at this URL here. Of course, this content was taken from distributed systems with Node. And feel free to hit that URL if you'd like more information.
Async-Await and Logging in Node.js
The syntax of async-await is nice. It's also the newest. You still get that nesting problem and that's solved with async-await. The role of logging in Node.js functions is as useful as it is in any other application. You want to know the health of your application, if things are still running, if things are getting backed up. The best way to organize time series logs from IoT devices is to lay out metrics using a hierarchy. Namespace them based on the IoT component and regions where the equipment is running. The logging approach shown in the presentation uses two loggers, one being a universal logger.
I think those are to be expected, yeah. The syntax of async-await is nice. It's also the newest. Yeah, it's pretty common right now to be using async-await. And promises are still pretty popular, I think. But yeah, you still get that nesting problem and that's solved with async-await. So that's nice.
So the QA session is pretty exciting now, because you're going to be giving away your great book. So let's get to the questions. And I hope we can get to all of them.
The first one is from Brian Howe. Hugh, what role does logging play for Node.js functions? I'm assuming that by functions they mean lambda functions, stuff like that. So I mean it's as useful as it is in any other application. So you want to know the health of your application, you want to know if things are still running, if things are getting backed up. So it's, you know, definitely useful by sending, perhaps, like metrics while you're invoking functions, you can then, you know, build out a dashboard that shows you, perhaps, how many like simultaneous functions you have running, you know, things like that. I think a lot of this information you can probably get from whoever's hosting the functions like AWS, but you can certainly extract your own data from that as well.
Okay. Next question is from chshirendra2010. What is the best way to organize the time series logs coming from IoT devices? Well, I guess I'll first admit I've never worked with IoT. You know, a lot of all the data that I talked about in this talk is about, you know, time series data. Data that is pinned to a certain point in time. And so, you know, I think with applications you want to sort of lay out your metrics using a hierarchy where sort of the top level is your application. So, with, like for example, maybe you have a profile view application, like a gallery application, authentication application, but with IoT, you know, perhaps you want to sort of namespace these based on like whatever IoT component you have. And then perhaps, you know, if you have regions where the IoT equipment is running, I think maybe you'd have like, oh, bed temperature in North America, you know, I think you just want to come up with some sort of hierarchy that fits your use case. Yeah, exactly, yeah.
Alright. Next question is from Alexey. Have you created the initial example a request in. Sorry. Have you created in the initial example a request instance logger? How can you recommend providing two levels that are below web server like business logic or database repositories? Yeah, so I guess the sort of the tricky thing with that is that with a, with the logging approach that I showed in the presentation, you have like the two loggers, one is like a universal logger.
Request-specific Logging and Distributed Tracing
The context within a PHP script represents the request being made. In distributed tracing, you can attach the request ID to the request-specific logger. For serverless applications, it is difficult to pass request headers, but you can insert the request ID into the payload. There are ways to send this data to a distributed tracing tool. Lower tools or middleware that prevent logging sensitive information exist.
The other is request specific. And if we were building applications using a language like PHP, well, globally, the context within a PHP script represents the request that's being made, both node, you know, we have to sort of keep track of that context and pass it around. So you can use, you know, I think patterns like the continuation, local storage to make it convenient to pass around these loggers. Otherwise, you might find yourself just sort of modifying, you know, deeply nested codes that you can pass around the request logger deeper into the application. That's a bit messy, but, you know, yeah, you can use something like the CLS to make that easier.
Cool. Another question from Java task. What is the best way to correlate logs with traces in DT? In DT? I don't know what... Oh, in distributed tracing. For example, in the presentation, I showed how a... you sort of like create a middleware that creates the request-specific logger. So, at the point in time, if you're using a distributed tracing tool and you receive an incoming request with a request ID, you could then attach that to the logger. However, if you're the most shallowest application within the stack, you know, the one that receives the first request at that point in time, you would then generate the request ID to be passed around to the deeper services. So, by attaching that request ID to the request-specific logger, you can then search for all those logs using your logging tools such as Kibana.
Cool. Next question is from Brian H. Yoo again. When you talk about distributed tracing, how could logging work or what would it look like in the cloud for serverless applications? Yeah, how does it work with serverless? So, yeah, I guess, again, serverless things like Lambda. Yeah. I think when you invoke a Lambda, you don't necessarily have an HTTP request like visible with it. So, it's difficult to pass along these request headers. That said, when you invoke a Lambda, you do provide some sort of requests like payload. And so you could then do things like take that request ID and just sort of insert it into that payload. And then there are ways to then, you know, take that data and then send it to a distributed tracing tool. The longer version of this presentation did show a breakdown of all the actual packets and messages that are sent. And so, it is actually possible to, you know, reconstruct these HTTP, like, span representations and then send them to the distributed tracing service.
All right. Another question from Alexey. Do you know any lower tool or middleware which prevents application from logging sensitive information like passwords and tokens? Not off the top of my head. Certainly they do exist.
Logging Middleware and Serverless Functions
At my employer, lob.com, we use a middleware to create a deny list of fields that should not be logged. Tools are available to automatically remove or allow manual deny lists. For serverless functions, you can instantiate a logger that ships logs to a different service, such as a Logstash instance. You can measure Node.js internals, such as event loop delays or GC frequency, by accessing metrics exposed within Node itself. Metrics usually have lower overhead, while tracing can be heavier, especially with automatic tracing solutions.
At my employer, lob.com, we do use a middleware where we're able to then sort of, you know, create, like, a deny list of fields that we want to prevent from getting logged. So, for example, we don't want to log names and addresses. And so, there certainly are tools out there that can both either automatically automatically remove them or allow you to provide your own deny list.
You can just give some keys, like if this is the key, don't log the value, please. Which can be done by hand easily, too.
Next question is from the house of Alejandro. Awesome talk. Well, that's always nice to hear. Any recommendation or experience using logging for serverless functions, an example of Firebase functions or Google Cloud or AWS Lambda?
Sadly, I don't have much hands-on experience with serverless when it comes to logging. Certainly, tools like CloudFormation extract some of those logs for you. But at the end of the day, though, you can instantiate a logger that ships logs out to a different service. So, for example, you can have a Logstash instance listening on a port, which then receives UDP messages that contain JSON payloads, and you can transmit those messages to that listener from your serverless environment. So, that's a way to sort of push logs out, which might make it easier in a serverless situation.
All right. Let's see how much time we have left. Okay, next question is from David Lime. Can you recommend ways to instrument Node.js internals like event loop delays or GC frequency to send to the surfaces? That's a good question. I like that one. Yeah, absolutely you can. There are a bunch of different metrics that are exposed within Node itself, and so you can measure the size of the heap or some of the ones I recently introduced in a live application is the number of open file handlers and the number of concurrent connections on a web server. And so, off the top of my head, I don't know where those properties are located, but if you buy my book, certainly there's a section in there with a bunch of examples. Nice plug.
Next question is from 1LV4N what is the expected overhead of gathering logs, traces, etc in your system? Great, great question. So, things like metrics usually end up having a lower overhead because you know, like with the metrics, with the data you're sending to StatsD, you're actually just sending tiny strings of plain text that you're just dispatching using UDP, which has very, very minimal overhead. Even logs usually end up having fairly low overhead. You know, there's the cost of serializing the log, but usually that information is sent asynchronously and shouldn't really slow down the main components of the application. Tracing though, tracing can get a little bit heavier, especially if you're using like an automatic tracing solution. I believe the performance impact of that is going to be a little bit higher. So yeah, if you're using an automatic, like APM style tool, it's going to be instrumenting basically everything.
Automatic vs Manual Instrumentation
If you're using an automatic, like APM style tool, it can be heavier. But if you're manually instrumenting and passing around headers, it'll have minimal impact. Console.log and process.stdout/process.write have similar performance. Console.log is easier to use and widely supported. Thomas chose David Lane's question as the best and will contact him for the book.
So yeah, if you're using an automatic, like APM style tool, it's going to be instrumenting basically everything. And so it does inject code that'll sit between a call between the application and the library. And this stuff can get a little bit heavier. But if you're perhaps manually instrumenting and manually passing around just some headers in your application, sort of like in the example I used, the distributed tracing section, that's probably going to have a more minimal performance impact.
So Thomas, we have just a few moments left. But instead of doing a question, I would like you to read through the questions that have been asked in the Notalk Q&A channel. So you can choose the best question for the winner, for your book. And while Thomas is doing that, I will inform you how this process is going to work. So Thomas is going to be reading the questions now and making an informed decision what is the best, the deepest question that's been asked. And then Thomas will announce the winner and contact that person on Discord to share their email with him and he will take care of getting the book to your place as soon as possible. So, Thomas, are you ready? Have you read the rest of the questions? I am. I believe... Yeah, I enjoyed David's question the most. David Lane. So the question was, can you recommend ways to instrument Node.js internals like EventLoop delays or GC frequency to send to these services? So, David, Thomas will be contacting you, so watch your inbox on Discord and enjoy your book. Be sure, when you have read it, to send a tweet along our way at Node congress, and also, of course, let Thomas know because feedback is always nice, right? Absolutely. All right, Thomas. Well, thanks a lot for joining us. And for the people that want to ask their questions to Thomas that he did not get a chance to answer, Thomas will be in his spatial chat speaker room if you want to join him. Thomas, thanks a lot. Hope to see you again soon. Thanks.