Out of the Box Node.js Diagnostics


In the early years of Node.js, diagnostics and debugging were considerable pain points. Modern versions of Node have improved considerably in these areas. Features like async stack traces, heap snapshots, and CPU profiling no longer require third party modules or modifications to application source code. This talk explores the various diagnostic features that have recently been built into Node.

You can check the slides for Colin's talk here. 


Hi, everyone. Thanks for coming to my talk titled Out of the Box node.js Diagnostics. This talk is about getting diagnostics information out of node.js without using a lot of third party tools. So just for a little bit of background, getting diagnostics out of node.js used to be pretty hard. node.js used to kind of follow a small core philosophy, which meant that a lot of functionality was left up to npm modules. So we had things like node.js Inspector for debugging, node.js HeapDump for capturing heap snapshots, Long John for asynchronous stack traces and things like that. But over the years, a lot of work has been put into diagnostics specifically and how we can improve them in Node. But a lot of people might not be aware that these things exist yet. And for most use cases, you can do a lot of, you know, debugging and getting information out of node.js just using the Node executable and Chrome. So a lot of the content in this talk is actually from the official CLI documentation, if you want to look that up on your own. And this talk also assumes the latest version of Node 16. So one of the oldest ways to get diagnostics information out of Node was via debug environment variables. So if anyone has ever used the debug module on npm, it's very similar to that. So you can use it to log a lot of extra information from Node's core during execution. And there are actually two flavors of this. So there's Node debug for getting information from javascript land and Node debug native for getting information out of the C++ layer. And so whenever you start your executable, you just specify this environment variable with a comma separated list of the subsystems that you want to listen to, as shown in the example at the bottom of this slide. And so whenever you run your executable or your application, it will dump a lot of information to standard output. You can see kind of an example here. The subsystems are prefixed with their name. So net and HTTP2 in this example. And then the process ID and then various debugging information. So whenever a connection is established, a connection is closed and things of that nature. So the next thing I wanted to talk about was warnings in node.js. So warnings are used to let the user know that they're doing things that are potentially risky and we kind of advise against. So there are a handful of flags that you can use here. The dash dash no warnings flag will disable all warnings. You might not want to do that. The dash dash redirect warnings flag can be used to take all warnings and dump them into a file somewhere so that you can look at them later. And then the dash dash trace warnings will actually print a stack trace whenever a warning is encountered. And this can be really useful for, you know, a warning is emitted, but you're not really sure why. It could be coming from deep inside of your node modules folder. So the stack trace will actually kind of give you more insight into what's going on. And so an example of that is shown here. So if you were to require the sys module, sys is an ancient module from Node's core that has been long deprecated but is still there. We don't want people to use it. So whenever you require it, you'll see this deprecation warning is printed. And then you can use the dash dash trace deprecation flag, as I mentioned before, to I'm sorry, the trace warnings flag to see where in your code that's coming from. So if you look at the arrow I have in the stack trace here, you can actually see it's coming from example.js on line one. And so it's very easy to kind of tell what's going on. So one thing to note here is you'll see in the message, dep 0025. So all deprecations in Node are given an ID. So you can actually go into Node core's documentation. We have a specific site just for or page just for deprecations. You can look up that deprecation ID and kind of get more information than what's available in the stack trace here. So the next thing I want to talk about are deprecations, which are similar to warnings. That's a specific class of warning. And again, we have similar flags for dealing with them. So we have the no deprecation flag. If you don't want to see any deprecation warnings in your code, then, you know, if you know what you're doing, then you might want to use that. We also have dash dash trace deprecation, which prints a stack trace similar to the warnings flag that I showed a little bit ago. And then we also have throw deprecation, which actually will throw an exception anytime you use a deprecated feature. Now this isn't something you probably want to run in production, but in your CI code or when you're doing testing, it's very helpful for kind of finding things that you need to address and kind of, you know, move out of your code. So we have a similar example here. Again, this is require sys because a deprecation warning is just another warning. So here I've used the dash dash throw deprecation for the same code as before. And in this case, you'll actually see, you know, a thrown exception, which is nice because it does some kind of highlighting of the stack trace. So you'll notice that a lot of the stack frames here are grayed out. Those are things that come from like nodes core, for example. And the line that is still in black with the arrow pointing to it is the line in your code where the deprecation is coming from. So the next thing I wanted to talk about was synchronous IO and how you can find it in your code. So synchronous IO blocks the event loop. This used to be a real performance killer. It still is to an extent, but now we have worker threads, which can kind of mitigate things a bit. But either way, you probably don't want to block your event loop if you don't need to. Especially if you're running an application like a server where you have multiple requests being handled at the same time, if you block the event loop on one of the requests, then all the other requests will suffer as well. So when you're setting up your server, it's fine if you're doing synchronous IO during the startup phase. But once you start moving to serving traffic, you probably don't want to allow that. And so you can use nodes trace sync IO flag to report any synchronous IO that happens after the first turn of the event loop. And so an example which I have here is we're using a set immediate so that we know that we're now moving on past the first turn of the event loop and then we're synchronously reading a file from the file system. And so read file sync does a couple of operations under the hood. It opens the file. It will also stat the file, read the data from it, and close it. And because all of these are happening synchronously, they'll all result in individual warnings. And so you can actually see in the stack trace that's provided here, open sync, try stat sync, read sync, and close sync are all being called out as synchronous IO after the first turn of the event loop. And of course, if I were to remove the set immediate here, this would not report any warnings because we would still be on the first event loop turn. Next I wanted to talk about tracing uncaught exceptions. So again, you might have an exception thrown in your application code. It could be coming from your node modules directory or if you just have a large application, it could be hard to kind of find it. And the stack on the error itself might not actually correspond to where the exception was thrown from in your application. So you can use the trace uncaught flag whenever you run your application. And in addition to printing out the error in its stack trace, as you see above the red arrow here, it will also print out this thrown at information and kind of give you the path into your application where the exception was thrown from. We have a similar flag for tracing process exit. So again, if a node module were to call process exit, you have no idea in your application what's going on. You can use trace exit and it will show you where in the code the exit call was coming from and also what the exit code was. You'll notice here, exited the environment with code zero. Now I want to move on to talk about unhandled promise rejections. So by default in javascript, unhandled rejections are ignored. This is largely okay in the browser, but this can cause a lot of serious problems if you're running a server. And so this was kind of reported to node on an ongoing basis and node.js decided that we were in starting with node 15, we were going to change the behavior of unhandled promise rejections to now throw an exception. So this is useful because say you're serving multiple requests at the same time in your server and you're opening file descriptors and promise rejections happen and nothing, the rejection is just swallowed. It's possible to get into a state where you're leaking file descriptors and other resources and this can cause memory issues and just leave your server in a bad state. So the new behavior makes it a lot more explicit what's going on and allows you to better deal with your promise rejections. So you can configure this behavior from the CLI. If you use the unhandled rejections flag, you can change the various behaviors. So right now we support five different modes of unhandled rejections. Throw and strict will both turn a rejection into an uncaught exception. Throw is the default as of node 15. Throw tries to emit an unhandled rejection event before throwing, whereas strict will move straight to throwing. So one kind of gives you the better opportunity to handle the rejection first. And then there's warning mode, which will display a warning to the console. Warn with error code, which is the same as warn, but also sets the process exit code to one. And then none, which is the javascript default of swallowing unhandled rejections. So here is a quick example with unhandled rejections set to strict. If I call promise.reject and pass in an error object, it will turn into an uncaught exception and you can deal with it that way. So you'll see the error message and the stack trace just the same as if you were to say throw new error in this case. So next I wanted to move on to talk about the tick processor, which is available in V8 that I don't think a lot of people are aware of. It's a command line based sampling profiler. It's nice because it can show you in its output where you're spending your time in libraries as well as javascript code and C++ code. So it's pretty comprehensive. And so the way it works is you'll run your application with the dash dash prof command line flag, and this will dump the V8 profiler output into a file. You can read the file if you want, but it's not really meant for human consumption. So what you'll do is after you collect that information, you'll run node again with the dash dash prof process and pass in the name of the log file that was generated. And so you can see an example of that at the bottom of the slide here. And so this is an example of what the processed output will look like. So you'll see at the top it shows you where you're spending your time in shared libraries. And then the next section shows you where you're spending your time in javascript code. You'll notice here that we have two functions, process immediate and normalized string. And both of those have an asterisk in front of them. The asterisk means that V8 was actually able to optimize the code in that function. And then the bottom section is just a breakdown of where your application is spending its time in C++ code. So starting a few years ago, Node and V8 actually started having first class support for the V8 inspector, which is essentially just Chrome's dev tools used with Node applications. So if you've done any type of browser development or even Node development in the past few years, there's a good chance you've seen Chrome's dev tools. It includes things like a step debugger, profilers with GUI interfaces as opposed to the CLI based one I showed before. And so the way that you actually set this up is you start Node with the dash dash inspect flag and then you can actually specify the host and port that you want it to listen on on your machine. By default, it's going to listen on port 9229. And then you can also set it up so that as soon as you start the application, it will set a breakpoint for you. So you don't have to worry about doing that by hand. And that's with the dash dash inspect dash BRK flag, which works other than the breakpoint in the same fashion. And then one security tip is be careful if you're going to bind to address or anything else that's really a public address on your machine. This used to be the default for Node. And it was reported as a security vulnerability because if you have a server that can be reached publicly and you're bound to a publicly exposed address like that, then technically it's possible for an attacker to actually connect to the debugger and start kind of messing with your code and runtime. So that's just something to be aware of. So next I wanted to talk, I'm sorry, just continuing with the V8 inspector example. The text at the top kind of shows you how you would run it. Node dash dash inspect break example.js. It will print out some information so it tells you where the debugger is listening via this WebSocket URL and then point you back to the documentation if you have more questions or need to do additional reading. But you can, the picture at the bottom here just kind of shows you what it's like whenever you're dropped into the dev tools. Again, a lot of people have probably seen this before, but if you haven't, this is kind of what you can expect to see. One of the nice tools that is available in Chrome devtools is a CPU profiler. It shows you what functions are executing over time. I want to point out that this is not the same as a flame graph. A flame graph actually takes stack traces over time and kind of consolidates them into one larger stack trace so you can see where your application is generally spending its time. If you're looking for flame graphs, there's an excellent tool on npm called 0x that you can play around with. But as far as CPU profiling goes, Node actually has a couple of command line flags that allow you to manipulate the CPU profiler without actually needing to go in and do this by hand. So if you pass the dash dash CPU prof flag, it will collect a CPU profile for you. It will start the profiler when the application starts up and then write out the profile when your application shuts down. The CPU prof dir flag allows you to specify where you want your profiles to be written to if you need to change from the default location. Similarly, the CPU prof name flag allows you to give your profile a different file name. And then the dash dash CPU prof interval flag allows you to define the sampling interval of the profiler. So if you needed to sample the stack more frequently or less frequently, you can control it there. And then, of course, you can also go into devtools in the UI and collect profiles by hand if need be. And so this is what the devtools view of a CPU profile looks like. So there is a kind of a region at the top that shows a high level view over time of the CPU activity. And then the highlighted window is shown at the bottom with the colored stack traces. You can actually see what functions were being called. So you'll see a lot of, in this example, module.run main as well as require. So you can, you know, from looking at this, you can kind of understand that this is the startup phase of your application where a lot of modules are being required and configured. And then the behavior kind of changes after that to be more, you know, more dependent on the traffic that you're serving. So another nice feature of Chrome devtools is heap snapshots. So these are very useful for seeing what's going on in the javascript heap and really, really helpful if you're trying to debug a memory leak. So you can actually take a snapshot at one point, let your application code run for a while, and then take another snapshot and compare them. And you can, you know, if you'll notice, like, you know, before you're in the first snapshot compared to the second snapshot, there are a ton of new objects that aren't being cleaned up that could, you know, help you trace down the memory leak. And so you can manipulate all of these things from the CLI as well. So the dash dash heap snapshot signal allows you to capture a heap snapshot if you send a specific signal to the process. And then there's also heap snapshot near heap limit. So it'll automatically try to take a heap snapshot for you when you're almost out of memory. This is not guaranteed to always work out, because it does take additional memory to collect the heap snapshot, but it's definitely something worth investigating. And then similarly to the CPU profiles, you can collect these manually via the devtools UI if you have a breakpoint set in your code. So this is just a quick look at the heap snapshot view. So it shows a list of every type of object broken down by its constructor. And then it'll show things like the object count. So in this specific example, there are looks like four event emitters in the code. The shallow size tells you how much memory that actual object is holding on to. And then the retained size actually follows the pointers from those objects to tell you kind of the cascading effect of holding this object in memory. And so you can do a lot of useful things here while you're trying to track down memory leaks. The next thing I wanted to talk about was TLS connection tracing. So it used to be if you wanted to diagnose TLS issues, you had to have the OpenSSL client set up and pass some command line flags and things like that. But now you can get that information directly from node.js. So the dash dash trace TLS flag from the CLI will dump all the same information for all TLS connections. Just so you know, this will be very noisy. So you definitely don't want to enable this in production. You can also set it on the individual socket level with TLS socket dot enable trace. And then you can also set it whenever you're setting up a socket or whenever you're setting up a server with the enable trace option passed to create server and TLS connect. Again, this will dump a ton of information. So just be prepared to sift through it. The next thing I wanted to touch on was postmortem debugging. So the way this works is you'll actually use the abort on unhandle I'm sorry abort on caught exception flag to create a core file of your application. And you know, this is it's very useful because you get the entire state of your application, but there are a lot of downsides. So first, you have to set up your system to collect core dumps, which is not usually the case by default. But you also will need an external tool like ll node to be able to inspect the core file and see what's going on in there. And while ll node is extremely powerful, it's constantly trying to play a catch up game with with changes to like the heap representation inside of V8. So it's, you know, it can be very hit or miss as far as from one version of node to the next how well it works. But you can actually inspect, you know, javascript objects, you can see a mix C++ and javascript stack trace. But you know, I would definitely say this is for more advanced use cases. And like I said, you your luck, you know, may vary from version to version. So we in node, we wanted to have kind of a lower barrier and something simpler than post modem debugging. So we introduced something called diagnostic reports. This is a human readable report containing all types of information about the process presented as a giant blob of JSON. And you can generate these under different conditions such as a crash, sending a signal to the process. You can do it programmatically through an api, etc. And so this includes, you know, lots of things like information about the operating system, the process, what's going on in the lib UV thread pool as far as handles and whatnot, the C++ stack, the javascript stack and a lot more. I guess one thing to kind of note is this can contain sensitive information such as environment variables. So handle them with care. So to generate a diagnostic report, we have a collection of CLI flags that you can use. So report on fatal error is, you know, if there's a C++ crash, you can create a diagnostic report. Report on uncaught exception is basically what it says. If there's an uncaught exception in javascript, a report will be generated for you. Report on signal, if you're passing, you know, signaling the process. You can configure which signal you want to listen on. So you can do that via the report signal flag. Report directory and report file name are used to configure where you're going to store the diagnostic report and what you want it to be named. And then there's a dash dash report compact flag, which will kind of make it a single line of JSON so that it's a little more easy for machines to consume. And so just a real quick look at what a portion of one of these looks like. I've created this one via the process.report.getreport api. You can see a subset of what's available here. So it has things like the report version. This allows node to version the reports for backwards compatibility reasons. The event that triggered this. So this was a javascript api called to get report. The file name. In this case, it's null because this was just dumped to memory. When it was collected, process ID, thread ID, a whole bunch more stuff. So I definitely encourage you to play around with these. And as I said before, you know, these can contain sensitive information. So handle them with care. If you need to share them with other people, you may need to redact some information by hand. But it's just JSON. It shouldn't be too tough. So that's everything that I had for today. Again, thanks for coming to my talk. And feel free to reach out to me on Twitter, GitHub, anywhere. Thanks, everyone. I hope everyone is like writing lots of stuff on pen and paper. So you asked a question. Have you ever had a bug in your node.js application but couldn't obtain the proper metrics to fix the problem? And 82% has answered yes. So... Yes. So I'm hoping that some of the information that got out of the talk today might help with that in the future. I think if, you know, if the same poll was asked five or six years ago, the answer would have probably been a lot closer to 100%. It used to be a really tough thing to do. And yeah, the project has just made a ton of progress since the old days of Node. Yes. Yes. I mean, even 84% is a lot. But yeah, most of people don't know about the new things coming in and all. So I see lots of good comments in the channel. That was lots of knowledge. Amazing talk. And I did not know anything that he said. So I mean, people learned a lot today. That was a cool session. Lots of stuff that I didn't know about. Wow. Yes. Awesome. So... Okay. So one follow up question on your talk would be where can I learn more about the things that you mentioned in the talk? Yeah. So two places, really. First if you go to the official node.js documentation, and then on the left-hand side, there's all the different subsystems inside of Node. If you scroll down to the command line, I think it's CLI or command line interface documentation, this talk came almost exclusively from that page and all the information there. And then if you also go onto the node.js website, there's a section called guides. And we have guides on different things like getting diagnostic information, creating flame graphs, running your application in Docker, and just a whole variety of different things like that. Okay. So, yeah. Do check it out for anyone who is listening here. Go check out all those documentation and learn more about it. So after this talk, like lots of people have learned new things and they would be interested more in diagnostics. So let's say someone is interested in working on node.js diagnostics. So is there a way that they can get involved with it? Absolutely. So if you're interested in actually working on the project, you can come to, you know, github.com. If you're new to Node, that's where the project really lives. But we also have a diagnostics working group, which is a team of people who kind of specifically dedicate time to improving diagnostics in Node. And I believe the URL for that is github.com slash node.js slash diagnostics. And you can go there and if you're really interested, join the working group and contribute back or just kind of follow along and see what people are talking about. Awesome. So there's a question by Azentile1990. What are some of the areas you still see improvements could be made in diagnostics with node.js? That's a good one. So I touched on postmortem debugging a little bit. I would love to see postmortem debugging get better. Unfortunately, that requires kind of a lot of cooperation with V8. And not that they don't cooperate, but I don't know that they see as much value in it. And then I would also like to see flame graphs have a kind of more first class citizenship inside of Node. So right now that's one of the few things I briefly touched on in the talk where you would actually still have to go outside of Node and dev tools to be able to look at. So I think that would be a great addition. Yeah. That definitely looks like a great addition. Jota developer asks, any node.js application or library that helps to visualize or collect all those diagnostics locally that you could recommend, like flame graphs, aside from Chrome dev tools? Yeah. So dev tools will be the main one. I'm trying to think, so I know Node source has a distribution of Node that has a lot of diagnostics things built into it. That's going to be more proprietary. I think you may have to pay for it or they might have a free trial. I think there's a module out of NearForm called Clinic that has a lot of this stuff integrated with a nice little UI. So I would look at those two things first. Okay. Thank you. And a CC Chris asks, which diagnostic techniques do you recommend having node.js in a serverless environment? In a serverless environment? So it's kind of tricky there because you probably don't have access to all of the same things. Like you can't just create a CPU profile and easily get it out. For serverless, you're kind of more at the whim of the provider, I think. So for example, I work at aws and lambda has pretty good diagnostics information. You can run node.js there, have logs go into CloudFormation. I'm sorry, not CloudFormation. Can't remember off the top of my head, pressure. But there's a logging system there. You can have the logs dumped too and just other integrations with aws services that you can take advantage of. But yeah, in a serverless environment, some of these things from this talk may not apply. Okay. So I have a question for you because you are a part of node.js TS committee. What are some of the features that is something which is coming up or you are looking forward to, excited about? Anything like that to share? So this is a really little thing, but you and I were actually talking about this backstage. Sometimes when you would log things from your Node application, when you are connected to devtools, it would show up in the console instead of in devtools. There was actually a pull request opened recently to try to improve that. So I think that's one of those, it's a small thing, but a nice UX improvement. Yeah, I'm really excited about that. That would really be a great addition. Small, but still it will be impactful, I guess. So okay. You touched on lots of points on diagnostics. And actually before your talk, we had a talk on errors in node.js, right? So it was a very nice transition from errors to showing all the diagnostics and in node.js and how you could do stuff like that. I have a last question for you for sure. What are some of the responsibilities for you as a member of TS committee? I'm really interested in that. So the technical steering committee is kind of the last resort as far as resolving any types of conflicts and technical decisions. Ideally, nothing should come to the TSC. Everything should be decided on GitHub amongst the collaborators. Sometimes we get to a point where people have strong opinions and an issue just can't progress. And the TSC will often then be pulled in to make a decision and occasionally even have to vote on it. But yeah, I would say that's the biggest thing. Yeah, that's interesting. So I did not see any more questions, but I'm sure people got a lot to explore after this talk. So they are going to do lots of Googling and GitHub and stuff. And so everyone for let me just I'm still seeing the poll. So yes. So if you have any other questions for Colin, you can still talk to him in a special chat. So Colin will be available in his speaker room. And thanks a lot, Colin, for such an in-depth and wonderful talk. It was great. And people have learned a lot. Thank you so much for being with us here today. Thanks for having me. Bye. Bye. Thanks for having me.
34 min
17 Feb, 2022

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic