Observability with diagnostics_channel and AsyncLocalStorage


Modern tracing products work by combining diagnostics_channel with AsyncLocalStorage. Let's build a tracer together to see how it works and what you can do to make your apps more observable.


Let's talk about observability with Diagnostics Channel and async local storage. Hi there, I'm Steven. I've been involved in node.js Core and the Diagnostics Working Group for many years. I work at Datadog building diagnostics tools and my pronouns are he, him. So first of all, what is Diagnostics Channel? Diagnostics Channel is high-performance global event channel to passively broadcast data about the current execution. It's a lot like an event emitter, but specifically designed to be as low cost as possible when nothing is actively listening. The idea being that users should feel comfortable broadcasting lots of things without worrying about the cost if it's not going to be observed most of the time. So channels are created at the top level of your javascript file by calling the channel function and providing a name for your channel. Channels share a global namespace to allow acquiring the same channel anywhere in the application without needing to explicitly share the channel object. And your module name should generally be included in the name to avoid collisions with channels from other things. So once you have the channel object, you can start publishing to it. This is like the emit function on an event emitter, but by creating channel objects ahead of time, it allows several optimizations such as avoiding looking up the handle by name every time an event occurs and making the publish call as close to zero cost as possible when there are no listeners. So each Diagnostics Channel should follow a convention of having a single object structure per channel. And if you have differently shaped datasets to communicate, it's likely those should probably be separate channels. And please document the names and shapes of all your channels. And then anywhere in the application, you can call channel again with the same name to acquire the same channel and then subscribe to it. And the order of channel calls doesn't matter. Whichever place calls it first will create the channel and every subsequent call will retrieve it. So you can even subscribe to channels for modules that are never loaded and therefore never have events published. This enables complete decoupling of code, allowing things like tracing products to passively observe your module behavior without needing any explicit connection between modules. So let's look at an example. We start by defining our named channel at the top of the file. Then we write our function, which we want to broadcast some data about. Eventually when it gets called, like it completes its internal set immediate and calls its callback, it will broadcast the data to the channel with the publish function. This could be handy for all sorts of things. For example, you might want to capture metrics of how many times do a thing did whatever it was supposed to do. It could even be captured with the time of completion to chart activity over time. But none of this needs to be specifically supported by do a thing as the subscriber can decide on its own what to do with the message or if it wants to capture any timing information. And what well, publish is no op unless there are subscribers. It has like no cost unless there are subscribers. Sometimes constructing the message itself can be costly if the thing would run very frequently. So there's a hat has subscribers can be used to skip message construction entirely in performance critical situations. So and what is async local storage? So it's like lexical scope, but following call path rather than the scope of definition. So lexical scope lets us retrieve data from the scope outside of the function itself. We're doing that here when A calls B. However, we define thing in an inner scopes that there's no way to reach it in B. The obvious thing to do in this particular case is just pass it as a parameter B. Nothing is reachable inside B. But what about more complicated situations like when we don't control the intermediate steps between where we've defined the variable and where we want to access the variable? Here we get B from some other module and it takes an event emitter to trigger an event on. There's no way to pass thing through its interface. So can you figure out how to get thing into that something event handler? Maybe you could stuff it in global or you could attach it to the emitter object. But what about concurrency? In either case, it could get swapped out to the wrong value if another call happens before the event gets emitted. So with async local storage, the value gets propagated through calls automatically without needing to add it to the arguments and it flows into callbacks and promise continuations to thus the async in its name. This means that no matter how much complexity there is between the two points, as long as a callback or continuation path can be drawn between the two, the value should be available. So to start passing the value through, you call the run method on the store and it will call the given function with that value as the current state of the store until the function ends. And any async activity triggered within that scope will also contain that value. And then to retrieve the value from the store, you can call the get store method later. So here we have a module for serving static files. It takes the folder to serve from analog function. So notice that it returns a request handler function to be called later. So it's designed to be called at the top level outside of the request to produce the handler. So there's nothing to pass through any sort of request identity or in any way change it to differentiate concurrent requests. So it would be very difficult to identify in the log which loading message pairs with which downloading message if the path is the same. So with asynclical storage, we can store an ID which we can retrieve in the log function. We provide even though the log function is defined outside the request. Now we can stamp the request ID on every message so we can correlate the exact request to every log message. So now let's put the two together. This next part is a bit dense, but this is a lower level manual approach which is getting simplified. And I'll get to that in a bit. Tracing a single action can be expressed with five events. When the call starts and ends, an equivalent async start and end around callbacks and an event for errors. The tracing prefix and the event name suffixes you can see in the names there are important. The middle part can be whatever you want. As stated earlier, it's a good idea to include your module name in the channel name. So like module.function is a reasonable naming pattern, but you can follow whichever naming pattern you like the middle part as long as you're consistent about it and documented clearly. So each of these events is given a separate channel, allowing selectively listening to only which events are relevant to the particular use case. And all events within a single action share a single message object to allow sharing data between event handlers. So the end event will receive the same message object as the start event if they both came from the same traced call. And if an error occurs, an error property will be added to the message and it will publish the message to the error channel. And the result property also gets added to the sync or async or like the synchronous end or async start. It'll have a result property. Then, so tracing async or tracing sync code, the start and end events provide a scope in which any sync or async activity will be attributed to the given message. In the case of trace sync, it doesn't need to associate async activity, but it will provide a shared message object, making it easy to correlate the events from a single traceable action. If the traced function throws an error, it will be added to the message object and an error event will be published. And the return value is also stored on the message object and can be seen in the end event. So tracing callbacks, it's the same as with sync tracing. Callback tracing has a start and end events around the sync call and captures thrown errors. But additionally, callback errors are also captured with error events and callback execution is scoped by the async start and async end events. This is helpful to restore span scope with async local storage if necessary. And so that as with callback tracing, all sync tracing code applies the same. Promise rejections are captured and published to the error channel and async start marks the completion of the async task. For consistency with callbacks, async end is also published, though it's semantically identical to async start in this case. It triggers immediately. Due to the infinitely chainable nature of promises, there's no distinct end point for a promise, so the async end is triggered immediately. This makes sense anyway, as promises are meant to model async graph resolving and merging back as expressed with async await. There wouldn't be anything which descends internally like callbacks do. So the graph would have merged back to the awaiting code. So storing traces, so now that we're publishing traces to tracing channels, we need to transform that data into something which meaningfully represents that collection of lifecycle events. This is typically called a span, which contains the metadata marking the logical start and end from the tracer perspective and captures whatever metadata it needs to identify what the application was doing that produced the span. So for example, a span for an HTTP request would likely contain the method URL. Here we're just passing in the message as is, but realistically a tracer would do some additional processing. This span object is then stored in async local storage using enter with to be made available in descending async activity as the parent binding shown in the start event and restored in the end event. Using this enter with method is not really recommended, but this is how we have to do things at the moment. Better things are coming, which I'll get into in a moment. Between the async start and end events, marking the callback scope, the span stored on message can be restored as the current span. This should not actually be necessary as async local storage is meant to handle this, but it can be handy just for like... There are cases where like context loss can happen. So this can enable manually restoring the context if necessary. It's not shown in here, but the tracers will generally have an end marker for when the span ends, but it varies when that would happen. Some tracers will consider once the processing of the async task itself is done, then it'll mark it as done, but some will consider themselves as owning their callback. And therefore, anything that happens within the callback and descending from that is owned as well. So it might trigger the end events in the async end or the async start, or it might even pass it through a whole graph and not complete the span until the whole tree of spans completes. A span is a container to store the data about the current execution along with IDs to form a graph. So typically spans have their own unique ID and then a separate unique ID to their graph to correlate them later. Holding a parent span object directly can cause memory leaks and it complicates event streaming. So it's generally preferred to use IDs for correlation linkage. And then when the end is triggered, the span is reported to the tracer. But like I said before, depending on the implementation of the tracer, this may aggregate to a trace object locally and send when the request completes, or it may like stream the individual spans immediately. It's up to the implementer. But it's getting easier though. So managing all those channels and ensuring correct behavior is complicated. So there's a higher level api coming that handles that internally. It constructs and exposes the correctly named channels for you, but it also provides several functions to wrap up the functionality of tracing in a much friendlier interface. So let's have a look. So first we have trace sync. It handles the start and an error events automatically, and it's fairly easy to use. You give it a function and it'll immediately invoke it with the start and end events before and after. You can also pass in a value for this and any arguments you'd like to pass through, which is helpful to avoid unnecessary closures. And it will also capture and pass through the return value, which allows you to fairly directly replace an existing function call with one wrapped in tracing events. Next we have trace callback, which in addition to handling the start and an error events also automatically triggers the async start and async end events around the callback and captures callback errors and results. So the position of the callback must be given and can either be indexed positively from zero or negatively from the end of the arguments list. The default is minus one to match the typical callback last design that node.js tends to follow. As with trace sync, this argument list is passed through as our return values. The generally callback based APIs won't actually have a return value. And tracing channel for promises. So last we have trace promise. As with trace callback, all events are handled automatically. Inputs and outputs are passed through the same as other trace functions. And as with the manual example shown earlier, the async end is published immediately after the async start as there's no clear end to promise continuation. And then to bind diagnostics channel data or these new tracing channels to async local storage, there's also bind store and run stores coming soon. The run stores method is used to suggest that stores might want to set their state from the given data for the duration of the provided function. It's used on a name channel the same as publish would be. And we'll also publish the input data on that channel after setting the state on any bound stores. And the bind store method can then be used to accept that suggested data, optionally transforming it in some way before it is set to the state of the given store for the duration of the function given to run stores. Like in this example, the URL store will contain the URL that is passed in in that object passed run stores. And so the combination of the two allows the decision to store context data in a particular store or any number of stores to be coordinated through diagnostics channel rather than needing to pass stores around through event publishers. So like another way to approach this api would be to like make your module have like a function you have to call to pass in a store and that just requires a bunch of extra effort. And it's like complicated if you want to have multi-tenancy and all these sorts of things. This just gives that to you for free. And all the trace functions on a tracing channel also wrap the execution of their given functions with run stores, allowing a span object to be stored free trace automatically. This is helpful because while tracing channels on context object is available in the direct code, it may not be available in code the traced function calls out to. So it can be helpful to use async local storage to propagate the context objects or a span produced from it through to any nested execution. Yeah. And that's what the future of tracing node.js apps looks like. Thanks for listening and please talk to me if you have questions about adding tracing channels here on modules.
21 min
17 Apr, 2023

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic