Learn how to instrument Node and client applications using OpenTelemetry and popular open-source observability tools like Prometheus and Grafana.
Getting Started with Open Source Observability
AI Generated Video Summary
Observability is a crucial tool for engineers to ship reliable websites and have a good developer experience. Metrics, logs, and traces are fundamental data types for achieving observability. OpenTelemetry is an open source standard for observability data, and instrumentation can be done manually or automatically. Data collection and transportation can be handled by packages built by the OpenTelemetry community, and the collector running can add security benefits. Grafana is used for data visualization, and it allows for analyzing web performance, exceptions, and traces. Traces capture user interactions and provide insights into user behavior and error occurrences. Leveraging these tools allows for gaining insights into various parts of the system and improving engineering goals.
1. The Story of StrongBad Inc.
Let's talk about the fictional but not entirely unrealistic story of StrongBad Inc., a young software company. Strong Bad Inc. is an email advice platform where people can go online and they can ask questions which Strong Bad will answer. On his first day, after getting his dev environment set up, Homestar Runner was super excited to pick up his first ticket and add a tags and input to the email submission form. They found a fix and were able to ship it to production and hooray, the website was back up. As engineers, we want to ship reliable websites, debug issues quickly and confidently, and have a good developer experience. Observability is one tool in our toolbox to help achieve those goals.
Let's talk about the fictional but not entirely unrealistic story of StrongBad Inc., a young software company. My name is Connor Lindsey and I'm a software engineer at Grafana Labs.
Strong Bad Inc. is an email advice platform where people can go online and they can ask questions which Strong Bad will answer. Like, can you draw a dragon or do you like techno music? Strong Bad is a good developer but he wanted a little bit of extra help around the office so he brought on his friend Homestar Runner to help out with some development tasks.
On his first day, after getting his dev environment set up, Homestar Runner was super excited to pick up his first ticket and add a tags and input to the email submission form. Everything was great, opened his first PR, it got approved and everyone was ecstatic when it got merged into production. He was so excited to be contributing code on his first day. Only to find out after they came back from lunch that the website had been taken down. They had no idea what was going on so they walked over to customer support to see what was happening and indeed they confirmed that the system was down. After a couple of hours of stressful debugging, they found a fix and were able to ship it to production and hooray, the website was back up. But they were left wondering what caused the bug. Even after debugging, they weren't 100% sure what the root cause was. Even with the fix, they weren't left with a lot of confidence as to what was happening and how they could prevent that in the future.
So Homestar was wondering to himself, there's got to be a better way. As engineers, we have the same goals. We want to ship reliable websites. When we have issues, we want to debug them quickly and confidently, understanding the root cause of what is going on. We want performance websites and we want to understand performance issues that are happening so we can improve them. Ultimately, we want reliable and performance sites so we can have a positive user experience. So that people can go to our websites and successfully accomplish what they came there to do. We also want to have a good developer experience. It should be as painless as possible. To write, ship, and maintain code. I think one of the most frustrating things as a developer is trying to reproduce a really, really difficult to reproduce bug where you just have no leads, no idea where to get started. And are left, you know, kind of wandering aimless through your logs, through whatever kind of breadcrumbs you have to get started. So, observability is one tool in our toolbox that we can reach to to help achieve some of those goals. This concept comes from control theory. And a formal definition would be having the ability to infer the state of an internal system based on its outputs. So, when we think about our front end applications running on end user devices, they're kind of like a black box.
2. Types of Observability Data
We can achieve insight into how our applications are running in production by using observability. Metrics, logs, and traces are fundamental data types that help achieve observability. Metrics provide aggregated data types, but traces and logs are important for more detailed insights. Traces give a detailed overview of request flow across service boundaries, while logs provide a linear stream of events. Having all these data types and correlating them is crucial for achieving observability.
We don't have full control or insight into what is happening when they're running. You know, it works on my machine, but I have no idea why it's broken on a customer's. And so, observability as a concept can help us gain insight into how our applications are running in production in the same way that we do when we're running them locally. You know, where we have full insight into exactly what's happening. We have all of our logs, we have our dev tools, we can see the network tab, et cetera.
We want to achieve some of those same tools and some of those same insights in production that we have when run locally. So, when talking about observability, let's talk about some of the tools, some of the things that we can reach for to achieve that goal. And one thing are different types of data that we can collect. Metrics, logs, and traces are kind of the fundamental data types that we have to work with. These in and of themselves do not mean that our systems are observable. But they're starting points. They're tools that we can work with to achieve observability.
So, metrics are numeric or aggregated data types. And because of aggregations that occur to get these metrics, you lose some level of detail. Which is why traces and logs are really important accompanying tools that we can reach for. So, some metrics that we can see on this dashboard are things like memory versus CPU usage. The number of requests that our servers are getting. Or for the frontend, distribution of page load times.
The next data type are traces. Traces, similar to a stack trace, will give you an overview of a request as it passes through your system. Unlike a stack trace when you are running a single application, a distributed trace, when talking about observability, can go across service boundaries. So, for example, you could see a trace of what happens when a user logs into your website. Well, some operations occur on the frontend, which then makes an HTTP request to your backend, which then go and makes a request to a caching layer to a database, for example. Traces give you a lot of detail of how the request flows through all of those systems and can be really, really useful.
Next, we'll look at logs. These are a linear stream of events, which are really handy to have but can often feel like a firehose. So, having good formatting and a good log aggregation tool make logs a lot more useful and a lot easier to work with. Individually, each of these data types are really powerful, but they have different pros and cons as to the type of information that they're showing, their cost, their performance implications, you know, what it takes to operate them. And so, having all of them and being able to correlate them is super, super useful. For example, when we think about a metric, it's this aggregated high-level overview of a single data point, whereas a trace is a single request that's passing through a system.
3. Exemplars, OpenTelemetry, and Instrumentation
One useful flow is going from a high-level metric to a trace with exemplars. OpenTelemetry is an open source standard for observability data. It has a specification and language-specific implementations. Instrumentation is done manually or automatically.
And so, one very useful flow is to go from a metric that's a very high-level aggregated view to a trace that went into making that metric. That's a feature called exemplars, which can be really handy, you know. You're looking at your page load dashboard, and you see that there's a spike at 2 p.m., and you can click on an exemplar at that point, and you can see this is an example of a page load that went into making that metric. And then, maybe you get some more insights when you're looking at the very specific example versus just aggregated overview.
So, let's look at how Homestar Runner could apply these concepts into their application. So, looking at the demo application, they have a React front-end, a Node back-end, and that is working with a Postgres database. Once you have your application, you need some way to collect the telemetry data, and that's where something like OpenTelemetry comes into play. It plays the role of instrumentation where you collect the different data that you're interested in. From there, you need to send it off somewhere to be stored long term. This is where open source tools like Tempo, Loki, and Mamir come into play. They store traces, logs, and metrics, respectively, and have other tools with them, like Queryers, which allow you to search for the data that you're interested in. From there, a visualization layer is helpful so you can bring in and correlate all of that data in one place. So, let's dive into OpenTelemetry.
This little code snippet is how you would set up metrics in a note application. On line two, you're importing the meter provider from OpenTelemetry's SDK metrics base package. On line five, you instantiate that to get a meter, this kind of high level object that you can use to create metrics. On lines 10 and 15, we see an example of that. On line 10, we create an instrument, which will be used to collect a single data point. In this case, that's the number of transactions. On line 15, you have really simple methods with the instruments that you're creating like add, where you can increment a metric that you have by a specific number, or you could record a specific example. This is an example of manual instrumentation. Where you're going in and throughout your codebase, you are collecting these data points when an event occurs, you increment a value, or you add a trace, or you write out a log using your logger, et cetera. You're doing this manually throughout your application, which is a bit of work, but gives you a high level of control over the data that you are collecting. In contrast to manual instrumentation, there are auto instrumentations.
4. Data Collection and Transportation
These are packages built by the OpenTelemetry community that automatically collect data. Registering an instrumentation is straightforward. Examples for the front end include document load and user interaction. Key events to track are page load, navigation events, significant user actions, errors, and page unload. Collecting attributes like app data, capabilities, and app metadata can provide valuable insights. Processors and exporters handle data collection and transportation, with options to send data to Tempo, Loki, or Mimir. The collector can act as a middle component for security purposes.
These are also packages built by the OpenTelemetry community, which use the same fundamentals to automatically collect data for you. So we'll see in the red box in the code snippet, registering an instrumentation to automatically collect data is really straightforward. You just list the ones that you want. You import them, and then it's off to the races.
Some examples that are particularly interesting for the front end are document load and user interaction. These are going to tie into browser APIs, to automatically collect data about how users are interacting with your application, how your apps are performing, and collect things like web vitals, for example. It's definitely a balance between manual and auto instrumentation, depending on the level of granularity of data that you need to collect, and ultimately, use a developer, know your applications best, and use cases the best, and so, from that, that can inform the type of telemetry data that you want to collect, as well as what attributes you want to attach to the events you're collecting, but some guidelines about what to collect.
For events, a couple of key events that you'll want to be looking at are page load, navigation events, that's especially important if you have a spa versus a multi-page app, since the browser events will behave a little bit differently when it comes to navigation. Significant user actions, you know, important button clicks or form submissions, errors are especially important to track, and page unload, you can use that and connect that to page load to kind of view an overview of a user's interaction with a specific page. With each of those events, you can collect additional metadata with them which can be used later on to query your data.
So some attributes that you might want to collect are things around app data. You know, what user is currently logged in, what team are they on, are they a paid user, even things like what feature flags do they have enabled, because later that will mean that when you see an interesting data point, you can slice and dice it in different ways, and, you know, maybe a bug is occurring, but only for users who have a certain feature flag enabled. And so by collecting that at this point in your instrumentation later on when you're querying it, you'll be able to view your data in different ways which can lead to interesting insights. Next category of attributes you might want to collect are capabilities. These would be things like what browser the user is on, their screen size and resolution, what connection are they on, are they on really slow 3G, are they offline, are they on wi-fi, and as developers, it's really common to want to know, hey, can I use this new browser API, can I use this CSS feature? And we have generic data like, can I use. But depending on your user base, your specific application, that might skew very differently from the generic statistics. So if you're collecting this data and you say, I want to use X feature really, really badly, that would be super important for us, you can go look in your telemetry data and you can see, oh, actually our user base skews in one way or another as far as what browsers they're using. For example, maybe you can say, oh, wait, actually, we can adopt this technology a lot early on than we thought we could have. Next are things like app metadata, what app version or build number is the user running? So, with all of these attributes being collected and data being collected, we then need to send it off to somewhere to store and query from later. And this is the role of processors and exporters in open telemetry.
A processor, we see here, we're adding a batch man processor, will collect all the data as you are adding it in your application. And then send it off in batches. There's some other customizations that you can do with open telemetry. But sending it off in batches is kind of a good standard. And then from there, you have an exporter. This will do the work of doing any data conversion if the back end that you're sending it to doesn't format metrics, logs, and traces in the exact same way that open telemetry does, and then it handles the transportation, right? And so in this case, we could send it off to Tempo directly or to Loki or to Mimir. One optional component is that we could run the collector. So up to this point, kind of been talking about our front end sends data to our back end data stores. But there are definitely a lot of good use cases for having a middle component there that can act as a man in the middle for your front end to your back end. One of those primary reasons would be security.
5. Data Collection and Security
Having the collector running can add security benefits by looking at allowed origins, doing rate limiting, and storing secrets. It is particularly relevant for front end applications.
Having data coming from your front ends means that it's public and that there are some security concerns there. And so having the collector running can add some security benefits, for example, by looking at allowed origins so that you only receive telemetry events from certain origins, or you do rate limiting, for example, in the collector. The collector can also store secrets since it's running on your own infrastructure versus your front end applications. And so if you have secrets to your different telemetry back ends, you wouldn't want to expose those publicly on your front ends. And so having them in the collector can be really useful. So the collector, kind of an optional component, but is particularly relevant for front end applications.
6. Data Source, Visualization, and Examples
We have a data source and a visualization layer using Grafana. In Grafana, you can set up data sources, write queries, and customize panels to visualize your data. Examples include web performance, exceptions, and traces. You can mark deploys on graphs to analyze their impact on web performance. Exceptions can be sliced and diced based on attributes like URLs or browsers. Kufana allows linking data sources to see the exact request flow and correlations. Traces show the request journey and collected attributes, including query details and error effects. Spans show how different parts of the request interact.
And then from there, we have our data source, which are going to store and let us query our data. From there, we have a visualization layer. This is what we'll be using Grafana for. In Grafana, you're able to set up your data sources and then query them using Explore, where you can write ad hoc queries or in prebuilt dashboards, where, again, you're writing queries, you're customizing these panels to bring all of your data together to quickly visualize.
Let's look at a couple examples of front end specific data. In this first dashboard, we're looking at web performance using the instrumentations from OpenTelemetry. We can collect web vitals, for example, and we could get an aggregated view of how our applications are performing and how they're trending over time. One interesting thing that you could do is that you could mark on these different graphs when different deploys happen. You could see, okay, we merged this new feature, we added a new page, whatever it might be. What effect did that have on our web vitals? And you can view that over time.
Another dashboard that would be particularly relevant are looking at exceptions. You know, seeing the raw messages that you're receiving, you know, we have 12 exceptions over the last time period regarding helmet, expecting a string, as a child of title, you know, we have some error on some page. And then we can slice and dice that data in different ways based on the attributes that we were collecting earlier. For example, if we were collecting the URL that exceptions are occurring on, then we can have this additional panel that shows top URLs by exception count. So, we see there, okay, the feature page, something is going on, we should take a look at that. Or looking at exceptions by browser, for example. So, these are just kind of examples of different ways that you can divide and query your data based on the attributes that you're collecting. You know, maybe you want to look at exceptions by feature flag so that you can see if a certain feature flag is causing a lot of errors. And this dashboard gives a really high level overview, but something really interesting you can do in Kufana is link these different data sources together. So you see these errors, and maybe you want to click and see an example of that. And that would link to a trace where that error occurred. And then you can see the exact request from the frontend to the backend to the database that caused that error. And that's where that correlation piece becomes really, really important. So this is an example of using dashboards with kinda metrics and logs to give these overviews of web performance and exceptions. Let's look at a couple examples of traces. This first example is a pretty traditional trace that shows a request coming from a frontend to an API all the way to the database, and we can see the types of data that we're collecting about it as attributes. We have, for example, the statement, the exact query that was being run and what parameters were passed in. If we have an error occur, that would be really handy information to know what exact query ran that caused that error and what effects did it have upstream. We also get an interesting view of how the different subsections of requests, which are called spans, interact with each other. For example, the total duration of this post request as far as the front end sees it is 27 milliseconds, 16 milliseconds of which happened on the API.
7. Observability Data Insights
We can gain insight into the duration of individual queries and collect data to work with. Traces capture user interactions, including navigation events and the data associated with them. Front-end specific data, like session IDs, provides valuable insights into user behavior and error occurrences. Examining component renders allows us to analyze mount and render times, identify unnecessary rerenders, and understand application performance. Tracing CI-CD pipelines is another interesting use case. By leveraging these tools, we can gain insights into various parts of our system, improve engineering goals, and enhance confidence in bug fixing and performance optimization.
And then we can see the duration of these individual queries. Maybe we see a query is a lot longer than we would expect it to be, and that gives us some data that we can work off of.
Next is an example of a user interaction caught as a trace. This is a navigation event to a new page, which took one millisecond. And we can see some data that we collect with that, like the user agent and the X path that they used to navigate from.
One really interesting piece of data that is front end specific is the session ID. Unlike a back end, where typically you're going to have a very clear cut request and response life cycle, on the front end, a user's session, their interaction with a website isn't so clearly defined. It could be very long lived. It could be minutes, it could be hours, it could be broken across days, depending on your application, how you want to define that. Creating session IDs and attaching them to your telemetry data can give you a lot of insight into how users are behaving, what pages are they going to, what features are they interacting with, over the course of a session, are there any errors that are occurring. And so, particularly for front end applications, having that session ID is really, really important.
And then finally, this is a really cool example of traces to look at component renders. So, here, we're hooking into lifecycle events of a React class component and we look at things like component mount and render. So, we see how long it takes for a component to mount, how long does it take to render, is it having unnecessary rerenders? In this case, we see that second rendered takes 500 milliseconds, which is way, way longer than the other ones. And so, that's some interesting data that we can look into to understand how our applications are performing.
Side note, one really interesting use of traces would be tracing a CI-CD pipeline. The applications of this can be really, really wide and it's kind of up to us as developers to be creative with that and use these fundamental tools to gain insights into different parts of our system, whether those are our front-end applications, our CI-CD pipelines, and to our back-end services, etc. And using all of this will help hit those engineering goals that we talked about earlier. With all this, I hope so Runner and StrongBoud are understanding their application a lot, lot better and are having a lot more confidence when it comes to fixing bugs and performance issues. Thank you so much for listening. Hope you learned something and are able to apply some of the things to your applications to get a better insight into how they are operating in production.