Getting Started with Open Source Observability

Bookmark
Slides

Learn how to instrument Node and client applications using OpenTelemetry and popular open-source observability tools like Prometheus and Grafana.

by



Transcription


Let's talk about the fictional but not entirely unrealistic story of StrongBad Inc., a young software company. My name is Connor Lindsay and I'm a software engineer at Grafana Labs. StrongBad Inc. is an email advice platform where people can go online and ask questions which StrongBad will answer, like, can you draw a dragon or do you like techno music? StrongBad is a good developer, but he wanted a little bit of extra help around the office, so he brought on his friend Homestar Runner to help out with some development tasks. On his first day, after getting his dev environment set up, Homestar Runner was super excited to pick up his first ticket and add a tags input to the email submission form. Everything was great, opened his first PR, it got approved, and everyone was ecstatic when it got merged into production. He was so excited to be contributing code on his first day. Only to find out after they came back from lunch that the website had been taken down. They had no idea what was going on, so they walked over to customer support to see what was happening, and indeed, they confirmed that the system was down. After a couple hours of stressful debugging, they found a fix and were able to ship it to production, and hooray, the website was back up. But they were left wondering what caused the bug. Even after debugging, they weren't 100% sure what the root cause was, and so even with the fix, they weren't left with a lot of confidence as to what was happening and how they could prevent that in the future. So Homestar was wondering to himself, there's got to be a better way. As engineers, we have the same goals. We want to ship reliable websites. When we do have issues, we want to debug them quickly and confidently, understanding the root cause of what is going on. We want performant websites, and we want to understand performance issues that are happening so that we can improve them. Ultimately, we want reliable and performant sites so that we can have a positive user experience so that people can go to our websites and successfully accomplish what they came there to do. We also want to have a good developer experience. It should be as painless as possible to write, ship, and maintain code. I think one of the most frustrating things as a developer is trying to reproduce a really, really difficult to reproduce bug where you just have no leads, no idea where to get started, and are left kind of wandering aimless through your logs, through whatever kind of breadcrumbs you have to get started. So observability is one tool in our tool belt that we can reach to to help achieve some of those goals. This concept comes from control theory, and a formal definition would be having the ability to infer the state of an internal system based on its outputs. So when we think about our front-end applications running on end-user devices, they're kind of like a black box. We don't have full control or insight into what is happening when they're running. It works on my machine, but I have no idea why it's broken on a customer's. And so observability as a concept can help us gain insight into how our applications are running in production in the same way that we do when we're running them locally, you know, where we have full insight into exactly what's happening. We have all of our logs, we have our dev tools, we can see the network tab, et cetera. We want to achieve some of those same tools and some of those same insights in production that we have when running locally. So when talking about observability, let's talk about some of the tools, some of the things that we can reach for to achieve that goal. And one thing are different types of data that we can collect. Metrics logs and traces are kind of the fundamental data types that we have to work with. These in and of themselves do not mean that our systems are observable, but they're starting points, they're tools that we can work with to achieve observability. So metrics are numeric or aggregated data types. And because of aggregations that occur to get these metrics, you lose some level of detail, which is why traces and logs are really important accompanying tools that we can reach for. So some metrics that we can see on this dashboard are things like memory versus CPU usage, the number of requests that our servers are getting, or for the front end, distribution of page load times. The next data type are traces. Traces similar to a stack trace will give you an overview of a request as it passes through your system. So unlike a stack trace, when you are running a single application, a distributed trace, when talking about observability, can go across service boundaries. So for example, you could see a trace of what happens when a user logs into your website. Well, some operations occur on the front end, which then makes an HTTP request to your back end, which then go and makes a request to a caching layer and then to a database, for example. So traces give you a lot of detail of how the request flows through all of those systems and can be really, really useful. Next we'll look at logs. These are a linear stream of events, which are really handy to have, but can often feel like a fire hose. So having good formatting and a good log aggregation tool make logs a lot more useful and a lot easier to work with. Individually, each of these data types are really powerful, but they have different pros and cons as to the type of information that they're showing, their costs, their performance implications, what it takes to operate them. And so having all of them and being able to correlate them is super, super useful. For example, when we think about a metric, it's this aggregated high level overview of a single data point, whereas a trace is a single request that's passing through a system. And so one very useful flow is to go from a metric, this very high level aggregated view to a trace that went into making that metric. That's a feature called exemplars, which can be really handy. You're looking at your page load dashboard and you see that there's a spike at 2 p.m. and you can click on an exemplar at that point and you can see this is an example of a page load that went into making that metric. And then maybe you get some more insights when you're looking at the very specific example versus just the aggregated overview. So let's look at how Homestar Runner could apply these concepts into their application. So looking at the demo application, they have a react front end, a Node back end, and that is working with a Postgres database. Once you have your application, you need some way to collect that telemetry data. And that's where something like OpenTelemetry comes into play. It plays a role of instrumentation where you collect the different data that you're interested in. From there, you need to send it off somewhere to be stored long term. This is where open source tools like Tempo, Loki, and Mimir come into play. They store traces, logs, and metrics, respectively, and have other tools with them like queries, which allow you to search for the data that you're interested in. From there, a visualization layer is really helpful so that you can bring in and correlate all of that data in one place. So let's dive into OpenTelemetry. OpenTelemetry is an open source standard for observability data. Has a specification which defines things like what is a trace, what is a log, what is a metric, what formats do they come in, how are they used, as well as things like semantic conventions. What pieces of data will always be collected? What names will they appear under? From there, there are language specific implementations of the specification as APIs and SDKs. For example, for javascript, that means that there are npm packages that we as end users, as developers, can download and use in our applications to collect metrics, logs, and traces. From there, there are other components like exporters, which play the role of sending the data from our applications that we've collected to whatever backend we're using to store them, and the collector. Let's look at an example of instrumentation. This little code snippet is how you would set up metrics in a Node application. On line 2, you're importing the meter provider from OpenTelemetry's SDK metrics base package. On line 5, you instantiate that to get a meter, this kind of high level object that you can use to create metrics. On lines 10 and 15, we see an example of that. On line 10, we create an instrument which will be used to collect a single data point. In this case, that's the number of transactions. On line 15, you have really simple methods with the instruments that you're creating, like add, where you can increment a metric that you have by a specific number, or you could record a specific example. This is an example of manual instrumentation. It's where you're going in and throughout your code base, you are collecting these data points when an event occurs, you increment a value, or you add a trace, or you write out a log using your logger, et cetera. You're doing this manually throughout your application, which is a bit of work, but gives you a high level of control over the data that you are collecting. In contrast to manual instrumentation, there are auto instrumentations. These are also packages built by the OpenTelemetry community, which use the same fundamentals to automatically collect data for you. We'll see in the red box in the code snippet, registering an instrumentation to automatically collect data is really, really straightforward. You just list the ones that you want, you import them, and then it's off to the races. Some examples that are particularly interesting for the front end are document load and user interaction. These are going to tie into browser APIs, like window.performance and navigation events to automatically collect data about how users are interacting with your application, how your apps are performing, and collect things like web vitals, for example. There's definitely a balance between manual and auto instrumentation, depending on the level of granularity of data that you need to collect. Ultimately, user developer know your applications best and your use cases the best. From that, that can inform the type of telemetry data that you want to collect, as well as what attributes you want to attach to the events you're collecting. But some guidelines about what to collect. For events, a couple of key events that you'll want to be looking at are page load, navigation events. That's especially important if you have a SPA versus a multi-page app, since the browser events will behave a little bit differently when it comes to navigation. Significant user actions, important button clicks or form submissions, errors are especially important to track. And page unload, you can use that and connect that to page load to view an overview of a user's interaction with a specific page. With each of those events, you can collect additional metadata with them, which can be used later on to query your data. Some attributes that you might want to collect are things around app data. What user is currently logged in? What team are they on? Are they a paid user? Even things like what feature flags do they have enabled? Because later, that will mean that when you see an interesting data point, you can slice and dice it in different ways. Maybe a bug is occurring, but only for users who have a certain feature flag enabled. So by collecting that at this point in your instrumentation, later on when you're querying it, you'll be able to view your data in different ways, which can lead to interesting insights. Next category of attributes you might want to collect are capabilities. These would be things like what browser the user is on, their screen size and resolution, what connection are they on? Are they on really slow 3G? Are they offline? Are they on Wi-Fi? As developers, it's really common to want to know, hey, can I use this new api/talks">browser api? Can I use this css feature? And we have generic data like can I use, but depending on your user base, your specific application, that might skew very differently from the generic statistics. So if you're collecting this data and you say, I want to use X feature really, really badly, that would be super important for us. You can go look in your telemetry data and you can say, oh, actually our user base skews in one way or another as far as what browsers they're using, for example. And maybe you can say, oh, wait, actually we can adopt this technology a lot early on than we thought we could have. Next are things like app metadata, what app version or build number is the user running? So with all of these attributes being collected and data being collected, we then need to send it off to somewhere to store and query from later. And this is the role of processors and exporters in open telemetry. A processor we see here, we're adding a batch man processor, will collect all the data as you are adding it in your application and then send it off in batches. There's some other customizations that you can do with open telemetry, but sending it off in batches is kind of a good standard. And then from there, you have an exporter. This will do the work of doing any data conversion. If the backend that you're sending it to doesn't format metrics, logs, and traces in the exact same way that open telemetry does, and then it handles the transportation. And so in this case, we could send it off to Tempo directly or to Loki or to Mimir. One optional component is that we could run the collector. So up to this point, kind of been talking about our frontend sends data to our backend data stores. But there are definitely a lot of good use cases for having a middle component there that can act as a man in the middle for your frontend to your backend. One of those primary reasons would be security. Having data coming from your frontends means that it's public and that there are some security concerns there. And so having the collector running can add some security benefits, for example, by looking at allowed origins, so that you only receive telemetry events from certain origins, or you do rate limiting, for example, in the collector. The collector can also store secrets since it's running on your own infrastructure versus your frontend applications. And so if you have secrets to your different telemetry backends, you wouldn't want to expose those publicly on your frontends. And so having them in the collector can be really useful. So the collector, kind of an optional component, but is particularly relevant for frontend applications. And then from there, we have our data stores, which are going to store and let us query our data. From there, we have a visualization layer. This is what we'll be using Grafana for. In Grafana, you're able to set up your data sources and then query them using Explore, where you can write ad hoc queries, or in pre-built dashboards, where again, you're writing queries, you're customizing these panels to bring all of your data together to quickly visualize. Let's look at a couple examples of frontend-specific data. So in this first dashboard, we're looking at web performance using the instrumentations from OpenTelemetry. We can collect web vitals, for example, and we could get an aggregated view of how our applications are performing and how they're trending over time. One interesting thing that you could do is that you could mark on these different graphs when different deploys happen. So you could see, okay, we merged this new feature, we added a new page, whatever it might be. What effect did that have on our web vitals? And you can view that over time. Another dashboard that would be particularly relevant are looking at exceptions. Seeing the raw messages that you're receiving, we have 12 exceptions over the last time period regarding helmet, expecting a string as a child of title. We have some error on some page. And then we can slice and dice that data in different ways based on the attributes that we were collecting earlier. For example, if we were collecting the URL that exceptions are occurring on, then we can have this additional panel that shows top URLs by exception count. So we see there, okay, the feature page, something is going on. We should take a look at that. Or looking at exceptions by browser, for example. So these are just kind of examples of different ways that you can divide and query your data based on the attributes that you're collecting. Maybe you want to look at exceptions by feature flag so that you can see if a certain feature flag is causing a lot of errors. And this dashboard gives a really high level overview, but something really interesting you can do in Grafana is link these different data sources together. So you see these errors and maybe you want to click and see an example of that. And that would link to a trace where that error occurred. And then you could see the exact request from the front end to the back end to the database that caused that error. That's where that correlation piece becomes really, really important. So as an example of using dashboards with kind of metrics and logs to give these overviews of web performance and exceptions, let's look at a couple examples of traces. This first example is a pretty traditional trace that shows a request coming from a front end to an api all the way to the database. And we can see the types of data that we're collecting about it as attributes. We have, for example, the statement, the exact query that was being run and what parameters were passed in. If we have an error occur, that would be really handy information to know what exact query ran that caused that error and what effects did it have upstream. We also get an interesting view of how the different subsections of requests, which are called spans, interact with each other. For example, the total duration of this post request as far as the front end sees it is 27 milliseconds, 16 milliseconds of which happened on the api. And then we can see the duration of these individual queries. Maybe we see a query is a lot longer than we had expected it to be. And that gives us some data that we can work off of. Next is an example of a user interaction caught as a trace. This is a navigation event to a new page, which took one millisecond. And we can see some data that we collect with that, like the user agent and the XPath that they used to navigate from. One really interesting piece of data that is front end specific is the session ID. Unlike a back end, where typically you're going to have a very clear cut request and response life cycle, on the front end, a user's session, their interaction with the website, isn't so clearly defined. It could be very long lived. It could be minutes, it could be hours, it could be broken across days. Depending on your application, how you want to define that, creating session IDs and attaching them to your telemetry data can give you a lot of insight into how users are behaving. What pages are they going to? What features are they interacting with? Over the course of a session, are there any errors that are occurring? And so, particularly for front end applications, having that session ID is really, really important. And then finally, this is a really cool example of traces to look at component renders. So here we're hooking into life cycle events of a react class component, and we look at things like component mount and render. So we see how long it takes for a component to mount, how long does it take to render, is it having unnecessary re-renders? In this case, we see that that second render takes 500 milliseconds, which is way, way longer than the other ones. And so, that's some interesting data that we can look into to understand how our applications are performing. Side note, one really interesting use of traces would be tracing a ci cd pipeline. The applications of this can be really, really wide, and it's kind of up to us as developers to be creative with that and use these fundamental tools to gain insights into different parts of our system, whether those are our front end applications, our ci cd pipelines, into our backend services, et cetera. And using all of this will help hit those engineering goals that we talked about earlier. With all this, Home Store Runner and StrongBad are understanding their application a lot, lot better and are having a lot more confidence when it comes to fixing bugs and performance issues. Thank you so much for listening. Hope you learned something and are able to apply some of the things to your applications to get a better insight into how they're operating in production. Thank you.
21 min
24 Oct, 2022

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic