Serverless Observability: Where SLOs Meet Transforms

Rate this content
Bookmark

This talk explores the use case of SLOs and transforms whilst migration to the serverless ecosystem. The talk starts by presenting the reasons why SLOs are important in the SRE/DevOps framework. It then analyses specific use cases of SLOs, the tools used for measuring the efficiency of SLOs and presents the main bottlenecks encountered when defining and adhering to SLOs in the process of migrating to a serverless ecosystem, especially when dealing with burn rates and transforms.

By the end of the talk, the audience will be able to subscribe to the following takeaways:

SLOs are important in a SRE/DevOps framework and there are numerous advantages for implementing them as long as they adhere to a process of continuous improvement.

Adopting the right tools and metrics are paramount in the implementation of SLOs.

Migrating to serverless adds pressure to the observability system and it is possible that burn rates and transforms will backfire. Our use case will show how it's possible to mitigate these challenges and learn from similar situations.

8 min
15 Feb, 2024

Video Summary and Transcription

This Talk provides an introduction to Serverless Observability and SLOs, explaining the concept of SLOs and their dependency on transforms. It highlights the codependency between SLOs, SLAs, and SLIs and discusses the importance of well-defined SLOs. The Talk also demonstrates how to create and monitor SLOs and alert rules, emphasizing the benefits of burn rate alerting in reducing alert fatigue and improving user experience.

Available in Español

1. Introduction to Serverless Observability and SLOs

Short description:

Hi, I'm Diana Toda. I'm here to present Serverless Observability where SLOs meet transforms. We'll discuss the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and have a short demo. Server level indicators are a measure of the service level, defined as a ratio of goods over total events. The service level objectives are the target values for a service level, and the error budget is the tolerated quantity of errors.

Hi, DevOps.js. I'm Diana Toda. I'm an SRE at Elastic, and I'm here to present Serverless Observability where SLOs meet transforms. So we're going to talk about the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and we're going to have a short demo.

So a bit of context. With Elastic's migration to serverless, we had the need to come up with a new idea around the rollup aggregations. So Elastic has a multi-cluster infrastructure, and we needed to move away from rollup aggregations and search due to some of their limitations. Then we started creating the transforms.

So let's start with some definitions. Server level indicators, as probably you well know, are a measure of the service level provided. They are usually defined as a ratio of goods over total events, and they range between 0 and 100%. Some examples, availability, throughput, request latency, error rates. The service level objectives are a target value for a service level measured by an SLI. Above the threshold, the service is compliant. For example, 95% of the successful requests are served under 100 milliseconds. The error budget is defined as 100% minus the SLO. So it's the quantity of errors that is tolerated, and the burn rate is the rate at which we are burning the error budget over a defined period of time. It's very useful at alerting before exhausting the error budget.

2. Codependency Between SLOs, SLAs, and SLIs

Short description:

So we have a codependency between SLOs, SLAs, and SLIs. How do we recognize the good SLO versus a bad SLO? A well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations. The SLO architecture relies on transforms to roll up the source data and summarize it into entity-centric indices. Transforms enable you to convert existing indices, providing new insights and analytics. Burn rate alerting calculates the rate at which SLOs are failing over time, helping prioritize issues. It has reduced alert fatigue, improved user experience, and good precision. Let's move on to the demo where you can create and monitor SLOs.

So we have a codependency between SLOs, SLAs, and SLIs. So how do we recognize the good SLO versus a bad SLO? A bad SLO is vague, subjective, it lacks quantifiable metrics, it has an undefined threshold, and no observation window. A good SLO is specific and measurable, user-centric, quantifiable, and achievable, and it's time frame defined. So a well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations, which are essential elements for effective monitoring and evaluation of service reliability.

The SLO architecture, basically the SLOs rely on the transform surface to roll up the source data into roll-up indices. To support a group by or the partition by feature, Elastic has added a second layer which summarizes the roll-up data into an entity-centric index for each SLO. This index also powers the search experience to allow users to search and sort by in any SLO dimension. So what are transforms? Transforms are persistent tasks that enable you to convert existing Elastic search indices into summarized indices, which provide opportunities for new insights and analytics. For example, you can use transforms to pivot your data into entity-centric indices that summarize the behavior of users or sessions or other entities in your data. Or you can use transforms to find the latest document among all the documents that have a certain unique key.

The burn rate alerting calculates the rate at which SLOs are failing over multiple windows of time, is less sensitive to short-term fluctuations by focusing on sustained deviations, and it can give you an indication of how severely the service is degrading and helps prioritize multiple issues at the same time. Here we have a graph of burn rate alerting with multiple windows. So we have two windows for each severity, a short and a long one. The short window is 112 of the long window so when the burn rate for both windows exceeds the threshold, the alert is triggered. The pros with burn rate alerting is that it has a reduced alert fatigue, improved user experience, a flexible alerting framework, and a good precision. The con at the moment is that you have lots of options to configure, but this will be improved with future versions of Elasticsearch.

So it's demo time. So here there is some demo that I made up for you around the transforms. You can see you can create the transforms there. You can check the data behind it. You have stat, JSON, messages, and some preview. And you can check the health of each transform. It could be degraded, healthy, or even failed. If you have some issues, you can troubleshoot it right from this screen. So let's try to create some SLOs. You go to observability, SLOs, and create a new SLO. You choose the type of the slide that you want, the index. In my case, I will use an serverless index, and a time seven field. You add your query filter that you're interested in, the good query that you like for your SLO, and the total query. Afterwards, you have an interesting selection here to partition by.

3. Creating SLOs and Alert Rules

Short description:

You set your objectives and target SLO for a specific time window. Add a title, description, and tags to identify your SLO. Choose a burn rate rule and configure it with various options. Save the rule and search for your SLO. The panel provides an overview, alerts, and options to edit availability or create new alert rules.

For example, serverless project ID SLO, or cluster type, etc. You set your objectives for the time window, depending on what you want to do, and target SLO, let's say, for example, 99%. You add the title to your SLO, a short description, what it does. And basically, you can add some tags to better identify your SLO. And if you want to choose a burn rate rule, you click on the tick there, and there you have it. You have your SLO, which prompts you immediately to create a burn rule. You have lots of options to define the hour, the action groups, etc. And you can select an action depending on where you want to be alerted. You save the rule, and then let's start looking for it. You can start typing the name of your SLO. And as you can see, I have a list of my SLO grouped by serverless project ID. And in this panel, in the screen, you can have the overview, the alerts. You could go in actions, edit availability, or create a new alert rule.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

JSNation Live 2021JSNation Live 2021
19 min
Multithreaded Logging with Pino
Top Content
Almost every developer thinks that adding one more log line would not decrease the performance of their server... until logging becomes the biggest bottleneck for their systems! We created one of the fastest JSON loggers for Node.js: pino. One of our key decisions was to remove all "transport" to another process (or infrastructure): it reduced both CPU and memory consumption, removing any bottleneck from logging. However, this created friction and lowered the developer experience of using Pino and in-process transports is the most asked feature our user.In the upcoming version 7, we will solve this problem and increase throughput at the same time: we are introducing pino.transport() to start a worker thread that you can use to transfer your logs safely to other destinations, without sacrificing neither performance nor the developer experience.
React Summit 2023React Summit 2023
28 min
Advanced GraphQL Architectures: Serverless Event Sourcing and CQRS
GraphQL is a powerful and useful tool, especially popular among frontend developers. It can significantly speed up app development and improve application speed, API discoverability, and documentation. GraphQL is not an excellent fit for simple APIs only - it can power more advanced architectures. The separation between queries and mutations makes GraphQL perfect for event sourcing and Command Query Responsibility Segregation (CQRS). By making your advanced GraphQL app serverless, you get a fully managed, cheap, and extremely powerful architecture.

Workshops on related topic

DevOps.js Conf 2024DevOps.js Conf 2024
163 min
AI on Demand: Serverless AI
Featured WorkshopFree
In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.
Node Congress 2021Node Congress 2021
245 min
Building Serverless Applications on AWS with TypeScript
Workshop
This workshop teaches you the basics of serverless application development with TypeScript. We'll start with a simple Lambda function, set up the project and the infrastructure-as-a-code (AWS CDK), and learn how to organize, test, and debug a more complex serverless application.
Table of contents:        - How to set up a serverless project with TypeScript and CDK        - How to write a testable Lambda function with hexagonal architecture        - How to connect a function to a DynamoDB table        - How to create a serverless API        - How to debug and test a serverless function        - How to organize and grow a serverless application


Materials referred to in the workshop:
https://excalidraw.com/#room=57b84e0df9bdb7ea5675,HYgVepLIpfxrK4EQNclQ9w
DynamoDB blog Alex DeBrie: https://www.dynamodbguide.com/
Excellent book for the DynamoDB: https://www.dynamodbbook.com/
https://slobodan.me/workshops/nodecongress/prerequisites.html
React Summit 2022React Summit 2022
107 min
Serverless for React Developers
WorkshopFree
Intro to serverlessPrior Art: Docker, Containers, and KubernetesActivity: Build a Dockerized application and deploy it to a cloud providerAnalysis: What is good/bad about this approach?Why Serverless is Needed/BetterActivity: Build the same application with serverlessAnalysis: What is good/bad about this approach?
GraphQL Galaxy 2021GraphQL Galaxy 2021
143 min
Building a GraphQL-native serverless backend with Fauna
WorkshopFree
Welcome to Fauna! This workshop helps GraphQL developers build performant applications with Fauna that scale to any size userbase. You start with the basics, using only the GraphQL playground in the Fauna dashboard, then build a complete full-stack application with Next.js, adding functionality as you go along.

In the first section, Getting started with Fauna, you learn how Fauna automatically creates queries, mutations, and other resources based on your GraphQL schema. You learn how to accomplish common tasks with GraphQL, how to use the Fauna Query Language (FQL) to perform more advanced tasks.

In the second section, Building with Fauna, you learn how Fauna automatically creates queries, mutations, and other resources based on your GraphQL schema. You learn how to accomplish common tasks with GraphQL, how to use the Fauna Query Language (FQL) to perform more advanced tasks.
Node Congress 2022Node Congress 2022
83 min
Scaling Databases For Global Serverless Applications
WorkshopFree
This workshop discusses the challenges Enterprises are facing when scaling the data tier to support multi-region deployments and serverless environments. Serverless edge functions and lightweight container orchestration enables applications and business logic to be easily deployed globally, often leaving the database as the latency and scaling bottleneck.
Join us to understand how PolyScale.ai solves these scaling challenges intelligently caching database data at the edge, without sacrificing transactionality or consistency. Get hands on with PolyScale for implementation, query observability and global latency testing with edge functions.
Table of contents        - Introduction to PolyScale.ai        - Enterprise Data Gravity        - Why data scaling is hard        - Options for Scaling the data tier        - Database Observability        - Cache Management AI        - Hands on with PolyScale.ai
TestJS Summit 2021TestJS Summit 2021
146 min
Live e2e test debugging for a distributed serverless application
WorkshopFree
In this workshop, we will be building a testing environment for a pre-built application, then we will write and automate end-to-end tests for our serverless application. And in the final step, we will demonstrate how easy it is to understand the root cause of an erroneous test using distributed testing and how to debug it in our CI/CD pipeline with Thundra Foresight.

Table of contents:
- How to set up and test your cloud infrastructure
- How to write and automate end-to-end tests for your serverless workloads
- How to debug, trace, and troubleshot test failures with Thundra Foresight in your CI/CD pipelines