English versionEN

Serverless Observability: Where SLOs Meet Transforms

This talk explores the use case of SLOs and transforms whilst migration to the serverless ecosystem. The talk starts by presenting the reasons why SLOs are important in the SRE/DevOps framework. It then analyses specific use cases of SLOs, the tools used for measuring the efficiency of SLOs and presents the main bottlenecks encountered when defining and adhering to SLOs in the process of migrating to a serverless ecosystem, especially when dealing with burn rates and transforms.

By the end of the talk, the audience will be able to subscribe to the following takeaways:

SLOs are important in a SRE/DevOps framework and there are numerous advantages for implementing them as long as they adhere to a process of continuous improvement.

Adopting the right tools and metrics are paramount in the implementation of SLOs.

Migrating to serverless adds pressure to the observability system and it is possible that burn rates and transforms will backfire. Our use case will show how it's possible to mitigate these challenges and learn from similar situations.

This talk has been presented at DevOps.js Conf 2024, check out the latest edition of this JavaScript Conference.

FAQ

Server level indicators, or SLIs, are measures of the service level provided, typically defined as a ratio of good events over total events, ranging from 0 to 100%. Common examples include availability, throughput, request latency, and error rates.

A Service Level Objective (SLO) is a target value for a service level measured by an SLI. It defines the performance threshold for service compliance, such as ensuring 95% of requests are served within 100 milliseconds.

Burn rate alerting calculates the rate at which SLOs are failing over multiple time windows. It focuses on sustained deviations, helping to reduce alert fatigue and indicate the severity of service degradation.

Transforms are persistent tasks in Elasticsearch that allow conversion of existing indices into summarized indices. These transforms enable aggregation and pivoting of data, facilitating insights and analytics on entity-centric indices.

The SLO transform architecture relies on transforms to roll up source data into roll-up indices, which are then summarized into an entity-centric index for each SLO. This structure supports features like group by or partition by, enhancing search and sort capabilities in SLO dimensions.

To create an SLO in Elastic's serverless environment, navigate to observability, select SLOs, and choose to create a new SLO. You define the type, index, time field, and queries for good and total events. Objectives, time windows, and burn rate rules are set to monitor and alert based on specific performance thresholds.

A good SLO is specific, measurable, user-centric, quantifiable, and achievable with a defined time frame. In contrast, a bad SLO is vague, lacks quantifiable metrics, has an undefined threshold, and no observation window, making it ineffective for monitoring service reliability.

serverless observability

Virginia Diana Todea

8 min

15 Feb, 2024

Comments

Video Summary and Transcription

This Talk provides an introduction to Serverless Observability and SLOs, explaining the concept of SLOs and their dependency on transforms. It highlights the codependency between SLOs, SLAs, and SLIs and discusses the importance of well-defined SLOs. The Talk also demonstrates how to create and monitor SLOs and alert rules, emphasizing the benefits of burn rate alerting in reducing alert fatigue and improving user experience.

Available in Español: Observabilidad sin servidor: Donde se encuentran los SLOs y las transformaciones

1. Introduction to Serverless Observability and SLOs

Short description:

Hi, I'm Diana Toda. I'm here to present Serverless Observability where SLOs meet transforms. We'll discuss the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and have a short demo. Server level indicators are a measure of the service level, defined as a ratio of goods over total events. The service level objectives are the target values for a service level, and the error budget is the tolerated quantity of errors.

Hi, DevOps.js. I'm Diana Toda. I'm an SRE at Elastic, and I'm here to present Serverless Observability where SLOs meet transforms. So we're going to talk about the concept, the SLO's dependency on transforms, SLO transform architecture, burn rate alerting, and we're going to have a short demo.

So a bit of context. With Elastic's migration to serverless, we had the need to come up with a new idea around the rollup aggregations. So Elastic has a multi-cluster infrastructure, and we needed to move away from rollup aggregations and search due to some of their limitations. Then we started creating the transforms.

So let's start with some definitions. Server level indicators, as probably you well know, are a measure of the service level provided. They are usually defined as a ratio of goods over total events, and they range between 0 and 100%. Some examples, availability, throughput, request latency, error rates. The service level objectives are a target value for a service level measured by an SLI. Above the threshold, the service is compliant. For example, 95% of the successful requests are served under 100 milliseconds. The error budget is defined as 100% minus the SLO. So it's the quantity of errors that is tolerated, and the burn rate is the rate at which we are burning the error budget over a defined period of time. It's very useful at alerting before exhausting the error budget.

2. Codependency Between SLOs, SLAs, and SLIs

Short description:

So we have a codependency between SLOs, SLAs, and SLIs. How do we recognize the good SLO versus a bad SLO? A well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations. The SLO architecture relies on transforms to roll up the source data and summarize it into entity-centric indices. Transforms enable you to convert existing indices, providing new insights and analytics. Burn rate alerting calculates the rate at which SLOs are failing over time, helping prioritize issues. It has reduced alert fatigue, improved user experience, and good precision. Let's move on to the demo where you can create and monitor SLOs.

So we have a codependency between SLOs, SLAs, and SLIs. So how do we recognize the good SLO versus a bad SLO? A bad SLO is vague, subjective, it lacks quantifiable metrics, it has an undefined threshold, and no observation window. A good SLO is specific and measurable, user-centric, quantifiable, and achievable, and it's time frame defined. So a well-defined SLO focuses on a crucial aspect of service quality, provides clarity, measurability, and alignment with user expectations, which are essential elements for effective monitoring and evaluation of service reliability.

The SLO architecture, basically the SLOs rely on the transform surface to roll up the source data into roll-up indices. To support a group by or the partition by feature, Elastic has added a second layer which summarizes the roll-up data into an entity-centric index for each SLO. This index also powers the search experience to allow users to search and sort by in any SLO dimension. So what are transforms? Transforms are persistent tasks that enable you to convert existing Elastic search indices into summarized indices, which provide opportunities for new insights and analytics. For example, you can use transforms to pivot your data into entity-centric indices that summarize the behavior of users or sessions or other entities in your data. Or you can use transforms to find the latest document among all the documents that have a certain unique key.

The burn rate alerting calculates the rate at which SLOs are failing over multiple windows of time, is less sensitive to short-term fluctuations by focusing on sustained deviations, and it can give you an indication of how severely the service is degrading and helps prioritize multiple issues at the same time. Here we have a graph of burn rate alerting with multiple windows. So we have two windows for each severity, a short and a long one. The short window is 112 of the long window so when the burn rate for both windows exceeds the threshold, the alert is triggered. The pros with burn rate alerting is that it has a reduced alert fatigue, improved user experience, a flexible alerting framework, and a good precision. The con at the moment is that you have lots of options to configure, but this will be improved with future versions of Elasticsearch.

So it's demo time. So here there is some demo that I made up for you around the transforms. You can see you can create the transforms there. You can check the data behind it. You have stat, JSON, messages, and some preview. And you can check the health of each transform. It could be degraded, healthy, or even failed. If you have some issues, you can troubleshoot it right from this screen. So let's try to create some SLOs. You go to observability, SLOs, and create a new SLO. You choose the type of the slide that you want, the index. In my case, I will use an serverless index, and a time seven field. You add your query filter that you're interested in, the good query that you like for your SLO, and the total query. Afterwards, you have an interesting selection here to partition by.

3. Creating SLOs and Alert Rules

Short description:

You set your objectives and target SLO for a specific time window. Add a title, description, and tags to identify your SLO. Choose a burn rate rule and configure it with various options. Save the rule and search for your SLO. The panel provides an overview, alerts, and options to edit availability or create new alert rules.

For example, serverless project ID SLO, or cluster type, etc. You set your objectives for the time window, depending on what you want to do, and target SLO, let's say, for example, 99%. You add the title to your SLO, a short description, what it does. And basically, you can add some tags to better identify your SLO. And if you want to choose a burn rate rule, you click on the tick there, and there you have it. You have your SLO, which prompts you immediately to create a burn rule. You have lots of options to define the hour, the action groups, etc. And you can select an action depending on where you want to be alerted. You save the rule, and then let's start looking for it. You can start typing the name of your SLO. And as you can see, I have a list of my SLO grouped by serverless project ID. And in this panel, in the screen, you can have the overview, the alerts. You could go in actions, edit availability, or create a new alert rule.

Available in other languages:

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

You Don’t Know How to SSR

DevOps.js Conf 2024

23 min

You Don’t Know How to SSR

Emanuele Stoppa

Active contributor to the Biome project

The Talk covers the speaker's personal journey into server-side rendering (SSR) and the evolution of web development frameworks. It explores the use of jQuery for animations in SSR, the challenges faced in integrating React with Umbraco, and the creation of a custom SSR framework. The Talk also discusses the benefits of Next.js and the use of serverless artifacts for deployment. Finally, it highlights the features of Astro, including its function per route capability.

astro next.js serverless

Multithreaded Logging with Pino

JSNation Live 2021

19 min

Multithreaded Logging with Pino

Workshops on related topic

AI on Demand: Serverless AI

DevOps.js Conf 2024

163 min

AI on Demand: Serverless AI

Top Content

Featured Workshop

Nathan Disidore

In this workshop, we discuss the merits of serverless architecture and how it can be applied to the AI space. We'll explore options around building serverless RAG applications for a more lambda-esque approach to AI. Next, we'll get hands on and build a sample CRUD app that allows you to store information and query it using an LLM with Workers AI, Vectorize, D1, and Cloudflare Workers.

artificial intelligence serverless architecture

Building Serverless Applications on AWS with TypeScript

Node Congress 2021

245 min

Building Serverless Applications on AWS with TypeScript

Workshop

Slobodan Stojanović

This workshop teaches you the basics of serverless application development with TypeScript. We'll start with a simple Lambda function, set up the project and the infrastructure-as-a-code (AWS CDK), and learn how to organize, test, and debug a more complex serverless application.
Table of contents: - How to set up a serverless project with TypeScript and CDK - How to write a testable Lambda function with hexagonal architecture - How to connect a function to a DynamoDB table - How to create a serverless API - How to debug and test a serverless function - How to organize and grow a serverless application

Materials referred to in the workshop:
https://excalidraw.com/#room=57b84e0df9bdb7ea5675,HYgVepLIpfxrK4EQNclQ9w
DynamoDB blog Alex DeBrie: https://www.dynamodbguide.com/
Excellent book for the DynamoDB: https://www.dynamodbbook.com/
https://slobodan.me/workshops/nodecongress/prerequisites.html

node.js typescript serverless aws

Serverless for React Developers

React Summit 2022

107 min

Serverless for React Developers

Workshop

Tejas Kumar

Intro to serverlessPrior Art: Docker, Containers, and KubernetesActivity: Build a Dockerized application and deploy it to a cloud providerAnalysis: What is good/bad about this approach?Why Serverless is Needed/BetterActivity: Build the same application with serverlessAnalysis: What is good/bad about this approach?

web development serverless beginner friendly

Building a GraphQL-native serverless backend with Fauna

GraphQL Galaxy 2021

143 min

Building a GraphQL-native serverless backend with Fauna

Workshop

2 authors

Welcome to Fauna! This workshop helps GraphQL developers build performant applications with Fauna that scale to any size userbase. You start with the basics, using only the GraphQL playground in the Fauna dashboard, then build a complete full-stack application with Next.js, adding functionality as you go along.

In the first section, Getting started with Fauna, you learn how Fauna automatically creates queries, mutations, and other resources based on your GraphQL schema. You learn how to accomplish common tasks with GraphQL, how to use the Fauna Query Language (FQL) to perform more advanced tasks.

In the second section, Building with Fauna, you learn how Fauna automatically creates queries, mutations, and other resources based on your GraphQL schema. You learn how to accomplish common tasks with GraphQL, how to use the Fauna Query Language (FQL) to perform more advanced tasks.

backend graphql serverless database

Scaling Databases For Global Serverless Applications

Node Congress 2022

83 min

Scaling Databases For Global Serverless Applications

Workshop

Ben Hagan

This workshop discusses the challenges Enterprises are facing when scaling the data tier to support multi-region deployments and serverless environments. Serverless edge functions and lightweight container orchestration enables applications and business logic to be easily deployed globally, often leaving the database as the latency and scaling bottleneck.
Join us to understand how PolyScale.ai solves these scaling challenges intelligently caching database data at the edge, without sacrificing transactionality or consistency. Get hands on with PolyScale for implementation, query observability and global latency testing with edge functions.
Table of contents- Introduction to PolyScale.ai- Enterprise Data Gravity- Why data scaling is hard- Options for Scaling the data tier- Database Observability- Cache Management AI- Hands on with PolyScale.ai

node.js serverless enterprise observability

Live e2e test debugging for a distributed serverless application

TestJS Summit 2021

146 min

Live e2e test debugging for a distributed serverless application

Workshop

2 authors

In this workshop, we will be building a testing environment for a pre-built application, then we will write and automate end-to-end tests for our serverless application. And in the final step, we will demonstrate how easy it is to understand the root cause of an erroneous test using distributed testing and how to debug it in our CI/CD pipeline with Thundra Foresight.

Table of contents:
- How to set up and test your cloud infrastructure
- How to write and automate end-to-end tests for your serverless workloads
- How to debug, trace, and troubleshot test failures with Thundra Foresight in your CI/CD pipelines

e2e testing serverless testing debug