Durable Objects, part of the Cloudflare Workers platform, are the solution for strongly consistent storage and coordination on the edge. In this talk, you will learn about Durable Objects, why they are suited for collaborative use-cases, and how to build multiplayer applications with them!
Building Multiplayer Applications with Cloudflare Workers & Durable Objects
AI Generated Video Summary
Durable Objects are a part of CloudFlare's long-term goal to expand application possibilities on workers, allowing for the building of scalable collaborative applications. Durable Objects provide a way to store global state and coordinate multi-client applications. They can be created as close to the user as possible and have unique IDs for routing requests. Durable Objects have a persistent storage API with strongly consistent semantics and IO gates to prevent correctness errors. They are well-suited for collaborative applications and can be used with WebSockets. Performance impact and read replicas are considerations for accessing Durable Objects globally.
1. Introduction to Durable Objects
I'm Matt Alonzo, a software engineer at CloudFlare. I work on the workers distributed data team and have extensive experience with durable objects. Durable objects are a part of CloudFlare's long-term goal to expand application possibilities on workers. They are not just a storage product but also ideal for building scalable collaborative applications like document editors, game servers, and chat rooms.
I'm Matt Alonzo and I'm a software engineer at CloudFlare. I work on the workers distributed data team who maintains durable objects. I worked at CloudFlare for almost three years and I spent almost the entire time working on durable objects. I'm very familiar with them. And durable objects are kind of part of a long-term goal for CloudFlare where we're trying to expand the types of applications that customers can build on top of workers. And durable objects have been thought of as kind of like a storage product that adds onto this. You can store state-on-the-edge durable objects, but there's a lot more to them than just that. Durable objects are also really well suited to building infinitely scalable collaborative applications like document editors, game servers, and chat rooms, and this talk is all about analyzing why.
2. Overview of Workers and Durable Objects
If folks have any questions after the talk, feel free to email me or contact me on my social media. And so, I'm going to go through a few subjects today. I'm going to talk about workers, kind of give a quick overview of them. I'm going to talk about durable objects, what they are, what the API looks like, and and then I'm going to talk about coordination. Why are durable objects useful for coordination? Why would you want to use them? And then I have a bit of a case study, an application I built on top of durable objects, and then if all goes well, I'll do a demo of it.
3. Introduction to Durable Objects Continued
Before DribbbleObjects, workers were stateless and lacked a way to store global state. Durable Objects were introduced to address the need for coordination in multi-client applications. Durable Objects apply the serverless philosophy of state by allowing the splitting of compute into fine-grained pieces. Each instance of a Durable Object class has a unique ID, ensuring that requests from different locations call the same FetchMethod on the same object instance.
And so I want to go back to this example of what a normal worker looks like. So I can kind of see the whole that DribbbleObjects built. Before DribbbleObjects were a thing, workers was entirely stateless. If you look at this code snippet, there's nowhere to store global state. You can throw things in global scope, but 99% of the time, it's not what you want. Because workers run in so many locations around the world, and each of these locations has hundreds of servers.
When your worker is executing, it's receiving traffic from many places, you have tons and tons of instances of your worker. And they're all gonna have completely separate global scopes. So storing data in global scope is not a solution for performing coordination. But all of these different instances are really good things for scaling. And in the aggregate, they can handle a huge amount of throughput. And so DribbbleObjects were a solution to being able to run workloads that need to perform coordination.
And what do I mean when I say coordination? Coordination is what makes multiplayer applications useful. And multiplayer applications are apps that have multiple clients connecting to the app to perform a task together. Think people collaborating on a document, playing a game together, or chatting in a chat room. And all of these apps, what they have in common is they're all about state. All of the clients are sending state updates to the app, and the app is responsible for making sure those state updates are reflected by other clients and this state is long lived and needs to be processed across multiple requests.
And so, what are Durable Objects? I would say a Nokia 3310 before my time, but it definitely seems pretty durable here. But really what they are is applying the serverless philosophy of state. Serverless compute is all about splitting compute into fine grained pieces that are easy to spin up as necessary, like on each incoming request. Durable Objects take this concept and they apply it to state and coordination. With normal workers, you write your FetchHandler and the runtime invokes it per request, and it implements some stateless logic.
Here's an example of what this API looks like for Durable Objects.
4. Worker Binding and Durable Object Placement
For a worker to send a request to a Durable Object, it needs a binding to it. The runtime routes the request to the Durable Object based on its location. Durable Objects can be created as close to the user as possible. There are two types of IDs for Durable Objects: unique IDs and named IDs. Unique IDs are simple to handle, while named IDs require a more complex process. The process involves hashing the named ID to determine the location of the Durable Object and routing the request accordingly. The coordination point stores the location, allowing subsequent requests to proceed directly to the object.
For a worker to send a request to a given Durable Object, it has to have a binding to it. This binding shows up on the env object that's passed to every incoming request to a worker. This env object is used to create an ID. You can create a sub object using that ID, and that's how you send requests to a Durable Object.
When you do that, the runtime will route the request to the Durable Object, depending on where it's located in the world, and handle finding that location and all the routing. So one important part of Durable Objects is when you create one for the first time, they're actually created as close to you as possible. So there's a subset of Cloudflare datacenters that support running Durable Objects. And when we have a request for a new object, we need to figure out which of these datacenters are we going to run that Durable Object in. And the answer to that is it depends how we do it.
And the reason for this is there's two types of IDs for Durable Objects. You can either create a unique ID, that's some randomly generated string, or you can create an ID based off of a string, like in this example here. And with unique IDs, figuring out object placement is really simple. You can't make two randomly generated numbers that are identical if they're of sufficient length, so you can just add some extra metadata to the ID and have this metadata encode where the object is originally located. But for named IDs, you can't do this because you want to have multiple people to be able to use the same ID and get the same object. And so we do something a little more complex.
This is a picture of what the process looks like if multiple workers tried to create the same Durable Object with the same ID at the same time. I use airport codes to name all these locations because I'm an infrastructure nerd, but they're running in Virginia, New Jersey, Amsterdam, and France. And all these workers, they try to create a Durable Object with the ID Bob. And Bob, this named ID, actually has a... Hashes to a location. And this particular location is used as a coordination point to determine where we're going to put that Durable Object. And in this case, Bob hashes the DFW, Dallas-Fort Worth. And so all of the runtimes involved here know that Bob hashes Dallas-Fort Worth, and they need to take this Durable Object request and send it to Dallas-Fort Worth. The first request to land there is going to be used to make the placement decision. In this case, it's probably the one from Virginia because Virginia's closest to Dallas. And so the coordination point stores that Bob is going to be in Virginia and puts it in the IAD Data Center in Virginia. And so this coordination point actually has to be checked for further request to Bob in some cases. But the vast majority of the time, we will cache the location of an object. And so if you're using an object more than once, the cache is going to be hot and you're not going to have to check the extra coordination point and your request can proceed directly to where the DERVL object is located. And so we have this concept now of an object instance that's globally addressable and it's unique, and we can send requests to it.
5. Using DERVL Objects and Storage API
And this is a really, really useful distraction. When you write an object on DERVL objects, you take your app state and find the smallest logical unit. Implement a DERVL object class that defines the common behavior. Each object instance has access to a persistent storage API specific to it. The storage API has strongly consistent semantics and is collocated with each object instance. The storage backend has an in-process caching layer, making reads very fast. The storage API and Drupal objects have a feature called IO Gates that prevent correctness errors. There are input gates and output gates, which prevent new events from being delivered to the object while it's waiting on storage.
So this object instance that we create per id is long lived to an extent. If your object instance is receiving requests on a regular cadence or has an active WebSocket connection, it will stay alive. Once it stops receiving requests, WebSocket connections are present, it will go to sleep. And so we also need a solution for how we're going to store data once the object goes to sleep. So we need to support that. There is a persistent storage API as part of Drupal objects. And each object instance has access to this API. And the storage within it is specific to each object instance. And the storage API looks like a key value API. It has strongly consistent semantics. And storage is collocated with each object instance. And there's an in-process caching layer in front of our storage backend. So reads to storage are usually cached, and they're very, very fast.
And so this snippet is an example of using that API to implement an atomic counter. And folks that have been reading the snippet probably have alarm bells going off in their head. This is racy. I'm not awaiting my puts. This looks really, really wrong. And in Drupal objects, this code is actually completely correct. The storage API and Drupal objects has a feature called IO Gates. And what IO Gates do is they prevent you from making correctness errors that are really easy to make in current applications. And they do this with two, it's IO, so there's two types of gates. There's input gates and output gates. The input gate shuts when a, and prevents delivery of new incoming events to the object whenever the object is waiting on storage.
6. Input and Output Gates in Durable Objects
In this example, input and output gates prevent concurrency issues and ensure consistent data storage. The Storage API is not the focus of this talk, but there is a blog post available for further information.
So in this example here, when we do this get, it's impossible for a new event to be delivered to the object and for another event to go through and do that get at the same time, preventing two gets from running concurrently and getting the same value and then writing the same value increments in the storage. That would cause this atomic counter to only increment once for two requests instead of twice. So input gates prevent that problem. Output gates do something different. They prevent any outgoing network messages from the durable objects when the durable object is waiting for its storage writes to complete. So they prevent you from having any clients of your durable object from making some sort of decision based on outdated information that hasn't, or based on uncommitted information that hasn't been persisted to storage. I'm not gonna go too much deeper into the Storage API since we have a really great blog post on it and it's not the focus of the talk. But folks can look that up and ask me for the name of the blog post after the talk if you're curious.
7. Durable Objects for Collaborative Applications
Durable Objects are well suited for collaborative applications, providing a single point of coordination between multiple clients. This eliminates the need for complex multi-region deployments and routing. Let's take the example of a collaborative document editor where multiple clients are sending edits. Durable Objects handle the coordination for such applications, allowing for a simple and easy-to-reason-about architecture. Another example is a multiplayer version of Conway's Game of Life implemented using Durable Objects.
So I've been talking a bit now about durable objects, what the API surface is and the features, but I haven't gotten into yet why they're so well suited for collaborative applications. And this abstraction of an object instance that you can create on demand is really useful to create single points of coordination between multiple clients and to create that point of coordination close to those clients. And so what does this look like in practice? Imagine you're building a collaborative document editor and you have multiple clients and they're all sending edits, keystrokes to the app and the app needs to decide which of these keystrokes are gonna go where, especially if you're editing the same sentence, you need to samba-grate between the two clients which letters are gonna go where.
And the simple approach here is to have a single point of coordination to determine this. You can do it in other ways, there's CRDTs and stuff like that, but it doesn't work for every data model and so I'm gonna focus on single point of coordination in this talk. And so for most developers, an architecture like this where you have a single point of coordination to implement your app logic is really hard to do. You need to do a multi-region deployment, you need to figure out routing, and if you're storing data, you have to handle replication and fill over logic and this is something that's very difficult to do across many locations around the globe. And Durable Objects solve this problem for you entirely. You get to have your simple, easy to reason about single point of coordination cake and you get to eat it too without worrying about operations and routing.
So let's go through another example. Let's say I'm writing a document about my favorite barbecue spots in Texas. This is a pretty contentious subject, so I imagine someone's gonna wanna argue with me about it. So I'm writing my list, let's say I'm home in Austin and I'm connecting to a worker in the DFW data center and I'm sending all my edits and that worker is gonna get a Durable Object with an ID based off of my document name, that's Barbecue in Texas, and it's gonna send all of my edits to that object. And I tell my friend, back where I'm originally from in Miami, he visited recently and he has some disagreements with me about what barbecue restaurants he likes. And so he starts making his own edits, he connects to the same document. And since we're both using the same ID, we're both gonna connect to the same Durable Object and this Durable Object will handle all the coordination for that document. I have a more concrete example here. For this talk, I built a kind of demo app based on Durable Objects that implements Conway's Game of Life. And multiplayer version where multiple people can connect and add things to the simulation. And if you're not familiar with Conway's Game of Life, it's a cellular automata, it's a type of grid simulation. And so it has defined rules for when cells in the grid should be considered live, like colored black or dead colored white. And for Conway's Game of Life, the rules are as follows. Any cell with two or three live neighbors lives on to the next tick. Any dead cell with three live neighbors becomes live and then all their live cells will die and all of their dead cells will stay dead. These rules are applied for every single grid, every cell on the grid on every single tick and each tick happens on a regular cadence, let's say 500 milliseconds. And so the way I implemented this was with two durable object classes. There's a Lobby durable object class that stores a list of Game Room durable object IDs. And so when a client connects to the app, they see the Lobby page and there's a list of game rooms. So they pick a game room and that game room has a durable object ID for specific durable object for that game room. And the client uses that to connect to the game and then it starts sending updates of what sells its place to the game.
8. WebSockets, Durable Objects, and Pricing Model
And this all happens over WebSockets, workers are able to forward a WebSocket connection that's been sent to the worker to the durable object that it connects to and all these grid tiles get passed over WebSocket. They reach the game room and the game room is actually running the actual Conway's Game of Life simulation. And it processes all the rules on a regular cadence and then it sends back the actual live tiles on every time a new generation has taken place and it sends those to all the connected clients.
And one can imagine this happening, wow, this map is hard to see, one can imagine this happening all over the world. The Durable objects can run in a pretty large subset of data centers that Cloudflare operates. Right now, this is mostly in the eastern and western coast of the United States, and eastern and western Europe, and in Asia and Southeast Asia, and there's no limit to the number of Durable objects you can create. So with this simple app that has a few lines of code, you can have thousands and thousands and thousands of people playing Conway's Game of Life with each other all in separate rooms.
And so, I actually have a demo of this. I'm gonna open this up. I think the Wi-Fi here has been good enough to allow people to connect and give it a try. So, here, let me move this window over. All right, so if you scan the QR codes on the screen, you should end up at the same page that I'm on. I'm gonna make a Node Congress room, and I'm gonna create a game, and we'll go ahead and join. And so, if you place a single tile on the grid here, it's not really gonna do anything. You need to place a few. The shape here is called a glider, and it should move across the grid, and so, if I click send cells here at the bottom, it's gonna send it across, and so, if other folks connect, and place some tiles and do them around, they should show up on the screen, and you should see the same simulation on whatever device you're using as what you're seeing up here. And all this simulation is taking place in a single durable object that's been created near us. Off the top of my head, I think the nearest durable object data center here is in Amsterdam, and so, this is running somewhere in Amsterdam on a machine in a concrete building. And, yeah, that's pretty much my talk, I'll have this run for a little bit and so folks can play with it. Yeah, I hope folks like this. This is my first talk ever, so I'm pretty excited about it. Yep, that's it for me.
9. Performance Impact and Read Replicas
Durable objects run in a single data center, so accessing them from the other side of the world can result in high latency. Creating durable objects for coordination instead of reusing them can help localize them. However, if you're using durable objects for long-lived data, you'll have to deal with the latency of requests going across the world. Read replicas are being considered to provide a closer view of the stored data.
Cool. And, I think you've kind of covered this a few times, but I'll ask the question, since durable objects run in a single data center, what are the impacts on performance when invoked from the other side of the world? So, unfortunately, we still have to deal with the speed of light, so if you're accessing a durable object from all the way on the other side of the world, you're gonna have high latency. The way around this is trying to create durable objects whenever you need to have a point of coordination instead of reusing them, and that way, if all of your clients for a particular thing you're trying to coordinate are in the same place, your objects should be located close to them, but if you can't get around this, you're using them as a single point of proof or long-lived data, unfortunately, there's nothing you can do. Your request is gonna have to go across the world. We're looking into adding read replicas, so you'll at least be able to get a stale view of whatever data is stored in your durable objects, and that could be located closer to you, but it's not something the platform supports today.
Durable Objects Output Gate and Deno KV
The output gate of Durable Objects has a timeout and can fail if the storage write takes too long. User code cannot manually close the output gate. DurableObjects and Deno KV are different in terms of serverless function and storage coordination. There is no limit to the storage size of individual DurableObjects. Performance characteristics cannot be guaranteed due to regular updates and security priorities. WebSocket connections in Durable Objects are treated the same as HTTP requests, using an internal RPC format.
Um, what if you have some durable objects output for ever busy, will it then delay or fail responding to the inbound request? I think this question is referring to the output gate feature. This, the output gate has a timeout, so if the storage write is taking over 10 seconds, that's one of the many ways the output gate can fail, so if it's an underlying storage request that fails, or that timeout hits, the output gate will break. And this just restarts the object, and any connected client should then reconnect to it, you can have it start fresh. There's not really a way for user code to manually close the output gate and prevent any outgoing messages from going out. You can control the input gate manually, I don't have an example of that here, but it's not something that most applications use.
Interesting, cool. This question came up a bunch, as I'm sure you're expecting in many forms, but could you briefly explain the difference between DurableObjects and Deno KV? So Deno KV, from what I understand, is a data store you access from a stateless serverless function, and Deno KV is replicated all around the world, and so you don't, it's very different from DurableObjects, where DurableObjects you have a specific serverless function that's specific to some storage, and that serverless function runs right next to the storage. I think Deno KV, from the durable storage side of DurableObjects, is a fairly similar product, but for doing coordination on the edge, as far as I can tell, I don't think they have, there's a way to direct to a specific instance in order to do in-memory coordination. Yeah, that one came up in a few different ways.
What is the max size for a DurableObjects? This is? There's not a limit to the amount of storage that an individual DurableObject can store. We have a limit for each class of how much data a whole class can store. I don't remember it off the top of my head. It's been a while since I've looked at usage limits. There are limits on individual value sizes. I think values are a few megs, and keys are in like, the two to four kilobyte range. Cool. We are running short on time. Let's see which other questions we can get through. Do you have guaranteed performance characteristics since billing depends on clock time? We do not, unfortunately. This is a really hard problem to solve when you're doing wall-clock billing because we update like, the workers' runtime on a really regular cadence. So, we're always keeping up to date with V8 updates. We're following like, the Chromium Beta release channel. And so, let's say a security issue pops up in V8, so we have to update to it, and then there is a performance degradation in that new release. There's nothing we can really do there. We need to prioritize security, and so, if we had any sort of guarantees around billing, we couldn't guarantee those in the presence of having to do a forced update like that because we're not 100% in control of the execution time of your object.
Okay. Would WebSocket connections nullify or greatly reduce the advantages of Edge, in terms of speed? I'm not sure I understand this question entirely. WebSocket, I mean, internal to durable objects, WebSocket messages and like normal HTTP requests are treated the same way. We actually don't use HTTP as transport under the hood. Everything gets turned into an internal RPC format and gets proxied over that.
WebSockets, HTTP, Demo Code, and Conclusion
There's no difference between WebSockets and HTTP in terms of performance for durable objects. We're working on addressing the billing difference. The code for the demo will be posted on Twitter after the conference. Thank you for a great talk.
So there's not really a difference between WebSockets and HTTP in terms of performance characteristics for durable objects. There is a billing difference that we're currently working on addressing, but that should be eliminated pretty soon.
Thank you ever so much for a great talk.