Durable Objects, part of the Cloudflare Workers platform, are the solution for strongly consistent storage and coordination on the edge. In this talk, you will learn about Durable Objects, why they are suited for collaborative use-cases, and how to build multiplayer applications with them!
Building Multiplayer Applications with Cloudflare Workers & Durable Objects
From:

Node Congress 2023
Transcription
So I'm Matt Alonzo, and I'm a software engineer at cloudflare. I work on the workers distributed data team who maintains durable objects. I worked at cloudflare for almost three years, and I spent almost that entire time working on durable objects. So I'm pretty familiar with them. And durable objects are kind of part of a long-term goal for cloudflare, where we're trying to expand the types of applications that customers can build on top of workers. And durable objects have been thought of as kind of like a storage product that adds on to this. You can store state on the edge of durable objects, but there's a lot more to them than just that. Durable objects are also really well suited to building infinitely scalable collaborative and multiplayer applications like document editors, game servers, and chat rooms. And this talk is all about analyzing why. Oh, I can't sleep there. Yeah, if folks have any questions after the talk, feel free to email me or contact me on my social media. And so I'm going to go through a few subjects today. I'm going to talk about workers, kind of give a quick overview of them. I'm going to talk about durable objects, what they are, what the api looks like. And then I'm going to talk about coordination, like why are durable objects useful for coordination? Why would you want to use them? And then I have a bit of a case study, an application I built on top of durable objects. And then if all goes well, I'll do a demo of it. And so workers is cloudflare's serverless javascript platform. The idea is you write a fetch handler like this one above. It takes in a request object. It does some work. And it returns a response. And so you take your worker code. You upload it to cloudflare. And we handle the rest. And this is pretty similar to a lot of other serverless platforms. But there's a few big differences of workers. Workers uses its own runtime. And workers don't run in regions. There's only one, really, region with workers. It's Earth. You deploy once. And your code runs in any of cloudflare's network locations worldwide. When you use workers, your domain is also using cloudflare's CDN product. And so DNS queries to your domain return a cloudflare Anycast IP. When you send traffic to an Anycast IP on one of cloudflare's, it's going to go to the nearest cloudflare data center, no matter where you are in the world. And so this means when you make a request to a cloudflare worker, it's going to run near you in the nearest cloudflare data center, regardless of where you're making the request from. And this map here is a map of the latest one I could find of all of cloudflare's locations around the world. So as you can see, it's pretty hard to be far from a location where workers can run, unless you're in the middle of the jungle or something. And so a core part of being able to run code like this on cloudflare's Edge is the way workers is implemented is very different from other serverless platforms. We have our own javascript runtime built on top of V8. And this is the same javascript engine used by node.js and Chromium. And instead of implementing multi-tenancy by starting up multiple runtime instances like other serverless platforms work, what workers does is we implement multi-tenancy with V8 isolates. This is a feature of V8 that allows you to run code separated all in the same process. And so we'll run multiple customers' code in the same OS process. They all run in different isolates, and they're completely isolated from each other. They can't interfere with each other, change global state, or anything like that. And so isolates are much cheaper to start up than an entire OS process. And so when a request for a cold-started worker hits the worker's runtime, we spawn up a new isolate. And this takes only a few milliseconds compared to hundreds of milliseconds to cold-start a whole runtime. And this is really important to workers. It wouldn't be possible for us to run a whole runtime per customer on the Edge. This kind of idea is how workers is actually able to exist today. And so I want to go back to this example of what a normal worker looks like. So I can kind of see the whole that Drupal objects built. Before Drupal objects were a thing, workers was entirely stateless. If you look at this code snippet, there's nowhere to store global state. You can throw things in global scope, but 99% of the time, it's not what you want. Because workers run in so many locations around the world, and each of these locations has hundreds of servers, when your worker is executing, if it's receiving traffic from any places, you have tons and tons of instances of your worker. And they're all going to have completely separate global scopes. So storing data in global scope is not a solution for performing coordination. But all of these different instances are a really good thing for scaling. And in the aggregate, they can handle a huge amount of throughput. And so Drupal objects were a solution to being able to run workloads that would need to perform coordination. And what do I mean when I say coordination? Coordination is what makes multiplayer applications useful. And multiplayer applications are apps that have multiple clients connecting to each other to the app to perform a task together. Think people collaborating on a document, playing a game together, or chatting in a chat room. And all of these apps, what they have in common is they're all about state. All the clients are sending state updates to the app, and the app is responsible for making sure those state updates are reflected by other clients. And the state is long-lived and needs to be processed across multiple requests. And so what are Drupal objects? I would say Nokia 3310 before my time, but definitely seems pretty Drupal here. But really what they are is applying the serverless philosophy of state. serverless compute is all about splitting compute into fine-grained pieces that are easy to spin up as necessary, like on each incoming request. And Drupal objects take this concept and they apply it to state and coordination. With normal workers, you write your fetch handler and the runtime invokes it per request, and it implements some stateless logic. With Drupal objects, you write a javascript class with a fetch handler method. And this fetch handler is invoked on instances of that class instead of a fetch method that's called on this handler object, like with a normal worker. Drupal objects, they still speak HTTP like normal workers do, and the fetch handler is still invoked on each incoming request, but they're not routable from the public internet. Instead, each instance of a Drupal object class has an ID associated with it. This allows workers that know that Drupal object instance's ID to send requests to it. The platform guarantees that this ID identifies a single instance in the entire world. And this instance is unique. You're not going to get different instances with the same ID. So if multiple workers send a request to the same Drupal object ID from different locations in the world, all of those requests end up calling the same fetch method on the same object instance. Here's an example of what this api looks like for Drupal objects. So for a worker to send a request to a given Drupal object, it has to have a binding to it. And this binding shows up on the env object that's passed to every incoming request to a worker. And this env object is used to create an ID. You can create a stub object using that ID, and that's how you send requests to a Drupal object. And when you do that, the runtime will route the request to the Drupal object, depending on where it's located in the world, and handle finding that location and all the routing. And so one important part of Drupal objects is when you create one for the first time, they're actually created as close to you as possible. So there's a subset of cloudflare data centers that support running Drupal objects. And when we have a request for new objects, we need to figure out which of these data centers are we going to run that Drupal object in. And the answer to that is it depends how we do it. And the reason for this is there's two types of IDs for Drupal objects. You can either create a unique ID, that's some randomly generated string, or you can create an ID based off of a string, like in this example here. And with unique IDs, figuring out object placement is really simple. You can't make two randomly generated numbers that are identical if they're of a sufficient length. So you can just add some extra metadata to the ID and have this metadata encode where the object is originally located. But for named IDs, you can't do this because you want to have multiple people to be able to use the same ID and get the same object. And so we do something a little more complex. This is a picture of what the process looks like if multiple workers tried to create the same Drupal object with the same ID at the same time. I use airport codes to name all these locations because I'm an infrastructure nerd. But they're running in Virginia, New Jersey, Amsterdam, and France. And all these workers, they try to create a Drupal object with the ID Bob. And Bob, this named ID, actually has a... Hashes to a location. And this particular location is used as a coordination point to determine where we're going to put that Drupal object. And in this case, Bob hashes the DFW, Dallas-Fort Worth. And so all of the runtimes involved here know that Bob hashes Dallas-Fort Worth, and they need to take this Drupal object request and send it to Dallas-Fort Worth. And the first request to land there is going to be used to make the placement decision. In this case, it's probably the one from Virginia, because Virginia is closest to Dallas. And so the coordination point stores that Bob is going to be in Virginia and puts it in the IAD data center in Virginia. And so this coordination point actually has to be checked for further requests to Bob in some cases. But the vast majority of the time, we will cache the location of an object. And so if you're using an object more than once, the cache is going to be hot, and you're not going to have to check the extra coordination point, and your request can proceed directly to where the Drupal object is located. And so we have this concept now of an object instance that's globally addressable, and it's unique, and we can send requests to it. And this is a really, really useful distraction. And the way you use it is when you write an object on Drupal objects, you take your app state and find the smallest logical unit. Like chat rooms in a messaging app or documents in a collaborative editor. And once you have an idea of what your objects should look like, you implement a Drupal object class that defines the common behavior. And since we create a real javascript instance using your Drupal object class, you can store a transient state as members of that class. And with the ID assigned to each object instance, you can effectively send requests to a particular subset of your data and then perform operations on it locally. So this object instance that we create per ID is long lived to an extent. If your object instance is receiving requests on a regular cadence, or if it has an active WebSocket connection, it will stay alive. Once it stops receiving requests, no WebSocket connections are present, it'll go to sleep. And so we also need a solution for how we're going to store data once the object goes to sleep. And Drupal instance does support that. There is a persistent storage api as part of Drupal objects. And each object instance has access to this api. And the storage within it is specific to each object instance. And the storage api looks like a key value api. It has strongly consistent semantics. And storage is co-located with each object instance. And there's an in-process caching layer in front of our storage back end. So reads to storage are usually cached, and they're very, very fast. And so this snippet is an example of using that api to implement an atomic counter. And folks that have been reading the snippet probably have alarm bells going off in their head. This is racy. I'm not awaiting my puts. This looks really, really wrong. And in Drupal objects, this code is actually completely correct. The storage api in Drupal objects has a feature called IO gates. And what IO gates do is they prevent you from making correctness errors that are really easy to make in concurrent applications. And they do this with two. It's IO, so there's two types of gates. There's input gates and output gates. The input gate shuts when a... And prevents delivery of new incoming events to the object whenever the object is waiting on storage. So in this example here, when we do this get, it's impossible for a new event to be delivered to the object and for another event to go through and do that get at the same time, preventing two gets from running concurrently and getting the same value and then writing the same value incremented in the storage. That would cause this atomic counter to only increment once for two requests instead of twice. So input gates prevent that problem. Output gates do something different. They prevent any outgoing network messages from the durable object when the durable object is waiting for its storage writes to complete. So they prevent you from having any clients of your durable object from making some sort of decision based on outdated information that hasn't... Or based on uncommitted information that hasn't been persisted to storage. I'm not gonna go too much deeper into the storage api, since we have a really great blog post on it. It's not the focus of the talk. But folks can look that up. Ask me the name of the blog post after the talk, if you're curious. So I've been talking a bit now about durable objects, what the api surface is and the features. But I haven't gotten into yet why they're so well suited for collaborative applications. And this abstraction of an object instance that you can create on demand is really useful to create single points of coordination between multiple clients and to create that point of coordination close to those clients. And so what does this look like in practice? Imagine you're building a collaborative document editor. You have multiple clients and they're all sending edits, keystrokes to the app. And the app needs to decide which of these keystrokes are gonna go where. Especially if you're editing the same sentence, you need to sambergrate between the two clients which letters are gonna go where. And the simple approach here is to have a single point of coordination to determine this. You can do it in other ways. There's CRDTs and stuff like that. But it doesn't work for every data model. And so I'm gonna focus on single point of coordination in this talk. And so for most developers, an architecture like this where you have a single point of coordination to do your app logic in is really hard to do. You need to do a multi-region deployment, you need to figure out routing, and if you're storing data, you have to handle replication and failover logic. And this is something that's very difficult to do across many locations around the globe. And DurableObjects solve this problem for you entirely. You get to have your simple, easy to reason about single point of coordination cake, and you get to eat it too without worrying about operations and routing. So let's go through another example. Let's say I'm writing a document about my favorite barbecue spots in Texas. This is a pretty contentious subject, so I imagine someone's gonna wanna argue with me about it. So I'm writing my list. I'm, say, I'm home in Austin, and I'm connecting to a worker in the DFW data center, and I'm sending all my edits, and that worker is gonna get a DurableObject with an ID based off of my document name, that's BBQ in Texas, and it's gonna send all of my edits to that object. And I tell my friend, back where I'm originally from in Miami, he visited recently, and he has some disagreements with me about what barbecue restaurants he likes. And so he starts making his own edits. He connects to my, to the same document. And since we're both using the same ID, we're both gonna connect to the same DurableObject, and this DurableObject will handle all the coordination for that document. I have a more concrete example here. For this talk, I built a kind of demo app based on DurableObjects that implements Conway's Game of Life, and a multiplayer version where multiple people can connect and add things to the simulation. And if you're not familiar with Conway's Game of Life, it's a cellular automata, it's a type of grid simulation, and so it has defined rules for when cells in the grid should be considered live, like colored black, or dead, colored white. And for Conway's Game of Life, the rules are as follows. Any cell with two or three live neighbors lives on to the next tick. Any dead cell with three live neighbors becomes live, and then all their live cells will die, and all of their dead cells will stay dead. These rules are applied for every single grid, every cell in the grid, on every single tick, and each tick happens on a regular cadence, let's say 500 milliseconds. And so the way I implemented this was with two DurableObject classes. There's a LobbyDurableObject class that stores a list of Game Room DurableObject IDs. And so when a client connects to the app, they see the lobby page, and there's a list of game rooms. They pick a game room, and that game room has a DurableObject ID for the specific DurableObject for that game room, and the client uses that to connect to the game, and then it starts sending updates of what cells it's placed to the game. And this all happens over WebSockets. Workers are able to forward a WebSocket connection that's been sent to the worker to the DurableObject that it connects to, and all these grid tiles get passed over WebSocket. They reach the game room, and the game room is actually running the actual Conway's Game of Life simulation. And it processes all the rules on a regular cadence, and then it sends back the actual live tiles every time a new generation has taken place, and it sends those to all the connected clients. And one can imagine this happening... Wow, this map is hard to see. One can imagine this happening all over the world. The DurableObjects can run in a pretty large subset of data centers that Clouder operates. Right now, this is mostly in the eastern and western coasts of the United States, and eastern and western Europe, and in Asia and Southeast Asia. And there's no limit to the number of DurableObjects you can create. So with this simple app that has a few lines of code, you can have thousands and thousands and thousands of people playing Conway's Game of Life with each other, all in separate rooms. And so I actually have a demo of this. I'm gonna open this up. I think the Wi-Fi here has been good enough to allow people to connect and give it a try. So here, let me move this window over. All right. So if you scan the QR codes on the screen, you should end up at the same page that I'm on. I'm gonna make a Node Congress room. And I'm gonna create a game. And we'll go ahead and join. And so if you place a single tile on the grid here, it's not really gonna do anything. You need to place a few. This shape here is called a glider, and it should move across the grid. And so if I click Send Cells here at the bottom, it's gonna send it across. And so if other folks connect and place some tiles and do them around, they should show up on the screen. And you should see the same simulation on whatever device you're using as what you're seeing up here. And all this simulation is taking place in a single durable object that's been created near us. Off the top of my head, I think the nearest durable object data center here is in Amsterdam. And so this is running somewhere in Amsterdam on a machine in a concrete building. And yeah, that's pretty much my talk. I'll let this run for a little bit, and so folks can play with it. But yeah, I hope folks like this. This is my first talk ever, so I'm pretty excited about it. Yep. That's it for me. I'm gonna ask the question verbatim, and I might change it a little bit. So the question is, can you go into the pricing model of using durable objects in production? What would be a good use case where durable objects are a cheaper solution? We don't need to go into the pricing model, perhaps in great depth, but it is worth noting, where is this a really good viable solution? So with durable objects, the real key use case is if you want to create all these objects around the world, and it would be really expensive for you to run the infrastructure yourself. The pricing model is based off of wall clock duration of when the durable object is handling an active request. And so if your durable object's handling a request, you're being billed for that time. Once it returns the response, you're no longer billed. Yeah, that's how the billing model works. Are there any times where you think this probably isn't the right solution for a given project? Each individual durable object is a single thread of javascript execution. And so if your individual durable objects need to handle a significant amount of load, it's probably not the correct use case for you. It's really best for when you can split your application to many durable objects, and in the aggregate, you have a lot of load, but the individual durable objects are not as busy. Cool. And I think you've kind of covered this a few times, but I'll ask the question, since durable objects run in a single data center, what are the impacts on performance when invoked from the other side of the world? So unfortunately, we still have to deal with the speed of light. So if you're accessing a durable object from all the way on the other side of the world, you're going to have high latency. The way around this is trying to create durable objects whenever you need to have a point of coordination instead of reusing them. And that way, if all of your clients for a particular thing you're trying to coordinate are in the same place, your objects should be located close to them. But if you can't get around this, you're using them as a single point of proof or long-lived data, unfortunately, there's nothing you can do. Your request is going to have to go across the world. We're looking into adding read replicas, so you'll at least be able to get a stale view of whatever data is stored in your durable objects, and that could be located closer to you. But it's not something the platform supports today. What if you have some durable objects output forever busy? Will it then delay or fail responding to the inbound request? I think this question is referring to the output gate feature. The output gate has a timeout, so if the storage write is taking over 10 seconds, that's one of the many ways the output gate can fail. So if it's an underlying storage request that fails or that timeout hits, the output gate will break. And this just restarts the object, and any connected client should then reconnect to it, and you can start fresh. There's not really a way for user code to manually close the output gate and prevent any outgoing messages from going out. You can control the input gate manually. I don't have an example of that here. But it's not something that most applications use. Interesting. This question came up a bunch, as I'm sure you're expecting in many forms. But could you briefly explain the difference between durable objects and DenoKV? So DenoKV, from what I understand, is a data store you access from a stateless serverless function. And DenoKV is replicated all around the world. And so you don't... It's very different from durable objects, where durable objects, you have a specific serverless function that's specific to some storage, and that serverless function runs right next to the storage. I think DenoKV, from the durable storage side of durable objects, is a fairly similar product. But for doing coordination on the edge, as far as I can tell, I don't think they have... There's a way to direct to a specific instance in order to do in-memory coordination. Cool. Yeah, that one came up in a few different ways. What is the max size for a durable object? There's not a limit to the amount of storage that an individual durable object can store. We have a limit for each class of how much data a whole class can store. I don't remember it off the top of my head. It's been a while since I've looked at usage limits. There are limits on individual value sizes. I think values are a few megs, and keys are in the 2 to 4 kilobyte range. We are running short on time. Let's see what other questions we can get through. Do you have guaranteed performance characteristics since billing depends on clock time? We do not, unfortunately. This is a really hard problem to solve when you're doing wall clock billing, because we update the worker's runtime on a really regular cadence. So we're always keeping up to date with V8 updates. We're following the Chromium beta release channel. And so let's say a security issue pops up in V8, so we have to update to it. And then there is a performance degradation in that new release. There's nothing we can really do there. We need to prioritize security. And so if we had any sort of guarantees around billing, we couldn't guarantee those in the presence of having to do a force update like that, because we're not 100% in control of the execution time of your object. Okay. Would WebSocket connections nullify or greatly reduce the advantages of Edge in terms of speed? I'm not sure I understand this question entirely. WebSocket, I mean, internal to durable objects, WebSocket messages and normal HTTP requests are treated the same way. We actually don't use HTTP as transport under the hood. Everything gets turned into an internal RPC format and gets proxied over that. So there's not really a difference between WebSockets and HTTP in terms of performance characteristics for durable objects. There is a billing difference that we're currently working on addressing. But that should be eliminated pretty soon. Awesome. And the last question is, is the code for that demo available? I will post it on my Twitter after the conference. I need to do some cleanup of it. I'm more of a C++ developer than a javascript developer. It's probably the most javascript I've written in about, like, ten years. So everyone can tell me how bad it is on Twitter. No one's going to tell you how bad it is on Twitter. But do post it. We'd be really interested in seeing it. Thank you ever so much for a great talk. Really enjoyed chatting here in the Q&A.