Learn how running your TS in Temporal's runtime gets you extreme reliability
How to build distributed systems in TypeScript
Transcription
Hi folks, this is how to build distributed systems in typescript. My name is Lauren and I wrote a book on graphql called the graphql Guide. And currently I am working as a language runtime engineer at Temporal, working on our typescript SDK. So when we're working in a monolith, we might not have very many distributed systems concerns to be worried about. If we just have a single app server layer and a database that does transactions, then we might just be getting a request from the user, doing a single operation, and then if that fails, then we say, sorry user, try again. If we're talking to external APIs and they're an essential part of the business transaction, like I talk to Stripe and I say, charge this, and then I update my database, then I have some more problems, like I might not be able to reach Stripe, so I need to retry that. When I retry, I need to do so idempotently, so that I'm not double charging the user. And there are cases in which my database might be out of sync with Stripe's understanding of the world. So Stripe returns success and I can't reach the database, that operation fails or my process dies before I get to that step, then my data is inconsistent. In this talk, we will be working on an e-commerce store and implementing a create order endpoint. And we will go through all the failure modes and try to make it more reliable. And we won't get to full reliability, but we'll see how much simpler we can make it if we write it in Temporal. To get started, we have a serverless function hosted on Vercel, and we get out of the body the item ID, the quantity, and the address that you go to, and we get the user ID out of the JWT. And then we take three steps whenever we have a new order, we reserve the inventory with the inventory service, then we talk to the payment service to charge, and then we talk to the fulfillment service to send the package. And then we respond success. So in a successful case, this logic works fine, but there are a number of things that can go wrong. If we fail to reserve, then we don't want to continue charging and sending packages. If we successfully reserve and then fail to charge, then we not only don't want to send the package, but we also want to unreserve from the inventory so that someone else can buy those items. And then lastly, if reserving and charging is successful, and we get to send package, then we want to refund the charge and unreserve. So let's see what that logic looks like. Get down to here, reserve. If the reservation failed, then we say status 400. Now when we charge and it fails, we want to talk to the inventory service again and say unreserve this amount of this item. And then finally, send package. If that fails, then we want to first refund, then unreserve. Right now when we have a failure, we send status 400 to the client. But ideally, we would be retrying each of these steps in case the failure was transient, like a network error, or the service was temporarily down. And if we retry it the second time, maybe it'll go through. So let's add retrying to each of these steps. We have a retry function that takes parameters, max attempts, a timeout, and intervals. And so for each of these attempts, it tries calling the service. It races between that and the timeout. So if 30 seconds passes and we haven't heard back from the service, then we retry. And each time we retry, we are doing exponential backoff. And here we're wrapping each of these service calls in the retry function. One issue we have now as we are retrying is that we might have multiple successful calls. And we need some kind of idempotency token so that the service knows not to do the second time. And the best way to do this is to get it from the client. So we'll add that. Get a request ID out of the body and pass it to each of the services. We also want to retry these failure calls. Like if the refund throws, then we won't, we'll neither refund nor will unreserve. So the customer will wind up still with the charge and the inventory is taken. So let's add those. So we're wrapping unreserve with the retry and also refund. Also, when we're speaking of throwing, we're not actually catching throwing. We're just seeing whether the response object had a failure property. So let's also be catching these errors. So we get a reservation object if we successfully talked to the service. But if we weren't able to reach the service, then that'll throw and we catch that, save it here, and then send it to the user. But what happens if this process crashes? So at this point with all these retries, you might be retrying for hours and we no longer fit in the time window of a Vercel serverless function. And so as we have these long running processes, at some point, some of them will fail, crash because the server loses power or whatever. And we want to be able to pick back up at the same state we were in so that we can continue processing the order. And right now, the only awareness of what step we're on is in memory of this code. So at each step, we want to save to some durable store and a disk somewhere that we, what state we're on, what we've done, what we have left to do. So let's do that. We've got a Mongo client and the different states that an order can be in. We connect to the DB, we insert one in the created state, we get the ID, and then we use that to update the same order record after every state, failed to reserve or reserved, failed to charge, et cetera. There are a number of other problems that we don't have time to fix. We want to make sure that the database client is retrying and we want to handle database errors. And we want to handle these compensation logic errors. And we need to have a worker pool. So right now, we're inside of a HTTP handler. But if this crashes, we need some other process somewhere to notice that a record and database is in a non-terminal state. And the last time it was updated was long enough ago that I know no one's still working on it. So we need to have a worker pool. And we also probably want to return to the user earlier than the end of this function, because all of these retry stuff might be taking minutes or hours. I don't have time to code this. You probably don't have time to code this. And we shouldn't have to. We should be working at a level of abstraction where everything we do is automatically persisted and retried, and we don't have to worry about it. Unfortunately, that system exists and it's called temporal. And let's see what our app looks like. So we're in a database. We're in a database. Our app looks like in the temporal system. So here's the new serverless function. We still get the same data out. We do only one step, though, and that's client.start. This is using the temporal SDK up here. And we're starting a process order function, and we're providing it the order arguments and the request ID as sort of an ID of the function execution and also an idempotency token. And then respond that we've submitted this. So if this has happened, we know that it's durably started and saved. And if we get this already started error, then we know the request was a duplicate. So we say, hey, user order has already been submitted. Now let's look at this process order function. We have each step, reserve, charge. If charge fails, then we unreserve. And then we send package. And if that fails, we refund and then unreserve. So that's very simple. That's all of our logic and it works. And you might be thinking, okay, but these reserve charge functions, et cetera, must be really complicated. Let's look at them. So these are called activities. And we use activities for anything that might fail. So all this logic doesn't have a risk of failure, but everything it calls does. So this is the service reserve, unreserve. You can see it's just the basic service calls. And each of these activity functions is retried and timed out according to my retry policy. And whatever options I put here, there's lots of different options. And that's just all taken care of by the system. Further, each step of this process order function is persisted. So if the process dies after this, then a new worker will continue execution in the exact same spot, in the exact same state. And any local variables, we only have order up here, but we could have lots of other ones that we change during the function. They'll all have the same values. You can also do other cool things like sleep for 30 minutes. Say we're doing the Amazon one click buy. We say reserve, and then we set up a new function. Set a 30 minute timer, and then we charge. And if we get canceled during those 30 minutes, then we don't charge our send package. We can also sleep for 30 days in a loop if we're trying to monthly charge someone on a subscription. And the way this works is we have a temporal server, which is open source, and you can host it anywhere. And that does persistence. And we have language runtimes or SDKs. And the SDK is split up into a client and a worker. And so you have a worker pool, and then you have a client that can be used anywhere. Right now we're using it on a serverless function. And the client talks to the temporal server, and the worker talks to the server. And the client says, hey, start this workflow, which is this process order function. And then the workers are constantly pulling and picking up tasks from the server. Like, okay, start this function. Okay, start this activity, et cetera. And the worker is reporting back to the server after it completes each step so the server can persist it. So when people ask me, what's temporal? I say it's a durable code execution framework. So we run your code in our workers, and we do so durably such that if anything goes wrong, we pick it back up in the exact same state. And so you can write your code in a way that's oblivious to faults. So that's temporal. I think it's a huge step forward in software development, working at a higher level of abstraction in which you don't have to worry about liability. If you'd like to learn more, check out temporal.io slash ts. That's the typescript SDK docs. Feel free to email me, lauren at temporal.io. Hit me up on Twitter, at lauren dsr. Also, we're hiring. So if you would like to work with me, let me know.