GraphQL Authentication and Authorization at Scale


At Unity, we use GraphQL federation to expose a wide range of business functionality across the organization in a single GraphQL schema. With an ever-growing number of services, this presents challenges for authentication and authorization across the board. I explore how we implemented GraphQL auth at the gateway level, the key design decisions behind it, and the wide-reaching benefits this can have.


♪♪ ♪♪ ♪♪ ♪♪ Hi, everyone, and welcome to my talk on GraphQL authentication and authorization at scale. My name is Johnny Green, and I'm a senior software engineer at Unity Technologies and also an open source developer. So before we get into it, I'd like to just quickly discuss the agenda, just to really set the scene for you all and just really provide a lot of context for the solution and design that we'll talk about coming up. So first of all, we'll talk about GraphQL at Unity. I'll introduce our team and basically some of the things that we do, as well as talking about our tech stack, just to really provide you all with an idea of how we work and how we implement GraphQL. Next up, I'll discuss the problem we wanted to solve with Auth, especially at scale. So I'll discuss, yeah, basically the problems we encountered and also what benefits we're looking to have as well. Next up, I'll talk about the design. So I'll discuss the actual details of the design, as well as also how this solves our original problem and how it gives us the benefits that we're looking to have as well. And then finally, I'll show you all the solution. So this will include implementation, but also a short, brief example to give you an idea of exactly how we implemented this at Unity. So GraphQL at Unity. So I work in the live platform team, where our primary aim is to expose business functionality to clients. And this is all through a centralized GraphQL schema. So we use GraphQL Federation under the hood, where we have a gateway. And then behind this, we expose several services that expose different parts of the business and its functionality. And clients talk to this centralized schema, which they treat as the hard contracts that we expose. And basically, yeah, just get the bits of business functionality that they need to access. We also are actively working on improving our self-service options. So as we get more and more requests from clients, we want them to be able to do the work themselves as well. So if they want to expose a new bit of business functionality, we want to say to them, here's some instructions, and you can go and implement it yourself in a GraphQL-compliant way with all the benefits that we serve as well. So for instance, we do a lot of caching under the hood. So we can tell them how to take advantage of all these benefits and tooling that we've developed over the past year or so. We're also looking to automate a lot of this. So a lot of this is fairly generic stuff and standardized by convention. So this enables us to look into can we generate all this code and can we make lives for new developers a lot easier by just saying, if you want to spin up a new service, just run this command, and you're good to go. And you've got all the service set up you need. It's all hooked in, and it can be deployed as well. So I thought I'd also talk about our tech stack. So we use Node.js and TypeScript under the hood, and it's done very well for us. And with that, we also use the Mercurius GraphQL server. And this is for both all our services and also our gateway as well. So the reason why we chose Mercurius is just it's got a fantastic open source community, and we really bought into that, and we really like it because of that. Also, Mercurius provides all the functionality that we need and more. So we just thought it was a bit of a no-brainer for us to choose Mercurius, and it's been fantastic so far. We also use GraphQL Federation. So as I mentioned before, we're exposing business functionality at scale, so through a central gateway with lots of services. Now, lots of these services provide functionality that can be related to other bits of functionality. So we're not only federating the graph and all the individual graphs into one big graph, we're also taking advantage of federation features such as if we've got a user with a certain set of details that comes from another service, we can provide that through GraphQL Federation. So I'd like to discuss a problem next, and that is handling auth at scale. So when we started out, all our services were running auth. So that means all the services had to implement the auth mechanism, they had to define their auth policy definition, and also then just define all the fields they wished to protect. And that's quite a lot of stuff for a service to be doing when we really want to scale this out long-term. So we want to take a lot of this responsibility away from the service and just really make sure that the implementation is really simple and they just need to – all the services need to worry about is the business logic it's exposing and just take a lot of the responsibility away from that. So with that, we've also got multiple services and also multiple teams. So lots of new contributors outside of our team that are contributing new and more and more services to our federated system. And this comes with its own problems. So multiple teams, they may use different tech stacks. They may use different auth models. And for the clients, they don't necessarily know how the fields are protected or what fields are protected. So we also want to solve this as well. So we want to ensure that we've got a consistent auth model across all services that they all adhere to. And we also want to present all this information to clients. So if a service protects a certain field in a certain way, we want to tell clients about this in order to tell them how to handle all the errors and basically how to integrate with our system. In terms of the benefits we're looking for, so we also want to be closer to the auth mechanisms. So because the services are currently running auth, they're quite far away from the Unity auth endpoint that they interact with. So we want to be closer to that, which will give us better performance but also allow us to optimize, for example, how many requests we make to this endpoint and how we request this endpoint. So we want to be closer to the auth mechanisms that Unity provides. We also want a single policy. So we essentially want to control this centralized policy and basically tell services how to access it. So all the services all use the same policy definition. And as long as they adhere to that, we're good to go. They can protect their fields in exactly the same way, in exactly the way they want. And it just makes integration really simple because all they then need to worry about is just configuration, and that's it. They can then focus on the business functionalities and not worry about how do we interact with the auth endpoint. It's already happened. One other benefit that is really apparent that we want to have is basically any request from the gateway to the services, we want that request to be already authenticated, already sufficiently authorized. So as it stands, with the services running auth at the service level, we don't have this, so we want to provide this as well. So next up is the design. So the next logical step is to take the auth to the gateway level, and this has several benefits. So you can see here that all the services now need to do is just define the policies that they care about. So, for instance, if they want to protect a certain field and make it authenticated only, they just need to add a GraphQL directive to that field, and they're good to go. Similarly, if they want authorization, they just add a GraphQL directive with a slightly different config to say, I want this X, Y, and Z permission, and also they're good to go. So for the service point of view, it's just configuration. For the gateway point of view, so we need to get the gateway or tell the gateway how to find the service policies. We need to tell the gateway how to register the policies, and we also need to tell the gateway how to apply policies upon a GraphQL request. Now, with GraphQL directives, we can do this. So not only is this really simple for services, it's also really simple for the gateway as well, because all it needs to do is just look through the federated schema that it's generated, and depending on the auth directive that it finds, it can create a policy accordingly. So how do we solve the problem? So as you can see, we've moved a lot of the work to the gateway level, and you can see that from a service point of view, all they need to worry about is configuration. And not only that, the configuration is also all in the GraphQL schema. So it means that services can be implemented with whatever tech stack they choose to use. As long as they're providing a GraphQL schema, they're good to go. And that's really, for us, that's really powerful because it allows us to truly scale with all the services that we want, and they could write it in Rust, and it doesn't matter. As long as they apply the same auth policy definition to their GraphQL service, then, yeah, that's all they need to worry about. So what do we need to do? So before we need to implement this, we needed to add some features to Mercurius in order to keep the approach as simple as possible and to, yeah, to really make this approach scale. So one of the first things we looked into was adding Mercurius hooks. So these are essentially lifecycle events within Mercurius that indicate, well, they indicate the points at which the important points within the GraphQL request and also a system. For instance, when a gateway refreshes its schema, we want to know about that, and we want to be able to provide functionality at that event. So we needed to implement that, but also we needed to implement an auth plugin. So this would be an opt-in plugin that would use the Fastify plugin system to essentially enhance the Mercurius server with the ability to add auth policy handlers, look for certain auth directives, and also add user information to the context to enable auth policy handling to get as much information as they need. And then from a Unity point of view, we needed to find our own policy handler that interacted with the Unity auth endpoint. And also we need to define all the policies at the service level. So we need to register these configs and basically protect all the fields we needed to protect in the services. And these would all use a centralized policy definition, so they would all use the same GraphQL directive. And as long as they adhere to that directive, as I said before, they're good to go. So the solution. So with Mercurius auth, from a high level, essentially what we're doing, we're we would traverse the GraphQL schema, and this is all at the gateway. So we traverse the GraphQL schema and look for auth directives. And depending on the directive arguments that were passed to it, we would then define a policy according to that position within the GraphQL schema. And from this, we would just keep traversing all the GraphQL schema, look for all the protected fields and types, and from there, build up an auth schema. We would then traverse this schema and wrap all the relevant field resolvers with the auth policy handler such that this auth policy handler would be configured to the config that we registered through the GraphQL directives. Now, a lot of this work, so we don't want to traverse the GraphQL schema at every request, so a lot of this work is done at registration time. So by the time a GraphQL request comes in, the field resolvers are already wrapped, and all that's run is just the auth policy definition with the correct configuration passed to it. Now, you can find out more about this on GitHub. So you can see mercurius.js slash auth. So if you want to have a look at the code or discuss it further, you can check it out there. So next up is usage at Unity. So as I mentioned before, we need to also implement some Unity-specific things. So we already had out of the box a Unity auth endpoint. So our Unity policy handler that we wrote basically just interacted with that endpoint, and this would all be at the gateway level. And we also defined a central Unity auth directive definition, and that would be controlled by us. So if we needed to evolve the schema, the services would then automatically pick up this central auth directive in a schema evolution compliant way. And as long as they adhere to this central directive definition, they're good to go. And then, yeah, as I mentioned before, we just needed to then define the Unity auth directive usages such that we're just protecting certain fields in the correct way that needed to be protected. And that's really it. It's actually a fairly simple implementation in the end, just because we did a lot of the work within the MercurialSort plugin and also handling the MercurialSorts. So as a simple example, I thought to provide you with this approach. It's very similar to what we did at Unity, just to give you an idea of how we did this. You can find this on GitHub, so And, yeah, if you want to check out the code, you can check out that. So what have we got? So we've got a federated system, so we've got a gateway, and then behind that sits a user and post service. So as part of this demo, we will define auth policy handler and also an auth directive, which the downstream services both use, and then the gateway automatically picks up these auth directive definitions. So a GraphQL talk wouldn't be complete without a GraphQL schema. So here we have the schema for the user service. You can see here we've got the auth directive definition here, and this would be the same across all services, and this is controlled by us. So this is completely centralized, and so whenever we evolve it, we evolve it in the correct way such that when services pick up the evolution, they've got it immediately and they just need to comply with that directive definition. So, yeah, so we've got that directive definition, and we've got the usage here. So we are protecting the me query with the auth policy that requires a user role. And that's it. That's all the service needs to do. It's all in GraphQL land, all in configuration, and they don't need to do any more to it. Similarly for the post service, so we've also got the auth definition, and we've also got the post service usages as well. So we can see we're protecting the author field with the admin role, and we've also got the post field that we're also protecting with the admin role. These are kind of maybe more sensitive fields that we want to protect with a different role, such as admin users. So you can see here, these are the ones we're looking to protect. And then finally, we've got the gateway registration. So you can see this is as simple as it gets. So I've redacted all the other setup, which is this fairly standard Mercurius and Fastify setup. This is the stuff specific to Mercurius auth. So as mentioned before, we've got several things we need to do. So we've got the auth context, which looks at the request headers and just applies an identity to the context that just says this user is this user from the headers. And then we've got the policy handler. So here you'll see it's very similar to a GraphQL resolver. So the final four parameters are just the parent information, the arg information, context, and info. So you've got everything you need from that point of view. And then we also provide the auth directive AST to say this is the configuration applied to this field. So here what we're doing, we're getting the roles using the identity, and we're making sure that the auth directive matches up with the roles on the user. And, yeah, so we can do whatever we want in this policy, and it's just completely up to you. We've left it completely flexible. So you can just call whatever auth mechanism you want with all the information you need from the parameters. And here we're just saying let's look for the auth directive within the gateway. Now here I thought I'd just give some example requests. So we've got full access to fields, so user of user admin, and we can see we've got the data as normal. And then here we've got partial access. So we can see we've got a request with just a user role, and we can see we no longer have access to posts. And we can see here it's set to null with the associated error. And then finally we've got a solution with no access to fields. So this is unauthenticated requests. We can see we've got no data and we've got both errors telling us what to do. So in summary, we talked about the Mercurius Auth plugin. We talked about how we defined a central auth definition. And basically as long as downstream services adhere to recognized schema, this is a scalable approach to auth. We really believe this, and we've really been seeing the benefits within our system. In terms of improvements, we're adding lots and lots of features all the time to the Mercurius Auth plugin. So one of the things that is going on at the moment is just adding support for a filter schema. So you only see parts of the schema you have auth to. And that's it. Thank you so much for listening to my talk. It's been really great talking to you. And personally, I'm really excited about this work. And I just hope you took something out of it. So thank you for listening. And I hope to see you around at the rest of the conference. Thank you very much.
22 min
10 Dec, 2021

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic