After spending over a decade at Google, and now as the CTO of Vercel, Malte Ubl is no stranger to being responsible for a team’s software infrastructure. However, being in charge of defining how people write software, and in turn, building the infrastructure that they’re using to write said software, presents significant challenges. This presentation by Malte Ubl will uncover the guiding principles to leading a large software infrastructure.
Principles for Scaling Frontend Application Development
AI Generated Video Summary
This Talk discusses scaling front-end applications through principles such as tearing down barriers, sharing code in a monorepo, and making it easy to delete code. It also emphasizes incremental migration, embracing lack of knowledge, and eliminating systematic complexity. The Talk highlights the use of automation in code migration and the importance of removing barriers to enable smoother code migration.
1. Introduction and Background
How's everyone doing? Yeah, thanks for talking about JSConf and stuff. We made the wise decision not to have a 2010 event because it wouldn't have happened anyway, this is actually my first conference, at least if you're counting community conferences, back from this bigger break in everyone's life. I'm coming from San Francisco. I was actually in Germany before, still a little bit jet-lagged but I'm trying to bring up the vibes here a little bit.
So I joined Brazil, I talked to all the teams and I find that we have an internal Dex team. And so I talked to them, you know, what are you guys doing? They explained to me what they're doing. I think that seems like a really good idea. And then I go talk to customers. our customers are like telling me that their problems are the problems that our internal Dx team is solving. And so I was like, why don't we turn this into a product team and actually make this available to everyone so that not everyone has to go over and over again, solve the same problems. Now, this is in a way relevant because the talk I gave back in the day, it was so like such theory heavy stuff, because, you know, I didn't have anything open source. I could only tell folks, like, this is what I learned. But you have to figure out how to like take this learning into something real by yourself. Now, I'm going to be doing similar stuff today and more power to you if you want to build yourself. You know, we did announce a product called Versell Spaces a few weeks ago that's trying to implement some of the stuff that I'm going to talk about in like a reusable fashion.
2. Scaling Front-end Applications
So let's go ship software like Google or Vercel because it's 2023. But I think this one is actually true, iteration velocity solves all known problems. What I mean is that when we make software, we are going to make mistakes over time, and we have to deal with that fact professionally. The rest of this talk is about how you can scale beyond yourselves, but through software that you've built, and through mechanisms that you establish in your teams that make the whole team better. I want to talk about principles for scaling front-end applications. I got six. Let's start with this one: tearing down the barriers.
All right. So let's go ship software like Google or Vercel because it's 2023. But kind of going to Google, there was this like guy called Eric Schmidt, he was the CEO for a long time and he, at least internally, maybe I'm leaking something, sorry, always used to say revenue solves all known problems. Basically saying, if you just make more money, it really doesn't matter what else you're doing. And I think that's wrong.
Like certainly, we don't, you know, have infinite money, you don't have infinite money. I think Google no longer has infinite money. So that's not really, it's not such a good mantra. But I think this one is actually true, iteration velocity solves all known problems. What I mean is that when we make software, and it was actually the first talk today had a similar point, we are going to make mistakes over time, right? That's the only thing that we know is that the future is uncertain and we have to deal with that fact professionally. And the way how you do that professionally is by being able, at the moment when you make that mistake to react to it and iterate and do it right the second time around. So with that, let's think a little bit about our role in this, and for the rest of this talk.
So the senior engineer, or really what I mean is like the lower case senior engineer, right? We've been around for a little bit. What do we do, right? You might have, there's a session later today around career advice, and they might tell you something like, you know, if you want to have more impact, you need to scale beyond yourself, and what people kind of usually associate with this is something along the lines of kind of either having management responsibility or at least telling other folks what to do. It's a people thing, right? And so the rest of this talk is kind of about something very different from that. It's about how you can scale beyond yourselves, but through software that you've built, and through mechanisms that you establish in your teams that make the whole team better, but through the software that you're building. And so for the rest of these slides, basically, consider yourselves as like the hypothetical lead of the hypothetical platform team for your hypothetical, I don't know, 12, 50, 100 person engineering team, right? You're that person and your job is to make that team scale better. Cool. That was a long intro. Anyway, I want to talk about principles for scaling front-end applications. I got six. I have a bonus one, but I think I'm too slow. So I'll do six. All right. Let's start with this one. Tearing down the barriers. So this is actually something that is—was not clear to me how big of an effect it would have. As I was mentioning, I was like at Google for the longest time. I was not like using the tools everyone else was using but I was doing this conference, JSConf. I was going to these meetups myself and I would hear folks on stage being incredibly excited about small modules, right? And there was a company called NPM, Shouter NPM, who created something called private modules.
3. Sharing Code and Monorepo
If you want to get that adrenaline shot in the arm, go migrate to a monorepo. It's really worth it. It encourages collaboration and puts a low barrier on collaboration. GitHub code owners addresses scalability problems. As part of our spaces product, we're shipping something called owners, which is basically like code owners but like good.
And so people would go and in their company share code through publishing modules to NPM. So I'm joining VerCell and I see that happening where like some team says well, you have some cool code, can you like publish that to NPM so I can use it? And then I have to update my package JSON to your new version. Maybe someone's helping me, maybe I'm doing this automatically. It's such an incredibly bad process and at scale if you have a larger team it really hurts. So my one kind of weird trick advice, if you haven't done it yet and you want to get that adrenaline shot in the arm, go migrate to a monorepo. It's really worth it. Now obviously it comes with trade-off. Everything in software engineering comes with trade-off. A few months in you'll notice, oh, everyone can change everything, maybe that isn't the best idea on the planet. But it's good to have a basis that encourages and puts a low barrier on collaboration and you can deal with the problems that kind of come along the way. And there's certain scalability problems that, for example, GitHub code owners addresses. Now that's kind of a weirdly not very great product among the other stuff at GitHub that's really good. So as part of our spaces product we're actually shipping something called owners, which is basically like code owners but like good. So check it out.
4. Making it Easy to Delete Code and Data Fetching
Making it easy to delete code is essential in managing large code bases. By encouraging engineers to have as little code as possible, including deleting unnecessary code, we can maintain a more efficient and scalable application. Using tools like Tailwind or CSS-in-JS libraries allows for easy deletion of code, as the CSS is collocated with the code. Similarly, data fetching inside components, such as in Next.js 13, reduces unnecessary code and improves performance. This approach, despite some skepticism, has been successfully implemented in larger Google apps, demonstrating its scalability and effectiveness.
All right, for the second principle, I'm calling this making it easy to delete code. Because there's one true thing about any large code base, is that it will get larger. And that's in a way good, right, like it's our job to write code, we're going to write code, there's nothing you can do about it. But we can, again, as the platform lead, professionally manage and encourage our engineers to add have as little code as possible and that includes deleting code.
Now if we all think about our code bases, there's probably something like this in there, right? There's some code that pretty obviously should not be there, you know, 2017 or whatever. And maybe that was originally a React component and we got rid of that React component, right? We had a CSS file that was also contributing to our 2017 Happy New Year message. And who knows whether that selector can be deleted, right? That's really like almost a halting problem, type of problem and so it kind of lingers around. Now on the other hand, if you're using something like Tailwind or CSS, like in JS libraries, because you have colocation of the CSS with the code, you can just delete the whole thing with high confidence, get rid of it. And again, like this is the type of thinking you should have as the lead to say like that's why I'm making this choice of using something like Tailwind, because it actually makes it easier to delete code. As another good example of this type of thinking, I think are like is data fetching inside of components in Next.js 13. So in this example, we have a Tweet component, right? And we only pass the ID, the Tweet component fetches its own data. Now if we delete that Tweet component, everything's gone. Remember how this worked in like legacy in Next.js or something like remix loaders? Data fetching is hoisted to the top of the route. So if I delete something that's very far down my React render tree, I have to remember to also delete the data fetching code. And if you have a large team, some will forget some of the time and it'll not only make your application kind of larger and have unnecessary stuff in it, but it will also make it substantially slower. So for data fetching, you can introduce cost because you're calling some backend system that you really don't have to call. So again, it's kind of the same idea of CSS in JS, it's just for data fetching in JS. And I know some folks say, wow, that doesn't scale, blah, blah, blah. One of my, I think, accomplishments at Google that I'm most proud of was introducing exactly this. So any larger Google app that you've been using over the last eight or so years is using this technique where you have data fetching in the render tree as opposed to data fetching in some hoisted pseudo-scalable fashion.
5. Incremental Migration and Application Standards
Always migrate incrementally. Plan for incremental migration from the start. Use the Next.js13 app router as an example. Migrate a single route at a time. Shout out to the Next.js team. Principle number four: always get better and never get worse. Introduce lint rules to ensure application standards. Comments can serve as a ledger of technical debt.
All right. Principle number three. Always migrate incrementally. I was joking on Twitter the other day that there's basically just two types of migrations. There's the incremental ones and there's the failed ones. Now, sometimes you start out with this big plan and you're going to say, like, we're going to do this all at once. But I think more often than not, maybe you're like three months in and your boss comes and says this is all fine and good but we can't wait six more months to prove that this is going to work, I need to ship something next week or next month, right? And so, something that started out as potentially, you know, big bang type of migration typically eventually starts to become something that you do incrementally and so it's ideally better to just plan it from the start. It's just an example for, you know, the type of thinking that goes into API design to make it incremental migration possible. I think another good example is once again the Next.js13 app router where you can go and migrate a single route at a time, lift and shifting it into the new world, right? So no one's forcing you to say, old API or new API, you can have that page on this API, this page on that API, and kind of work it through self through the codebase. And even further down in that project, you know, here we have like some get service at props, legacy. Should I say legacy? I don't know. Whatever. Old API, right? Which you perfectly find still to use. But, you know, that's how it looked and then you can do the first migration step. Note that this is like a trivial migration. Right? Nothing really changed. We just took that code that was in get service at props and put it into our top-level component. Now, is that better than before? No, it's not. Right? But it's an incremental step. Where you didn't have to rewrite your entire render tree to take advantage of some of the new capabilities. And again, just an example for how to make APIs that you can be migrated in this step-by-step fashion rather than doing everything at once. And obviously, shout out to the Next.js team who actually did a good job at this, I think. Cool. Moving on to principle number four. Which is always getting better and never getting worse. This is actually, in a way, kind of related to the topic of incremental migration, but a little bit more general. So one good example for how you ensure that your application kind of has a certain standard would be by introducing lint rules, right? Which say, you know, in my app you have to do it this way, not the other way. Now what happens sometimes, which isn't so great, something like this, which by the way I just randomly copy pasted from Brasel's internal super secret code base, so I'm sorry if anything interesting is in there. The actually interesting parts are the comments, which basically disable lint rules all over the place, right? And actually I think that's good. I'll talk about why this is the right way to do it, but obviously it sucks to litter your code base with these comments, which add no value, right? But in a way, these comments are now this ledger of technical depth, right? Technical depth usually is this super abstract thing.
6. Allow Lists and Embracing Lack of Knowledge
For our Spaces product, we use external allow lists to avoid littering the code base. The value of rules is higher on new code. Embrace lack of knowledge and encode code base information in a machine-readable fashion. Example: Next.js middleware and the use of allow lists to ensure approval for context-specific changes.
But here you have a list. A to-do list of things you want to fix. Now again, this kind of is nasty to your code base, which is why for our Spaces product what we're using are these external allow lists that don't litter your code base. You basically have a JSON file that says there's some rules here, and in this file it's okay to violate the rules.
Now, coming back to the theme of getting better and not getting worse, what you really want to see for your team's code base is a graph like this. These are violations over time. You want to let this come down over time and then you see these steps up. That's actually not bad, that's when you introduce a new rule. You allow all existing violations and then as this code actually gets touched, it gets better over time. I want to encourage everyone to allow this stuff. It's not actually important to migrate everything and put it into the new state. The reason is that, and this is actually something that wasn't obvious to me early in my career at all, that if you have something like some rule over your app, the marginal value of that rule is drastically higher on new code than it is on the old code, because your old code is battle tested. It went through QA, it went through production outages, it went through users having creative ideas about what a name is. You fixed all these things already, right? And so you likely actually won't find anything interesting. The reason why these rules exist is that when you write something new, at that moment it tells you right away, hey you're making a mistake, go fix it now, right? Where it's also really cheap to do it. And so that's why it's so important to quickly get these opinions into your code base for future code, and it's not important to now take out a month out of your life to migrate everything you already have to a new state.
So I talked a little bit about in the abstract fashion about rules like this, and I gave this lint rule example, but actually I think I want to talk about something slightly higher level than a lint rule, which is associated with code style and stuff like this. Coming to the principle, you know, embrace lack of knowledge. Again, you're the lead of this like larger team, there's going to be people who are new to your team, there's going to be people who are more junior, they're going to have some context about that code base that they're lacking, that they can't know or that they missed, you know, or where they didn't read the memo, right? This all happens. And we as leaders of that team, we have to embrace that they might not have that context. And so that's why it's important to try to encode as many of these things that you know about your code base in some machine readable fashion. I have an example here of what that means. And again, this goes way beyond like linting but in more to like application design opinions. This rule, again, like taken from ourselves internal code base. And so we use Next.js middleware and which is a very powerful tool, right? It runs on almost all requests we're serving. And that means if our middleware is slow, our site's going to be very slow, right? And one way how you can make your middleware really slow is by fetching something from some other service, right? On the other hand, fetching something from some other service might be exactly what you want to do, right? Because that's how you do interesting things, like you call services. Now, which means that, yes, you sometimes have to call a service, but no, sometimes it's not the right idea, right? And so with an allowless mechanism like this, what you can do is you can say, by default, making this change requires approval by someone who actually has the context to make the decision whether in this context this is the right idea. And this kind of combines this idea of an allowless external to the code together with the owner's mechanism that we have, where the owner for the allowless is not the same person who is writing the code, and respectively, you can easily enforce that you get your architect or, you know, leader of the platform team type of person to go take a second look and validate that in this case yes, it's an OK usage. And then again, like, this is not like the lint rule where you eventually probably want to have code base and uniform code base. Here, you just ensure that, yes, we did take a look and this is actually what we want to do.
7. Eliminating Systematic Complexity
Eliminating systematic complexity is crucial in building scalable applications. One common issue is version skew in distributed systems, where the client and server are not at the same version. This can cause problems when introducing new fields in the API. To address this, tools like the Zot library and TRPC layer provide solutions, but can be complex. At Vercel, we introduced a serverless product that serves different versions of the site, ensuring the client and server are always in sync. This eliminates the problem altogether and makes building applications at scale easier.
Cool. Moving to the final principle here, eliminating systematic complexity. Once again, you're the person who's leading the platform team for your organization and you're seeing stuff that people struggle with. Every so often, I think it's important to take a step back and make a catalog of what these things that people, like, struggle with all the time and try to get rid of that problem once and for all by introducing some kind of abstraction that makes it possible to just not worry about it anymore.
One, I think, good example for this type of issue is what, at least in big tag, is called version skew. So version skew is something that happens in distributed systems when you have a client and a server, and they're not at the same version, right? Now, you might say, you know, I'm a front end engineer, I don't write distributed systems, and you're wrong. Like, you're actually doing the hard mode, right? Like, because on a data center, you control how things are deployed, right? You know how that works. You can put rules around it. On the other hand, when you have a web client and a server, you know, you're not in control when that web client redeploys, right? So, it goes to your site, and they might stick around, right? Certainly, for like a while. Meanwhile, you redeploy your server. So now, you have this our old client, new server, and you're introducing a new field in your API, and the client has no idea, right? And so, this is a problem that you have to manage now.
Now, there's almost like an entire ecosystem around this. Like for example, the Zot library for TypeScript or the TRPC layer on top of it, which allows you to express expectations around your APIs and handle the problem if those expectations aren't met. But that's a lot of complexity, right? There's a lot of stuff you have to like, do. And what do you do if you don't get the field, right? What do you actually do? It's a hard problem. And so, we were seeing this issue at Vercel, and we were wondering, can we, again, eliminate the whole systematic problem? And so what we did was we said, hey, we have this serverless product, right? And we don't have one server with the latest version. We can actually serve other versions of your site than the latest deployment. And so, what we have now is a very experimental version, but we were rolling it out over the next weeks, is that you can opt-in to a mode in Next.js where the server that responds to your client will always be exactly the one version that the client was at, which means you can just forget about this whole problem altogether. You never have to worry about it again. In particular, with server actions, which are already like function calls, they're really like function calls now, because you actually know which functions you're calling, right? It's not like some abstract version of something named this, but you don't know which one. It's going to be exactly that. And so again, this is just an example of the type of stuff you can do to make it easier to build applications at scale.
Amazing. That's all I have today. This is the whole list. I don't have to go through it again. I think we have Q&A. Maybe it'll be helpful with questions and stuff. Thank you very much. We have time for maybe one or two questions, so let's get right to it.
8. Automating Code Migration and Removing Barriers
Automating code migration using ASTs can be beneficial, even when making large-scale changes. Google's robot, Rosy, automates the process by converting a 15,000-line change into multiple one-line changes, applied incrementally to avoid potential mistakes. This approach removes barriers and allows for smoother code migration.
There's a question here that I find really interesting, but I want to expand on it a little. The question is, you mentioned to always migrate incrementally, but what about using ASTs to automatically migrate huge amounts of code? And to that, talk a little bit about when is it worth automating things? Is it a benefit? Is it a distraction? No, it's definitely worth automating things. Even if you automate a change, I would still apply the principle. At Google, there's a robot called Rosy, and it will take your 15,000-line change and turn it into 15,000 one-line change and manage applying it into your code base in an incremental fashion so that you don't have to then eventually say, oh shit, I made one mistake, and you have to roll back the 15,000-line change. It kind of recursively applies. That's really cool. Remove the barriers, Google.