Parse, Don’t Validate

Bookmark
Most JavaScript applications use JSON.parse to create any object first, and then validate and narrow the data type to the expected one. This approach has performance and security problems, as even if the data is invalid, the whole JSON string needs to be parsed first before the data is validated, instead of failing at JSON parsing stage (e.g., if number is passed instead of string in some property).

Many languages support parsing of JSON strings directly into the expected types, but it is not natively supported in JavaScript or TypeScript.

In this talk we will show how the powers of TypeScript combined with the new specification JSON Type Definition (RFC 8927) and Ajv library can be used to parse your data directly into the expected application-defined type faster than JSON.parse, and also how to serialize data of a known type approximately 10 times faster than JSON.serialize.

 



Transcription


Hello, I'm Evgeny and this is Jason. We're going to talk today about javascript and how to ensure data correctness, but before that, we'll give a brief introduction. So we've done lots of great things together. We worked together at MailOnline, added threads, and when I did the html library, Jason also helped a lot. So currently, I founded a Simplex chat, which is a messaging platform that is the only one of a kind that doesn't have user-identified. But this is not what the talk is about. So I'll hand over to Jason to introduce himself. Thanks Evgeny. Obviously, you know by now I'm Jason Green. I'm Director of Technology at Thread Styling. Threads is a fashion tech company pioneering the world of personalized luxury shopping through chat and social media. I also previously worked with Evgeny as a principal engineer at the MailOnline. I've been a longtime user of data validation with Jason Schema, and in particular, using AJV, which I've witnessed grow and mature so much over the years. I'm an early investor in Simplex chat as well. Yeah, AJV growth has been indeed crazy. So it's got from really nothing in 2015 when it started, and now it has 350 million downloads every month with every javascript application, probably everyone using that. So why do we want to talk about what we talk, right? In all these projects we've done, we've done some really cool things, right? We've done a content creator at MailOnline when we've been quite mature, and nevertheless, we've built a very complex in-browser application with hundreds of thousands of lines of javascript code that allowed Edison, the whole MailOnline website, is managed by that. And then when I was designing and engineering threads, and Jason also joined threads, we built Storymaker. Mostly Jason built it. I'm just basking in the glory. So it was a content management system for Instagram, which we definitely learned a lot of things from the previous project. And in all of those projects we did, we have been invariably hitting the same problems of security, reliability, data validation. Whatever project we do, the problems are invariably the same. So I've done a lot of Haskell, and these parts don't validate. Maxim belongs to Alexis Kane, so one of the best and most generous Haskell engineers out there, who proposes the approach parsing as an alternative to validation. So rather than just check that your data is correct, you push the proof and carry the proof of validity around. So not just correct, not just check that your data is correct, but also obtain some proof as if it's in some different type and use it across your application. So I'm going to hand over to Jason. And in this class with javascript, what we learned is that you should really not be using native JSON in javascript. You should be doing some other things. Jason, over to you. So as we all know, JSON is a widely used format that's generally considered to be flexible and easy to work with. However, it's important to be aware of some of the potential problems and challenges that it has. JSON is particularly wasteful. Now, it's not something you're going to notice in your day-to-day debugging when you're working with it. But parsing JSON can be a very wasteful process, as you need to parse the entire piece of data before you can understand or even begin to check if it's valid or not. Because of the potentially complex and nested nature of JSON, it can be particularly time-consuming to then go on and validate. Many of us who started in javascript have come to love working with typescript. But then you go and throw a big, large blob of unstructured JSON into the mix, and suddenly you're back to square one. None of your types matter, and everything is unknown again. It also has some security issues. These are issues that I actually wasn't very aware of, despite working with it for a long time until looking into it. If you have very, very deep structures, they can actually exhaust your call stack. This can be just because of the data itself, or it can be a deliberate attack with very deeply nested structures being sent to your APIs. You can also suffer from very large blobs of data being sent to your APIs in the form of a DDoS attack. And once again, before you can even understand if it's valid or not, your api will have to dutifully parse those blobs, which once again is very wasteful. And it is even possible to do prototype pollution attacks via JSON as well. So before we are concerned about performance and reliability, it's important to think about when performance and reliability is actually important. It does seem like an obvious statement. Most people wouldn't go out of their way to make an argument that it's not important. But it's not going to be important for every situation. It really depends on various factors. Obviously a slow app is better than no app at all. So if you have an application that's delivering value, you may have much bigger issues that you need to face before worrying about performance and reliability. Particularly in the early stages of app development, you're going to be much more concerned with time to market. If your app isn't even available yet, that's obviously a big issue. You're going to be concerned about budget, the overall user experience of your application, and of course what your users need and what's most pressing to them. However it is going to be an issue when the performance is affecting user experience and satisfaction. That can risk you losing users and those people who go away from your application or site because it didn't load fast enough, they may not come back, which is obviously what we refer to as high bounce rates. Even worse, if reliability is your issue and your customers are losing their work or their data is becoming corrupted, that's a big issue that in the best case can result in some apologies. In the worst case, you may actually end up having to pay for it in some way through compensation or discounts to keep people happy. So there is actually a solution to part of this problem which is tackled by a library called fastify, which is a replacement for your Express router. What it does is it tackles the serialization part of the problem, which is to say that by defining the inputs and outputs and the shape of them in JSON schema, this library is able to more quickly serialize the responses. And it can get quite good increase in speed because it knows the structure of the data it's supposed to be returning. In this way, it can take a lot of things that would normally be loops and turn them into straight property access. So if you talk about schemas, for a long time, JSON schema was the only way to define the format of the data or the type of the data or whatever you call it. It started from 2009 and since 2020, there is an alternative specification that was created to address the shortcomings. So you can spend quite some time comparing pros and cons of two specifications. You can watch it later and pause. But fundamentally, JTD is defined by its simplicity and one of the important qualities it supports is discriminated unions. But at the same time, it has limitations on the supports. But it's extremely well aligned with data types, unlike JSON schema. JSON schema, on the other hand, has extremely wide adoption today. It's part of some api specification. But at the same time, it doesn't support discriminated unions and it doesn't define... It's a long comparison. It's a non-trivial trade-off. And I use it firsthand with AJAV library that supports both of those specifications. So to skip to the recommendation that I can give, JTD is really much, much better for the absolute majority of the use cases in the api. So if you're building an api, you should be using JTD full stop. And for the cases when JTD is actually worse than JSON schema, it's a big question whether you should be using a schema at all. You should probably be writing code and using schema. So I'll demonstrate on some examples in some funny way. So we had line that JSON type definition logic versus JSON schema, what's... If you remember those internet memes was what? What is that? So if you look at this schema, so what it defines, majority of software engineers who discover any schema would say that this is this thing. It's obviously an object. What else can it be? And it has a property and the property obviously must be a string. And that's what JTD treats this schema as. The JSON schema has an interesting view. So it has like, if the data happens to be an object and it has this property, then this must be a string. And any other data will be valid in JSON schema, meaning any number or any string or array or any object that doesn't have property foo will all be valid. It caused millions of hours debugging to various engineers. And I have been answering like in azure library, probably 50% of questions are about this kind of gotchas, right? Or for example, this, it's a type, right? Clearly properties is misspelled here and JTD responds correctly. This schema is just invalid. What else can it be? Well, JSON schema has a view. JSON schema believes this is a valid schema. It just has some property that we don't know what it means. And any data is valid according to this schema. Again, millions of hours debugging spent fixing errors like this in JSON schemas. Or for example, for arrays, right? So JTD has this structure of array. So this data must be array. Obviously it must have object as elements and this object must have property foo and this property must be string. So it's also hard to acquire on the data shape, right? So if you have a similar schema in JSON schema, the only difference is keyword. Well what it really means? Well, one can guess the data doesn't have to be array and the data doesn't have to have all elements to be objects. And if they are objects, they don't have to have property foo. Like it's kind of a lot of different data types can be valid according to this schema, which again causes a lot of debugging, right? And the clincher like this, right? So what if we put square brackets here? We thought it's array. Why don't we put square brackets? Well, JTD just says it's an invalid schema. In JSON schema, this is unfortunately a valid thing that only validates the first item and ignores all other items. And it's an exceptionally common support question in JTD and it's millions of hours spent debugging bugs like that. So fundamentally, JSON schema is exceptionally error prone specification that requires lots of additional annotation to express what you actually want to express. JTD found a solution. It's turned out to be extremely popular called strict mode. You effectively make all those cases mistakes, which is an extension or deviation from JSON schema specification and people use it. But fundamentally it just means that JTD is simpler and in many cases better for your code. But over to Jason, we'll show you how to do some magic with JTD and typescript and do actually some data validation and parsing. All right, time to look at some code. So just a remark on a few of the points Evgeny made there. Having worked with a lot of JSON schema, we ended up building a lot of things around these decisions and it takes a long time to kind of work all those things out. With JTD, I found it a lot more intuitive and a lot simpler and more straightforward. So let's have a look. There's obviously a lot of different types of data that you can validate. I'm going to jump straight into a nested object with, I think, enough complexity that it's a good example for us to have a look at how we build a schema for it. So I've been playing a lot of Mario Kart with my family lately. Surprisingly, my wife is quite good at it as well and we're both just jumping in and playing a lot of Mario Kart. So I couldn't come up with any other example but a Mario Kart character. So if we have our character, this is our data. This is our type, our interface. It has a name, an optional surname. We all know Mario Kart characters. They have weight. The heavier guys are the best, obviously. You have a createdAtDate, just to give us an example here. We obviously have an array of weapons. So each weapon has an ID, a name of the weapon, which is an enum, and a damage counter for how much damage this weapon will do. I'm actually going to just clear out all this because I want to show the process of writing the schema more so than anything else. Because we have this very interesting utility type which AGV comes with. It's called the JTD Schema Type. And when we use this type, passing in our Mario Kart character, it will only allow us to create a schema that is valid for this particular data type. So I don't really have to, I mean I could start off by typing, well I actually don't know how to write a schema at all yet. So I can just go straight to my type ahead and trigger the, what do you call it, trigger ... I forgot the name, but triggering the possible properties that I can have here. And we have a properties value. So we're going to need that to start with. That's because when we're defining an object, this is how you do it with the properties property. That sounds awful, but you get the idea. If I go to trigger again, now we can start putting in the actual different properties that our Mario Kart character needs to have defined. You'll notice that surname is missing from here, and we'll get to that in a second. We're just going to start off by putting in, defining a couple of these. So we need a type, obviously, and the type of this one is, it can only be, well it can be one of two things actually, but in our case this is string. Now in this case you'll see that when I go to put the createdAt type, it's only giving me the option of date. This time it's no longer allowing me to put in a string, and that's because it knows that from this type of the Mario Kart character, the createdAt is a date. Finally we have weapons. I'm going to get to defining that in a second, because I want to first jump to how we define the surname. So surname is an optional property, and if I look at my predictions again, we have optional properties, and in here, as soon as I start typing s, we get surname as well. So it just makes everything very intuitive, and then from this point it's exactly the same as any other property that you're defining. I also forgot we also have the weight of the character, and this is a number, and as you can see there's a whole different, there's quite a bit more variety of different types of numbers that we can support in JTD. I'm just going to go with, given this is weight, I think probably uint8 will make sense, just a positive integer there. And if I got that right, oh yes, so this is the best part about this. It's not valid, it's not correct. This weight is not fulfilling the needs of this type here in the schema, and that's because I've made weight that it can be a null value as well. So here we have to pass nullable true, and then that's going to be valid as well. Elements, so when we want to define an array, I think you might remember from the example Evgeny showed before, it is elements, and then from that point on you're once again just defining, it's just more of the same schema. It's sort of a nested value now. In this case we want properties because this is an array of objects, so to define an object in JTD we need properties. We can once again go about putting in the rest of these values. Type, that is number, ID, I'm going to go uint32, and name. So the name in this case, this is another new type that we've come across. We've made this an enum. This is actually more of an ad hoc enum rather than using the enum keyword in typescript. In this case we just need to pass in all the possible values for this enum, and that is in the form of an array. So as you can see already it starts predicting what are the possible values. It knows this from the type that we've defined, which is really useful, and red shell. Finally we have damage, which is another number. We're going to make this a float, 32, so we're going to have decimal points as well. And that is the full schema now defined. So what can we do with our Mario Kart schema? We can serialize data, so if I have some javascript object defined that fulfills this data, that fulfills this type, I can create a serializer from my Mario Kart schema. Apologies, this is from a previous example. So here's our Mario Kart serializer. We can now pass in as much data as we like, and it will serialize 10 times faster than by using this type-aligned serializer that we've created. We can also of course parse. Here is a string with a JSON string of Mario Kart character data. We can compile a parser using our schema as well, and the best part of this is that it knows the type that we're expecting from this. When we use this parser to actually parse this string, the resulting type is going to be either a Mario Kart character or undefined. Undefined is basically going to be the case if the data is in any way invalid. If the data isn't invalid and we have a result, then we know that this is Mario Kart data. We have all of our typings set up. We can start, we get our type ahead, we can be sure of the data that's been defined here, and it just makes all of our code from that point on so much easier. And of course if it's undefined, then we can refer to the error message via a property on the parse function. I think that's it for the code example. This is really cool and it's exciting. I actually created such type utility for JSON schema and equivalent type utility that JSON just was created by some exceptionally bright javascript engineer. He contributed it all on his own. But that gives you huge powers in typescript because this approach of parsing JSON directly to the application type rather than to some generic data structure and the opposite, serializing a specific type rather than serializing a generic data structure, is used in almost all languages. Like Haskell, Go, rust, a generic JSON data structure is usually used in scripted languages like javascript and Python, but it fundamentally undermines performance and reliability. And using parsers which are type aligned, which results in high security and high performance at the same time. So as Jason said, CompileSerializer actually generates javascript code under the hood. So if it's used repeatedly, it gives you 10 times performance boost compared to JSON serialize. And parser, if the data is valid, then it will parse in pretty much the same time as JSON parse, but it will validate it as it goes. But for example, if somebody sends array instead of object, it will fail on the first character. If somebody will send the property that's not allowed or overrun type, the parser will not parse the whole data structure. It can be huge. It will simply fail on the first invalid character in the JSON string, which gives you much better defense against any kind of attacks. And you never can end up with properties you don't expect in your data that may result in prototype pollution or in any other sense. So this is really, really improved security and performance. And by now I know it's been used in large scale applications, even Wastestream, we're actually currently using their office system, it's my previous company when I was in engineering team, they were kind to allow us. They use this approach as well. So we did it when I was here. One more note I had was that for the Storymaker project, we built all of the state of the individual stories that you create is all managed on the server. And there's a lot of back and forth of big pieces of JSON. And we did all the validation with JSON Schema. However, because we wanted it to be type aligned, what we ended up doing was kind of similar in a roundabout way in that we generated JSON Schemas from our types. And we didn't have any JSON Schemas that were doing anything that you couldn't do in the typescript type anyway. So because we didn't want that, we didn't want to have to second guess things and have an inconsistency there. So yeah, I think it's kind of the right way to go. And it may seem more restrictive in what you can do with it. But I think I agree with your statement before, those things, you probably shouldn't be doing them in a schema anyway. So yes, JTD is fine. HW makes it exceptionally powerful. So use them both. That'll make your job easier and your api is more secure and you'll save yourself lots of hours lost on debugging compared to alternatives. So that's our recommendation from this talk. Thank you. Thanks.
26 min
17 Apr, 2023

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic