AI Generated Video Summary
Source maps allow understanding of transpiled, bundled, or minified code. Debugging with post hoc and debug identifiers helps identify files. Issues with source maps include hash collisions and missing function names. Various techniques can be used to determine the function that caused an error. Source maps can store additional information and improvements can be made to path resolution and column positions. Code points and token positions can differ across browsers. Detecting source maps can be challenging without a standardized JSON schema.
1. Introduction to Source Maps
Source maps allow you to transpile, bundle, or minify the source code, or rather, understand how transpiled, bundled, or minified source code was produced. Once we have the source maps, we need to marry them with the minified file. This can be done through a comment at the end of the file or a request header. Ad hoc debugging allows you to see the error as it happens and use the exact code that triggered the error.
Hey, everyone. My name is Kamil Ugurek, and I'm a senior software engineer at Sentry, where currently I'm working in the processing team, where one of the things that we are working on is actually processing source maps. I'm also a core team member of TRPC, and as you can probably tell, I really, really do love source maps, which brings me to the optional title of this talk, which is why source maps are so hard to get right.
We won't go into very details of how source maps work. There's plenty of other sources that you can use for this. However, we need to understand a few very basic ideas. Source maps allow you to transpile, bundle, or minify the source code, or rather, understand how transpiled, bundled, or minified source code was produced. What it does is it's able to map tokens from minified code back to the original code, which then lets you see more usable errors, rather than something like undefined X is not the functional function, or something of that sort. If you want to understand more in-depth how it's actually encoded, there is a great blog post from a friend of mine, Arpad, on his blog. I really encourage you to read through this.
2. Debugging with Post Hoc and Debug Identifiers
There is another way to debug called post hoc debugging. The main problem is the lack of identity, as we cannot determine which file was used. Using a release as a unique identifier helps, but it's not enough. We can use debug identifiers, which are unique hashes based on the source maps, to ensure the files are the same.
However, there is another way to debug, which is post hoc debugging, which is what Sentry is actually doing, which means that error happens, it's sent to us, and now we need to figure out what code was actually used when the error was produced, which just happened after the fact.
This brings me to what are the problems with this right now. The first and the biggest one is lack of identity, which means that we are not able to tell what file was actually used. Even if you upload it to us, which first I'll very briefly describe how it works. You produce some files, you have minified files, you have corresponding maps. You push them through one of the tools that we provide, which is the CLI binary or the plugins for your bundlers. If you want, you can use just API call and you store them in Sentry. You need to use the release, which is a unique identifier for your build. And we need this because this is the only way that we can actually have some sort of identity of the file. Because other than the filename, there is nothing else that really makes it special.
In time, you can have 10 bundled MinJS files uploaded for the same release, and we are not able to tell which one is which, basically. That's why we have something which is also called a disk, which you can think of as directories or the bucket for files. So we can have the same filenames for multiple environments, like production or development or staging. However, it's still not enough, because you can basically have the same filename, something like bundled MinJS, which is very, very common, but produced in completely different times, like today and a month in advance, six months later and so on and so forth, and those names can be still the same. You can use hash names, but this is not always possible because some people prefer to just take care of caching using HTTP headers, and it's just sometimes annoying to deal with.
So let's say that we have all the files. Now the error happens. Here is the error. The first frame points to HTTPS DRPC.IO. Assets Bundled MinJS, which is this line, and we don't really care about the host name. We can skip it for the processing part, which leaves us with Assets Bundled MinJS. However, the problematic part is that it has been served from somewhere. The Bundled MinJS lives inside Assets. However, what happens if the structure of your project, when you uploaded it, was something like this, which means that if you just included the dist directory to be uploaded, it means that your file will live under dist front-end Bundled MinJS. This is not the same path and they will not match, which means that we cannot tell that this file that you just served, that actually caused the error is the exact same file that was uploaded. It would be just a guess.
So what's the way around this? How we can make sure that those files are the same files, right? We can actually use something which is called debug identifiers, which in native world, it's very, very common. You have something which is called debug files and in our case we called it debug ID because it's very easy to remember. How it works is the very, very similar way with source mapping URL, however, instead of using paths, you hash the whole source map produced and you use this hash or rather the unique identifier based on the hash, which is UUID in this case, and you stick it inside the mefied file and inside the source map itself. You have to hash source map instead of just the source itself because there is a way that original source code can produce the same hashes for different contents.
3. Issues with Source Maps and Debugging
If your bundler removes comments, inlines functions, or collapses function calls, the resulting minified file may have the same hash for different original source code, rendering source maps useless. To address this, we hash the source map content and associate it with a debug ID. We keep a list of files in the bundle and their corresponding debug IDs. A better solution would be to have a list of frames with their debug IDs, allowing for more precise debugging. Symbol servers can be used to store debug files with corresponding IDs, eliminating the need to guess file names or paths. However, a problem arises with naming and scopes in source maps, as function names may not be included. This can make error messages and stack traces unhelpful.
For example, if your bundler decides to remove comments or inline some functions or just collapse those function calls, something like this, if you just add a single line of new comments to your source code and then you compile it and then you remove all the comments and you compile it again, it can end up being the same mefied file which means that you had the same hash for different original source code which means that source maps then cannot use their mappings because the line and columns will be completely off.
This API that you see here is completely made up. There is nothing like this, but this is one of the requirements that we need. Right now, how we do this is that we just keep a list of all the files that we produce inside the bundle, inside global namespace. So like window debug ID, and we have a list of mappings from URL to debug ID, which is definitely not great, not perfect, but it works with what we have right now, which let us validate our ideas. But the perfect solution would be something like this, where the error is not only just a stack string, but rather a list of frames, which would then let us to go frame by frame and ask for its corresponding debug ID, and then send it along the way with the event itself.
This gives us another very, very nice perk, which is so-called symbol servers. Symbol servers are used in native world as well. What it does, basically, is it allows you to have a completely separate, first-party server where you can store all your debug IDs, sorry, debug files with corresponding debug IDs. In our case, it would be just source maps, for example, on S3 bucket, and just let your tool know like, here's the URL, please, whenever you need some source maps, or any other debug files, just go there. Just go there, ask it, if it has corresponding debug ID, it has the file, it'll give you all the information you need. You basically set one URL in the config, you're done, nothing else. You do not need to guess the file names. You do not have to guess the paths or anything like this.
Something way more visible for end user is a problem with naming and scopes. Because names and like function names, original function names inside the source maps are not required. You can have source map completely without them, you can skip an array of original names. What can end up happening is that you can have a function that throws and you see the error like this. You have undefined function, and the only frame that you see is at x, at y, at z. It's completely useless. You see everything happen on line number one and column 21000, whatever. It's not great. Take this code for example. There is a function maybe which calls function callme, which then throws an error. And then we call the function. In this minified form, it's something like this. It just all collapses into one single line and a few functions. Sorry, few function names with just single or double letters.
4. Detecting the Function that Caused the Error
To determine which function caused an error, we can use backward scanning by tokenizing the minified source and searching for the minified name of the function. Another trick is caller naming, where we go one step up the call stack and retrieve the column and line of the frame. However, the best approach is AST reconstruction, which provides all the necessary information but can be memory inefficient and computationally expensive for large files.
So in this case, it would be callme. However, how can we know that it's actually this? So we have few tricks in our sleeves. The first one is so called backward scanning, where you take the minified source, you tokenize it, and you go one token at the time backwards. So what you do is you have error, you go to new, you go back to throw, and you do it as long as you see the minified name of the function. Or rather you see the token function preceded with the minified name. In our case here, it was XY, right? Once we have this function and XY tokens, we can ask what is the position of the token XY, in this case, and then ask the source map, hey, okay, now I know new location. This is not new error. I know the location of XY. Do you have a corresponding name for this? And if yes, we are lucky, we can now tell that it was actually call me and not new error that caused the error itself. However, it's not always possible. If you have a anonymous function or ES6 arrow function, it just won't work.
The second trick that we can use is so-called caller naming, which makes an assumption that the function that called us didn't modify the original name. For example, it didn't reassign it anywhere or didn't modify in some strange way or didn't call it dynamically, here is the call stack, the original one, which is function maybe, which calls function call me, which produces the error, and the minified one, VM calling XY, calling new error. What we can do is instead of asking what is the location of new error, we can go one step up, which is XY in this case, and ask the XY what is the corresponding column and line of this frame. What it requires is that you need to know the whole stack in advance. For example, if you just get the first frame, let's go back to the example here, if you just know at X asset bundle min JS, the first one, you're out of luck, you're not able to do this because what you need to do is you need to consult the frame below this, which like a column 2430, right? This usually gives some ideas, like we can fall back to this, but it's also not great. It's not always working, it's still just a guess.
The best idea that we came up with and we use with Success right now, and this is what plenty of DevTools are doing right now, which is doing AST reconstruction. What it does is, it takes minified source, and it's doing the same work that bundlers and transpilers already did, but, again, it's very memory inefficient, it's heavy, it basically requires a lot of work. However, the perk with this is that we get all the information that we may ever, ever need. So for example, we can tell that here a new error was thrown inside static method on class foo method bar, right? Or in this case, we can exactly tell that it was just object literal, assigned to variable capital A, and it was method, actually the property, sorry, the other method, or other custom getter called foo. And here we can tell that it was constructor of class B. So we can know exactly where we are at the time where the error happened, because we could reconstruct the whole thing. However, the problem is that again, you need to process the whole IST, which can be very, very expensive for very large files. And bundlers and transpiles already had this data.
5. Storing Information in Source Maps
Source maps can store information inside them, like the pasta format used by Bloomberg engineers. Scopes can be added to the source map JSON file to encode offsets and names. The nested nature of scopes allows for easy reconstruction and binary search. There are also smaller issues that could be addressed in the next specification revision, such as conformance tests.
They can basically somehow store this inside a source map, which is another idea that was already started by engineers from Bloomberg with their so-called pasta format, which you can read about more here. And there is also an RFC, which you can read about in the repo itself of the new source map RFCs.
Instead of just doing this work over and over again, what would happen is that we add a new attribute called scopes, or scope names, something like this, inside the source map JSON file, and it would encode all the offsets, all the offsets in bytes and the name of the scope itself. And because the scopes are nested, so for example, you have this outer scope, which can contain inner scope, which can contain inner scope and so on, so forth. It's very easy to reconstruct this because it overlaps, right, like 15 is within the bounds of 10 and 30, 16 to 20 is within the bounds of 15 and 25 and so on so forth. Once you have this, like the pyramid, you can then do a binary search and find the best match for your case.
Those are two large issues, however, there are a few smaller ones that are still, that are not critical, but it would be nice to fix in the next revision of the specification. The first one is conformance tests. If you've never heard of it, it's a set of tests that confirm that your tool, your engine, your implementation follows the specification to the D. If it, if it passes all the tests of the conformance tests, it means that it has been implemented correctly. This is what test 262, it's actually what the browser, browsers and other JS engines are running in order to make sure that ECMAScript spec is, is actually implemented correctly.
6. Path Resolution and Column Positions
Path resolution can be frustrating as it's difficult to determine the path. Original sources are usually encoded inside the source map itself, but this is optional. Some tools only provide the files array and a source route, which can have various formats. Guessing the actual file location can be challenging. Additionally, column positions can differ across browsers due to encoding and lack of specification.
The second one, which is way more annoying is path resolution, because you can never assure what the path is. One of the, one of the features that we have at Sentry is called context lines, which means that for the location where the error happened, we want to show you something like, you know, a small snippet from the editor. So if your lines above, your few lines below. But to do this, we need the original sources somehow.
Original sources are usually encoded inside sources contents inside the source map itself, is just an area of source codes, basically. However, it's also optional. So sometimes you're out of luck and you do not have access to this which makes us. So we using whenever you are using Sentry CLI, we'll probably in-line it. We'll fix it for you. However, there are still some tools that do not produce this. They just produce the files array and an optional source route mentioned, which means that those are the paths that the tool used for producing this bundle in case you want to go back and, and, you know, see, see where those files were located and the source route is like a prefix, common prefix for all those files. However, the problem is that they can be of any format. They can be absolute, they can be relative. They can have custom host like for example, webpack is doing right now. They can have custom namespaces, you name it. It's wild, wild West. So you can end up with something like this. You have an error where the frame is pointing to example.com this bundle.js. It's corresponding source map is point is pointing to two directories, up slash assets slash bundle.js map because yeah, you can also have path travels in here. And your files are pointing to like webpack hosts slash, and then namespace webpack produced, which in this case would be like some namespace library and there is source API issues. And you somehow need to guess or other understand where the file was actually located in case someone uploaded the file to you.
Um, so fun fact right now, our, um, just a simple, simple URL join function is over 50 lines of code and arrest. So it's so much fun. Uh, the column position is something that if you think about this, like it shouldn't happen, we are already like 2023, but for some reason it still happens because encodings are hard and you know, there is no specification. Um, considering this very trivial code, it's just a function that throws, however, you can see that there is a comment which has few emojis inside them. And what happens if you run this code in all the modern browsers right now, if you call this in Firefox, you'll have report that the error was actually thrown on column 13, which is the start of the what, uh, function in Chrome it's column 16 also far it's column 19. So what happens? I mean, like what happens is that Firefox is counting. Code units. Sorry. Or code, code, code points, code points.
7. Code Points and Token Positions
Code points differ in counting between Chromium, which counts code units, and Safari, which treats the opening parenthesis of the token as the function call. Some tools produce two tokens for a function, while others produce one without whitespace. To ensure accurate minify location and debugger behavior, use equal or greater position values and consider the possibility of off-by-one errors.
Yes. Code points. Um, which in this case is it's free because it's UTF 32. So it's like a single cold point for every fire emoji, Chromium is counting code units where, um, it's UTF 16, which means that every fire emoji is made out of two code units. Um, and Safari for some reason decided that, uh, it's best to do, uh, treat the opening parenthesis of the token as the token that actually calls the function, not the function itself, right? Fun. Um, it's still better than, uh, than this one, which is of by one, uh, which, which is then there that we of course had to have in source map as well. If you have this function, which is function what some tools will produce two tokens where the first token is function white space and the second is what, which is name of the function. However, some tools will produce this one, which is functional without white space. And the second one is space what? Who gives right now, if you want to, uh, make sure that your minify location points to very specific place in the original location, you can just use equal, you need to use equal or more just in case there is this of by one, of by one error here, uh, produced. And it also applies to debugger because if you click the debug, um, debug, the debug tick inside the dev tools on token or the function name, it needs to know exact token position.
8. Detecting Source Maps
To detect if a file is a source map, we currently rely on the map extension and specific attributes like 'version' and 'mapping'. However, since source maps are just JSON files, there are no magic bytes or indicators. It would be beneficial to create a standardized JSON schema for source maps to ensure their identification.
And the last one is, are you sure that it's source map? Because well, the source map is just Jason file, right? So how do we detect that it's actually a source map? Well, right now we are doing this in two folds. The first one is using that map extension. However, you know, you can call any file that map. So what most of the tools are doing is that they are looking for version attributes with literal string 'free'. And if you have this, maybe they are looking for mapping attributes as well. If you have all of those things, then it means that it's most likely a source map. But it's only most likely because it's just JSON and because it's just JSON, we cannot have any magic bytes or anything like this appended or prepended to the file. So we are out of luck. What would be great is to create a JSON schema which would be then standardized and we could all use it everywhere. This would make sure that the file we're actually looking at is the source map.