1. Preserving Retro Game Footage with WebGL
Travis Tykemany3 McGeehan, a full-stack developer at Gordon Food Services and an administrator for the TaskBot team and ambassador from Task Videos, discusses preserving retro game footage from the TAS scene for modern viewers with WebGL. He explains the role of TaskBot, a player piano for retro game consoles, and how the TaskBot team translates tool assisted speedruns to create the fastest possible sequence of inputs. He also highlights the core problems in video game footage archival, including chroma subsampling and stretched aspect ratios in small resolutions.
Thanks, this is Travis Tykemany3 McGeehan. I'm a full-stack developer at Gordon Food Services and an administrator for the TaskBot team and ambassador from Task Videos. I'm also a TAS author, so I just said the word TAS a whole bunch of times. You guys are probably wondering right now, what is a TAS? Well, that's the first thing we'll get into talking about preserving our retro game footage from the TAS scene for modern viewers with WebGL.
So this is who is the TaskBot and what is the TaskBot team? TaskBot is like a player piano for retro game consoles, particularly the NES, SNES, and N64 era consoles with help from the Game Boy interface on the Game Boy. The TaskBot team helps translate tool assisted speedruns, playing video games as fast as possible with the assistance of tools to go back frame by frame through those video games and create the fastest possible sequence of inputs. And then we take those trends and translate them to the original consoles and like I said, like a player piano, put those inputs through into a real console. So we've done all sorts of showcases for this at MAGFest and Games Done Quick, requiring precise video capture. And that kind of gets into the talk more directly about the tools we're using for that precise video playback and capture on the web.
Just a word of caution, the discussion here is going to be mainly relevant to this sort of retro video archival and tune towards animated footage, especially in these very, very ancient video games. It'll be less relevant if you have live filmed video, except when we're trying to transmit an archived copy of such video over the web. So there's a bunch of core problems we see in video game footage archival when we're trying to document these records of speedruns on TAS videos. The first one is chroma subsampling, just super high-level, that is color reduction. In all the modern video codecs, they take the color aspect of the data and split it from brightness, and because of that split that goes all the way back to the difference between black and white and color televisions, the game consoles were not originally doing that split. They were just outputting full color information alongside brightness information, and when they get shoved into a more modern video codec that has that split, they lose some of their data and you get worse video, especially at these very low game resolutions like 240p. If you only have half the color data from 240, now you're only talking about 120 lines of color and that's not very much color. So this didn't really happen with sub-sampling in video games until the GameCube era. So these really old video games where we're doing these speedrun records, often in the NES to N64 era, we don't see this color reduction being baked into the games, and then we do see corruption of the video footage.
The second problem we have is these super small resolutions having stretched aspect ratios. So these are also very small resolutions just to start with, and we have to get them upscaled to modern displays. So we talk about displays like the OLED LG G3, or the Samsung S95B, or all sorts of... now they're coming out with actual gaming monitors for these extremely color accurate displays. We want to be able to show on those displays, but we can't necessarily do that at the original resolutions. Because these displays are like 1440p or 4K. So we have to stretch up to the full display. And then we also have this issue of the original games resolutions were not square pixels. When they played back on CRTs, CRTs played every single input as a 4-3 source. So if the game coming in was 256x240, like on the NES and was nearly like a square image, it would be stretched out to being a rectangle on the resulting CRT display. So we need to replicate that stretching on the monitor. But if you do that when you store the footage, so let's say you made an NES video and you stored it at 320x240, you would get weird artifacts where you'd have maybe a pixel that looks right, and then the pixel to the right of it is halfway between two, and now you have an artifact there.
2. Challenges in Retro Game Footage Preservation
So you want to store records at the original, unstretched resolutions and avoid bilinear interpolation. Lossy compression and intentional CRT corruption in retro games pose challenges. Different algorithms, such as point and Area, produce distinct effects when scaling up images.
So you want to be able to store these records at the original, unstretched resolutions of the 256x240 and then doing that stretch out to the 320 when you get to the final display resolution like 1440 or 4k. We also have a problem where when we do that upscaling in the normal video codec, or normal video player, the HTML5 video player, it does everything in a algorithm called Bilinear, which is really great for film footage, where if you wanted to interpolate between one pixel and the next, it would make sense to just average the two. But in a sort of pixel art video game sense, you don't want to have all those extra bits of color. In the original video game, it went straight from being one color to the next color. So it doesn't work to do that sort of bilinear interpolation. It might use other algorithms for pixel art instead.
Fourth, we have the issue of lossy compression. Almost all web videos are lossy, which means there's a bunch of artifacts just into the compression to be able to make them transmittable over the web. So we have our own set of techniques to be able to do lossless stuff for these very ancient and smaller videos. And then fifth, we have a, and lastly, we have this CRT and composite video signal intentional art design. So a lot of these original NES, SNES games, they weren't just developing the signals to be like, okay, we're just going to show you this image. They were actually developing it, intending it for it to be corrupted the way that a CRT would corrupt an image just like even these really, really advanced 4k and 1440p color accurate monitors still have slight corruption to images. Back then CRTs had a lot of corruption and this corruption was used to create extra colors, like using dithering or you would have a checkerboard pattern to then gradient between two colors and have actually more color than the system itself could create available by taking advantage of the artifacts caused by the CRT.
So with all of those problems in mind, let's just see what some of that looks like in practice. So this is what I talked about with the bilinear smoothing, that's the middle one of this comparison. That is how almost all videos are played if you just watch a video online like YouTube, Twitch, all of them are going to have that filtering on them to get them up to the size of your browser. On the left, you can see a different algorithm called point. Between the top and bottom, you see differences in two different successive frames of a video. So on the top frame, Link's shield in this image is on one pixel, and on the next frame it's on a different pixel. And because of how point works, it's going to be wider on one of those frames than the next frame if you happen to be scaling up by a non-integer factor. So let's say your source is 240, like on the NES. And you're going up to, not 960, 4 times, but 1080p, a normal desktop resolution, 4.5 times. And if you go up from one source pixel, 4.5 times, well nearest neighbor, or point, says you can only go up by 4 or 5 times. So that's why in one frame of video on the top, it's gone up 5 times on Link's shield, and then on the bottom one, on the next frame, it's moved over, and now it's only going up 4 times for that line. Because over the source of the whole video, each vertical line is either going to be 4 or 5, back and forth. And this creates like, a sort of shimmering and waving effect. And on the right of the three, you see Area, which is the algorithm we like to use for this sort of extreme accuracy reproduction, where it's kind of like the Mercator projection. In the Mercator projection, when you're talking about maps, you have all this distortion of the shapes of continents, because of trying to make things straight latitude and longitude, and that's what we saw in Point. Whereas Area is trying to compromise between having straight longitude and latitude, and having proper shapes. So Area has a proper shape of the shield, it's always the same number of pixels wide, and that's because if we were going up like four and a half times, it would have four regular pixels, and then for that half, it would go back to doing bilinear and averaging just for that one line where it needs to.
3. Preserving Retro Game Footage Techniques
Bilinear and area scaling algorithms have different effects on video game footage. CRT simulation on RGBScaler.com reproduces the scan lines and slot mask effect. The talk concludes with an overview of techniques used on rgbscaler.com to preserve crisp footage of low-resolution GameBoy games, including the use of the area algorithm, AV1 and H265 encoded videos, and the ability to play videos with CRT effects.
Whereas bilinear, it's going to be like, on one it's going to be the full shield, and then on the next it's going to be 0.8 of the shield and some of the ground behind it, and then the next it's going to be 0.6. So that's why the area is really useful. You get a sort of compromise between the problems of point and bilinear instead of just this fuzzing that you would prefer more for like a film footage.
So we can see this again in Atari, Atari the classic 557 controversy video, the human element, we have this shown with the artifacts, all of these different things we talked about stacked together and really it doesn't look too bad until you compare it to the unadulterated pure area scaled source. You can see there's just these tiny, tiny bits of artifacting around the edges of the five and four for the picture. And then lastly, this is that same thing, but talking about CRT simulation. So I said how CRT can actually affect the image in ways that the artist intended. So we have the ability on RGBScaler.com to reproduce that CRT where each individual red, green and blue sub-pixel are blown up. And then also something called the scan lines are blown up. So when NES footage was shown on an old CRT, all footage back then was interlaced. And that meant that it would show the horizontal lines kind of like a comb, like you can see my fingers where there's an image and then there's a space and there's an image and then there's space and there's space. So it's like that. What would happen for NES games, because the console itself isn't creating interlaced footage, it would just, instead of on the second field, displaying the missing information and then going back and doing this and then filling it in again, on successive frames or fields of data, the NES would paint over top of the same lines. So you'd have this constant empty space between each horizontal line. And that's what this CRT simulation helps add, having those scan lines, is what they're called. And then the slot mask is that effect of reproducing the blowing up of the red, green, and blue subpixels.
4. Video Codecs and WebGL Setup
The AV1 and H.265 video codecs are used to support lossless footage and proper upscaling of pixel art footage. AV1 is compatible with Firefox and Chrome on Android, while H.265 allows Safari and iOS devices to play videos. However, due to issues with iOS and Safari, a third fallback to an H.264 source is not possible. To blow up the video in WebGL, the area scaling algorithm is used instead of bilinear. A custom canvas with custom controls is created using React, as the original video player's controls cannot be reused. Interacting with the canvas in React involves setting a ref attribute to a callback and using callbacks for WebGL setup.
And this can be easily clonable, too. There is a library, React RGB Scaler, on NPM that you can use to take the technology and play it on your own sites. So let's just go through some of the techniques there that the site uses. The first two are the AV1 video codec and the H.265 video codec. So AV1 broadens support for 444 and lossless footage, which takes care of the problems of color reduction and the footage being lossy and having artifacts from that. It's compatible with Firefox and Chrome on Android and Gitchen to Chromium-based desktop browsers. And it enables efficient storage of proper upscaling of pixel art footage by being lossless in that small original resolution format. H.264 can be more efficient for lossless modes, but because of its lack of compatibility, we end up settling on AV1 for this Android side. And then the second half of it is the H.265. And this is the first part where we get into code, where when you are doing the AV1 and H.265, you have to specify these types of the sources very precisely in a source set. So one source is the AV1 encoded side, and that's typed with Video WebM because it has to be in the WebM container. And then the second one is advertising not just that it's an MP4 container, but specifically that it is the HVC1 or H.265 Video Codec, and that allows Safari and iOS devices to pick up, oh, I have a video source that I can use. And then the complementing ones, the Android can use the AV1 source. And a really fun fact about this is that we can actually, we cannot provide a third fallback to an H.264 source because of malfeasance from iOS and Safari. They break spec and do not respect the order of these sources, and if they see an H.264 source, they will always play them, even if you have the H.265 source first in the source set. And that's some newer information I'm providing just for this talk that we've learned since the last talk I did on this, talking about how we've been doing these improvements for RGBScaler. This is something brand new, just in the last month or two that we've discovered and implemented that we have to have just AV1 and H.265 with these exact source types or it just doesn't work on iOS and Safari. So that takes care of how we have to pre-encode videos to be able to take advantage of RGBscaler.com, and then now to the rest of it, once you have that video, what we're trying to do is blow it up in WebGL instead of in the native video element. In the native video element, it always uses bilinear like we said, but we don't want bilinear. We want area to have that compromise so that we only have, we have sharp pixels everywhere except for only on those borderlines between integers of four and five. For example, in the earlier shot. So if we just roll back again, this is talking about the scaling algorithms here from point by linear area. And if we go back again, this is why we're going to send a video to a WebGL texture. And how do we do that? We actually have a really helpful tutorial from MDN's docs, Animating Textures in WebGL. And I heavily adapted that tutorial to be used in React because with this strategy, we can't actually reuse the original video player's controls. If you use an HTML5 video player, the controls are baked into that element. So if you hide the element, you also hide the controls. So to upscale the video on a custom canvas, we also have to make all these custom controls, which then gets into wanting to use React where we can make really easy state management of all those controls. And so now, interacting with that canvas in React, what we do is we actually set a ref attribute to a callback defined in a custom hook, and we use callbacks to do a bunch of WebGL boilerplate setup on initialization. And you can see that here, things like setting uniforms for the shader program, for whether we're doing those CRT effects, like the CRT mask intensity and scanline intensity.
5. WebGL Texture Update and Shader Logic
If you disable the art design for CRTs, you get pure area scale. React manages the state of sliders and updates the WebGL uniforms. Use effects and custom hooks handle video dimensions and effect intensity. The WebGL texture updates using a render loop, grabbing the video frame and sending it to a texture. The use ref hook and set animation frame track the animation frame. Cancel render and cancel animation frame end the render loop when the video is paused. The shader logic recreates the mask and scan lines based on pixel position.
If you disable those, then you just get your pure area scale for people who didn't do the sort of art design for CRTs. But if there is that art design for CRTs, then there's settings to up the mask intensity and scanline intensity. And when you go into that, then this is how the code actually uses React to manage the state of those sliders, upping the intensity and then sending that to the WebGL uniforms. And the use effects and the custom hook also helps with things like the dimension of the video and effect intensity.
So another thing with the dimension of the video, the browser the user is using might be resized while the video is playing, right? So we need to have a use effect set up to watch for the changes in the browser's dimensions and properly update the canvas to be targeting the correct dimensions with that change in the browser size or also like on mobile, the user could rotate their device if they rotate the device. Now we need to target different dimensions for that reason.
So how does the WebGL texture update for new frames? Well, we also have this concept of a render loop. So the first thing we do is we do the text image 2D, which grabs the video off the video element and takes that frame and sends it to a texture. And this is where we had that caveat where if you're sending just any URL to RGB scalar dot com, when you go to RGB scalar dot com, there's an input for arbitrary video URLs. If you use that, then you're going to have an issue of the website. It's coming from has to respect anonymous cores. And that's because this bit here with text image 2D, it respects cores. So if if you have the canvas, it is not allowed to read data from a texture that is not also like fully authenticated through that core cycle. So that's where that comes into effect. And then draw elements actually takes that texture and draws it to the canvas. And then we have set animation frame that set image animation frame is a use ref hook.
So the use ref makes it so that the entire website doesn't cause a re-render every time each animation frame ticks. But when we have request animation frame, that gives you a callback to an ID of that frame. And then we're using a user F with the set animation frame to keep track of the animation frame. And that's necessary for this next bit cancel render. Because what if, for example, somebody paused the video, we don't want their GP to be just like spamming, rendering all these frames of the same frame in the background. So we do a cancel render and cancel animation frame to end that loop of the render loop at that point. And then we can re-initialize it if they unpause the video. And we need to keep track of the animation frame to be able to do that sort of pause and unpause there.
Here's just some very basic essentials of the shader logic. Now I'm not a expert at all in WebGL shader logic. So this is probably doing something monstrously wrong, but it gets the value across. This is able to recreate our mask and our scan lines by varying them based on the position within a pixel. And this texture, it has to be sampled using the GL.linear, so I do it enabled. If you were to sample in the GL.nearest, which you think would be more appropriate for retro video, then we can't then in this shader do that sampling of, oh, okay, well, this one really can be in the center of a pixel to get the nearest.
6. WebGL and Video Player Controls
And then this one can be in the linear space. The area effect is ported from the open Broadcaster Studio Direct source code. Combining that with the slot mask and scan lines in WebGL gives you a web player experience of precise scaling. The React RGB Scaler library enables syntax highlighting for the vertex shader and the fragment shader, making development easier. Simple React logic is used to re-implement video player controls and custom controls for playing and pausing the canvas. DPI scaling is used to take full advantage of the device's resolution. Pre-encoding high-resolution videos and scaling them down wastes bandwidth, while using the proposed technique reduces video sizes significantly.
And then this one can be in the linear space. So like I said, the area effect is ported from the open Broadcaster Studio Direct source code, so there's not much really to show there. It's just the same area effect you'd see anywhere that's using that kind of upscaling. But the real key here is combining that with the slot mask and scan lines and doing this in WebGL to get you this web player experience of this precise scaling.
Now, another cool thing I found while doing all of this is you would want your shader language, your actual WebGL code, to be syntax highlighted while you're developing the plugin. So this React RGB Scaler library actually has that. If you download it and look at it in VS Code, you can see syntax highlighting for the vertex shader and the fragment shader. And that's because what we're actually doing is we're using roll-up plugin string to just take the raw contents of those files and inject them where they're needed in the code, the roll-up compilation time of the library. And this enables us to actually just write the code as if we're having full syntax highlighting for WebGL. A lot of times if you're writing WebGL shaders otherwise, you just have to write them directly into a template string with backticks, and that looks really un-ideal. So this is some nice little extra magic to make the development of this plugin extra easy.
And then like I mentioned earlier, because the HTML5 video element has to be hidden, the native video controls are also hidden. So we have a couple of things like this where we're just doing simple React logic to re-implement the video players controls in React and have some custom controls that the user can then use to effectively play and pause the canvas where the video is ultimately being shown. Another really, really cool thing here is working with DPI scaling. So if you were just to naively implement this without an understanding of DPI scaling, what would happen is the upscaling of the video wouldn't be fully taking advantage of the user's or the user's device's resolution. It would only be taking as much advantage as the CSS pixels because CSS pixels are a kind of standard size that is now used across all devices. And so, for example, if your phone has a thousand pixels wide on the screen, that might only be 500 CSS pixels. So to fully take advantage of the actual screen and give you the sharpest possible image, what we have to do is separately set the canvas width and the canvas style width. So the canvas style width is the CSS pixel ratio, where it's not taking into account the device pixel ratio. And you can get this device pixel ratio from an actual property that is available on the window element in browsers. So we take that property and then we use it to account for the full device resolution and actually use all of those device pixels properly. So just kind of to reiterate, if you were to naively use it without this device pixel ratio, you would have an artifact line that would be the size of the CSS pixels and not just the size of the native device's pixels. So for some comparisons of the efficiency of this technique and why it's really useful, another thing that you could do to get the same value is pre-encode a 4K or 8K video and then on every single device, scale it down. So if a user had like a 1080p screen, they would be playing the 4K retro video pre-upscale on their 1080p screen. But then you're wasting a ton of bandwidth on the internet, which we don't want to do. We don't want to waste bandwidth like that. So you can kind of see here the actual sizes of these videos are much, much smaller when taking advantage of all these properties. You can see a lossless encode of a Atari dragster footage is only about 119 kilobits per second, which is remarkably small compared to what you would get if you were to upscale all the way to 4K, which if you were to do it losslessly might be 340 to 400 kilobits. But YouTube is just going to naively assume like a constant bitrate to use for a 4K video into thousands or tens of thousands. And then you're going to be spending literally orders of magnitude more data to send this footage over the web.
7. Value of RGB Scaler Site
The RGB scaler site demonstrates the value of improving video quality while using significantly less bandwidth than YouTube. Lossless data for 3D footage using H264 or AV1 on H265 is around five to 10 megabits per second, compared to 100 megabits per second for a 4k upscale. This technique provides higher quality and less data usage for retro game footage, especially for 2D and Atari games, and is around five times better for the N64 era.
So that's part of the value here that this RGB scaler site demonstrates is not just can we have this video look more quality, but we can do it while at the same time using hundreds of times less bandwidth than YouTube. And now this is showing the same concept, but instead of the sort of best case scenario, this is a sort of middle case scenario for some 3D Super Mario 64 footage. So because this is 3D, you don't have the sort of big splotches of exactly identical color to combine together to create like perfectly a lossless footage. But we still actually see only five to 10 megabits per second of lossless data when we're using H264. And you'd kind of get comparable numbers from the AV1 on H265. I mentioned that we have to be using for getting this to work on iOS and Android perfectly. You'd see comparable like five to 10 megabit per second numbers. Whereas if you were to go all the way to a 4k upscale to do it lossless, you'd be talking about 100 megabits per second, which isn't anything you can really do for most average. Only like 30 to 50, most cable users around the US would have, it would make your video inaccessible to most people. It would be lossy then. Then you're going to be having artifacts, which are going to get blown up to this user's screen size, and we really don't want to be worrying about all that. We end up getting higher quality and less data usage all the way up through the N64 era with all these techniques. I just reiterated here on the slide everything I just said. Like I said, this works really well for 2D and Atari, and it continues to work well, but not quite as insane. It's not like orders of magnitude or hundreds of times better. It's just around five times better for the N64 era. Then beyond that, you'd get into things like that GameCube already has 420 color reduction. These techniques wouldn't be as relevant anyway.