Video editing is a booming market with influencers being all the rage with Reels, TikTok, Youtube. Did you know that browsers now have all the APIs to do video editing in the browser? In this talk I'm going to give you a primer on how video encoding works and how to make it work within the browser. Spoiler, it's not trivial!
Video Editing in the Browser
AI Generated Video Summary
This Talk discusses the challenges of video editing in the browser and the limitations of existing tools. It explores image compression techniques, including Fourier transform and Huffman encoding, to reduce file sizes. The video codec and frame decoding process are explained, highlighting the importance of keyframes and delta frames. The performance bottleneck is identified as the codec, and the need for specialized hardware for efficient video editing is emphasized. The Talk concludes with a call to create a simplified API for video editing in the browser and the potential for AI-powered video editing.
1. Introduction to Video Editing in the Browser
Hey, everyone. My name is Christophe Archido, also known as Vegeux on the Internet. And I've done a few things for the React community. I co-created React Native, Prettier, Excalibur, CSS in JS, but today I want to talk about something different. I want to talk about video editing in the browser.
So during the pandemic, I spent a lot of time doing video editing. And I was even thinking maybe I should go like become a YouTuber full-time. But then I realized that with this number of views, I should probably keep my job as a software engineer for a bit longer.
2. Video Editing API and Image Compression
Unfortunately, the desired API for video editing in the browser is not possible due to the large file sizes involved. A single image of a thousand by thousand pixels can already be around four megabytes in size. With 60 frames per second, a one-second video would be around 200 megabytes. This is too big for current browsers and computers. However, image compression techniques have been developed to address this issue, which will be discussed in the following minutes.
And unfortunately I cannot be here in person today, so what I decided to do was to bring some of the sunny California to Amsterdam. And for this I put a palm tree in all of the pictures. So in this case, we have React summit in the background and then moving to the foreground and the palm tree fading away. So what would be the API that I would expect to be able to do that? So I initially wanted a load video kind of API. That takes a file path and returns me a list of images. And then I'm going to modify the images, remove the background, like cut and paste and a bunch of stuff. And then like a save video that would take the file path and render. And like a list of images and like actually save the video.
So unfortunately, this API cannot exist. So let's see why. So let's go into like one image of this whole video. And not too big, not too small. Like a thousand by thousand image. And how large is it actually to represent this? So it's going to be like one thousand by one thousand pixels. About one megabyte. And then there's red, green and blue. And so we are about like four megabytes in size. And this is just for one image. Now, if you want like 60 fps, like one second, you're going to be at like 200 megabytes for every single second. So this talk right now is around 20 minutes. So this is going to be big. And this is actually going to be too big for the browser or like any computer right now. And what do we do? So fortunately, a lot of very smart people have worked on this for years. And what they built is a shrinking machine. Well, not exactly. What people have been doing is image compression. And so I'm going to talk for like the next few minutes around like different types of image compression. And not because I find interesting, which I do, but because they actually have a big factor into the actual API used for video encoding. So let's see the main ideas around video encoding. Sorry, about image compression.
3. Image Compression Techniques
In video frames with only two colors, we can use run-length encoding to reduce the size. However, this technique is not effective for images with varying pixel colors. We will explore other techniques to address this issue.
So if you look at this one frame, one thing that we can see is that there's only two colors being used and there's a lot of white and dark pixels. And so instead of displaying seven dark pixels in a row with red, green, blue, red, green, blue, red, green, blue, what we can do is to start writing one byte, which is a number of the pixel, and then the red, green, blue once, and then we can basically repeat it like this. And so this is a technique called run-length encoding. So now this technique in itself is very useful for images with only two colors, but if you take a picture with your camera, you're never going to be to see two pixels next to each other with the exact same color. And we're going to see next how to help with this, but keep in mind that this is a building block for compressing images that's going to be used in the pipeline.
4. Image Compression Techniques Continued
We can use a Fourier transform to decompose the image into sinusoidal functions and reduce the information needed to encode it. Another technique is Huffman encoding, which remaps patterns of 0s and 1s to compress the image. These building blocks, along with others, enable a 10x reduction in image size. However, for videos, we can further improve compression by encoding only the differences between consecutive frames and predicting the next frame.
So now the other strategy I'm going to talk about, you need to have some imagination for this. So in this case, we're going to think of the image not as a series of pixels but as a continuous function. And one of the things we can do with a continuous function is to run a Fourier transform on top of this, that's going to decompose this continuous function into an infinite sum of sinusoidal functions like sin and cos with some variation. So why would you want that? How is it useful? So what you can see in the illustration is the first few sinusoidal functions, they actually end up being very close to the final function and then the more you go down into those sinusoidal functions, the less information that they have.
In practice, if you just keep the first few and re-encode it back, you're going to get very close image but you lose a lot of the details that you may not be able to perceive. Now, every single one of those is taking roughly the same amount of bits to encode, so by doing this you're able to reduce the information that you have to encode in order to compress the image. And the third technique I'm going to talk about... You also need to think about the image in a different way, in this case a series of 0 and 1s. And so one of the things we can start observing is that some patterns are emerging.
So for example, the sequence 0 1 0 1 is there 15 times. Then the sequence 1 1 0 is only there like 7 times and then you keep going and at some point some of them are only going to be there once. And the idea behind the compression is you can do a remapping. You can remap 0 1 0 1 to the bit 0, then you can remap 1 1 0 to the 2 bits 1 0, and then you keep doing keep doing and at some point, because you mapped a bunch of things to smaller things like some things would need to be mapped to bigger things. But if you look at the entire like sequences of 0's and 1's, it's going to compress using this technique if you also add the mapping table. So this is called Hoffman encoding. So these are like three building blocks in order to compress the image. And what is the result of this? We are able to get a 10x reduction in the size of the image. So, going from like 4 MBs, we're like about 400K. And the name of this step is image compression. And the most popular image compressions like out there are like JPEG, WEBP, PNG. And so this is like, they use all of these building blocks and a few more in order to compress the image. So we've made like massive progress into getting the image to be smaller, but it's still like 20 MBs per second for our video. So this is like still too much. So what else can we do? So for this, we can think of our video as a series of images. But now what you can see is like all of the images next to each other are actually very, very close to each other. And so there's probably something we can do about it. So the first idea is we're going to only, we're going to like do a diff of like the image before and the image after. And encode only the diff using those strategies before. And so this is working and this is giving better results, but we can do even better. What we can do is to start predicting what the next image is going to be.
5. Video Codec and Frame Decoding
The video codec reduces the size of the frames drastically. There are two types of frames: keyframes and delta frames. To decode the video, a stateful API is required, and frames need to be sent in a specific order. Bidirectional frames (B-frames) optimize video encoding by decoding frames in both directions. This introduces two notions of time: presentation time and decoding timestamp. The API should include a load video API and a decoder API.
So this video codec is able again to reduce the size drastically. So in this case, this is how our setup looks like. So we now have two types of frames. We have keyframes, so in this case, the first frame, which is using something like JPEG to compress it, and then we have delta frames. So in this case, like every one in this picture.
And now, in order to decode the video, it's no longer, oh, just give me like one image and I can do it. Now you need to start with the keyframe. And then, in order to decode the second one, you need to have decoded the keyframe, do the prediction where it's going to be next, and then do the delta in order to decode it. So now we are seeing that we need a stateful API and in a specific order. But this is only one part of the picture, because the people doing video encoding and compression wanted to do even better. One thing that they realized is that you can do this optimization going forward. But you can also do it backwards. So you can start from the end, do the prediction, the encoding, and then start looking at in which direction do we get the most savings. And take the one that is actually going to be the smallest overall. And so this is where the notion of bidirectional frames or B-frames comes in. So in this case, the frame number 5 is a B-frame, which means in order to decode it, you need to decode the number 4 and the number 6. And also to decode the number 6, you need to do like the 7, and 7 you need 8 and 8 you need 9, and same in the other way. And so now what you're seeing is in order to decode the video sequence in order, you need to send all of the frames in a different order. And this is where we have two notions of time. So we have the presentation time, which is the one that you expect to see in the duration of the movie, and then the second one is the decoding timestamp. And so this is the timestamp at which you need to send the frames in order to be decoded in the right order. And so this is where we've got our first, there's actually no truth, where time only goes forward. So now that we've seen the first breaking stuff, let's go back to the API, actually the real API. So in this case, we need to have some kind of load video API to give us all the frames. And then we want a decoder API.
6. Video Decoding and Performance
The web codecs provide a video decoder with options, including a callback on decoded frames. Decoding frames may not return frames in the same order they were sent. The load video API requires storing metadata, such as frame time, types, timestamps, and dependencies. Video containers like mp4, move, avi, and mkv hold the frames for the codec. The performance bottleneck is the codec, which handles image compression, decompression, predictions, and encodings.
And so in this case, the web codecs, like the browser is giving us a video decoder with a bunch of options, including one which is a callback on decoded frame. And so now you need to do decoder.decode, send it like the first frame, and then it's going to process, and at some point it's going to give us a callback with the first image. And then we do it with number two, number three, number four, and we're getting them in order. But now, what happens for our B frames, so now what we need to do is to send the frame number nine, and then the frame number eight, and then the frame seven and six. But our callback is not going to be called. It's as if nothing happens, but in practice, a bunch of things are happening behind the scene. And only when we send in the frame number five, now it's going to do the whole chain again and going to call all our frames in the right order for us to be able to use. And so this is where the truth number two is a lie. So if you're decoding one frame, you're not getting one frame back. So in practice, you may get like zero or you get made five based on how the encoding has been done. And so this is very mind bending because all of the APIs I can think of, even the asynchronous API, when you call something is going to give it to you back after some time, but it's never like you're getting one or ten or like zero, and very in an unpredictable way. So now that we've seen another truth, let's go like even deeper. So let's think about like this load video API that I talked about. So how would you encode all of this information? So now there's a lot of metadata that we need. So we need to have like, hey, how much time is a frame? Like what are the list of frames? What are our types? What are they like timestamp? What are our dependencies between them? So a lot of metadata that needs to be stored. And so if we were to do it today, you would probably implement it in JSON. But all of those five formats have been written like 20 years ago. And so they're all in binary, but the idea is the same. So what is this step called? So what is this thing called? This is called a video container. And so in practice, like the four most known ones are the mp4, move, avi, and mkv. And they all use different encoding and different ways to represent this, but they all have like very similar information that is about like a container for like then calling the codec. And this step of reading these file formats and then sending all of the frames in the right order to the codec is called demuxing. So now it's called demuxing.
So let's talk a bit about performance. So what takes time in this whole thing? So in practice, the codec is the part that takes the most time. And to refresh your mind, the codec is doing all of the image compression, decompression, all of the predictions, all of the delta, all of these encodings. And one of the ways to think about it is just look at the size of things. So the binary data is like in the tens to hundreds of kilobytes. But then the actual metadata for each frame is like tens of characters. And so you can see a very big change.
7. Video Editing Challenges and Call to Action