High quality video encoding in browsers have traditionally been slow, low-quality and did not allow much customisation. This is because browsers never had a native way to encode videos leveraging hardware acceleration. In this talk, I’ll be going over the secrets of creating high-quality videos in-browsers efficiently with the power of WebCodecs and WebAssembly. From video containers to muxing, audio and beyond, this talk will give you everything you need to render your videos in browsers today!
Pushing the Limits of Video Encoding in Browsers With WebCodecs
AI Generated Video Summary
This Talk explores the challenges and solutions in video encoding with web codecs. It discusses drawing and recording video on the web, capturing and encoding video frames, and introduces the WebCodecs API. The Talk also covers configuring the video encoder, understanding codecs and containers, and the video encoding process with muxing using ffmpeg. The speaker shares their experience in building a video editing tool on the browser and showcases Slantit, a tool for making product videos.
1. Introduction to Video Encoding with Web Codecs
Welcome to my talk on pushing the limits of video encoding with web codecs. My name is Akash Hameedwasia. I write code at Razorpay. I also love building products on the web and contributing to open-source. You might know me from some of my projects mentioned here, Blaze, Untabbed, Dyod, or Slanted.
I hope you're all having a great time at JS Nation 2023. Welcome to my talk on pushing the limits of video encoding with web codecs. My name is Akash Hameedwasia. I write code at Razorpay. I also love building products on the web and contributing to open-source. You might know me from some of my projects mentioned here, Blaze, Untabbed, Dyod, or Slanted. If you don't know about these projects, I strongly suggest you check them out. You can also connect with me on my Twitter, GitHub, or also check out my past talks, past blogs, and also various projects on my personal website, AkashHameedwasia.com.
2. Building a Video Editing Tool on the Browser
In this talk, I will share the learnings I had while building a video editing tool on the browser. I encountered challenges with video encoding, audio, muxing, and codecs, but eventually figured everything out.
So before we get started on our journey of understanding how to use web codecs, here's a little back story. A while back, I was building a video editing tool on the browser that would allow me to make really catchy product videos really quickly. There are various video editing tools out there, but they are desktop based, so you have to download them and they're quite large. And then you have to learn how to use them, and then they take a lot of time for exporting a video. I thought, why not, let's make a simple tool. Because I have a lot of video editing needs for all my projects. Let's build a simple video editor on the browser so that I can use them to export videos really Now, everything was going fine until I started writing code for the export process. I got stuck at a bunch of places. Video encoding, audio, browse, muxing, codecs, a lot of jargon, a lot of places to get stuck. But I was eventually able to figure everything out, and this is what I want to share with you in this talk, all the learnings I had of how to go about building something like this.
3. Drawing and Recording Video on the Web
Now HTML Canvas API is a bit verbose, but it also gives you a lot of flexibility to do a lot of various things. And I strongly suggest you to actually use HTML5 canvas for doing something like this. And as you'll see in the talk, the canvas API enables various things that we can do, which the other methods of rendering cannot.
So here's some code that shows you how you can do something with capture stream. It defines a function called record, which takes in a duration, and we create a stream out of the canvas element using the capture stream. The first parameter is the frame rate. Here I'm just setting it to 30 for 30 frames per second. Then I pass that stream and create a new media recorder object. This recorder is what we'll use to record the canvas video. Before we can start recording, we need to attach to event listeners on the recorder. The first is data available. So whenever some data is available in the recorder, instead of, you know, transmitting it over network or doing something, we simply store that data inside a chunks array. Then we attach another event listener, which is the stop event listener. So when the recorder is stopped, we've taken all the accumulated chunks of data, processed them as a blob file and download it as a blob of type video slash webinar. And once the event listeners are attached, it's simply a matter of starting the recorder and stopping it after the duration has elapsed.
Now this code looks very simple, right? And I really hoped this entire process was this simple. But sadly, it's not. This entire method of using capture stream, it's not really well suited for this use case. And there are four reasons to it. The first one is that there is no constant frame rate with this approach.
4. Capturing and Encoding Video Frames
The capture stream function we saw takes in one parameter, which is frame rate. But it is actually the maximum frame rate in which the video should be captured. Now, if the device on which the canvas, you know, is rendering the frames, if it is an underpowered device, the browser might decide to drop frames in order to improve the performance of rendering the frames on the canvas. And this is, you know, leads to lower reliability. You might have, you know, the same code working on two different devices, giving you two different outputs.
So there's no constant frame rate, the browser might at any time decide to drop frames to increase the performance. There is no option to change the video file format, it always produces an output of the WebM format, which is not really common with at least, you know, other re-editing softwares if you're trying to do something with them. There's low reliability as I mentioned, the code might give you two different outputs on two different devices because of the performance of both of them. And the output is also of decent quality, nothing too great, it might be bad because of the dropping frames. But yeah, it honestly depends on the performance of the device.
Now, it might seem like I'm trying to bash the media recording APIs here, but honestly they are not designed for such precise and high quality video exports. The media recording APIs are much more suited towards real-time applications where a trade-off of quality is acceptable as long as you're getting higher speeds. So if the capture stream cannot work, can we capture and encode individual frames of the canvas ourselves? So here's some pseudocode with this idea. So here what I do is in the loop, I seek the canvas to every frame, capture it as an image, and then process all the frames of the animation to a single video file. The loop starts by seeking the canvas to the frame. So here I've seeked it to the frame one. Then I capture the frame as an image using the toDataUrl function. And then I give it image-webp to capture it as a WebP image file. And then once the image is ready, I push it inside the images array. Then again, I seek to the next frame, I capture it as an image and store it in the images array. And once that's done for all the frames, I come outside the loop and with that images array I have, I can pass it to an encode function to encode it as a video file and download it on the user's device.
5. Discovering an Interesting New Standard
And then you define the frame rate of your video, which is 30 in my case, and it will give you a video file that you can download on the user's device. You can check out WAMI at its GitHub repository over here.
Coming to the pros. The first pro is that you get high quality output, because we are doing the manual, you know, seeking, capturing as image and then encoding. We get high quality output. We get constant frame rate because there's no way a frame can be dropped because we are manually seeking and ensuring that every frame gets captured. The entire process was also relatively easy. As long as you have a nice library like WAMI to do the encoding process, the rest of the code was really simple to understand.
So now what? It seems like we have reached a dead end. And honestly, at this point, I had given up. I tried out a lot of JS libraries, a lot of approaches to solve this problem, but most of them had some or the other con which was leading to poor quality output. But eventually I stumbled across this interesting new standard that appeared on the browsers, but not enough people have been talking about it. There's very little documentation online about it as well. And I hope this talk actually fills in that gap and gives you the knowledge you need to work with this interesting standard. This new standard that has landed is drumroll.
6. Introduction to WebCodecs
WebCodecs are a set of APIs that give you low-level access to the individual frames of a video stream. It allows you to do encoding and decoding process on individual frames in various formats. WebCodecs is asynchronous and hardware accelerated, providing high performance. It supports video and audio encoding and decoding, with a focus on video encoding in this talk.
It's WebCodecs. So what is WebCodecs? WebCodecs are a set of APIs that give you low-level access to the individual frames of a video stream. So what WebCodecs allows you to do is once you have the access to these individual frames, it allows you to do encoding and decoding process on individual frames in various formats. The nice thing about WebCodecs is that it's asynchronous and it's also hardware accelerated. So you get really high performance because of these two features. And WebCodecs is more than just video. It also supports audio. You can do video encoding, decoding, audio encoding, decoding. The APIs are vast. But in this talk, I'll only be focused on video encoding aspects of WebCodecs.
7. Video Encoding Process and Using WebCodecs
The WebCodecs API provides low-level access to individual frames, making it useful for building video or audio editors. The video encoding process involves defining the input source, converting it to a video frame object, and passing it to a video encoder to encode multiple frames into a single encoded video chunk. We'll focus on the storage aspect of the encoded video chunk. To encode videos using WebCodecs, we create a video encoder and use callbacks to handle the encoded video chunks and errors.
Now, the other day I was looking through the WebCodecs article on MDN. This is the first paragraph from the article, and there's a part of this article which got me really excited when I first looked at it. I'm not sure if you noticed it, but this is the bit that I'm talking about. So what it essentially says is that WebCodecs API gives low level access to individual frames and it's useful for building applications such as video or audio editors. And this is exactly what I'm building. So it felt like I was in the right track and using WebCodecs would eventually lead me to the correct solution of the export process.
So let's look at the video encoding process at a high level. The entire process is actually divided into three parts. The first part is defining your input source. This is where you define how you input the data to the video encoder. Now there are three ways to actually you know encode or render data that the WebCodecs can use. The first is using Canvas to draw your frames. Second is using ImageBitmap, third is MediaStreamTracks and you can also use ArrayBuffers to define individual pixel data manually but in this talk I'll be using Canvas. That is also why I suggested you to use Canvas in the first place.
So once when you have your input source ready, the next step is actually converting that input source to a video frame object. This video frame object is then passed to a video encoder that we create and that video encoder takes in multiple video frames and encodes them to a single encoded video chunk. Once the encoded video chunk is ready, we can do various things with it. We can transmit it over network, we can transmit it over storage, we can store it somewhere else. We can do various things with it. But in our talk we'll only be focused on the storage aspect of this encoded video chunk. So let's finally look at some code to encode videos using web codecs.
The first step in this entire process is creating a video encoder. The constructor of this video encoder class takes in two callbacks. The first one is output. So this gets called whenever the video encoder has an encoded video chunk ready. And we can do something with this chunk. We can transmit it over network, we can store it. There's also an error callback. So whenever an error occurs this callback will be called. Now as I mentioned, since we are interested in storing this chunk to process it later, we'll create a chunks array and store all the chunks inside this array and process it later.
8. Configuring the Video Encoder
For error handling, a console log is used. The video encoder needs to be configured with various options such as codec, hardware acceleration, frame rate, and latency mode. The codec is the most important option. Finding the right codec is not as easy as WebM MP4 AVI. The video codec is not the same as the video file format.
For error handling I've simply put a console log. You can handle it in a better way.
The next step is actually configuring the video encoder. So we have created the encoder, but we haven't configured it yet to our options or to our settings that we design. So this is where we do it. We call the configure with various options. Some of the important ones are mentioned over here. The first option is codec. We'll discuss this in great detail. For now I've just set it as VP8. The second option is hardware acceleration. This is where you can tell the browser to prefer hardware acceleration if the device supports it or you can choose to skip it or let the browser decide on its own. The third option is frame rate. This is where you tell what the frame rate of your video is. In this case it's 30. The next option is latency mode. This is where you tell whether you prefer quality or the speed of the encoding process. If you prefer speed you can set this as real time and then you will basically have some trade off with quality but you'll get higher speed. So that might be useful for real time applications. But since I'm more interested in high quality outputs, I'll set the latency mode to quality. Then there are various other options with height or bitrate. But the most important option out of all this is actually the first one which is codec.
So how do you go about finding the right codec? It's not as easy as WebM MP4 AVI. Obviously I hoped it was this easy, but it's not. The codecs itself, they seem like random strings. Example VP8, AVC1, 420. Blah Blah Blah. It's completely random. It just seems like these magical strings that just work. The main takeaway that I want you to have from this slide is that video codec is not the same as video file format.
9. Understanding Codecs and Containers
Codec is an algorithm to convert and compress individual frames of the video. The video stream is further processed and passed to a muxer to create a video container. Videos contain audio, subtitles, and visual data, which are combined by the muxer into a single container. Codecs and containers must be compatible. Browsers support different codecs, so compatibility can vary. For more information on codec compatibility, check the provided links.
And this is the thing that got me confused a lot in this entire journey. I always thought that codec is the same as WebM, MP4, AVA, but they're honestly very different and to understand how different they are, let me actually give you a 101 on the entire video encoding process.
So the entire process starts with the individual frames. Individual frames are the frames of your video. There are three frames in this demo. All these frames get pass to what's called an encoding algorithm. Now the job of the encoding algorithm is to actually take the frames, convert them, compress them and store them in a single video stream. So this encoding algorithm, in other words, is called a codec. It takes in video frames, converts them, compresses them and stores the frames in a video stream. So the output of the codec is a single video stream. And this is not the thing that's stored on your hard drive.
The video stream is then further processed and is passed to what's called a muxer to create what's called a video container. So video container is what contains your video stream and this video container is finally stored on your hard disk. So the mp4 file that you see, it's actually a container which internally has video stream data that has the individual frame data.
Now you might wonder why do we need a muxer. Can't just the encoder do this step also for us? Well that is because videos are not just visual data, they have audio, they have subtitles and all of these are part of a single file. And the magic behind having everything inside a single file is made possible through the muxer. The muxer combines or multiplexes various streams of data into a single container. And as you can see here, this single container video.mp4 has video stream data, audio stream data, subtitles and it's the job of the muxer to combine all of them together.
So what's a codec again? Well, codec is simply an algorithm to convert and compress individual frames of the video. A codec and a container must be compatible with each other to use them correctly. So a container if you're using, let's say the webm container, you have to use VP8 or AV1 codec because the webm container can only store video streams and put it in that our codec. Similarly, if you want to use mp4, you have to use the AVC codec or the AV1 codec. And the example strings that I have mentioned here, these are the strings that you have to pass in the codec parameter of the configure function.
Now to make the entire process even more complicated, every browser supports different set of codecs. So this browser that you're using might support some codec, which might not be supported in another browser. So yeah, that also fragments this entire API a bit. But I won't be diving deep into codecs now, I've left two links over here, which you can check out to understand the compatibility of various codecs and what the individual numbers in those codec strings mean. So I strongly suggest you to check them out. Coming back to the code, this is the configure function.
10. Video Encoding Process and Muxing with ffmpeg
I've set the codec as AVC1. We pass the canvas to video frame to create a video frame. We close the video frame. Once all the video frames are rendered, we flush the frame data from the encoder, close the encoder, and concatenate the chunk data to a single array buffer. WebCodecs only handles the first three steps of the video encoding process. Muxing a video on the Web is not well-documented, but we can use ffmpeg as a muxer to mux the final video stream to the video container.
I've set the codec as AVC1. Since I'm using AVC codec, I have an option to pass a few more options for the AVC format itself. So here I'm telling to use the Annex B format, which is simply a video processing format. Again, I won't dive deep into what it is. Basically you get options to configure the codec itself as well.
Now, after this entire step, we have the video encoder ready with us that can take video frames and encode the video. The next step is actually going through every frame of our video, rendering them on the canvas, and then converting that canvas to a video frame. So we pass the canvas to video frame to create a video frame. It also requires an option timestamp, which is basically the time at which the frame appears in the video. So here I'm calculating the timestamp from frame and the frame rate, and since timestamp is in microseconds, I also multiplied it with 1 million. So we get a video frame and then we pass it to the video encoder using the ENCODE function.
Since our job with the video frame is done, in the next step I close the video frame and This keeps on happening for every iteration for all the frames that we have in our video. Now once all the video frames are rendered and they are passed to the video encoder, we can flush all the frame data from the encoder so that it processes them and creates the chunks, and then we close the encoder because the job with the encoder is done. Now remember we had attached a callback on the video encoder when we created it. It was pushing the data in the chunks array. Well with that chunk data since the chunk array is actually an array of arrays we can concatenate or flatten that array to a single array buffer and this concat buffer function basically does that. Now I won't be going deeper into what concat buffer does but it's essentially flattening the chunks array, combining multiple array buffers to a single array buffer. There are various algorithms online that you can find to do this operation.
Now after this operation you might think that the video buffer that we have we can simply download that right. That's the end. But no, recall the video encoding process again. There were various steps in this and WebCodecs only handles the first three steps which is taking the individual frames, passing it through a codec and giving you a video stream. The next step of passing it through a muxer that we have to handle on our own. Now this bit is not documented anywhere, at least it's very hard to find online how to go about muxing a video on the Web. And after a lot of searching, after a lot of trial and error and doing a lot of things, I was able to finally figure out how do you mux a video reliably on the Web. And this is where our discussion shifts to this decade old library called ffmpeg. No discussion of video encoding is over without this library. So ffmpeg is a decade old library written in C++ and C to encode, decode, video, audio, and do anything related to multimedia. And thanks to WebAssembly, we can use ffmpeg now on the browser. So here I'll be using ffmpeg.wasm to use ffmpeg on the browser as a muxer to mux the final video stream to the video container.
11. Encoding a Video in the Browser
There are various compilations available, but I found this one to be the best. The next step is actually reading the output after the muxing process is done and deleting the files because we no longer need them. And that is how you encode a video in the browser. The list of pros that you get with this approach is high quality output, constant frame rate, fast process, hardware acceleration, support for any video format, and the ability to add audio. The only con is that it's relatively hard, but I hope this talk made it easy for you. WebCodecs is supported in all browsers except Firefox, and I hope Firefox gets support soon. WebAssembly is supported in all browsers.
There are various compilations available, but I found this one to be the best. So do try out ffmpeg.wasm. Once ffmpeg.wasm is installed in the project, we call the create ffmpeg function from the library. Then we call load on that object to load the WebAssembly file. Then we use the ffmpeg.fs function to access the file system and write the video buffer, the video stream inside the file system. The name here is set as raw.h264. H264 basically tells ffmpeg what codec we used to encode this particular stream. Then we run the ffmpeg command. This command that I have here is basically for muxing one video stream to the final video container. Muxing, itself, is a very cheap process. It happens really quickly. So there is no performance bottleneck that can happen here.
The next step is actually reading the output after the muxing process is done and deleting the files because we no longer need them. That output file that we have, we just convert it into a blob and download that file on the user's device. And that is how you encode a video in the browser. It might seem like a lot of steps, but honestly, the list of pros that you get with this approach, it will literally fill the screen. I'll show it to you now. So the first pro is you get high quality output because we are again, you know, rendering each frame, capturing them, and then encoding them, we get high quality output. We get constant frame rate because there's no place where the frames can be dropped. The entire process is really fast because most of the APIs are asynchronous and it also supports hardware acceleration. You can use any video format as long as the codec that you want is supported on the browser. And it also supports adding audio. The adding audio bit happens at the muxing layer. So the FMPEG muxer that we're using, it already supports including audio in that single command, so you can easily inject an audio in the final video as well.
Now the only con with this entire process is that it's relatively hard, but it was hard for me and I hope this talk actually made it easy for you. So this is also no longer true. I hope this has become entirely easy for you. Now speaking of browser support, WebCodecs is actually supported in all browsers except Firefox, which is kind of sad. So I really hope even Firefox gets support for WebCodecs soon. WebAssembly is supported in all browsers.
12. Building Slantit: A Tool for Making Product Videos
We can freely use FFmpeg and Wasm libraries to do crazy things on the Web. I built a tool called Slantit for making catchy product videos quickly in the browser. The video in the background is an example of a Slantit-made video. You can try Slantit at slantit.app. It uses the same video encoding process we discussed. Thank you for your attention and the opportunity to talk about web codecs. Connect with me on Twitter, GitHub, and my personal website.
So we can freely use FFmpeg and Wasm libraries to do crazy things on the Web. Now remember, I was building a tool to make product videos. I finally ended up building it and I want to share it with you so that you can even try it out. It's called Slantit. It allows you to make catchy product videos really quickly on your browser.
So the video that you see in the background, that is Slantit made with Slantit. So I recorded, you know, screen recording of me using Slantit and then I use Slantit to edit it, add these 3D transitions effects and made a really nice product video for Slantit. You can check out Slantit at slantit.app and yeah, it uses the same video encoding process that we discussed in this talk.
So yeah, I really hope you try it out, give your feedback and yeah, that brings us to the end of the talk. I really hope you learned something from the talk and I also want to thank the people at selecting this talk and giving me an opportunity to talk about web codecs in this wide platform. So yeah, thank you. You can connect with me. You can follow me on Twitter. I have a GitHub and my personal website as well. So yeah, looking forward to talking to you.