This talk gives an introduction about MediaPipe which is an open source Machine Learning Solutions that allows running machine learning models on low powered devices and helps integrate the models with mobile applications. It gives these creative professionals a lot of dynamic tools and utilizes Machine learning in a really easy way to create powerful and intuitive applications without having much / no knowledge of machine learning beforehand. So we can see how MediaPipe can be integrated with React. Giving easy access to include machine learning use cases to build web applications with React.
Using MediaPipe to Create Cross Platform Machine Learning Applications with React
Transcription
Welcome everyone to my talk at react Day Berlin 2022. I'm Shabai presenting virtually on the topic of using MediaPipe to create cross-platform machine learning applications with the help of react.js. I'm a Google source code mentor at MediaPipe and also a tensorflow.js working group lead and you can connect with me on my Twitter, howdevelop. And without wasting any further time, let's get started. So today we see a lot of applications of machine learning everywhere. And this is especially true for web applications with the advent of libraries like tensorflow.js, MediaPipe. There are a lot of these full stack applications that utilize the machine learning capabilities in their web apps. And we're seeing those also in production with a lot of startups and even companies like LinkedIn, which are using machine learning to power up multiple applications. And that's because of the fact that machine learning is so versatile that you could use it for a number of different applications. And here are some of the common areas where you see the use of machine learning, and especially one thing that is common amongst all of these applications. So you can see from the left hand side, we have some people utilizing the face detection on the iPhone XR. You can see some points that are able to detect your hands. Then you can see some really cool effects with web and we can see some facial expressions. And then you have the next cam that uses the camera to be able to detect objects. Then we have OK Google or Google Assistant and even things like Raspberry Pi, Coral, Edge TPUs. So all of them have one thing in common and that common thing is they are being powered with the help of machine learning and that with the help of MediaPipe. So what is MediaPipe? MediaPipe is an open source cross platform framework that is used for building perceptions and dedicatedly towards video and audio based perceptions. So just think of it in this line that in case you want to build an end to end machine learning application. So MediaPipe allows you to actually prepare not only your datasets, but also it allows you to go through the entire machine learning inference. That means that not only getting your objects that will be used for the detection, but also then being able to get the visualizations and the outputs for a particular model that you might be running. Because in a typical machine learning scenario or typical machine learning algorithm, you'll start off by taking some input data and you will run a machine learning model on top of it and then you'll be getting some inference. So MediaPipe allows for an end to end pipeline for being able to do machine learning inference and it's especially useful for analyzing video or audio data. And today we'll be seeing some examples where you could actually use it for a live audio, a live video or a camera based scenario. And there are of course a lot of different features that come out of the box. So it provides end to end acceleration. That means that MediaPipe can use your system's CPU or GPU as well. And the underlying technology, especially if you're using it with javascript, is that it uses webassembly on the backend. And that means that with the help of webassembly, you can also leverage the use of your system hardware to be able to accelerate and improve the performance of the inference of the machine learning models. And one other great thing is that you just need one MediaPipe pipeline and one MediaPipe model and it can be used in multiple areas because MediaPipe is supported across multiple frameworks, including javascript, Android, iOS and other platforms. You can also actually deploy it on platforms like Raspberry Pi for iot or edge applications. And there are ready to use solutions. That means we'll be exploring some of these MediaPipe solutions in a bit. And these are completely ready to use. You just have to import them inside of your functions and inside of your programs. For example, if you're using javascript, you just have to import the actual function and you'll be able to use it very quickly. And all of these different solutions that we'll be exploring are completely open sourced. So in case you are interested to leverage them, you can also check out their code base and apply them for your own use case. And these are some of the solutions that are currently there. And when we again, just kind of a quick reminder that when we talk about solutions, these essentially are end to end machine learning pipelines. That means right from being able to detect and then also classify or get your inference running up and running. All of that is handled with the help of these machine learning pipelines provided by MediaPipe. So you have some standard models that you will also see in Python, things like object detection, object tracking, but also being able to do things like face mesh or human pose tracking, place pose. All of these are today being used by a lot of different startups that are basically providing electronic or e being able to do like, you know, gym or being able to do your exercises and keeping a track of your exercises virtually with just the help of your webcam to do things like, you know, rep counts or like having an e physiotherapist in your so they're being utilized for those. Then we have the face detection, which is being used by companies like L'Oreal for augmented reality based lipstick techniques. So a lot of these solutions are already being used in production. And then of course, you see some even more solutions. And you can of course, check them out on the MediaPipe website. It's called MediaPipe.dev. So you can just visit that and check out all these different solutions. Things like the selfie segmentation solution is being used in Google Meet to put up virtual backgrounds. So these are just some of the solutions that you can directly use and embed inside of your program. And of course, these are some of the examples that we like, you know, share. So you can see on the left hand side, the one that is being very similar to the one in L'Oreal that you can use and augmented reality based lipstick, then you can see some augmented reality based movie trailers in YouTube. Or you can use Google Lens that is able to basically add virtual reality based or augmented reality based objects in front of you using computer vision. So these are some examples where MediaPipe is being used for not just like applications for javascript, but also for other platforms as well. But of course, I like to also break down how this inference is actually taking place in the first place. So for that, let's take a look at the life perception. And for that, we'll take the example of hand tracking algorithm. So the idea is that if you were to use a webcam and you put your web and you use your hand in front of the webcam, you should be able to detect something that we call as these landmarks. So basically, the idea is that you will take the image or a video of your hand and the ML model or typically basically the MediaPipe pipeline will be able to get these specific landmarks. And these landmarks are usually denoting the different joints inside of your hand. And you'll be able to overlap the landmarks on top of your hand so that it detects your hand and it detects the exact location of the landmarks and then superimposes them to it. So that's what we are trying to do with the meaning of actually localizing your hand landmarks. So the idea is that first you'll take your video. So this is the video. It could be a webcam footage or it could be a Nest cam footage for that matter. And then what we do is that we transform this image to a size that will be used by the machine learning model. So let's say if your original size is 1920 by 1080, so it will go under, it will go a transformation where we are resizing the image and rotating it so that it fits the expected size from the machine learning model. And then what we do is we convert the image that we have to our tensors. Now if you are aware or if you have not heard about tensors, so tensors are like the building blocks of deep learning or tensorflow. If you have probably heard about the word tensorflow before, if you haven't, it's kind of like large numerical arrays. And what we are essentially doing is that we are converting our image into these large n dimensional mathematical arrays, which will be again, they will undergo some transformations to be used for our machine learning property. And then once these have been converted to the tensors, we'll run the machine learning inference on top of these tensors that will of course make, it will make some changes inside of the tensors. And then what we do is that we'll basically go ahead and convert these tensors into the landmarks that you see on the right hand side. And then we'll render this landmark on top of the image that has captured your hand. And once you get that, you will finally end up getting a video output where you see something like this, where the hand, the landmarks have been superimposed on top of the hand. So this will be a similar kind of pipe, a perception pipelines for different type of solutions, middle pipe solutions that you use, but the typical idea of utilizing your video footage, capturing the input and converting it into the mathematical tensors and running the inference on top of it, making some changes to these tensors and finally getting it back. So that it's understandable in a human interpretable form is what is being utilized with the help of media pipe. So now we'll just go and talk a bit more about usage of graphs and calculators. So basically, whenever we talk about any media pipe solution, there are primarily two different talking points or two different points to consider. So the first one is a media pipe graph, and that kind of denotes the entire end to end perception pipeline that we just spoke about, or we showcased for the hand detection example. And inside of this graph, if you are aware of how graphs in data structures work, so there are edge and nodes. So in the case of a media pipe graph, each and every node depicts kind of a unique media pipe calculator and like very similar to how nodes are being connected by what I says in the case of, in the case of media pipe, these two nodes will be connected by something called a stream. And these nodes are basically accepting new inputs or packets that contain the information about the pipeline that we are running. And again, all these calculators are written in C++. So whenever you're doing any kind of an inference or doing the actual perception pipeline, all of that is being handled with the help of the flow of the packets inside of these nodes. And the calculations are being taken care of with the help of the media pipe calculators that are present at these nodes. And you'll have both your input and your output nodes. So input ports will take care of the incoming packets and the output nodes will take care of the outgoing resultant packets that you will end up seeing as the output. And again, if you are more interested to kind of see different types of visualizations, you can take a look at vis.mediapipe.dev to kind of see the perception pipelines for different types of media pipe solutions. So I'll definitely recommend to check that out in case you're interested in what's going on behind the scenes. Again, it's not required if you're just coming from a javascript background, you can directly use the solution. But in case you want, if you are interested, you can definitely check that out as well. Now, of course, coming to the most important part, and that's how to use these media pipe packages within react. So there are a number of different media pipe solutions alongside their npm packages that you can directly embed inside of your react code. So some of them, the most common ones are over here on the screen that you can see, including face mesh, face detection, hands, holistic, object-front pose. So again, they can be used for a multitude of different applications. And in case you are interested, you can check out the specific examples that are being provided by the media pipe team on mediapipe.dev slash demo, whichever demo that you want. So if you were to integrate it with the react JS code, this is very simply that you'll do it. Once we have the npm packages, the first thing that you'll do is you'll go ahead and import the specific npm packages for that specific module or that specific solution that you want to use. And apart from that, there will be a number of other npm packages that you'll have to install. So one of them is the camera utils. Again, you'll find the documentation for this on the MediaPipe in javascript website. And these are used just for being able to capture your camera footage and then use the frames to be run from the inference. And you can see in the lower half, we are using the selfie segmentation model. So you can see that we first use the webcam, the reference, and the canvas. So first, we go ahead and run our model. And it is able to locate the file where we have actually utilized the actual machine learning model. And then it will take your webcam footage. It will capture the input and run the inference on top of it and render it on top of the canvas. So that's simple code. And you can see that, again, within 10 to 29 of code, you are able to create an end-to-end machine learning solution. And of course, now we'll go ahead and take a look at a demo that I created. So basically, the demo is pretty much the hand perception that we created. So here, I have, let's say, I bring up my hand. And you can see that it says that it's open. And if I close it, you can see that it changes the label to closed. So let's see how this will actually work. So the model that we are going to be taking a look at is basically the hands model that, as we showcased in the perception pipeline, is able to detect the landmarks of your hand. And then it superimposes them. And here, I have my code that I'm running in a GitHub code space. So the most important one that we're going to be seeing is the index.ts file. So this is where you'll see that we have imported the hands solution from Metapipe. And over here, I have two different objects that I'm using. One is the hand itself and the hand connection. So these basically are all the list of different points or the landmarks that we have. So pretty much like the example that we showed in the actual slides, we are first taking some important constants. And that's primarily the setting up our landmarks and fetching our camera, the video element. And of course, the first thing that we're doing is that since it's going to be detecting hands, so we just run this on top of our canvas. So inside the demo, we are rendering all of this on top of a canvas element. So we basically go ahead and render our landmarks to kind of see how many landmarks do we actually get inside of our canvas. And then what we're doing is that we are basically going ahead and drawing the actual image on top of our canvas. And this is where we are primarily just processing our actual footage. So if you take a look at from code lines 36 to 48, this is where when you bring your image, and in this case, like in the webcam footage on the actual, near your webcam, then it will be able to fetch the landmarks and then find the coordinates for each of these landmarks. Since it's a 2D image, it will fetch the X and Y coordinates of each and every landmark and it will keep them stored inside of an array. And then what we're doing is that we are drawing a rectangle. So as you can see that when I bring my hand up inside the demonstration, it will actually go ahead and render these landmarks by fetching the X coordinate and the Y coordinate of each and every landmark that finds. And it will render it on top of the actual image that is being rendered on top of the canvas. And this is where we are basically loading the actual model, and that's the hands model. And this is where we are initializing our camera. So when we run the demo, this is where we initialize our camera. And then we have a couple of functions that we run, including loading the hands model. And then finally, what we are seeing is we are rendering the results using the async function on results, which is basically capturing your footage. And it's basically going ahead and rendering the landmarks and connecting them, ensuring that they are matching or being superimposed on top of your footage. So this way, what you can do is that, of course, this is one example for just being able to run the hands demo. And of course, the separate logic that has been put on top of this is to be able to detect when the landmarks are closed or are open. So of course, in this case, the logic that I used is that when all the landmarks are not overlapping with each other, then it should just print it as an open. But in this case, when the landmarks are overlapping with each other, you can see that I am printing the label as closed. So depending on the need or depending on how you want to use this particular model, you could go ahead and write your custom logic in javascript itself to see if, like, let's say, your, because each of these different landmarks have their unique coordinates. So you could do a lot more with this, specifically, like let's say, if you wanted to create something like an American Sign Language. So you could train your model in such a way that depending on the landmarks or the positions of your landmark and the way they are oriented, you could create an entire end-to-end American Sign Language demonstration as well with the help of the hands demo or the hands-selfie segmentation model. So that's, of course, totally up to you in terms of how you want to do it. So basically going back to our screen. So this is basically the quick demonstration that I want to showcase. So with that, I'd like to conclude my talk. And of course, in case you have any questions about how to get started with MediaPipe in javascript, definitely, like, you can reach out to me and I'll recommend you to check out the MediaPipe in javascript, where you'll find a list of all the different solutions and their respective npm modules. And of course, you'll see some working examples that are already out there. With that, I'd like to conclude and thank you so much. And I hope to see you in person next year at react Evolution. Thank you so much.