This talk gives an introduction about MediaPipe which is an open source Machine Learning Solutions that allows running machine learning models on low-powered devices and helps integrate the models with mobile applications. It gives these creative professionals a lot of dynamic tools and utilizes Machine learning in a really easy way to create powerful and intuitive applications without having much / no knowledge of machine learning beforehand. So we can see how MediaPipe can be integrated with React. Giving easy access to include machine learning use cases to build web applications with React.
Hello everyone. I'm Shivay Lamba. I'm currently a Google Summer board mentor at MediaPipe, and I'm going to be talking at React Advanced. So excited to be speaking at React Advanced on the topic of using MediaPipe to create cross-platform, machine learning applications with React. So a lot of this talk is going to be centering around machine learning, MediaPipe, and how you can integrate basically MediaPipe with React to create really amazing applications. So without wasting any further time, let's get started.
Of course, I mean today machine learning is literally everywhere. You look at any kind of an application, you'll see machine learning being used there, whether it's education, healthcare, fitness, or mining for the sake of it, you'll find the application of machine learning today in each and every industry that is known to human kind. So that makes machine learning so much more important to also be used in web applications as well. And today as more and more web applications are getting into the market, we are seeing a lot more of the machine learning use cases within web applications as well.
[01:28] And let's actually look at a few of these examples that we can see. For example, over here, we can see a face direction happening inside of the Android. Then you can see the hands getting detected in this iPhone XR image. Then you can see the Nest Cam that everyone knows is a security camera. Then you can see some of these web effects where you can see this lady and she has some facial effects happening on her face using the web, or you can also see the Raspberry PI and other such kind of micro-based microchips or such kind of devices that run on the Edge. And what are the things in common in all of these? That's the question.
[02:03] So the thing that is common in all of these is MediaPipe. So what exactly is MediaPipe? MediaPipe is essentially Google's open source, cross-platform framework that actually helps you to build different kinds of perception pipelines. What that means is that we are able to basically build or use multiple machine learning models and use them in a single end to end pipeline to, let's say, build something. And we'll also look at some of the common use cases very soon. And it has been previously used widely in a lot of the research based products at Google, but now it has been made sort of upstream and now everyone can actually use it since it's an open source project and it can be used to process any kind of an audio, video, image based data and also sensor data.
[03:46] Now looking at some of the most commonly used solutions, some of the most well-known solutions include the Selfie Segmentation solution that basically is also actually being used in Google Meet where you can see the different kind of backgrounds that you can actually apply, the blurring effect. So what it does is that it uses segmentation mask to only detect the humans in the scene and it is able to extract only the information needed for the humans. And then we have Face Mesh that basically has more than 400 plus facial landmarks that you can put, and you can make a lot of different, interesting applications using this. For example, let's say AR filters or makeup, right? Then we have Hair Segmentation that allows you to segment out the hair. Then we have standard computer vision based algorithms, like object detection and tracking that you can do to detect specific objects. Then we have facial detection. We also have hand tracking that can track your hands and you can probably use it for things like being able to use hand based gestures to control, let's say, your web applications. Then we have the entire human post detection and tracking that you could probably use to create some kind of a fitness application or a dance application that can actually track you. Then we have the Holistic tracking that actually tracks your entire body, right? And it tracks your face, your hands, your entire pose, right? So it's combination of basically the human pose, hand tracking, and the Face Mesh. Then we have some more advanced object detection, like the three detection that can help you to detect bigger objects like chair, shoes, right, table. And then we have a lot more other kind of solutions that you can actually go ahead and look at. And these are all end to end solutions that you can directly just implement. That is why MediaPipe solutions are so popular.
And just to look at some of the real world examples where it's being actually used. So we just spoke about them. The Face Mesh is a solution that you can see over here taking place on the AR lipstick, Tryon that is there on YouTube. Then we have the AR based movie filter that can be used directly in YouTube. Then we have some basically Google Lens services that you can see like augmented reality taking place. Then you can also see it being used not only in these augmented reality or these kind of things, but also more other kind of inferences like the Google Lens Translation. That also does use the MediaPipe pipelines in its backend. And you can see Augmented Faces that again is based on the Face Mesh.
Live Perception Example
[06:05] So let's look at a very quick live perception example of how basically it actually takes place. For this, what we are going to be doing is we're going be looking at the hand tracking, right? So essentially what we want to do is that we take an image or a video of your hand, and we are able to put these landmarks. What are landmarks? Basically landmarks are these dots that you see and you can superimpose them on your hand and they sort of denote all the different... Like you could say the different edges of your hand and you're going to be superimposing them. So this is what the example is going to be looking like.
So how would that simple perception pipeline look like? So essentially, first you'll take your video input, then basically you'll be able to reduce the frame. Basically you'll be getting the frames from your video and you'll be breaking down that entire frame into a size that is usable by the tensors because internally, MediaPipe uses a TF Lite that is TensorFlow Lite. So you're working with tensors. These are high-dimensional, numerical arrays that basically contain your entire information about the machine learning. So basically you'll be doing a geometric transformation of your frame into a size that is being used by the tensors. So these images will get transformed into the mathematical format of the tensors, and then you'll run the machine learning inference on those. Basically, you'll be doing some high level decoding of the tensors and basically that will result in the creation of the landmarks. And then you'll be rendering that landmark on top of the image and you'll get that output. So essentially what will happen is that if you have your hand and you import those landmarks on top of it, you'll finally get this result that you see is basically the hand tracking.
[07:45] So this way we can build such kind of pipelines and basically what's happening behind the scenes or under the hood is that we have the concept of graphs and calculators. So if you are aware of the graph data structure of how the graph has edges and vertices, similarly, a MediaPipe graph also works in a similar manner that whenever you're creating any kind of a perception pipeline or a MediaPipe pipeline, right? So basically it's consisting of, you could say, the graph in the nodes and the edges. And the nodes specifically denotes the calculator. Now, essentially the calculators are the C++ configuration files that essentially store what exact kind of transformation or what is the main brain. You could think of the calculator as the main brain behind that solution that you're implementing.
And then essentially these are the nodes and the data which actually comes into the node and it's processed and comes out of the node, all of those connections via the edges are sort of what is representing the entire MediaPipe graph. So including the edges, and then what is the input put at the calculator and what is the output. So input is what is coming into the calculator. And once the calculations have been done, once the transformations have been done, what's coming outside. So essentially that is how you can think of the entire perception pipeline of using different kinds of calculators together to form, let's say, one particular solution. And all of that will be represented through this MediaPipe graph. So that's essentially what is the backend or what's going on behind any kind of this backend structure of a MediaPipe solution.
[09:23] Now you can also look at some of the docs to get to know more about calculators, graph by going into docs.mediapipe.dev, or you can also actually visualize different types of perception pipelines, let's say. The one that we used was actually a very simple one where we were just using it to detect the landmarks on your hand. But if you have much more complex pipelines, you can actually go ahead and use this viz.mediapipe.to visit that and look at some of the pipelines that are there to offer on this particular set.
How can you integrate MediaPipe with React?
[09:56] And now coming to the essential part, what this talk is really all about, and that is how can you integrate MediaPipe with React, right? So there are a lot of NPM Modules that are shared by the MediaPipe Google team. And some of these include basically Face Mesh, face detection, hands, basically the hand tracking, Holistic that is having the Face Mesh, hand, and your pose. Then Objectron that is the 3D object detection. And then we have the Pose, right? And we have Selfie Segmentation that we had covered, is basically how the Zoom or the Google Meet background sort of works.
So for all of these, you'll find the relevant in NPM packages and you can refer to this particular slide and you can also look at the real world examples that have been provided by the MediaPipe team. These are available on CodePen. So you can refer to any of these to look at how basically that has been implemented. But what we are going to be doing is we are going to be specifically implementing this directly in React.
[10:00] So here is a brief example of how it's supposed to be working. So in the first piece of code that you can see at the top, where we have basically integrated or we have imported React, we have also imported the webcam because the input stream, right, that we are going to be putting up is with the help of the webcam that we are going to be using. So we have just integrated the webcam. Then we have integrated one of the solutions over here as an example and that is the Selfie Segmentation solution that you can see where we have imported from the MediaPipe Selfie Segmentation in NPM Module. And we have also integrated the MediaPipe camera utils. So this is to basically fetch the details from the camera, right? We do also have some other utils that help you to actually create the landmarks, which we'll discover in a bit. But after that, you can see basically the code where we have used the actual MediaPipe Selfie Segmentation.
And again, the best part about this is that you are not supposed to be writing 100, 200, 300 lines of machine learning code. And that's the benefit of using MediaPipe solutions, that everything is packed into this code. And we are doing such kind of important and such kind of essential machine learning based things like object detection, object tracking that usually run into 200, 300 lines of code. And you can simply just put it in less than 20 to 30 lines of code. Over here, we have just simply created our function for the Selfie Segmentation, where we are using the webcam as a reference. And we are using, on top of that, Canvas as a reference because the webcam is sort of the base, right? You get your frames from the webcam and then you are using the Canvas element on top of it to render the landmarks, right? And over here, you can see that we are just implementing the CDN to get the MediaPipe Selfie Segmentation solution. And then we are rendering the solutions. We are rendering the results on top of whatever is being detected.
[12:42] But yeah, I mean, so far, it's all been sort of discussion. We'll move quickly into the code with the demonstration thing. So if everyone is excited, I'm more than happy to now share the demonstration for this.
So let me go back to my VS code. So over here, basically I have implemented a very simple React application. It's a simple created React application, right, that you can use very simply. You can find it on the Facebook documentation. Over here, in my main app.js code, what I've done is that I have integrated four different types of examples, right? So these four examples include the Hands, the Face Mesh, the Holistic, and the Selfie Segmentation solution. So I'll be quickly showing you the demos of all of these, but I am just going to be quickly demonstrating how easy it is to be able to integrate such kind of a MediaPipe solution or machine learning solution, right?
[13:35] So within my function in my app, I have, for now, commented out all the other solutions. The first one that I'll probably demonstrate is a Face Mesh. So I have imported all the components for each one of these, and currently I'm just rendering or I'm returning the Face Mesh component. So if I go very quickly to my Face Mesh component, over here, I see we walk through the code. You can see that I've integrated, imported React. I've imported some of the landmarks. Now, basically, whenever we are talking about, let's say, the face, right? We want our right eye, left eye, eyebrow, our lips, our nose, and all. So these are basically all the ones that we have imported specifically from the Face Mesh.
And then we have created our function, right? We have created basically the MP Face Mesh that will be used to render the Face Mesh on top of our webcam. So over here, we have just to return, again, the CDN and we are using the facemesh.onResults to render the result that will be seen. So we start off by basically getting the camera. So we use the new Camera object to get the reference of the webcam and using that, what we do is that we wait. We have basically created the acing function since the machine learning model itself that will be loaded can take some amount of time to load. That is why we have just used an acing function to wait for the landmarks to actually load. And that is why we send to the Face Mesh the webcam reference that is basically your current frame that you're using. So once your camera loads, the frame starts coming in. We send that frame to our Face Mesh where it can actually render the landmarks.
[15:17] So basically on the const onResults function, what we have done is that we have taken our video input. Then on top of it, we are rendering the canvas element, right, using the canvasCtx. And what we are doing is that we are going to be now rendering the facial landmarks on top of the actual frame that we are seeing. So that is what you see over here, using basically the draw connectors. This is a util that has been provided by MediaPipe, and we are using that. So very quickly, what we're doing is that we are rendering all the different landmarks and we are finally returning our webcam and also the canvas that is going to be put on top of our webcam. And then finally, we are exporting this React component, right, in our app.js.
So very quickly, jumping into the demonstration, right? So I'll open up in incognito mode and I'll go very quickly to locall.host/3000 where it'll open up the webcam and we should be able to actually see... As you can see, hi, everyone. That's the second me. And very soon, I should be able to see the Face Mesh actually land up on top of the demo. As you can see, that's the Face Mesh. Boom. Great. And as you can see that as I move around, I open my mouth, I close my eye. You can see that how all the facial landmarks are happening. So I can close this and very quickly change to, let's say, one another demonstration. Let's say, if we can use this time, the Selfie Segmentation and in this, what I'll do is I'll basically comment out my Face Mesh, and I'll just comment this out.
[16:55] And I'll comment out this Selfie Segmentation and I'll save it. And til then it's loading, you can see the Selfie Segmentation. Over here again, we have used it. And what we are doing over here is that we are going to be providing a custom background, right? So that background that we are going be providing is going to be defined over here when basically we are using the canvas.fillstyle. So it'll basically not colorize your human body, but it'll colorize all the other things. So that is why we're using the fillstyle to basically, let's say, add a virtual background.
So if I go again, if I quickly go back to my incognito mode, I can go and look at locall.host/3000 and very soon I should be able to, if it... Hope everything works fine. So as you can see, this is my camera and my frame is coming in and very soon it should load the Selfie Segmentation model.
[17:49] And let's just wait for it. So as you can see, boom, blue background. So essentially again, how it's working is that it is taking in your body and it is segmenting out your human from the frame. And it's basically coloring the rest of the entire background with this blue color, because it is able to segment out your human body and just color the rest of the other things.
So similarly, you can try out various kinds of solutions that are there. Of course, for my demonstration, I showed you the Face Mesh and also I showed you the Selfie Segmentation, but you can also try out all the other ones that are shared inside of the NPM modules that are provided by the React code. So that essentially is what I wanted to particularly show with respect to the demonstration. Again, it's super quick what I have just shared with everyone, right? Even with the Selfie Segmentation code, the actual logic behind the actual Selfie Segmentation that we are writing is literally not more than from line number 10 till line number, probably till, I guess 36. So within 70 to 80 lines of code, you are really creating such kind of wonderful applications, then you can just think about what kind of amazing applications that you could probably think of and you could actually create with help of MediaPipe. And these are just two of the examples that I have shown you.
[19:13] So I mean, the sky is the limit and the best part is that you don't really need to know what's happening behind the scenes. You don't need to know what kind of computer vision or machine learning is happening. You just have to integrate it. And that is why today we are seeing MediaPipe being used in so many live examples, in productionized environments, by companies, by startups. So it's really the future of web, right? And it's just easily integrable with React that is the future of the web as well.
So with that, that brings an end to my presentation. I hope you liked it. You can connect with me on my twitter @howdevelop or on my Github on github.com/shivaylamba. And if you have any queries with regards to MediaPipe or being able to integrate MediaPipe with React, I'll be more than happy to sort of help you out. And I hope that everyone has a great React Advanced. I really loved being a part of it. And hopefully, next year, whenever it takes place, I'll meet everyone in the real world. So thank you so much. With that, I sign off. That's it. Thank you so much.