This talk gives an introduction about MediaPipe which is an open source Machine Learning Solutions that allows running machine learning models on low powered devices and helps integrate the models with mobile applications. It gives these creative professionals a lot of dynamic tools and utilizes Machine learning in a really easy way to create powerful and intuitive applications without having much / no knowledge of machine learning beforehand. So we can see how MediaPipe can be integrated with React. Giving easy access to include machine learning use cases to build web applications with React.
Using MediaPipe to Create Cross Platform Machine Learning Applications with React
AI Generated Video Summary
Welcome to a talk on using MediaPipe for cross-platform machine learning applications with ReactJS. MediaPipe provides ready-to-use solutions for object detection, tracking, face mesh, and more. It allows for video transformation and tensor conversion, enabling the interpretation of video footage in a human-readable form. MediaPipe utilizes graphs and calculators to handle the perception pipeline. Learn how to use MediaPipe packages in React and explore a demo showcasing the hands model for detecting landmarks. Custom logic can be written to detect open and closed landmarks, making it useful for applications like American Sign Language.
1. Introduction to MediaPipe
Welcome to my talk at React Day Berlin 2022. I'm Shivaay, presenting on using MediaPipe for cross-platform machine learning applications with ReactJS. MediaPipe is an open source framework that allows for end-to-end machine learning inference and is especially useful for video and audio analysis. It provides acceleration using system hardware and can be used across multiple platforms.
Welcome, everyone, to my talk at React Day Berlin 2022. I'm Shivaay, presenting virtually on the topic of using MediaPipe to create cross-platform machine learning applications with the help of ReactJS. I'm a Google Code Mentor at MediaPipe and also on TensorFlow.js working group lead, and you can connect with me on my Twitter, how-to-develop. And without wasting any further time, let's get started.
So today we see a lot of applications of machine learning everywhere. And this is especially true for web applications with the advent of libraries like TensorFlow.js, MediaPipe. There are a lot of these full stack applications that utilize the machine learning capabilities in their web apps, and we are seeing those also in production with a lot of startups and even companies like LinkedIn, which are using machine learning to power up multiple applications. And that's because of the fact that machine learning is so versatile that it could use it for a number of different applications. And here are some of the common areas where you see the use of machine learning. And especially one thing that is common amongst all of these applications. We can see from the left-hand side we have some people utilizing the face detection on the iPhone XR. You can see some points that are able to detect your hands. Then you can see some really cool effects with web and we can see some facial expressions. And then you have the next cam that uses the camera to be able to detect objects. Then we have OkGoogle or Google assistant and even things like Raspberry Pi, Coral, Edge TPUs. So all of them have one thing in common and that common thing is they are being powered with the help of machine learning and that with the help of MediaPipe.
2. Exploring MediaPipe Solutions
That means we'll be exploring some of these MediaPipe solutions in a bit. And these are completely ready to use. You just have to import them inside of your functions, inside of your programs.
And here are some of the solutions that are currently there. And when we, again, just kind of a quick reminder that when we talk about solutions, these essentially are end-to-end machine learning pipelines. That means right from being able to detect and then also classify or get your inference running, up and running, all of that is handled with the help of these machine learning pipelines provided by MediaPipe. So you have some standard models that you will also see in Python. Things like object detection, object tracking, but also being able to do things like face mesh or human pose tracking, place post, all of these are really being used by a lot of different startups that are basically providing electronic or e- being able to do like, you know, gym or being able to do your exercises and keeping a track of your exercises virtually with just the help of your webcam to do things like, you know, rep counts or like having an E physiotherapist in your, so they're being utilized for those.
3. Video Transformation and Tensor Conversion
The idea is to transform the video image into tensors, which are n-dimensional mathematical arrays used for machine learning. These tensors undergo transformations and are converted into landmarks that are superimposed on the captured image. This process is the basis of media pipe solutions, allowing for the interpretation of video footage in a human-readable form.
So the idea is that first you will take your video. So this is the video, it could be a webcam footage or it could be a Nest cam footage for that matter. And then what we do is that we transform this image to a size that will be used by the machine learning model. So let's say if your original size is 1920 by 1080 so it will go a transformation where we are resizing the image and recruiting it so that it fits the expected size from the machine learning model. Then what we do is we convert the image that we have to our tensors. Now, if you are aware, or if you have not heard about tensors, so tensors are like the building blocks of deep learning or TensorFlow. If you have probably heard about the word TensorFlow before, or if you haven't, so it's kind of like large numerical arrays and what we are essentially doing is that we are converting our image into these large n-dimensional mathematical arrays. Which will be, again, they will undergo some transformations to be used for our machine learning property. And then once these have been converted to the tensors, we'll run the machine learning inference on top of these tensors that will, of course, make some changes inside of the tensors and then what we'll do is that we'll basically go ahead and convert these tensors into the landmarks that you see on the right-hand side and then we'll render this landmark on top of the image that has captured your hand and once you get that you'll finally end up getting a video output where you see something like this where the landmarks have been superimposed on top of the hand. So this will be a similar kind of perception pipelines for different solutions, media pipe solutions that you use, but the typical idea of utilizing your video footage, capturing the input and converting it into the mathematical tensors and running the inference on top of it, making some changes to these tensors and finally getting it back so that it's understandable in a human interpretable form is what is being utilized with the help of media pipe.
4. Media Pipe Graphs and Calculators
In media pipe solutions, there are nodes and edges that form a graph representing the perception pipeline. Each node represents a media pipe calculator connected by streams. These calculators handle the flow of packets and perform the necessary calculations. Input and output nodes handle incoming and outgoing packets. For more visualizations, visit wiz.mediapipe.dev.
So now we just go and talk a bit more about usage of graphs and calculators. So basically whenever we talk about any media pipe solution, there are primarily two different talking points or two different points to consider. So first one is a midi pipe graph and that kind of denotes the entire end to end perception pipeline that we just, spoke about, or we showcased for the hand detection examples and, inside of this graph, if, if you are aware of how graphs in data structures work, so there are edge and nodes.
So in, in case of a media pipe graph, each and every node depicts kind of a unique media pipe calculator and, like, very similar to how nodes are being connected by what I says in the case of, in the case of media pipe, these two nodes will be connected by something called a stream. And these nodes are basically accepting new inputs or packets that contain the information about the pipeline that we are running. And again, all these calculators are written in C++.
5. Using Media Pipe Packages in React
Learn how to use media pipe packages within React by importing the specific NPM packages for the desired module or solution. Additionally, install other necessary NPM packages, such as camera utils, for capturing camera footage. Utilize the webcam, reference, and canvas to run the model and perform inference on the webcam footage. Explore a demo showcasing the hands model, which detects landmarks of the hand and superimposes them. Import the hands solution from MetaPipe and set up landmarks and fetch the camera in the index.ts file.
Now, of course, coming to the most important part, and that's how to use these media pipe packages within React. So there are a number of different media pipe solutions alongside their NPM packages that you can directly embed inside of your React code. So, some of them, the most common ones, are over here on the screen that you can see, including face mesh, face detection, hands, holistic, object drawn pose.
So, again, they can be used for a multitude of different applications. And in case you are interested, you can check out the specific examples that are being provided by the media pipe team on mediapipe.dev slash demo with whichever demo that you want.
So you can see that we first used the webcam, the reference, and the canvas. So first we go ahead and run our model and it is able to locate the file where we have actually utilized the actual machine learning model. And then it will take your webcam footage, it will capture the input and run the inference on top of it and render it on top of the canvas. So that's a simple code and you can see that again within 10-29 codes you are able to create an end-to-end machine learning solution.
And of course now we'll go ahead and take a look at a demo that I created. So basically the demo is pretty much the hand perception that we created. So we here I have like let's say I bring up my hand and you can see that it says that it's open. And if I close it, you can see that it changes the label to closed. So let's kind of see how this will actually work. Right. So the model that we're going to be taking a look at is basically the hands model that as we showcased in the perception pipeline is able to detect the landmarks of your hand and then it superimposes them. And here I have my code that I'm running in a github.code space. So the most important one that we're going to be seeing is the index.ts file. So this is where you'll see that we have imported the hands solution from MetaPipe. And over here I have two different objects that I'm using. One is the hand itself and the hand connection. So these basically are all the list of different points or the landmarks that we have. So pretty much like the example that we showed in the actual slides. We are first taking some important constants and that's primarily the setting up our landmarks and fetching our camera, the video element. And of course, the first thing that we're doing is that since it's going to be detecting hands, so we just run this on top of our canvas.
6. Rendering Landmarks and Conclusion
So inside the demo we are rendering all of this on top of a canvas element. So we basically go ahead and render our landmarks to kind of see how many landmarks do we actually get inside of our canvas. And then what we're doing is that we are basically going ahead and drawing the actual image on top of our canvas. And this is where we are primarily just processing our actual footage.
So if you take a look at from code lines 36 to 48, this is where when you bring your image and in this case, like in the webcam footage near your webcam, then it will be able to fetch the landmarks and then find the coordinates for each of these landmarks since it's a 2D image. It will fetch the X and Y coordinates of each and every landmark and it will keep them stored inside of an array. And then what we're doing is that we are drawing a rectangle. So as you can see that when I bring my hand up inside the demonstration, it will actually go ahead and render these landmarks by fetching the X-coordinate and Y-coordinate of each and every landmark that finds and it will render it on top of the actual image that is being rendered on top of the canvas.
And this is where we are basically loading the actual model and that's the hands model. And this is where we are initializing our camera. So when we run the demo, this is where we initialize our camera and then we have a couple of functions that we run including loading the hands model. And then finally, what we are seeing is we are rendering the results using the async function on results, which is basically capturing your footage. And it's basically going ahead and rendering the landmarks and connecting them, ensuring that they're matching or being superimposed on top of your footage. So this way what you can do is that of course, this is one example for just being able to run the hands demo. And of course the separate logic that has been put on top of this is to be able to detect when the landmarks are closed or open.