Do you scroll through TikTok, amazed at the goofy, yet complicated dance moves featuring today's youths? Those kids are popping off while you're sitting around, coding SQL queries. Fortunately, we are technologists, and there's no problem we can't solve, including getting better at TikTok dancing. In this talk, I'll show you how I perfected my moves building PoseDance, your friendly TikTok trainer. We'll discuss how I leveraged PoseNet, which allows you to pinpoint body motion and draw a 'skeleton' on a video. Combined with a webcam mapping your own dance skeleton, a bit of math to compare the matching points, Azure functions to authenticate a user, and PlayFab as a game-friendly backend to keep scores and create a leaderboard, you've got the perfect quarantine pastime, making a perfect fool of yourself in front of a webcam. Come dance with me!
PoseDance: Build a TikTok Trainer
AI Generated Video Summary
In this Talk, Jen Looper introduces PoseDance, a TikTok trainer app that uses PoseNet for pose detection. She discusses the challenges of scoring and the potential for medical applications. Jen also mentions the use of Azure Functions and PlayFab for the app's backend and deployment. The Talk concludes with a code tour and an invitation for contributions to improve the scoring system.
1. Introduction to PoseDance and TikTok
Hi everyone, in this part, I'll be introducing PoseDance, a TikTok trainer app. We'll discuss the idea of building a skeleton using the Canvas API and video to improve your dancing skills. I'm Jen Looper, a Cloud Advocate Lead at Microsoft, and I'll also cover what TikTok is, delve into PoseNet, explain its inner workings, and share insights on how I built PoseDance. Stay tuned for an exciting demo!
Hi everyone, I'm so excited to be here to speak to you today about a very interesting app that I've created. It's called PoseDance and what it is is it's a TikTok trainer. So we're going to talk a little bit about PoseDance, the idea of building a skeleton, using the Canvas API and video to allow yourself to dance just as well as those awesome kids that we are all wasting time watching on TikTok.
My name's Jen Looper, you can always find me on Twitter at Jen Looper and I'm a Cloud Advocate Lead at Microsoft. So the agenda today is to talk a little bit about what TikTok is, we're going to talk about PoseNet, we're going to dig a little bit deeper into what is running behind the scenes in PoseNet and how it was built. We're going to talk a little bit about how I built PoseDance and then I'm going to do a demo for you, you're going to love it.
2. Introduction to TikTok
TikTok is a popular mobile app that evolved from Musical.ly. It's known for short videos, particularly dances. Despite being a waste of time, it's incredibly fun. Stay tuned for more information and resources about TikTok.
Okay, so what is TikTok? If you're not up to date on what TikTok is, it is the greatest mobile app currently in the market, just to let you know. It's based on Musical.ly, which was a little app that was released a couple years ago to have the kids do music videos and it evolved into TikTok, which is for some, you know, because of this evolution from Musical.ly I feel like it's very amenable to people dancing on this app. It is a complete waste of time, but it is also very, very fun. It's a lot like Vine, which was little short videos, the videos on TikTok are about a minute each, but you can patch a bunch of little videos together to get to that 60 second limit and it's famous for dances and I'm going to share this slide deck later, you can find me on Twitter and I'll share the link out and there's a lot of little articles referenced in these slides that you can take a look at of how I spent a week on TikTok and only lost a week on TikTok. All right.
3. Introduction to PoseNet
PoseNet is a product from Google that allows pose detection in the browser. It uses TensorFlow.js and can be used with webcams, video, or images. It can display one or more people and has various use cases, such as sports and yoga. It provides estimations of 17 key body points, and the user can create the skeleton on canvas. PoseNet is built on top of PersonLab models, which enable a parts-first approach to determining a person's actions in an image.
What is PoseNet? Let's talk a little bit about what PoseNet is. PoseNet allows you to do pose detection in the browser. It's a product from Google and it's very, very interesting. It uses TensorFlow.js behind the scenes so it can be used for your websites within your websites. It can be used with your webcam, with video within a canvas, or it can be used on Still Pics, so any kind of images that you want to show it.
It can display one or more people, so you can have a skeleton that you create for one person's poses to be displayed or you can focus on many people in a video or an image. One of the use cases is that it's being used for sports. For example, to speak with an American idiom, if I slid into first base when playing baseball and I twisted my ankle, if there's a video of that, you can see exactly and you could pinpoint exactly where it went wrong so that you can figure out where you went wrong and twisted your ankle, and maybe you could avoid doing that another time. Personally, I'd like to try this to show yoga poses and what the best positioning and placement is for your hands, feet, and your whole body. And all it does is it gives estimations of the position of 17 key body points and you yourself have to create the skeleton on canvas, so it's not completely out of the box. All it does is it gives you the estimation of those body points, and then you can go and do whatever you like with the data that's sent back to you.
An example output of PostNet looks like this. It's basically saying that, you know, we think this is the left eye. It sends back a score that it's 99% sure that this is a left eye at the X and Y Cartesian coordinates within the image or within the video. Now, if you're running this against a video, it's going to give you this data set for every key frame. So that is a pretty powerful piece of software.
Whenever I'm talking about machine learning in the browser, or any kind of machine learning built into web and mobile apps, I like to talk also about what's going on behind the scenes. And you can take a look at some of the papers, some of these research papers that are referenced in these slides. PoseNet is actually built on top of PersonLab models. And PersonLab was a new way of determining people's stances and actions in a video or photo. So previously, we would be looking for people in a video or a photo by giving bounding boxes. I'm pretty sure that this is a person. I'm pretty sure that this is another person. But PoseNet and PersonLab allows you to do a bottom-up, excuse me, parts-first way of determining what a person is and what exactly they're doing within an image. So instead of the top-down boxes, it's a bottoms-up, parts-first, and it's a box-free, fully convolutional determination. And it predicts relative position of those 17 key points. So there's eyes, ears, nose, mouth, I think, no, sorry, there's no mouth, shoulders, elbows, wrists, hip, knee, ankle. And it starts from the most confident detection, and it works up based on trained body poses. So let's dig even a little bit deeper than that. So what's going on behind the scenes here? Well, let's talk about PersonLab and how this thing is working in the background.
4. PoseNet Training and Functionality
PersonLab training consists of two steps: detecting a person and estimating the poses. It uses heat maps and short range offset predictions to identify key points. The system employs huff voting and huff arrays for precise key point placement. PoseNet decodes the poses and groups people using long range offsets. Convolutional neural networks are used throughout the process. PoseNet abstracts away complexities and provides easy-to-use methods for real-time estimation in the browser. It can be installed via NPM as part of TensorFlow models.
So PersonLab training consists of two steps, right? It detects a person and then it estimates the poses. And first it attempts to detect the key points. You can see on these heat maps. It groups them into person instances. So it's looking at an image, determining, you know, which parts belong to which person and figuring out, you know, which, which elements are belonging to that single person. So it tries to do these, these groupings, and it's, the network itself is trained in a supervised fashion, and it looks for 17 face and body parts using the COCO data set. So COCO is Common Objects In Context, COCO data set. So it's always useful to know how these models were built and what data set is being used to build this kind of thing.
So it's using the heat maps and then short term, excuse me, short range offset predictions to identify possible key points, figuring out, you know, maybe this is an ankle. Maybe this is a And then it uses a system of huff voting to determine the most precise placement of a key point. So it's using the huff voting to create huff arrays. And you might take a look at what huff arrays are by doing a quick Google. So huff arrays is based on huff transforms, which is a feature extraction technique used in image analysis and computer vision. And it's designed to find imperfect instances of objects within a certain class of shapes by means of a voting procedure. Interestingly, it was invented in 1959 and patented in 62, I believe. So all of these machine learning techniques that we're using, a lot of them have been around for quite a long time. And it's quite interesting to just remember that, you know, we are building on the shoulders of giants.
So anyway, so we have our huff votes that come through. We have huff arrays constructed. And then we're trying to figure out the persons, the people involved in creating the posing. So it's decoding the poses of these people, using these detections to create likely instances of individual people. Right? And it's using the long range offsets that it finds to group the people into segmented instances of each person in an image. And it's good to know that all of these steps are using convolutional neural networks. So there's really quite a lot going on behind the scenes to create this model that is then packaged up and given to us for use in PoseNet. So always useful to know what's going on behind the scene and especially what the dataset was used to create the model that we just, you know, grab out of the box and use. Take a look at the research paper that's referenced in the bottom here.
So, like I said, PoseNet was a project of Google Creative Lab and it's running real time estimation in the browser. And what PoseNet does is it basically abstracts away complexities of the model given to us by this complex machinery behind the scenes. And it encapsulates the functionality delivered by these models into easy to use methods that we can use on the web. It's installable by NPM as part of TensorFlow models, so you can take a look at the TensorFlow models repo, and there's a whole bunch of demos that you can take a look at to get some inspiration.
5. Building PoseDance and Addressing Ableism
TensorFlow.js allows for machine learning in the browser, enabling inference and training of models. Building on TikTok, I created an inclusive app using PoseNet to compare a dancer's movement to your own. I addressed concerns of ableism by showcasing differently abled dancers. Let's dive into building PoseDance, considering design, architecture, PoseNet usage, webcam integration, and handling heavy models.
And it accompanies TensorFlow.js, which is really helpful, because with TensorFlow.js, we can use machine learning in our browser. You can do inference easily enough using TensorFlow.js in your browser. You can also actually train models in your browser. It's not my favorite way of training. I prefer to do that, you know, using a different stack to make it a little bit more efficient. But you could, if you wanted to, train models in your browser. But I like to use it for inference.
And just a fun fact, it does use FastGreedy decoding algorithm from the lab paper we've been talking about. So we have the model that we want to use. We've determined that we want to build some kind of app. What would be the most fun thing that you can think of to build with this technology? For me, I went straight to TikTok, because, you know, why not waste a little more time on TikTok and actually push it a little bit technologically, see what we can build on top of TikTok to use PoseNet. I thought, well, why don't we build a side by side comparison so that you can use PoseNet to compare a dancer's movement to your own by comparing the TikTok video to your webcam output, because this thing is designed to be using your webcam. So it will be me with my webcam over here with my skeleton determined and then making a comparison to the TikTok kind of point of truth. So why not? Let's do it.
So as I was thinking about this and starting to build it, I started to get a little bit uncomfortable. Because I thought, well, maybe what I'm building and what I'm using is inherently ableist. So just a quick note on what ableism is, is it's discrimination in favor of standardly abled people, people who can walk, people who are not in wheelchairs. And I started to ask myself, our person lab is PoseNet and is PoseDance my app? Is it inherently ableist by nature? Because it's looking for skeletons, intact skeletons to be drawn on top of an image. And like most machine learning models, this thing doesn't actually make judgments. It just measures. And if it doesn't find a piece of a body or, like, for example, if their arm stops at the elbow, then it's just not going to show that wrist key point. So it just doesn't make judgment. It's just looking for parts and it's measuring and displaying. However, I thought if I create an app on top of TikTok comparing my moves with a scoring mechanism to semi-professional, abled dancers, that actually might be ablest. So what I decided to do is to make an inclusive app and make sure that I show differently abled dancers, folks in wheelchairs, folks with different abilities. So that's what I tried to do, just to let you know. So when you visit the app, you'll see some interesting different types of dancers.
So let's talk about building PoseDance. I had a lot of considerations that I had to gather together when I was starting the work. I had to worry about the design, get the architecture in place, figure out how to use PoseNet, figure out how to use a webcam, and realizing that these models are heavy, and you have to handle that when you're building a web app.
6. App Responsiveness and Back-end Implementation
To make your app responsive, consider building the scoring and leaderboard. Azure Functions and PlayFab are great tools for the back-end implementation. Azure Functions can communicate with PlayFab to send scores and create a leaderboard. Deployment is another important aspect to consider.
You have to make your app responsive. How to build the scoring and leaderboard, that was interesting, and then the back-end implementation. So, fortunately I had Azure Functions in my back pocket. I know how to use those. I know that that's a really great way to build an API. And then I also had a system called PlayFab in my back pocket. This is another Microsoft acquisition called PlayFab. And it's a really nice back-end as a service that you can use for games and gamified interfaces. It works really nicely. And your Azure Functions will be able to talk to PlayFab and send your scores back so that you can build a leaderboard. Then I had to worry about deployment where this thing is going to live. So, lots of things to think about.
7. App Dependencies and PoseNet Usage
When building the app, I installed NPM and dependencies, including PoseNet from TensorFlow models and TFJS. Axios, Bulma, Sass Loader, Vue, and other components were also used. The interface allows for Webcam and TikTok video integration. Initially, I tried using TikTok embedded videos but had to grab the videos from the mobile app. Concerned about terms of service, I ensured proper attributions and credits. Using PoseNet, which has large models, the app takes about 10 seconds to load. Two models are running, one for the webcam and one for the TikTok videos.
So, I needed to build an interface that allows you to have Webcam on one side and TikTok videos on the other. I wanted initially to use the TikTok embedded videos. If you go to TikTok on the web, you'll find that you can download embeds. I thought that I'd be able to draw my skeleton right on top of that embed. Turns out I couldn't. I had to grab the video from TikTok, and you can actually do that on their mobile app, which is interesting. I'm grabbing these videos as mp4s, and I'm using those within the web app. I started getting concerned about terms of service, so I made sure the TikTok embeds are on the home page so you can see all the attributions and the credits that are given to each dancer and the music they use.
Okay. So, using PoseNet. This is really fun when you start thinking of cool things you can use with these machine learning models. Then you realize oh my gosh, these are really big. These are big models. It takes about 10 seconds for this app to load up. I'm using PoseNet twice. I have two models running. I have one for the webcam and one for the TikTok videos.
8. Handling Large Models, Webcam, Posing, and Scoring
To handle large models and the webcam, create a responsive and asynchronous-friendly website. Use asynchronous calls and life cycle hooks like Mounted Async to load the video, set up the webcam, and draw key points and skeletons. Pose estimation requires loading models and drawing the skeleton using the canvas API. Be mindful of performance limitations. Drawing the skeleton allows for creativity in thickness and color. Scoring involves comparing key points from different videos.
So, you have to just really create a responsive and asynchronous friendly website to handle these large, large models and also to handle the webcam because that thing has to download, has to be initialized and load up as well. So, I had a nice loader widget just running. Bulma is very helpful with that.
Okay. So, handling the webcam. Asynchronous calls. Asynchronous life cycle hooks like Mounted Async are your friend because in the Mounted hook, you have to load up your video, reference the video, set up the webcam, load up the webcam, get that thing ready to roll. Set up your two canvases to draw the key points and skeletons on top of. Load up your models and handle your video events. I couldn't cram it all into one piece on the slide. So, there's really a lot going on and it all has to be asynchronous. I just want to emphasize that.
So, posing. For each video, you need to load a model and start estimating the poses. So, each frame is being called here. It's grabbing each frame and grabbing the data that's sent back from this net estimate poses. And then you're drawing the key points and the skeleton yourself using the canvas API on the estimated location. So, it's something that has to be pretty responsive. I found that if you ask too much of it, it can get a bit laggy. Let's just put it this way. This app is a fun thing to try. I would not at this moment send it to production and try to monetize it. Put it that way.
So, drawing the skeleton, like I mentioned, you're using the canvas API. Here's where you can get a bit creative and make the skeleton really thick, you can make different colors, you can draw your key points in different ways. I just use the dots for the skeleton points and then lines in between. Very DIY type of thing. Scoring. Scoring was a challenge. How can you figure out how well you're moving compared to another video? Well, for each of these videos, for the webcam feed and the video feed, I gathered the key points and I figured out I appended them all into an array and then I did basically a subtraction.
9. Challenges with Scoring
The score tends to degrade to negative as you diverge from the video. Positioning yourself exactly where the video person is and aligning to their movements is a challenge. My high score is only 6, which is not great. My daughter finds the app annoying because her score was bad.
So, I figured the point of truth is the video. And your webcam is what you're comparing it to. So, you're just finding the difference between the webcam points and the video points. And turns out, your score tends to degrade to negative as you diverge from the video. Maybe this is something that could be fixed. But you need to position yourself exactly where the video person is positioned and make sure that you're really aligning to where that person is moving. So, that was a bit of a challenge to... Let's put it this way. I think my high score is like 6. It's not that great. And actually, my daughter was annoyed by it. She finds the whole app annoying because her score was that bad. So, the negative.
10. Backend, Leader Board, and Deployment
Let's talk about the backend. I decided to use PlayFab as a backend for the app. It's great for games. I built the leader board using the PlayFab API and displayed it with Axios. For deployment, I used Azure Static Web Apps, which was recently released and works well for various types of apps.
Okay. Let's talk about the backend. Now that we've got our scores. I decided, like I said, to use PlayFab as a mobile platform as a service backend. It is great for games. I put all the calls to it in an API folder and I use Azure functions to call PlayFab.
So, here is my Azure function that's grabbing the PlayFab SDK package and calling its API, update player statistics and sending a score. So, that was pretty easy to build, actually. Works pretty well.
And then, the leader board. So, PlayFab does create a public leader board of all players. You don't have to be logged in to see the leader board. You do need to be logged in to send items to the leader board. But anyone can take a look at the leader board. And you just need the title secret key. So, that is saved in variables in the service that I deployed it on. I'll talk about that in a second. So, I built the leader board via the PlayFab API and used Axios to display it.
So, deployment. Let's talk about deployment. This is where the release of a product really saved my life because I was able to use the brand new Azure Static Web Apps. And this was just released at Microsoft Build two, three weeks ago. And it is new, hot. It is in preview. But it works beautifully for these kind of apps. It works great for ViewPress, for Gatsby, for a standard SPA type of website. Or React or whatever. What have you, Angular's best felt. And if you'd like to try it, you can go to aka.ms slash Try Static Web Apps. Great way to host your apps.
11. Pose Dance Demo
I'm going to show you my app, Pose Dance. It has an invitation to dance as much as you like. It includes differently abled people. The model is loading, and I can click preview to watch the video with the skeleton. My score is not captured because I'm not logged in.
It gives you a nice URL and you can just go to the races. It works beautifully. I strongly recommend.
Okay. I think at this point we're ready for the demo. I think we can give this a try. I'm going to just escape this. And I'm going to show you my app.
So here it is. And okay. This is Pose Dance. This is the app that I built. So this is the first page and it has an invitation to dance as much as you like. Like I said, I put differently abled people in here. And I am going to go ahead and dance with this fella. And you can see that it's loading, loading, loading the model. It's loading, loading, loading. Okay. It's ready. I'm not logged in. So my score is not going to be captured. I can also click preview and just watch the video with the skeleton. But I haven't played this. It's already lagging a little bit. Here we go. Oops. I screwed it up already. Okay. See, I'm not far back enough. What's my score? It's probably terrible.
Conclusion and Q&A
I encourage you to visit the app on aka.ms/slashosedance. All the content, including a video, slides, and code, is available on GitHub. I hope you enjoyed my talk. Thank you.
Oh, I'm in the negative and the negative. Oh, well, I should try again. Last time I got six. So fortunately I'm not logged in. So we won't visit the leaderboard. Well, we will visit the leaderboard and see if anybody's doing better than me.
Ah, still my high score as anonymous is six. Oh, well, I encourage you to go to aka.ms slashosedance to visit the app. All of the content is on GitHub. And I have written a nice post-dance. Read me. And you can get all the information you need on this post-dance. There's a video. You can take a look at the slides. Blog post coming soon. And you can try the app. And all the code is ready for you.
So I hope you enjoyed my talk. I'll talk to you on Twitter. And that's it for me. Thank you. Amazing, amazing presentation, Jen. Please join me on stage for some Q&A. Hey. Hi, everyone. Hey, Jen, how's it going? Doing great. How are you? Good, good, good. I was saying before that I wanted to personally thank you for creating post-dance. My pleasure. Exactly.
App Lagging and Body Types
I have extensive experience dancing poorly. And you taking your time out to create this is a good sin for myself. Well, I'm very happy to create the useful apps of the world. Exactly, exactly. The ones that matter. The ones that change lives. That's right.
I was saying that when I logged on to the app just to kind of prepare for this talk, I did score a 61.98, which I thought was like up here. It is, it is. Exactly, exactly. That's a great score. Mine is like negative 133, so to be perfectly honest. This is it, this is it. I feel like I'm already getting better as we speak, even just one try. I did try to go back and aim even higher, but there are like loads of lagging problems with the app. I was just wondering if you could kind of, touch on that a little bit about like you know the challenges that you're faced in building it and the lag problems and how those could have been avoided or repaired.
For sure, for sure, for sure. And we do have a question from one, he was asking that since frames must match the position exactly, do body types, that is wider shoulders, kids versus adults, affect that? Great question. And yeah, they would, because it's really looking at those x and y coordinates of where all of your key points are within the frame. So if you've got a person who's 6'7, comparing themselves to, you know, my very short daughter, if I've one, it's not going to match.
App Potential and Medical Applications
You could position yourself better in the frame. The app has potential for yoga training and medical applications. It can help ensure correct body positions and prevent injuries. Running a video of a gymnast would be interesting. The technology has various applications beyond dancing.
You could, however, position yourself in the frame a little bit better, like come a little closer to the camera and try to get that match. So that's why we put that preview button in there so that you can kind of see where the person is situated and get yourself organized before you start, you know.
Yeah, and then, Moldita, just like, you know, kind of thinking wider about the application itself and the technology behind it, I guess, where else could or what other applications can you see for this app and for this technology?
Yeah, well, like I said, I had my daughter play testing this and she said, you know, you really missed an opportunity here because a lot of people are trying yoga right now and they have no idea what they're doing. They're watching, you know, one minute on TikTok and becoming a yogi immediately. It's not a very difficult situation. So I think just in terms of using PoseNet to properly, you know, make sure that all of your bits and pieces and body parts are in the right position for so you won't hurt yourself doing yoga. I think that would be a really great way to work with a personal trainer, you know, match my pose exactly and get a score, that would work.
I was always wondering about medical applications for this thing and if people would be considering, you know, figuring out if you injure yourself, what happens if you injure yourself? You know, with your hip overtwisted, like for a gymnast, for example, when you're training, when you go up and do certain moves on the parallel bars, are there things that you should not be doing so you won't hurt yourself in the dismount? You know, that sort of thing. Actually now I really want to try running a video of a gymnast against this and see what happens. It'll be interesting.
There you go. There you go. You know, I feel not only are we getting answers about how good I can be as a dancer, but we are getting other uses and other ways that this can be applied for the better.
Scoring System Rework
There are plans to rework the scoring system, possibly making it movement-based rather than position-based. The current scoring system involves a lot of math and is not very performant. Suggestions and PRs are welcome to improve the scoring and make it faster.
We have a question from Jay. He is asking are there any plans to rework the scoring, maybe movement-based rather than position-based? Yeah I don't really love the way I did scoring. It's an awful lot of math, and I'm not a mathematician. So I would welcome a PR to figure out the scoring and get away from us constantly being in the negative except for Chris who got that plus 65, who's kind of like a guru of Tic Tac-a-doo this way. So yeah send me your PRs, and let's think about how we can make it better and faster. It's not a very performant scoring system either. Right now it's like very brute force going through the arrays.
Code Tour and PRs for Scoring Improvement
Yesterday, I added a code tour to my code base in Visual Studio Code. You can run a code walk through of the code base and provide feedback. I welcome your PR to help fix any issues. There's also a question about measuring angles of limbs instead of X, X, Y's for improved scoring. It's a smart idea that would eliminate the need to be close to the camera. Feel free to PR and make it better. For more questions, visit aka.ms/Posedance and check out the GitHub repo for improvements.
One thing I wanted to just tell you is yesterday I took my code base and I added a code tour. So if you open my code base in Visual Studio Code, you can run and have the code tour extension installed. You can run through a code walk through of the code base. So go ahead and give me feedback on that too. So yeah I welcome your PR. Please help me fix this thing. This is it. This is it.
And we also have another question from Ruben. He wants to know, do you think measuring angles of the limbs instead of X, X, Y's could improve scoring? Yeah. I think it could. I think it could definitely because then you wouldn't have that problem of having to situate yourself closer to the camera. I think that's a really smart idea. And please PR this thing so we can make it better. Cool.