Do you scroll through TikTok, amazed at the goofy, yet complicated dance moves featuring today's youths? Those kids are popping off while you're sitting around, coding SQL queries. Fortunately, we are technologists, and there's no problem we can't solve, including getting better at TikTok dancing. In this talk, I'll show you how I perfected my moves building PoseDance, your friendly TikTok trainer. We'll discuss how I leveraged PoseNet, which allows you to pinpoint body motion and draw a 'skeleton' on a video. Combined with a webcam mapping your own dance skeleton, a bit of math to compare the matching points, Azure functions to authenticate a user, and PlayFab as a game-friendly backend to keep scores and create a leaderboard, you've got the perfect quarantine pastime, making a perfect fool of yourself in front of a webcam. Come dance with me!
PoseDance: Build a TikTok Trainer
From:

JSNation Live 2020
Transcription
Hi, everyone. I'm so excited to be here to speak to you today about a very interesting app that I've created. It's called Pose Dance. And what it is is it's a TikTok trainer. So we're going to talk a little bit about Pose Dance, the idea of building a skeleton using the Canvas api and video to allow yourself to dance just as well as those awesome kids that we are all wasting time watching on TikTok. My name is Jen Looper. You can always find me on Twitter at Jen Looper, and I'm a cloud Advocate Lead at Microsoft. So the agenda today is to talk a little bit about what TikTok is. We're going to talk about PoseNet. We're going to dig a little bit deeper into what is running behind the scenes in PoseNet and how it was built. We're going to talk a little bit about how I built Pose Dance, and then I'm going to do a demo for you. You're going to love it. Okay. So what is TikTok? If you're not up to date on what TikTok is, it is the greatest mobile app currently in the market. Just to let you know, it's based on Musical.ly, which was a little app that was released a couple of years ago to have the kids do music videos. And it evolved into TikTok, which is for some, you know, because of this evolution from Musical.ly, I feel like it's very amenable to people dancing on this app. It is a complete waste of time, but it is also very, very fun. It's a lot like Vine, which was little short videos. The videos on TikTok are about a minute each, but you can patch a bunch of little videos together to get to that 60-second limit. And it's famous for dances. And I'm going to share this slide deck later. You can find me on Twitter, and I'll share the link out. And there's a lot of little articles referenced in these slides that you can take a look at of how, you know, I spent a week on TikTok and only lost a week on TikTok. All right. What is PoseNet? Let's talk a little bit about what PoseNet is. PoseNet allows you to do pose detection in the browser. It's a product from Google, and it's very, very interesting. It uses tensorflow.js behind the scenes, so it can be used for your websites, within your websites. And it can be used with your webcam, with video within a canvas, or it can be used on still pics, so any kind of images that you want to show it. It can display one or more people, so you can have a skeleton that you create for, you know, one person's poses to be displayed, or you can focus on many people in a video or an image. One of the use cases is that it's being used for sports, for example, to speak with an American idiom, if I slid into first base when playing baseball and I twisted my ankle, if there's a video of that, you can see exactly, and you can pinpoint exactly where it went wrong so that you can figure out where you, you know, where you went wrong and twisted your ankle, and maybe you can avoid doing that another time. Personally, I'd like to try this to show yoga poses and what the best positioning and placement is for your hands, feet, and your whole body. And all it does is it gives estimations of the position of 17 key body points, and you yourself have to create the skeleton on canvas, so it's not completely out of the box. All it does is it gives you the estimation of those body points, and then you can go and do whatever you like with the data that's sent back to you. An example output of PostNet looks like this. It's basically saying that, you know, we think this is the left eye. We will get, it sends back a score that it's 99% sure that this is a left eye at the X and Y Cartesian coordinates within the image or within the video. Now, if you're running this against a video, it's going to give you this data set for every key frame. So that is a pretty, pretty powerful piece of software. So whenever I'm talking about machine learning in the browser or any kind of machine learning built into web and mobile apps, I like to talk also about, you know, what's going on behind the scenes. And you can take a look at some of the papers, some of these research papers that are referenced in these slides. PostNet is actually built on top of person lab models, and person lab was a new way of determining people's stances and actions in a video or photo. So previously, we would be looking for people in a video or a photo by giving bounding boxes. You know, I'm pretty sure that this is a person. I'm pretty sure that this is another person. But PostNet and person lab allows you to do a top, a bottoms up, excuse me, parts first way of determining what a person is and what exactly they're doing within an image. So instead of the top down boxes, it's a bottoms up, parts first, and it's a box free, fully convolutional determination. And it predicts relative position of those 17 key points. So there's eyes, ears, nose, mouth, I think, no, sorry, there's no mouth, shoulders, elbows, wrists, hip, knee, ankle. And it starts from the most confident detection and it works up based on trained body poses. So let's dig even a little bit deeper than that. So what's going on behind the scenes here? Well, let's talk about person lab and how this thing is working in the background. So person lab training consists of two steps, right? It detects a person and then it estimates the poses. And first it attempts to detect the key points you can see on these heat maps. It groups them into person instances. So it's looking at an image, determining which parts belong to which person and figuring out which elements are belonging to that single person. So it tries to do these groupings and the network itself is trained in a supervised fashion and it looks for 17 face and body parts using the COCO dataset. So COCO is common objects in context, COCO dataset. So it's always useful to know how these models were built and what dataset is being used to build this kind of thing. So it's using the heat maps and then short term, excuse me, short range offset predictions to identify possible key points, figuring out, you know, maybe this is an ankle, maybe this is a wrist. And then it uses a system of Huff voting to determine the most precise placement of a key point. So it's using the Huff voting to create Huff arrays. And you might take a look at what Huff arrays are by doing a quick Google. So Huff arrays is based on Huff transforms, which is a feature extraction technique used in image analysis and computer vision. And it's designed to find imperfect instances of objects within a certain class of shapes by means of a voting procedure. Interestingly, it was invented in 1959 and patented in 62, I believe. So all of these machine learning techniques that we're using, a lot of them have been around for quite a long time. And it's quite interesting to just remember that, you know, we are building on the shoulders of giants. So anyway, so we have our Huff votes that come through, we have Huff arrays constructed, and then we're trying to figure out the persons, the people involved in creating the posing. So it's decoding the poses of these people. Using these detections to create likely instances of individual people, right? And it's using the long range offsets that it finds to group the people into segmented instances of each person in an image. And it's good to know that all of these steps are using convolutional neural networks. So there's really quite a lot going on behind the scenes to create this model that is then packaged up and given to us for use in PoseNet. So always useful to know what's going on behind the scene, and especially what the data set was used to create the model that we just, you know, grab out of the box and use. Take a look at the research paper that's referenced in the bottom here. So like I said, PoseNet was a project of Google Creative Lab, and it's running realtime pose estimation in the browser. And what PoseNet does is it basically abstracts away complexities of the model given to us by this complex machinery behind the scenes. And it encapsulates the functionality delivered by these models into easy to use methods that we can use on the web. It's installable by npm as part of tensorflow models. So you can take a look at the tensorflow models repo, and there's a whole bunch of demos that you can take a look at to get some inspiration. And it accompanies tensorflow.js, which is really helpful, because with tensorflow.js, we can use machine learning in our browser. You can do inference easily enough using tensorflow.js in your browser. You can also actually train models in your browser. It's not my favorite way of training. I prefer to do that, you know, using a different stack to make it a little bit more efficient. But you could, if you wanted to, train models in your browser. But I like to use it for inference. And just fun fact, it does use fast greedy decoding algorithm from the person lab paper that we've been talking about. Okay, so we have the model that we want to use. We've determined that we want to build some kind of app. Well, what would be the most fun thing that you can think of to build with this technology? Well, for me, I went straight to TikTok, because, you know, why not waste a little more time on TikTok and actually push it a little bit technologically, see what we can build on top of TikTok to use PoseNet. So I thought, well, why don't we build a side by side comparison so that you can use PoseNet to compare a dancer's movement to your own by comparing the TikTok video to your webcam output, because this thing is designed to be using your webcam. So we'll be, you know, me with my webcam over here with my skeleton determined, and then making a comparison to the TikTok kind of point of truth. So why not? Let's do it. So as I was thinking about this and starting to build it, I started to get a little bit uncomfortable, because I thought, well, maybe what I'm building and what I'm using is inherently ableist. So just a quick note on what ableism is, is it's discrimination in favor of standardly abled people, people with, you know, people who can walk, people who are not in wheelchairs. And I started to ask myself, our person lab is PoseNet and is PoseDance my app, is it inherently ableist by nature, because it's looking for skeletons, you know, intact skeletons to be drawn on top of an image. And like most machine learning models, this thing doesn't actually make judgments, it just measures. And if it doesn't find a piece of a body, or like, for example, if their arm stops at the elbow, then it's just not going to show that risk key point. So it just doesn't, it doesn't make judgment, it's just looking for parts and it's measuring and displaying. However, I thought, you know, if I create an app on top of TikTok comparing my moves with a scoring mechanism to, you know, semi-professional abled dancers, that actually might be ableist. So what I decided to do is to make an inclusive app and make sure that I show differently abled dancers, you know, folks in wheelchairs, folks with different abilities. And so that's what I tried to do, just to let you know. So when you visit the app, you'll see some interesting different types of dancers. Okay, so let's talk about building PoseDance. I had a lot of considerations that I had to gather together when I was starting the work. I had to worry about the design, get the architecture in place, figure out how to use PoseNet, figure out how to use a webcam, and realizing that these models are heavy and you have to realize, and you have to handle that when you're building a web app, you have to make your app responsive. How to build the scoring and leaderboard, that was interesting. And then the backend implementation. So fortunately I was, I had azure functions in my back pocket. I know how to use those. I know that that's a really great way to build an api. And then I also had a system called PlayFab in my back pocket. This is another Microsoft acquisition called PlayFab. And it's a really nice backend as a service that you can use for games and gamified interfaces. It works really nicely. And your azure functions will be able to talk to PlayFab and send your scores back so that you can build a leaderboard. Then I had to worry about deployment, where this thing is going to live. So lots of things to think about. So when I'm building an app, I usually start by, like every javascript developer start installing npm dependencies. So here's the list of my dependencies for the app. And the most important two are the top two, right? We have PoseNet installed from tensorflow models and TFJS, that's tensorflow JS installed as well. Those two have to be paired together. They work together. Axios for calling my api, Bulma for a nice responsive website. Bulma uses SAS and SAS loader. It's a view site. I'm a view developer. So I need my bits and pieces from view. And the rest of that is the router, view all the data for the form that I created, so that you can log in and save your score. And Persistent State to make vuex behave itself. So those are the bits and pieces that I used. So I needed to build an interface that allows you to have webcam on one side and TikTok videos on the other. I wanted initially to use the TikTok embedded videos. If you go to TikTok on the web, you'll find that you can download embeds. I thought that I'd be able to draw my skeleton right on top of that embed. Turns out I couldn't. I had to grab the video from TikTok. And you can actually do that on their mobile app, which is interesting. So I'm grabbing these videos as MP4s, and I'm using those within the web app. I started getting a little concerned about terms of service. So I made sure that the TikTok embeds are on the homepage so that you can see all the attributions and the credits that are given to each dancer and the music they use. Okay. So using PoseNet. So this is really fun when you start thinking of cool things that you can use with these machine learning models, and then you realize, oh, my gosh, these are really big. These are big models. It takes about 10 seconds for this app to load up. Because I'm not only using PoseNet, I'm using PoseNet twice. I have two models running. I have one for the webcam and one for the TikTok videos. So you have to just really create a responsive and asynchronous-friendly website to handle these large, large models and also to handle the webcam, because that thing has to download, has to be initialized and load up as well. So I had a nice loader widget just running. Bulma is very helpful with that. Okay. So handling the webcam, asynchronous calls, asynchronous lifecycle hooks like mounted async are your friend, because in the mounted hook, you have to load up your video, reference the video, set up the webcam and load up the webcam and get that thing ready to roll, set up your two canvases to draw the key points and skeletons on top of, load up your models and handle your video events. So I mean, and this is trimmed down because I couldn't cram it all into one piece on this slide. But there's really a lot going on and it all has to be asynchronous. I just want to emphasize that. So posing for each video, you need to load a model and start estimating the poses. So, um, for, so each frame is being called here. It's grabbing each frame and grabbing the data that's sent back from PoseNet from this net estimate poses. That's the, that's the call that you're using. And then you're drawing the key points and the skeleton yourself using the canvas api on the estimated location. So it's something that, you know, it has to be pretty responsive. I found that if you ask too much of it, it can get a bit laggy. Let's just put it this way. This app is a fun thing to try. I would not at this moment, send it to production and try to monetize it. I'll put it that way. So drawing the skeleton, like I mentioned, you're using the canvas api. Here's where you can get a bit creative and you can make the skeleton really, really thick. You can make different colors. You can draw your key points in different ways. I just use the dots for the skeleton points and then lines in between. So it's very DIY type of type of thing. Scoring, scoring was a challenge. How, how can you figure out how well you're moving compared to another video? Well, for each of these videos for the webcam feed and the video feed, I gathered the key points and I figured out, I appended them all into an array and then I did basically a subtraction. So I figured the point of truth is the video and your webcam is what you're comparing it to. So you're just finding the difference between the webcam points and the video points. And it turns out your score tends to degrade to negative as you diverge from the video. Maybe this is something that could be fixed, but you need to position yourself, you know, exactly where the video person is, is positioned and make sure that you're really aligning to where that person is moving. So that was a bit of a challenge to, let's put it this way. I think my high score is like six. It's not that great. And actually my daughter was annoyed by it and she finds the whole app annoying because her score was that bad. So her only negative. Okay. Let's talk about the back end. Now that we've got our scores, I decided, like I said, to use PlayFab as a mobile platform as a service back end. It is great for games. I put all the calls to it in an api folder and I use azure functions to call PlayFab. So here is my azure function that's grabbing that PlayFab SDK npm package and just calling its api update player statistics and sending a score. So that was pretty easy to build actually. It works pretty well. And then the leaderboard. So PlayFab does create a public leaderboard of all players. You don't have to be logged in to see the leaderboard. You do need to be logged in to send items to the leaderboard, but anyone can take a look at the leaderboard. And you just need the title secret key. So that is saved in variables in the service that I deployed it on. I'll talk about that in a second. So I built the leaderboard by the PlayFab api and used Axios to display it. So deployment. Let's talk about deployment. This is where the release of a product really saved my life because I was able to use the brand new azure Static web apps. And this was just released at Microsoft Build two, three weeks ago. And it is new, hot. It is in preview, but it works beautifully for these kind of apps. It works great for ViewPress, for gatsby, for a standard spa type of website, UJS or react or whatever, what have you, angular, svelte. And if you'd like to try it, you can go to aka.ms slash try static web apps. Great, great way to host your apps. It gives you a nice URL and you can just go to the races. It works beautifully. I strongly recommend. Okay. I think at this point we're ready for the demo. I think we can give this a try. I'm going to just escape this and I'm going to show you my app. So here it is. And okay. This is Pose Dance. This is the app that I built. So this is the first page and it has an invitation to dance as much as you like. Like I said, I put differently abled people in here. And I'm going to go ahead and dance with this feller. And you can see that it's loading, loading, loading the model. It's loading, loading, loading. Okay. It's ready. I'm not logged in, so my score is not going to be captured. I can also click preview and just watch the video with the skeleton. But I haven't plugged this. It's already lagging a little bit. Here we go. Oops. I screwed it up already. Okay. See I'm not far back enough. What's my score? It's probably terrible. Oh, I'm in the negative. In the negative. Oh, well. I should try again. Last time I got six. So fortunately I'm not logged in, so we won't visit the leaderboard. Well, we will visit the leaderboard and see if anybody's doing better than me. Still my high score as anonymous is six. Oh, well. I encourage you to go to aka.ms. Slash pose dance. To visit the app. All of these, all of the content is on GitHub. And I have written a nice read me. And you can get all the information you need on this pose dance. There's a video. You can take a look at the slides. There's a blog post coming soon. And you can try the app. And all the code is ready for you. So I hope you enjoyed my talk. I'll talk to you on Twitter. And that's it for me. Thank you. Amazing, amazing, amazing presentation, Jen. Please join me on stage for some Q&A. Hey, you all. Hi, everyone. Hey, Jen. How's it going? Doing great. How are you? Good, good, good. I was saying before that I wanted to personally thank you for creating pose dance. My pleasure. Exactly. I have an extensive experience at dancing poorly. And you taking your time out to create this is a good thing for myself. Well, I'm very happy to create the useful apps of the world. Exactly, exactly. The ones that change lives. That's right. I was saying that when I logged on to the app just to prepare for this talk, I did score a 61.98, which I thought was up here. It is. It is. Exactly. That's a great score. Mine is like a negative 133. So to be perfectly honest. This is it. I feel that I'm already getting better as we speak, even just one try. I did try to go back and aim even higher, but there are like loads of lagging problems with the app. I was just wondering if you could kind of touch on that a little bit about the challenges that you faced in building it and the lag problems and how those could have been avoided or repaired. Yeah, I think that there's a couple of things that might be causing that lag. One thing I've noticed is that when you click back to the screen with all those embeds, I think there's eight embeds coming out of TikTok. And I think there's quite a lot of javascript just loading on that page. There's not a whole lot I can do about that because I wanted to keep all of the branding around those TikTok embeds. According to their terms of service, you need to, if you're going to use TikTok, you need to at least have that much on your website. And I didn't want to have any problems with TikTok. And then in terms of when you click the dance button and you get to the second page where you're creating the skeletons and all that, like I said in the talk, there's an awful lot of javascript. It usually takes about 10 seconds to load up your models, load up your webcam, load up the video, get everything organized, and click that press button because there's two models running on that page. So there's a lot of javascript, there's very heavy apps and heavy models being deployed there. So it's kind of apps that we're not going to just launch this into production right now. It's highly likely that there's a memory leak that I haven't found, but maybe you just did. So things I need to take a look at. For sure, for sure, for sure. And we do have a question from one, he was asking that since frames must match the position exactly, do body types, that is wider shoulders, kids versus adults, affect that? Great question. And yeah, they would because it's really looking at those X and Y coordinates of where all of your key points are within the frame. So if you've got a person who's 6'7", comparing themselves to my very short daughter, 5'1", it's not going to match. You could, however, position yourself in the frame a little bit better, come a little closer to the camera and try to get that match. So that's why we put that preview button in there so that you can kind of see where the person is situated and get yourself organized before you start. And then just thinking wider about the application itself and the technology behind it, I guess, where else could or what other applications can you see for this app and for this technology? Yeah, well, like I said, I had my daughter play testing this and she said, you really missed an opportunity here because a lot of people are trying yoga right now and they have no idea what they're doing. They're watching one minute on TikTok and becoming a yogi immediately. It's not a very difficult situation. So I think just in terms of using PoseNet to properly make sure that all of your bits and pieces and body parts are in the right position so you won't hurt yourself doing yoga, I think that would be a really great way to work with a personal trainer, match my pose exactly and get a score. That would work. I think for, I was always wondering about medical applications for this thing and if people would be considering figuring out if you injure yourself, what happens if you injure yourself? Where's your hip over twisted? Like for a gymnast, for example, when you're training, when you go up and do certain moves on the parallel bars, are there things that you should not be doing so that you won't hurt yourself in the dismount? You know, that sort of thing. Actually, now I really want to try running a video of a gymnast against this and see what happens. It would be interesting. There you go. There you go. I feel like not only are we getting answers about how good I can be as a dancer, but we are getting other uses and other ways that this can be applied for the better. We have a question from Jay. He is asking, are there any plans to rework the scoring, maybe movement-based rather than position-based? Yeah, I don't really love the way I did scoring. It's an awful lot of math and I'm not a mathematician. I would welcome a PR to figure out the scoring and get away from us constantly being in the negative except for Chris who got that plus 65, who's kind of like a guru of Tic Tac at this point. So yeah, send me your PRs and let's think about how we can make it better and faster. It's not a very performant scoring system either. Right now, it's like very brute force going through the arrays. One thing I wanted to just tell you is yesterday, I took my code base and I added a code tour. So if you open my code base in Visual Studio Code, you can run and have the code tour extension installed. You can run through a code walkthrough of the code base. So go ahead and give me feedback on that too. So yeah, I welcome your PR. Please help me fix this thing. This is it. This is it. And we also have another question from Ruben. He wants to know, do you think measuring angles of the limbs instead of X, X, Y could improve scoring? Yeah, I think it could. I think it could definitely because then you wouldn't have that problem of having to situate yourself closer to the camera. I think that's a really smart idea. And please PR this thing so we can make it better. Cool. And then just final question, I guess. This is when I say final, just peeps, Jen's not going anywhere. She's going to be in the virtual room to answer all your questions. So keep them coming. And if you guys want to check out my score, again, my 61.83 as well, more than happy to share that with you as well, give a few pointers. But I guess in terms of that question, someone, Rin is asking, where can we send PRs to improve it? Right. So if you go to aka.ms slash Pose Dance, I may just make sure that that's the proper link. That'll send you to the app. And I just yesterday added the GitHub repo in the top right corner. So you can just click the little GitHub. Icon and that'll take you straight to the repo and it's open and ready to roll. Nice. Nice. Thank you. Thank you very much, Jen, for giving us, for basically helping me improve my social life, which once we do get out of this and hopefully build a few more followers, you know, out of the a million plus that I already have on TikTok right now, you know, just blowing up, blowing up. Exactly, exactly. So Jen is going to be going to her virtual Zoom room. So please do feel free to send any more questions that you have about Pose Dance, about, you know, machine learning in javascript and so many other questions. And she'll be definitely happy to answer them. Jen, thank you. Thank you. And speak again. Thanks. Bye. Bye.