Finding Stealthy Bots in Javascript Hide and Seek

Bookmark

JavaScript has a lot of use cases - one of them is automated browser detection. This is a technical talk overviewing the state of the art automated browser for ad fraud, how it cheats many bot detection solutions, and the unique methods that have been used to identify it anyway.


by



Transcription


Hey everyone, I'm Adam. I'm super happy to be here and I'm here to ask what's going on with bots on the web. I'm not talking about the nice ones, the testing. I'm talking about the bad ones. So we'll talk about simple detections, how the bots got better. We'll talk about what's possibly the best bot out there cheating on most detection solutions. And we'll lastly get to my favorite part, which is how you can find it anyways. But before all that, one reason I'm here is because I always like packing stuff and now I'm the reverse engineer for DoubleVerify. They measure ads. Part of my job being playing hide and seek with these bots so advertisers can avoid them. It's not just advertisers in the games, it's going to be social media, concert ticket sellers, a lot of people facing this issue because the internet was not designed with bot detection in mind. Seriously, the only real standard is bots.txt telling bots what they're allowed and disallowed to do. Basically the honor system asking people to play nice. When you do that, yeah, real story, when I was 16 High School Prodder Project may or may have not dropped service to some sites. But some people actually do this on purpose and at scale, denying service to real users, using what they have to steal, sneakers, sneaking around social media with fake users, I practice that part. So to make the internet better, we want to detect them. Let's talk detections. Starting with the basics, not because bot makers can't play around these, but because they're usually the first thing you rely on when you come up with something more complicated because simple detections are pretty straightforward. User agent. Does the HTTP request headers identify the browser? You guys know this. You see it's a Python bot. You block that. Probably not a real user behind that. They figured this out, the bot makers, now they hide the user agent. Let's say you don't run JavaScript on your bot. Maybe you make a token as the detection, as the site. In Azure, you make sure it's created. So if you have a bot that's navigating to your site, non-generating this token, not running JavaScript, you know something's going wrong. But let's say they do run JavaScript. All of a sudden, you can check how the browser behaves. You people probably hate browser quirks. Bot makers hate them too because they can be used to verify what's under the hood and not what the browser's reporting at face value. And sometimes you can dig deep in JavaScript to see if somebody's trying to hide something. User agent. We talked about that. That property on the window navigator is going to be read-only. So bot markets, they're going to hide that with object.defineProperty. You look at the property descriptor, you see somebody did some funny stuff there trying to hide a user agent, that's going to be suspicious image here being how you have a bad attribute identifying you. You fix it into something perfectly fine as the bot maker accidentally leave behind an artifact that can be used to incriminate you. That's going to be used for detection. This is going to be a common theme, the cat and mouse of bot detections. Another example is the bot maker can override the toString, something they're trying to hide. So you look at the toString, the toString, they hide that too. There's a fun game established there. Several ways around this key takeaway being here, this cat and mouse theme that's going to repeat back to some more detections. Let's say you're really good at this DOM stuff. You make JavaScript library like creep.js to fingerprint the browser under the hood. That only goes so far because the bot makers, they can see what you're doing. Every time you find them some way, they're going to evolve, they're going to patch a little bit, and now we got to use something else. Let's say you're on a site, you want to use data. Data is going to be tricky because you have to mine privacy issues around real users that you're testing. But let's say you have a site that users, they go to just for the sake of argument. Each page, you put a generated token, you can validate as the user go through, you can see where that user went. All of a sudden, you have a whole new arena for catching these bots, duplicate tokens, nonsense navigations, anything that's giving you even the tiniest hint that somebody maybe just have pre-programmed your logic in some way and it's not an actual user navigating. Some of you might be thinking, hey, real users do this too, and that's absolutely right with caches and stuff. That's the way this isn't a smoking gun on its own, but caches, for example, that also introduces a whole other avenue for bot detection. Some bots, they clean the cache too much. We'll get into advanced bot and detection in just a moment, but still in the domain of old school simple detections, behavior tests. How many times should the user click in your site, let's say in an hour? Okay, so there's going to be a little spectrum, how much each user does it, and you're going to look in the edges and one side you have zero, there's going to be my father clicking absolutely zero times after a long day at work, but on the other edge, you have people clicking 172 times per second. Okay, we're getting somewhere there. Also caches, that's what we came up with originally to distinguish humans from bots. You might be asking, hey, when you start with that, reason is that the bots trained, they absolutely demolish humans and simple ones. This offers for complex caches too, so caches aren't there to prevent bots. What they do currently, the reason you see them is that they slob the bots down. Moving forward, let's talk about how the bots got better. Getting closer to the advanced detection, that part, the bot makers haven't been sleeping on their gourd all this time. They got better, they keep getting better. With every little patch they evolve, eventually game became written in their favor. 10 years ago, they're struggling with Python scripts. In recent news, it's now publicly retweeted, so here's some guy complaining about this to Elon Musk. Ping me in the Q&A if you want to talk about the seekers and the hatters about this some more. I'm right about that one. Point being that the evolution thing is really good at gradual improvement and problem solving. Bot find eventually become indistinguishable from humans entirely, but moving from philosophy rambles to practice, biggest technical advances bot made in the last 10 years, that's going to be automated browsers. Automated browsers changed the game. No more requesting Python. You're taking the whole browser, you run the DOM and JavaScript. You can even fake the user. Automated browsers got really good at their thing. So you got bot makers fake attributes, like the user agent, without leaving the artifact behind that you'd normally do in JavaScript. That makes bots more flexible, harder to detect. Browser quirks and all that, not as useful when the browser automation solution is importing the browser with whatever quirks they have. You can fake the user too with these, using vanilla JavaScript or browser hooks, say scrolling through some articles with Windows scroll. So here's some automated browsers that are good for testing, but don't be like some of the bot operators I found out using these for fraud schemes, because they're not meant to hide anything. They're easy to detect. No, you want something better. We're getting to the kingpin here. You want something that's based on this guy, Puppeteer. That's Google's automated browser. It's not malicious at trying to hide, but it's really good at its thing and makes itself super easy to extend. Makes everything super smooth. And with that, all the pieces kind of came into place. We see bots getting weathered better, automated browsers available, and operators improving their game. Thus, the king bot was born. This guy. Puppeteer, as they're still best at hiding, so it can run headless, meaning no rendering in the screen. Still doesn't get detected. It's amazing, criminally easy to use, making really good botting really accessible. The community behind it is an army looking for even the slightest discrepancies. For example, they patch hardware concurrency. That's the amount of available processors you have. And so they can scale the operation to run many of these on the same computer without even raising the suspicion that they're a bot. Whole different playing field there. They left some traces when they patched this attribute on the prototype. People detected that on the screen. They found it. They patched it. They did this fast. Let's talk about something harder to fake. Canvas fingerprinting. That's when you dynamically render a canvas to fingerprint your device alongside the browser. Sorry. Should be harder to fake. Bam. These folks developed an extension that reports fake values using JavaScript hooks. So they solve pretty much anything. I want to take the time to explain just one of these before I fly through the others. You have Chromium here. When it's headless, it does not end the Chrome.csi, Chrome app, all that performance stuff. So you think, okay, I can detect Puppeteer when it's headless with this. All of a sudden, not so fast, buddy. They fill every single object with fake values like this. Navigator Chrome, load times, run time, up, so on and so forth. Anything that can be used for detection. It's making it stupid easy to use. Annoyingly elegant. Remember, CAPTCHA has two lines here, a little bit of money, and they solve that. And all of this is going to be just with these two lines here. Everything we talked about, super easy to use, bot tests failed to find it. They just hang the spot detection on their repo. But I promise that gets to what actually works. That part, I made it within ten minutes. All right. Obviously, I can't specify too much here. But let's start with the easy part. Easy bots are easy to detect. I'm going to name three ways to go after hard bots. This is going to be quick, but I'm going to say this more than what's out there. Starting with stuff like hardware concurrency, there's still more JavaScript engine artifacts to be found if you know where to look. These are becoming increasingly rare, though, so I wouldn't count on them long term. But the upside here is that they're very clear cut. What we'll hold at the time is behavior tests and session level data analysis. Behavior tests that still work usually look at window context discrepancies interacting with the DOM. And data analysis can take many shapes. For example, let's say you pick the user agent perfectly. The question is what value you put there. Look at this graph. That's the user agent along navigating to an app. Each point is how many people navigated with that specific user agent. You'd say there's some variance here. But this is what it's supposed to look like. Blue line's almost flat, so that's probably because the bot foster got right how the user agents vary. They got the wait part entirely wrong. They're probably just producing these at random. Normal sites don't do this. And here's one way you can detect even the best bots. So at the start I asked you, what's up with bots on the web? I can't tell you for sure, but what I do know is that they're getting better. We need more smart people like you to be aware so that the internet becomes a better place. Thank you and feel free to reach out.
11 min
20 Jun, 2022

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Workshops on related topic