Finding Stealthy Bots in Javascript Hide and Seek

Rate this content
Bookmark

JavaScript has a lot of use cases - one of them is automated browser detection. This is a technical talk overviewing the state of the art automated browser for ad fraud, how it cheats many bot detection solutions, and the unique methods that have been used to identify it anyway.


11 min
20 Jun, 2022

Video Summary and Transcription

The Talk discusses the challenges of detecting and combating bots on the web. It explores various techniques such as user agent detection, tokens, JavaScript behavior, and cache analysis. The evolution of bots and the advancements in automated browsers have made them more flexible and harder to detect. The Talk also highlights the use of canvas fingerprinting and the need for smart people to combat the evolving bot problem.

Available in Español

1. Introduction to Web Bots

Short description:

I'm here to ask what's going on with bots on the web. We'll talk about simple detections, how the bots got better. We'll talk about what's possibly the best bot out there cheating on most detection solutions. And we'll lastly get to my favorite part, which is how you can find it anyways. My job is playing hide and seek with these bots, so advertisers can avoid them. It's going to be social media, concert ticket sellers, a lot of people facing this issue because the internet was not designed with bot detection in mind. When you do that, yeah, real story, when I was 16, high school product projects may or may have not dropped service to some site. So to make the internet better, we want to detect them. Let's talk detections. Starting with the basics. User agent. Does the HTTP request header identifying the browser? You guys know this. You see it's a Python bot. You block that. Probably not a real user behind that. They figured this out, the bot makers know, they hide the user agent. Let's say you don't run JavaScript on your bot.

Hey, everyone. I'm Adam. I'm super happy to be here, and I'm here to ask what's going on with bots on the web. I'm not talking about the nice ones, the testing. I'm talking about the bad ones. We'll talk about simple detections, how the bots got better. We'll talk about what's possibly the best bot out there cheating on most detection solutions. And we'll lastly get to my favorite part, which is how you can find it anyways.

But before all that, one reason I'm here is because I always like packing stuff, and now I'm the reverse engineer for DoubleVerify. They measure ads. But my job is playing hide and seek with these bots, so advertisers can avoid them. But it's not just advertisers and the games. It's going to be social media, concert ticket sellers, a lot of people facing this issue because the internet was not designed with bot detection in mind. Seriously. The only real standard is bots.txt telling bots what they're allowed and disallowed to do. Basically the honor system asking good people to play nice. When you do that, yeah, real story, when I was 16, high school product projects may or may have not dropped service to some site. But some people actually do this on purpose and at scale, denying service to real users, using what they have to steal, sneakers, sneaking around social media with fake users. I practice that part. So to make the internet better, we want to detect them.

Let's talk detections. Starting with the basics. Not because bot makers can't play around these, but because they're usually the first thing you rely on when you come up with something more complicated because simple detections are pretty straightforward. User agent. Does the HTTP request header identifying the browser? You guys know this. You see it's a Python bot. You block that. Probably not a real user behind that. They figured this out, the bot makers know, they hide the user agent. Let's say you don't run JavaScript on your bot.

2. Detecting Bots with Tokens and JavaScript

Short description:

You can use tokens and JavaScript behavior to detect bots on your site. Browser quirks can be used to verify the true nature of a browser. Digging deep into JavaScript can reveal attempts to hide something.

Maybe you make a token as the detection as the site. In Azure, actually make sure it's created. So if you have a bot that's navigating to your site, not generating this token, not running JavaScript, you know something's going wrong. But let's say they do run JavaScript. All of a sudden, you can check how the browser behaves. You people probably hate browser quirks. Bot makers hate them too, because they can be used to verify what's under the hood and not what the browser is reporting at face value. And sometimes you can dig deep in JavaScript to see if somebody's trying to hide something.

3. Detecting Bots with User Agent and Behavior Tests

Short description:

User agent, hiding with object.define property. Funny stuff, bad attribute, accidental artifact for detection. Cat and mouse theme, hiding the to string. Clever ways around, repeating vectors. JavaScript library creep.js, limited effectiveness. Using data, tokens, duplicate tokens, nonsense navigations for bot catching. Caches as another avenue for detection. Behavior tests, user click frequency.

User agent, we talked about that. That property on the window navigator is going to be read-only. So bot markers, they're going to hide that with object.define property. You look at the property descriptor, you see somebody did funny stuff there, trying to hide a user agent. That's going to be suspicious image here being how you have a bad attribute identifying you. You fix it into something perfectly fine as the bot maker accidentally leave behind an artifact that can be used to incriminate you that's going to be used for detection.

This is going to be a common theme. The cat and mouse of bot detections. Another example is the bot maker can override the to string on something they're trying to hide. So you look at the to string, the to string, they hide that too. There's a fun game established there. Clever ways around this key take away being here the cat and mouse theme that's going to repeat vectors for more detections.

Let's say you're really good at this dumb stuff. You make JavaScript library like creep.js to fingerprint a browser under the hood. That only goes so far because the bot makers they can see what you're doing. Every time you find them some way they're going to evolve they're going to patch a little bit and now we got to use something else. Let's say on a site you want to use data. Data is going to be tricky because you have to mine privacy issues or unruly users that you're testing, but let's say you have a site that users they go to just for the sake of argument. Each page you put a generated token you can validate is a user go through. You can see where that user went. All of a sudden you have a whole new arena for catching these bots, duplicate tokens, nonsense navigations, anything that's giving you even the tiniest hint that somebody maybe just have pre-programmed your logic in some way. It's not an actual user navigating. Some of you might be thinking, hey, will users do this too? And that's absolutely right. With caches and stuff. That's why this isn't a smoking gun on its own, but caches, for example, also introduces a whole other avenue for bot detection. Some bots they clean the cache too much. And we'll get into advanced detection in just a moment, but still in the domain of old-school simple detections, behavior tests. How many times should the user click on the side? Let's say in an hour. OK. So, there's going to be a little spectrum.

4. Bots, Caches, and Automated Browsers

Short description:

How much each user does it? And you're going to look in the edges and one side you have zero, which is going to be my father clicking absolutely zero times after a long day of work. But on the other edge, you have people clicking 172 times per second. Caches were originally used to distinguish humans from bots, but they slow the bots down. The evolution of bots and the advancements in automated browsers have made them more flexible and harder to detect. Puppeteer, Google's automated browser, is the kingpin in botting, making it accessible and difficult to detect.

How much each user does it? And you're going to look in the edges and one side you have zero, which is going to be my father clicking absolutely zero times after a long day of work. But on the other edge, you have people clicking 172 times per second. OK, we're getting somewhere there.

And also, caches. That's what we came up with originally to distinguish humans from bots. You might be asking, hey, why don't you start with that? The reason is that the bots train, they absolutely demolish humans in simple ones. And this is for complex captures too. So, captures aren't there to prevent bots. What they do currently, the reason you see them, is that they slow the bots down.

Moving forward, let's talk about how the bots got better. Getting closer to the advanced detection, that part, the bot makers haven't been sleeping on their guard all this time. They got better, they keep getting better with every little patch they evolve. Eventually, the game became written in their favor. Ten years ago, they were struggling with Python scripts. In recent news, it's now publicly retweeted, so here's some guy complaining about this to Elon Musk. Ping me in the Q&A if you want to talk about the seekers and the hatters about this some more. I'm right about that one. Point being that the evolution thing is really good at gradual improvement and problem solving.

Bots might eventually become indistinguishable from humans entirely. But moving from philosophy rambles to practice, biggest technical advances bots made in the That's going to be automated browsers. Automated browsers have changed the game, no more requests in Python, you're taking the whole browser, you run the DOM in JavaScript, you can even fake the user. Automated browsers got really good at their thing so they let bot makers fake attributes like the user agent without leaving the artifact behind that you'd normally do in JavaScript. That makes bots more flexible, harder to detect. Browser quirks and all that is useful when the browser automation solution is supporting the browser with whatever quirks they have and you can take the user too with these using vanilla JavaScript or browser hooks, scrolling through some articles with Windows Scrolls, here's some automated browsers that are good for testing but don't be like some of the bot operators I found out using these for fraud schemes because they're not meant to hide anything, they're easy to detect, no, you want something that we're getting to the kingpin here. You want something that's based on this guy, Puppeteer. That's Google's automated browser. It's not malicious of trying to hide but it's really good at its thing and makes itself super easy to extend, makes everything super smooth and with that, all the pieces kind of came into place, we see bots getting weathered better, automated browsers available and operators improving their game, thus the king bot was born. This guy, Puppeteer X-Distill, best at hiding so it can run headless meaning no rendering on the screen, still doesn't get detected, it's amazing, criminally easy to use, making really good botting really accessible. Community behind it is an army looking for even the slightest discrepancies and for example, they patch hardware concurrency, that's the amount of available processors you have, so they can scale the operations, run many of these on the same computer without even raising the suspicion that there are bots, a whole different playing field there. They left some traces when they patched this attribute on the prototype, people detected that on the screen, they found it, they patched it, they did this fast.

5. Detecting Bots: Canvas Fingerprinting and Beyond

Short description:

Let's talk about canvas fingerprinting and how it can be faked. Chromium's headless mode makes it easy to fill objects with fake values for detection. Easy bots can be detected, but hard bots require techniques like hardware concurrency, behavior tests, and data analysis. By analyzing user agent data, you can identify even the best bots. The internet needs more smart people to combat the evolving bot problem.

Let's talk about something harder to fake, canvas fingerprinting, that's when you dynamically render a canvas to fingerprint your device alongside the browser. Should be harder to fake, bam, these faults develop an extension that reports fake values using JavaScript hooks, so they solve pretty much anything.

I want to take the time to explain just one of these before I fly through the others, the Chromium here, when it's headless, it does not end the Chrome.csi, Chrome app, all that performance stuff, to think, okay, can it detect puppeteer when it's headless with this? All of the sudden, not so fast, buddy, they fill every single object with fake values like this, navigator Chrome, load times, runtime, app, so on and so forth, anything that can be used for detection, it's making it stupid easy to use, annoyingly elegant, remember, CAPTCHA's two lines here, a little bit of money, and they solve that. And all of this is going to be just with these two lines here. Everything we talked about, super easy to use, bot tests failed to find it, they just hang the spot detection on their repo, but I promise that, get to what actually works, that part, I made it within 10 minutes, all right.

Obviously, I can't specify too much here, but let's start with the easy part. Easy bots are easy to detect. I'm going to name three ways to go after hard bots. This is going to be quick, but I'm going to say this more than what's out there. Starting with stuff like hardware concurrency, there's still more JavaScript artifacts to be found if you know where to look. These are becoming increasingly rare, though, so I wouldn't count on them long term, but the upside here is that they're very clear cut.

What we'll hold at the time is behavior tests and session level data analysis. Behavior tests that still work usually look at window context discrepancies interacting with the DOM, and data analysis can take many shapes, for example, let's say you pick the user agent perfectly, the question is what value you put there, look at this graph, that's the user agent along navigating to an app, each point is how many people navigated with that specific user agent, you'd say there's some variance here, but this is what it's supposed to look like. Blue line's almost flat, so that's probably because the bot foster got right how the user agent is varied, they got the weight part entirely wrong. They're probably just producing these at random. Normal sites don't do this, and here's one way you can detect even the best bot.

So at the start I asked you what's up with bots on the web? I can't tell you for sure, but what I do know is that they're getting better, we need more smart people like you to be aware so that the internet becomes a better place.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

TechLead Conference 2023TechLead Conference 2023
35 min
A Framework for Managing Technical Debt
Let’s face it: technical debt is inevitable and rewriting your code every 6 months is not an option. Refactoring is a complex topic that doesn't have a one-size-fits-all solution. Frontend applications are particularly sensitive because of frequent requirements and user flows changes. New abstractions, updated patterns and cleaning up those old functions - it all sounds great on paper, but it often fails in practice: todos accumulate, tickets end up rotting in the backlog and legacy code crops up in every corner of your codebase. So a process of continuous refactoring is the only weapon you have against tech debt.In the past three years, I’ve been exploring different strategies and processes for refactoring code. In this talk I will describe the key components of a framework for tackling refactoring and I will share some of the learnings accumulated along the way. Hopefully, this will help you in your quest of improving the code quality of your codebases.

React Summit 2023React Summit 2023
24 min
Debugging JS
As developers, we spend much of our time debugging apps - often code we didn't even write. Sadly, few developers have ever been taught how to approach debugging - it's something most of us learn through painful experience.  The good news is you _can_ learn how to debug effectively, and there's several key techniques and tools you can use for debugging JS and React apps.
Node Congress 2022Node Congress 2022
26 min
It's a Jungle Out There: What's Really Going on Inside Your Node_Modules Folder
Top Content
Do you know what’s really going on in your node_modules folder? Software supply chain attacks have exploded over the past 12 months and they’re only accelerating in 2022 and beyond. We’ll dive into examples of recent supply chain attacks and what concrete steps you can take to protect your team from this emerging threat.
You can check the slides for Feross' talk here.
React Advanced Conference 2022React Advanced Conference 2022
22 min
Monolith to Micro-Frontends
Top Content
Many companies worldwide are considering adopting Micro-Frontends to improve business agility and scale, however, there are many unknowns when it comes to what the migration path looks like in practice. In this talk, I will discuss the steps required to successfully migrate a monolithic React Application into a more modular decoupled frontend architecture.
React Advanced Conference 2023React Advanced Conference 2023
22 min
Power Fixing React Performance Woes
Next.js and other wrapping React frameworks provide great power in building larger applications. But with great power comes great performance responsibility - and if you don’t pay attention, it’s easy to add multiple seconds of loading penalty on all of your pages. Eek! Let’s walk through a case study of how a few hours of performance debugging improved both load and parse times for the Centered app by several hundred percent each. We’ll learn not just why those performance problems happen, but how to diagnose and fix them. Hooray, performance! ⚡️

Workshops on related topic

React Summit Remote Edition 2021React Summit Remote Edition 2021
87 min
Building a Shopify App with React & Node
Top Content
WorkshopFree
Shopify merchants have a diverse set of needs, and developers have a unique opportunity to meet those needs building apps. Building an app can be tough work but Shopify has created a set of tools and resources to help you build out a seamless app experience as quickly as possible. Get hands on experience building an embedded Shopify app using the Shopify App CLI, Polaris and Shopify App Bridge.We’ll show you how to create an app that accesses information from a development store and can run in your local environment.
React Summit 2023React Summit 2023
56 min
0 to Auth in an hour with ReactJS
WorkshopFree
Passwordless authentication may seem complex, but it is simple to add it to any app using the right tool. There are multiple alternatives that are much better than passwords to identify and authenticate your users - including SSO, SAML, OAuth, Magic Links, One-Time Passwords, and Authenticator Apps.
While addressing security aspects and avoiding common pitfalls, we will enhance a full-stack JS application (Node.js backend + React frontend) to authenticate users with OAuth (social login) and One Time Passwords (email), including:- User authentication - Managing user interactions, returning session / refresh JWTs- Session management and validation - Storing the session securely for subsequent client requests, validating / refreshing sessions- Basic Authorization - extracting and validating claims from the session token JWT and handling authorization in backend flows
At the end of the workshop, we will also touch other approaches of authentication implementation with Descope - using frontend or backend SDKs.
JSNation 2022JSNation 2022
41 min
Build a chat room with Appwrite and React
WorkshopFree
API's/Backends are difficult and we need websockets. You will be using VS Code as your editor, Parcel.js, Chakra-ui, React, React Icons, and Appwrite. By the end of this workshop, you will have the knowledge to build a real-time app using Appwrite and zero API development. Follow along and you'll have an awesome chat app to show off!
GraphQL Galaxy 2021GraphQL Galaxy 2021
164 min
Hard GraphQL Problems at Shopify
WorkshopFree
At Shopify scale, we solve some pretty hard problems. In this workshop, five different speakers will outline some of the challenges we’ve faced, and how we’ve overcome them.

Table of contents:
1 - The infamous "N+1" problem: Jonathan Baker - Let's talk about what it is, why it is a problem, and how Shopify handles it at scale across several GraphQL APIs.
2 - Contextualizing GraphQL APIs: Alex Ackerman - How and why we decided to use directives. I’ll share what directives are, which directives are available out of the box, and how to create custom directives.
3 - Faster GraphQL queries for mobile clients: Theo Ben Hassen - As your mobile app grows, so will your GraphQL queries. In this talk, I will go over diverse strategies to make your queries faster and more effective.
4 - Building tomorrow’s product today: Greg MacWilliam - How Shopify adopts future features in today’s code.
5 - Managing large APIs effectively: Rebecca Friedman - We have thousands of developers at Shopify. Let’s take a look at how we’re ensuring the quality and consistency of our GraphQL APIs with so many contributors.
JSNation 2023JSNation 2023
57 min
0 To Auth In An Hour For Your JavaScript App
WorkshopFree
Passwordless authentication may seem complex, but it is simple to add it to any app using the right tool.
We will enhance a full-stack JS application (Node.js backend + Vanilla JS frontend) to authenticate users with One Time Passwords (email) and OAuth, including:
- User authentication – Managing user interactions, returning session / refresh JWTs- Session management and validation – Storing the session securely for subsequent client requests, validating / refreshing sessions
At the end of the workshop, we will also touch on another approach to code authentication using frontend Descope Flows (drag-and-drop workflows), while keeping only session validation in the backend. With this, we will also show how easy it is to enable biometrics and other passwordless authentication methods.