Mastering Web Scraping with Scrapoxy: Unleash Your Data Extraction Wizardry!

Rate this content
Bookmark

Unlock the potential of web scraping with this session! 

1/ Building Web Scrapers - The Art Unveiled

2/ Proxy and Browser Farms Adventure

3/ Scrapoxy Orchestration - Elevate Your Scalability

4/ Protection Measures Disclosed

This concise session will immerse you in the world of web scraping.

#WebScraping #Proxy #ReverseEngineering 🕵️‍♂️

Fabien Vauchelles
Fabien Vauchelles
21 min
04 Apr, 2024

Video Summary and Transcription

Fabien Vauchel is a software developer passionate about web scraping and the creator of Scrapoxy, a proxy aggregator. Isabella, a student, used web scraping to gather data for her trip tool. The talk discusses various techniques for web scraping, including using language models and proxies to bypass website protections. The speaker also introduces ScrapOxy, a super proxy aggregator, and explains how to integrate it into web scraping projects. The use of Playwright, a headless browser, is highlighted for handling fingerprint errors. Ultimately, the talk emphasizes the importance of adjusting time zones to successfully complete web scraping requests.

Available in Español

1. Introduction to Web Scraping and Proxy Systems

Short description:

Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. I'm also the creator of Scrapoxy, a free and open-source proxy aggregator. It supports major cloud providers and proxy services. It's fully written in TypeScript with NetJS and Angular frameworks.

Hi, I'm Fabien Vauchel. I've been deeply passionate about web scrapping for years. My enthusiasm led me to explore the fascinating world of proxy and antibot systems.

I work at Wiremind, an amazing company specializing in revenue management within the transportation industry. Our work at Wiremind involves handling millions of prices on a daily basis, which requires substantial investment in web scrapping technologies.

I'm also the creator of Scrapoxy. Scrapoxy is a free and open-source proxy aggregator. It allows you to manage and route traffic through cloud providers and proxy services. It supports major cloud providers such as AWS, Azure, GCP, and DigitalOcean. It supports proxy services like Zyte, Railbite, IPRail and many others. It's fully written in TypeScript with NetJS and Angular frameworks.

2. Isabella's Journey to Web Scraping

Short description:

Before diving into this amazing product, let me share with you a little story. Isabella, a final-year student in IT school, noticed a gap in the market and realized she needed a vast amount of data to create her ultimate trip tool. She decided to focus on accommodations and made sure to consider all legal aspects. Now, let me introduce you to the website she chose to scrape, TrekkieReviews.com. It is your go-to spot for checking out accommodations in any city. Isabella is interested in analyzing reviews to see what people think about accommodations.

Before diving into this amazing product, let me share with you a little story. Enter Isabella. She's a final-year student in IT school. Isabella has a bright mind and a lot of energy and also a thirst for traveling. Every year, she embarks on a one-month backpacking journey to a random country. But here's a twist. This level of planning consumed her entire year in preparation for just one month of travel. Isabella couldn't help but notice a gap in the market. Why wasn't there such a tool in a digital era pumped with AI? This could be her ticket to a successful business. She realized she needed a vast amount of data to create such a tool. This vast amount of data will train a large language model to curate her ultimate trip. However, she's a total newcomer in the web scrapping industry. How to collect massive amounts of data? To kick off things, she decided to focus all efforts on accommodations.

However, Isabella is very careful in her approach to business. Before she starts scrapping data, she makes sure to consider all the legal aspects. She knows it's important not to overwhelm the website by making too many requests too quickly. She also respects privacy. She only collects information that is already public, like reviews and doesn't take any personal details like names. She doesn't sign the website terms and conditions either. She's free from any contract. Now that everything is clear, she is ready to collect the data. Let me introduce you to the website she chose to scrap, TrekkieReviews.com.

So what's TrekkieReview all about? It is your go-to spot for checking out accommodation in any city you're interested in. Here is how it works. You simply enter the name of the city you want to explore in the search bar, and you will see a list of all available accommodations. Let's say that Isabella is dreaming of Paris. She will find 50 accommodations. If she clicks on one hotel, she will get all the information like its name, description, address, email, and reviews. Isabella is interested in reviews. It is all about analyzing those reviews to see what people think about accommodations.

3. Deep Dive into Web Scraping

Short description:

With the latest large language models, Isabella can perform sentiment analysis and bypass website protections. By following requests using the Chrome inspector tool, Isabella focuses on extracting HTML content and uses libraries like Axios and Cheerios for efficient text extraction. Her spider enables her to retrieve hotel information in a structured way.

With the latest large language models, she can deep dive into sentiment analysis and extract the main feeling or issue. But wait! There is more. The website is also super secure. I've put different levels of protection to keep the place safe. And during this presentation, Isabella will try to bypass each protection one by one.

To better understand how requests are chained, let's open the Chrome inspector tool. Here is a scoop. When you land on a webpage, it's like getting a big packet delivery at your doorstep. Inside this package, there's a whole bunch of stuff. You've got HTML, CSS, JavaScript, and images. But let's keep it simple. We are only interested in HTML. I click on dock and make sure to preserve logs so we don't lose track of anything. Now let's follow the request.

The first one is a search form with the URL level 1. The second one is a list of hotels and paginations with the URL cities and parameters cities set to Paris. The last one is a detail of the hotel with the URL hotel slash ID. If we click on the response tab, you will see the full HTML content to extract the hotel ratings. However, it can be complex to execute requests and parse HTML manually. So I want to spotlight two powerful libraries Axios and Cheerios. Axios is a JavaScript library designed to handle requests. Cheerios is another JavaScript library which parses HTML using CSS selectors, making text extraction a breeze. All right, let's break down our spiders.

So if I'm opening my spider, I've got many information and many methods. The first one is a method to go to the homepage. After that, I can search hotels and cities. Here I'm going to Paris and I retrieve the links of the hotels. When I've got the links, I can get the hotels, detail, extract the name, email, and rating. Let's run this spider. If I'm running this spider, I've got 50 items and I can check the information in a structured way.

4. Increasing Security and Introducing Proxies

Short description:

I've got different information here, names, emails, and ratings. By identifying and modifying the request headers, I can bypass the website's spider detection and increase security. Moving to level three, I encounter the issue of too many requests from the same IP address. To solve this, I introduce a proxy system that relays requests and masks the original source. Data center proxies, hosted on services like AWS, Azure, or GCP, are fast and reliable but can be easily detected by Antibot solutions.

I've got different information here, names, emails, and ratings. It's perfect. I can increase the security level. Let's move to the level two. It's perfect, it's here. So if I'm running against a spider, I've got unknown browser error. It is because a spider identified itself as Axios. The website rejected the spider. If we open Chrome and check the request headers, we'll see that Chrome is sending insightful information such as user agent, Chrome is identifying itself as Mozilla on Linux. Also, there are other headers associated with the user agent. I will add those information to the spiders. I will write the user agent and say that it is Windows. I will add also other information such as sexyh headers, sexyh mobile, sexyh UA platforms. And if I'm running against a spider this time, I've got my 50 items. It's perfect. Now let's move to the next level.

I'm moving to level three and I'm running my spider. The spider has some delays and I will get an error which is too many requests. It is because I'm doing a lot of requests from the same IP address. My laptop is sending all requests to the web server from the same IP and the server is rejecting me because I'm doing too many requests. I need many IP addresses to send the request. That's where I want to introduce a proxy. What is a proxy? A proxy is a system running on the internet. It relays requests to a server and the server believes that the request is coming from the proxy, not the real source. And of course, there are many types of proxies. The first type is a data center proxy. This kind of proxy runs on AWS, Azure, or GCP. It is the first serious kind of proxy that you can use. They are fast and reliable. However, they can be easily identified by Antibot solutions.

5. Understanding ISP Proxies

Short description:

IP ranges are associated with autonomous system numbers like Amazon or Microsoft. To bypass Antibot solutions, ISP proxies can be used. These proxies rent IP addresses from clean autonomous system numbers, like mobile providers, and mix your activity with other IP addresses.

To explain you, IP ranges are associated to autonomous system numbers. And the name of the autonomous system numbers can be Amazon or Microsoft. So it can be easily identified by Antibot solutions. But there is a trick to get around it. And this is called ISP proxy. Internet Service Provider Proxy. Let's talk about ISP proxy and how they work. Autonomous proxy are set up in data centers, but they don't use IP address from these data centers. Instead, they rent IP address from a clean autonomous system number like mobile providers such as Verizon. They get a bunch of IP addresses and the proxy uses one of them. This means that when you are using the proxy, your activity gets mixed with all the IP addresses that is keeping you hidden.

6. Rotating Residential Proxies and ScrapOxy

Short description:

The last type of proxy is the rotating residential proxy. It uses IP addresses from devices like laptops or mobile phones, which come from real users. ScrapOxy, the super proxy aggregator, is used to manage the proxy strategy. It offers a user-friendly UI and allows easy addition of connectors for different types of proxies. AWS connectors can be created with just a few clicks, and ScrapOxy handles all the installations and management of instances. The proxies can be monitored for status, metrics, and real IP address and geo information.

And there is the last type of proxy, the rotating residential proxy. The IP address comes from a device which can be a laptop or a mobile phone. How does it work? When a developer wants to earn money from his application, he has 3 solutions. First, he can sell subscriptions like monthly or annual subscriptions which unlock features. Second, he can add advertising like having an ad at the bottom of the application. And third, he can share the bandwidth of the device, of course, only with the user agreement. But that's where the IP comes from. This is the type of proxy which is very powerful because IP addresses come from real users. And of course, there are millions of endpoints available.

Now we will use ScrapOxy, the super proxy aggregator to manage our proxy strategy. To start ScrapOxy, it is very easy. It's just a docker line to run. So I'm running that. In a second, I've got ScrapOxy up and ready. I can go to the UI here and enter my credential. And I've got one project already created. On this project, I can add very easily a new connector. So I'm going to the marketplace and I've got 100 connectors available. I can add data center connectors, ISP connector, residential connectors, 4G mobile connector, hardware connector, whatever. If I want to add an AWS connector, I just click on create and I enter my credentials. ScrapOxy will handle all the installations, start and stop the instance for you. You don't have to manage that anymore.

So let me demo that. I already created one AWS connector. If I'm starting this connector, it will create very quickly different proxy on AWS. Now I've got 10 instances on AWS. I can see status and metrics. And also I have the real IP address of the proxy here and geo information. Here is proxy are based in Dublin. And I can confirm that with the coverage map, every proxy are in Dublin.

7. Integrating ScrapOxy and Moving to Level 5

Short description:

Let's integrate ScrapOxy into the spiders by adding the proxy configuration. By using residential networks like ProxySeller, we can avoid the error of data centers being forbidden. After adding the connector to ScrapOxy, I can see all hotel information being collected by the spider. Now, let's move on to level 5.

So now let's integrate ScrapOxy into the spiders. I will copy the username and go back to my spider. So I can add the proxy here, localhost, port 8888, protocol HTTP and my credentials. I need the password. And if I'm going back there, I can add the password. That's perfect. Let's restart the spider.

So now it's working and I've got my 50 items. Let's move to the next level, level 4. If I'm running again the spider, I've got a big error, data center are forbidden. It's because the Antibot system detects the autonomous system number, which is Amazon 02. So I need to use residential network for that. Today, I will use ProxySeller residential network. They provide data center proxy, ISP proxy, residential proxy and mobile proxy. I already created ProxySeller credential in ScrapOxy and I will add a connector. First, I stop the AWS connector to avoid paying anything more. And I will create one connector. I need 50 proxies, based in the US. And they start in a second. Yeah, that's perfect. So I'm checking the proxy list and can see that ScrapOxy stops the instances of AWS. And I can see the new proxy from the residential network. If I'm going to the coverage map, I will see in a minute all the new proxies. Here I've got 50 proxies in the US.

So now I can go back to the spider. As you see, I didn't touch the spider. I just added a connector to ScrapOxy. And if I'm running again the spider, I can see all hotel information are collected. I've got my 50 item, it's perfect. Now I will move to the next level, level 5.

8. Using Playwright and Handling Fingerprint Errors

Short description:

I encountered an error with no fingerprint when running the spider. To solve this, I used Playwright, a headless browser that allows me to send requests with real browser information. By executing the fingerprint request and sending the information to the antibot, I can quickly make every request. However, antibody systems also look for consistency, which caused an error with inconsistent time zones when moving to the next level.

When I run the spider this time, I've got a big error, no fingerprint. And to understand that, I need to go back to my browser. And on the browser, as you can see, there are a lot of POST requests. The browser is sending a lot of information to the Antibot system. So I'm going to check what kind of information am I sending. I'm sending the platform type, the time zone, and the real user agents. We cannot spoof anymore this kind of information.

I need a real browser to send my request instead of access and execute JavaScript, a browser which can be controlled by a script. I will use Playwright. So Playwright is a headless browser. We can see that it's helpful for the presentations. I can execute JavaScript and it works with Chrome, Firefox, Edge, and Safari. It is open source and maintained by Microsoft. So let's see how we can adapt our spiders.

Now I can create a Playwright script based on the previous spider. In the Playwright script, I've got the same methods. I go to the home page. I post a form and get a list of hotels. And I get the details of each hotel. I extract names, emails, and ratings. So if I'm running this spider, you will see a browser opening and going to the home page. I'm executing the fingerprint request and sending all the information to the antibot. And now I can do every request very quickly. As you see, we cannot see the page because it downloads only content without rendering. So I've got my 50 items. But of course, antibody systems are not only catching fingerprint information. They are catching consistency.

So if I'm moving to the next level, the LoveX6, and I'm running again the Playwright spider, the spider connects to the home page, sends a fingerprint. But when I execute other requests, I've got a big error, inconsistent time zone. It is happening because we are sending the real time zone of the browser.

9. Adjusting Time Zone and Completing Requests

Short description:

The antibot has a real time zone, which caused inconsistencies with the browser time zone. By changing the browser time zone to America, Chicago, I was able to execute all requests successfully using Playwright. Thank you!

So the antibot has a real time zone. And the browser time zone is Europe, Paris. And we are using IP addresses from the US. There is a five-hour difference on the time zone. I need to correct that. I cannot correct the IP address time zone. But I can correct the time zone of the browser. To do that, I need to go to the setting here and change the time zone ID to America, Chicago.

If I'm running again the spider, the spider connects to the home page, sends the fingerprint information. And this time, the IP time zone is consistent to the browser time zone. I can execute all requests. And as you see, I've got every request executed by Playwright. And I've got my 50 items.

That's all for me. Thank you very much. Download Sprapoxy and join Wiremind.

Check out more articles and videos

We constantly think of articles and videos that might spark Git people interest / skill us up or help building a stellar career

Don't Solve Problems, Eliminate Them
React Advanced Conference 2021React Advanced Conference 2021
39 min
Don't Solve Problems, Eliminate Them
Top Content
Humans are natural problem solvers and we're good enough at it that we've survived over the centuries and become the dominant species of the planet. Because we're so good at it, we sometimes become problem seekers too–looking for problems we can solve. Those who most successfully accomplish their goals are the problem eliminators. Let's talk about the distinction between solving and eliminating problems with examples from inside and outside the coding world.
Jotai Atoms Are Just Functions
React Day Berlin 2022React Day Berlin 2022
22 min
Jotai Atoms Are Just Functions
Top Content
Jotai is a state management library. We have been developing it primarily for React, but it's conceptually not tied to React. It this talk, we will see how Jotai atoms work and learn about the mental model we should have. Atoms are framework-agnostic abstraction to represent states, and they are basically just functions. Understanding the atom abstraction will help designing and implementing states in your applications with Jotai
Debugging JS
React Summit 2023React Summit 2023
24 min
Debugging JS
Top Content
As developers, we spend much of our time debugging apps - often code we didn't even write. Sadly, few developers have ever been taught how to approach debugging - it's something most of us learn through painful experience.  The good news is you _can_ learn how to debug effectively, and there's several key techniques and tools you can use for debugging JS and React apps.
Fighting Technical Debt With Continuous Refactoring
React Day Berlin 2022React Day Berlin 2022
29 min
Fighting Technical Debt With Continuous Refactoring
Top Content
Let’s face it: technical debt is inevitable and rewriting your code every 6 months is not an option. Refactoring is a complex topic that doesn't have a one-size-fits-all solution. Frontend applications are particularly sensitive because of frequent requirements and user flows changes. New abstractions, updated patterns and cleaning up those old functions - it all sounds great on paper, but it often fails in practice: todos accumulate, tickets end up rotting in the backlog and legacy code crops up in every corner of your codebase. So a process of continuous refactoring is the only weapon you have against tech debt. In the past three years, I’ve been exploring different strategies and processes for refactoring code. In this talk I will describe the key components of a framework for tackling refactoring and I will share some of the learnings accumulated along the way. Hopefully, this will help you in your quest of improving the code quality of your codebases.
AHA Programming
React Summit Remote Edition 2020React Summit Remote Edition 2020
32 min
AHA Programming
Top Content
Are you the kind of programmer who prefers to never see the same code in two places, or do you make liberal use of copy/paste? Many developers swear the Don't Repeat Yourself (DRY) philosophy while others prefer to Write Everything Twice (WET). But which of these produces more maintainable codebases? I've seen both of these approaches lay waste to codebases and I have a new ideology I would like to propose to you: Avoid Hasty Abstractions (AHA). In this keynote, we'll talk about abstraction and how you can improve a codebase applying and creating abstractions more thoughtfully as well as how to get yourself out of a mess of over or under-abstraction.
The Epic Stack
React Summit US 2023React Summit US 2023
21 min
The Epic Stack
Top Content
Modern web development is fantastic. There are so many great tools available! Modern web development is exhausting. There are so many great tools available! Each of these sentiments is true. What's great is that most of the time, it's hard to make a choice that is wrong. Seriously. The trade-offs of most of the frameworks and tools you could use to build your application fit within the constraints of the vast majority of apps. Despite this, engineers consistently struggle with analysis paralysis.Let's talk about this, and a solution I am working on for it.

Workshops on related topic

React, TypeScript, and TDD
React Advanced Conference 2021React Advanced Conference 2021
174 min
React, TypeScript, and TDD
Top Content
Featured WorkshopFree
Paul Everitt
Paul Everitt
ReactJS is wildly popular and thus wildly supported. TypeScript is increasingly popular, and thus increasingly supported.

The two together? Not as much. Given that they both change quickly, it's hard to find accurate learning materials.

React+TypeScript, with JetBrains IDEs? That three-part combination is the topic of this series. We'll show a little about a lot. Meaning, the key steps to getting productive, in the IDE, for React projects using TypeScript. Along the way we'll show test-driven development and emphasize tips-and-tricks in the IDE.
Web3 Workshop - Building Your First Dapp
React Advanced Conference 2021React Advanced Conference 2021
145 min
Web3 Workshop - Building Your First Dapp
Top Content
Featured WorkshopFree
Nader Dabit
Nader Dabit
In this workshop, you'll learn how to build your first full stack dapp on the Ethereum blockchain, reading and writing data to the network, and connecting a front end application to the contract you've deployed. By the end of the workshop, you'll understand how to set up a full stack development environment, run a local node, and interact with any smart contract using React, HardHat, and Ethers.js.
Remix Fundamentals
React Summit 2022React Summit 2022
136 min
Remix Fundamentals
Top Content
Featured WorkshopFree
Kent C. Dodds
Kent C. Dodds
Building modern web applications is riddled with complexity And that's only if you bother to deal with the problems
Tired of wiring up onSubmit to backend APIs and making sure your client-side cache stays up-to-date? Wouldn't it be cool to be able to use the global nature of CSS to your benefit, rather than find tools or conventions to avoid or work around it? And how would you like nested layouts with intelligent and performance optimized data management that just works™?
Remix solves some of these problems, and completely eliminates the rest. You don't even have to think about server cache management or global CSS namespace clashes. It's not that Remix has APIs to avoid these problems, they simply don't exist when you're using Remix. Oh, and you don't need that huge complex graphql client when you're using Remix. They've got you covered. Ready to build faster apps faster?
At the end of this workshop, you'll know how to:- Create Remix Routes- Style Remix applications- Load data in Remix loaders- Mutate data with forms and actions
Vue3: Modern Frontend App Development
Vue.js London Live 2021Vue.js London Live 2021
169 min
Vue3: Modern Frontend App Development
Top Content
Featured WorkshopFree
Mikhail Kuznetcov
Mikhail Kuznetcov
The Vue3 has been released in mid-2020. Besides many improvements and optimizations, the main feature of Vue3 brings is the Composition API – a new way to write and reuse reactive code. Let's learn more about how to use Composition API efficiently.

Besides core Vue3 features we'll explain examples of how to use popular libraries with Vue3.

Table of contents:
- Introduction to Vue3
- Composition API
- Core libraries
- Vue3 ecosystem

Prerequisites:
IDE of choice (Inellij or VSC) installed
Nodejs + NPM
Developing Dynamic Blogs with SvelteKit & Storyblok: A Hands-on Workshop
JSNation 2023JSNation 2023
174 min
Developing Dynamic Blogs with SvelteKit & Storyblok: A Hands-on Workshop
Featured WorkshopFree
Alba Silvente Fuentes
Roberto Butti
2 authors
This SvelteKit workshop explores the integration of 3rd party services, such as Storyblok, in a SvelteKit project. Participants will learn how to create a SvelteKit project, leverage Svelte components, and connect to external APIs. The workshop covers important concepts including SSR, CSR, static site generation, and deploying the application using adapters. By the end of the workshop, attendees will have a solid understanding of building SvelteKit applications with API integrations and be prepared for deployment.
Back to the Roots With Remix
React Summit 2023React Summit 2023
106 min
Back to the Roots With Remix
Featured Workshop
Alex Korzhikov
Pavlik Kiselev
2 authors
The modern web would be different without rich client-side applications supported by powerful frameworks: React, Angular, Vue, Lit, and many others. These frameworks rely on client-side JavaScript, which is their core. However, there are other approaches to rendering. One of them (quite old, by the way) is server-side rendering entirely without JavaScript. Let's find out if this is a good idea and how Remix can help us with it?
Prerequisites- Good understanding of JavaScript or TypeScript- It would help to have experience with React, Redux, Node.js and writing FrontEnd and BackEnd applications- Preinstall Node.js, npm- We prefer to use VSCode, but also cloud IDEs such as codesandbox (other IDEs are also ok)