## Difference between WebCrawlers and WebScrapers
Feature | Web Scraper | Web Crawler |
---|---|---|
Purpose | Extracts specific data from websites | Navigates and indexes web content |
Functionality | Focuses on a specific set of data | Explores the web broadly |
Use Case | Data harvesting, analysis | Search engine indexing, SEO analysis |
Data Handling | Extracts and processes targeted data | Collects data from many sources |
Complexity | Can be complex depending on the data | Generally simpler in design |
Speed | Varies based on the data complexity | Usually faster at covering more ground |
Customization | Highly customizable for data needs | Less need for customization |
- Puppeteer
- Cheerio
- brightData
- Website block you by doing IP blocking and rate limiting, if sent too many requests
- Dynamic content
- Traditional web scralers are not able to handle dynamic content
- Complex navigation is not always possible
- IP rotation is not possible
- Captcha
- Human intrepretation while scraping
-
Develop the UI
-
Create actions
/lib/actions
export async function scrapeAndStoreProduct(productUrl: string) {
if (!productUrl) return;
try {
const scrapedProduct = await scrapeAmazonProduct(productUrl);
} catch (error: any) {
console.log(error);
}
}
-
Install packages axios
npm i axios
and cheerionpm i cheerio
-
Make scraper function
/lib/scraper
"use server";
import axios from "axios";
import * as cheerio from "cheerio";
export async function scrapeAmazonProduct(url: string) {
if (!url) return;
const username = String(process.env.BRIGHT_DATA_USERNAME);
const password = String(process.env.BRIGHT_DATA_PASSWORD);
const port = 22225;
const session_id = (100000 * Math.random()) | 0;
const options = {
auth: {
username: `${username}-session-${session_id}`,
password,
},
host: "brd.superproxy.io",
port,
rejectUnauthorized: false,
};
try {
const response = await axios.get(url, options);
console.log(response.data);
} catch (error: any) {
console.log(error);
}
}
Now after copying a link of the amazon product to the search bar, the scrapped html should display on the console.
- Setting up
cheerio
for parsing the scrapped html content