Steps to develop this web scrapper

## Difference between WebCrawlers and WebScrapers

Feature	Web Scraper	Web Crawler
Purpose	Extracts specific data from websites	Navigates and indexes web content
Functionality	Focuses on a specific set of data	Explores the web broadly
Use Case	Data harvesting, analysis	Search engine indexing, SEO analysis
Data Handling	Extracts and processes targeted data	Collects data from many sources
Complexity	Can be complex depending on the data	Generally simpler in design
Speed	Varies based on the data complexity	Usually faster at covering more ground
Customization	Highly customizable for data needs	Less need for customization

Open Source Web Scrapers

Puppeteer
Cheerio
brightData

Problems

Website block you by doing IP blocking and rate limiting, if sent too many requests
Dynamic content
Traditional web scralers are not able to handle dynamic content
Complex navigation is not always possible
IP rotation is not possible
Captcha
Human intrepretation while scraping

Steps to develop this web scrapper

Develop the UI
Create actions /lib/actions

export async function scrapeAndStoreProduct(productUrl: string) {
  if (!productUrl) return;

  try {
    const scrapedProduct = await scrapeAmazonProduct(productUrl);
  } catch (error: any) {
    console.log(error);
  }
}

Install packages axios npm i axios and cheerio npm i cheerio
Make scraper function /lib/scraper

"use server";

import axios from "axios";
import * as cheerio from "cheerio";
export async function scrapeAmazonProduct(url: string) {
if (!url) return;



const username = String(process.env.BRIGHT_DATA_USERNAME);
const password = String(process.env.BRIGHT_DATA_PASSWORD);
const port = 22225;
const session_id = (100000 * Math.random()) | 0;
const options = {
auth: {
username: `${username}-session-${session_id}`,
password,
},
host: "brd.superproxy.io",
port,
rejectUnauthorized: false,
};
try {
        const response = await axios.get(url, options);
        console.log(response.data);
    } catch (error: any) {
     console.log(error);
    }
}

Now after copying a link of the amazon product to the search bar, the scrapped html should display on the console.

Setting up cheerio for parsing the scrapped html content

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
components		components
lib		lib
models		models
public		public
types		types
.env		.env
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
README.md		README.md
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
tailwind.config.ts		tailwind.config.ts
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Source Web Scrapers

Problems

Steps to develop this web scrapper

About

Releases

Packages

Languages

CodeMaster17/spider-sense

Folders and files

Latest commit

History

Repository files navigation

Open Source Web Scrapers

Problems

Steps to develop this web scrapper

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages