Python

How to Scrape a Website With Python: A Complete Guide

Learn web scraping with Python the right way — from your first request to handling JavaScript-heavy sites, pagination, headers, rate limits, and saving data to CSV or a database.

June 1, 202612 min read
Share
Advertisement (not configured)

Introduction

Web scraping is one of the highest-leverage skills a Python developer can learn. Pricing data, job listings, real estate, sports stats, product reviews — most of the internet is structured data that nobody bothered to expose as an API. With 50 lines of Python, you can collect in seconds what would take a human days.

This guide covers everything you need to scrape professionally:

  1. Setup and the rules you must follow
  2. Your first scraper with requests + BeautifulSoup
  3. Handling JavaScript-heavy sites with Playwright
  4. Pagination, headers, rate limits, and other real-world problems
  5. Saving scraped data (CSV, JSON, SQLite)
  6. A full real-world example you can run today

Every code block here is production-ready.

Setup: Install What You Need

The two libraries you'll use for 90% of scraping work:

pip install requests beautifulsoup4 lxml

For JavaScript-rendered sites, add Playwright (better than Selenium in 2026):

pip install playwright
playwright install chromium

For saving data:

pip install pandas

That's the entire stack. You don't need Scrapy or any heavyweight framework for most projects.

The Rules (Don't Skip This)

Scraping carelessly will get your IP banned, your account suspended, or in rare cases — sued. Three rules keep you safe:

Rule Why
Check /robots.txt Tells you what's off-limits (e.g., example.com/robots.txt)
Add delays between requests Hammering a server is rude and triggers blocks
Set a real User-Agent Lets the site identify you; some block default Python UAs

Also: never scrape personal user data, paywalled content, or anything behind a login unless you own that login. Public, displayed-to-everyone data is fair game in most jurisdictions.

1. Your First Scraper

Let's scrape book titles and prices from books.toscrape.com — a sandbox site built for practicing:

import requests
from bs4 import BeautifulSoup

def scrape_books(url: str) -> list[dict]:
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; LearningScraper/1.0)"
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    books = []

    for article in soup.select("article.product_pod"):
        title = article.select_one("h3 a")["title"]
        price = article.select_one(".price_color").get_text(strip=True)
        availability = article.select_one(".availability").get_text(strip=True)

        books.append({
            "title": title,
            "price": price,
            "availability": availability,
        })

    return books


if __name__ == "__main__":
    data = scrape_books("https://books.toscrape.com/")
    for book in data[:5]:
        print(book)
    print(f"\nTotal: {len(data)} books scraped")

Three concepts to understand:

Concept What it does
requests.get() Downloads the HTML of the page
BeautifulSoup(html, "lxml") Parses the HTML into a searchable tree
soup.select(...) Finds elements using CSS selectors

You don't need to learn regex or XPath. CSS selectors (the same ones you use in CSS files) handle 99% of scraping.

2. Finding the Right Selectors

The biggest beginner question: how do I know which selector to use?

Use your browser's DevTools. Right-click any element on the page → Inspect. The HTML for that element opens in the panel. Look at:

  • The tag (<h3>, <div>, <span>)
  • The class (class="price_color")
  • The id (id="main")

Then translate to CSS:

soup.select_one("h3")               # first <h3>
soup.select_one(".price_color")     # first element with class price_color
soup.select_one("#main")            # element with id="main"
soup.select_one("article.product_pod h3 a")  # nested — <a> inside <h3> inside article.product_pod
soup.select("a.btn")                # ALL <a class="btn"> (returns a list)

Pro tip: in Chrome DevTools → right-click an element → Copy → Copy selector. You get the CSS path instantly. Then simplify it.

3. Handling JavaScript-Heavy Sites

requests only downloads the raw HTML. If a site renders content with JavaScript (React, Vue, Angular), your scraper will see an empty <div> where the data should be.

The fix: use Playwright to run a real browser, then scrape the rendered HTML.

from playwright.sync_api import sync_playwright

def scrape_dynamic_page(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        # Wait for a specific element to be present
        page.wait_for_selector(".product-card", timeout=10000)

        html = page.content()
        browser.close()
        return html


def parse_products(html: str) -> list[dict]:
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, "lxml")
    products = []

    for card in soup.select(".product-card"):
        products.append({
            "name": card.select_one(".name").get_text(strip=True),
            "price": card.select_one(".price").get_text(strip=True),
        })
    return products


if __name__ == "__main__":
    html = scrape_dynamic_page("https://example-shop.com/products")
    products = parse_products(html)
    print(f"Scraped {len(products)} products")

When to use Playwright vs requests:

Situation Use
Page source already contains the data (view-source: shows it) requests + BeautifulSoup
Data appears only after JavaScript runs Playwright
Site has anti-bot protection (Cloudflare, etc.) Playwright with headless=False
You need to click buttons, fill forms, scroll Playwright

Playwright is 10–50× slower than requests because it launches a real browser. Use it only when you must.

4. Pagination — Scraping Multiple Pages

Most real-world scrapes need many pages. Three common patterns:

Pattern A: Page numbers in URL

import time

def scrape_all_pages(base_url: str, total_pages: int) -> list[dict]:
    all_items = []
    for page in range(1, total_pages + 1):
        url = f"{base_url}?page={page}"
        items = scrape_books(url)   # from earlier
        all_items.extend(items)
        print(f"Page {page}: {len(items)} items (total: {len(all_items)})")
        time.sleep(1)  # be polite
    return all_items

Pattern B: "Next" button until it disappears

def scrape_with_next_button(start_url: str) -> list[dict]:
    all_items = []
    url = start_url

    while url:
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.text, "lxml")

        all_items.extend(parse_items(soup))

        next_link = soup.select_one("li.next a")
        url = next_link["href"] if next_link else None
        if url and not url.startswith("http"):
            url = f"https://books.toscrape.com/{url}"
        time.sleep(1)

    return all_items

Pattern C: Infinite scroll (use Playwright)

def scrape_infinite_scroll(url: str, max_scrolls: int = 10) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")

        for _ in range(max_scrolls):
            page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            page.wait_for_timeout(1500)

        html = page.content()
        browser.close()
        return html

5. Headers, Sessions, and Avoiding Blocks

Real websites can tell you're a bot if you don't fake being a real browser. Here's how to look human:

import requests

def make_session() -> requests.Session:
    session = requests.Session()
    session.headers.update({
        "User-Agent": (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        ),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
    })
    return session


session = make_session()
response = session.get("https://example.com")

A Session keeps cookies between requests — useful for sites that set a session cookie on the first visit.

If you're still getting blocked, escalate in this order:

1. Add random delays:     time.sleep(random.uniform(2, 5))
2. Rotate User-Agents:    fake_useragent library
3. Use proxies:           BrightData, ScraperAPI, Smartproxy
4. Use Playwright:        looks like a real browser
5. Use a scraping API:    ScrapingBee, ZenRows handle everything for you

6. Rate Limiting and Retries

A polite scraper handles failures gracefully:

import time
import random
import requests
from requests.exceptions import RequestException

def fetch_with_retry(url: str, max_retries: int = 3) -> str | None:
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            time.sleep(random.uniform(1, 3))   # be polite
            return response.text
        except RequestException as e:
            wait = 2 ** attempt
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait}s…")
            time.sleep(wait)
    print(f"Gave up on {url} after {max_retries} attempts")
    return None

The 2 ** attempt is exponential backoff — wait 1s, then 2s, then 4s. This is what professional scrapers do.

7. Saving Your Data

Three formats cover almost everything.

CSV (best for spreadsheets)

import csv

def save_to_csv(data: list[dict], filename: str):
    if not data:
        return
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

JSON (best for nested data)

import json

def save_to_json(data: list[dict], filename: str):
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)

SQLite (best for big or repeated scrapes)

import sqlite3

def save_to_sqlite(data: list[dict], db_path: str = "scraped.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS books (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT,
            price TEXT,
            availability TEXT,
            scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.executemany(
        "INSERT INTO books (title, price, availability) VALUES (?, ?, ?)",
        [(b["title"], b["price"], b["availability"]) for b in data],
    )
    conn.commit()
    conn.close()

8. A Complete Real-World Example

Putting it all together — a scraper that gets all books from books.toscrape.com, follows pagination, saves to CSV and SQLite, and handles failures:

import csv
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException

BASE = "https://books.toscrape.com/"

def make_session() -> requests.Session:
    s = requests.Session()
    s.headers.update({"User-Agent": "Mozilla/5.0 LearningScraper/1.0"})
    return s

def fetch(session: requests.Session, url: str, retries: int = 3) -> str | None:
    for attempt in range(retries):
        try:
            r = session.get(url, timeout=10)
            r.raise_for_status()
            return r.text
        except RequestException:
            time.sleep(2 ** attempt)
    return None

def parse_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    return [
        {
            "title": a["title"],
            "price": p.get_text(strip=True),
            "availability": av.get_text(strip=True),
        }
        for article in soup.select("article.product_pod")
        for a, p, av in [(
            article.select_one("h3 a"),
            article.select_one(".price_color"),
            article.select_one(".availability"),
        )]
    ]

def scrape_all(session: requests.Session) -> list[dict]:
    books = []
    url = BASE
    page = 1
    while url:
        html = fetch(session, url)
        if not html:
            break

        books.extend(parse_page(html))
        soup = BeautifulSoup(html, "lxml")

        next_link = soup.select_one("li.next a")
        if not next_link:
            break

        # Build absolute URL
        if "catalogue/" in url:
            url = f"{BASE}catalogue/{next_link['href']}"
        else:
            url = f"{BASE}catalogue/{next_link['href']}"

        page += 1
        print(f"Scraped page {page - 1}: {len(books)} books so far")
        time.sleep(random.uniform(1, 2))
    return books

def save_csv(data: list[dict], path: str):
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=data[0].keys())
        w.writeheader()
        w.writerows(data)

if __name__ == "__main__":
    session = make_session()
    books = scrape_all(session)
    save_csv(books, "books.csv")
    print(f"\n✅ Done. Scraped {len(books)} books → books.csv")

Run it: python scraper.py. In ~60 seconds you'll have all 1,000 books on the site saved to a CSV.

Scraping Cheat Sheet

Task Use
Static HTML page requests + BeautifulSoup
JavaScript-rendered page Playwright
Login / form submission Playwright
Pagination Loop + sleep
Avoid getting blocked Sessions, real User-Agent, delays, proxies
Save data CSV (small), JSON (nested), SQLite (big)
Parallel scraping concurrent.futures.ThreadPoolExecutor

What to Build Next

You now have a complete scraping toolkit. Real projects you can build today:

1. Price tracker — scrape Amazon/Daraz daily, email when prices drop
2. Job aggregator — combine listings from 5 job boards into one CSV
3. Real estate analyzer — scrape Zameen/OLX, find undervalued listings
4. Stock news monitor — scrape headlines, flag mentions of your tickers
5. Lead generator — scrape company sites for emails (be careful with TOS)

Final Thought

Web scraping looks like dark magic until you write your first one. After that, you'll see every webpage differently — as a structured data source waiting to be unlocked. The hardest part isn't the code. It's deciding which problem is worth scraping for in the first place.

Pick one of the project ideas above. Build the scraper in a single sitting. By tomorrow, you'll have data that nobody else has — and that's where every interesting analysis starts.

Advertisement (not configured)

Written by

Raretechsol

International software company specializing in Python and JavaScript. Passionate about automation, AI, and building practical web applications.

Related Articles