Introduction
Web scraping is one of the highest-leverage skills a Python developer can learn. Pricing data, job listings, real estate, sports stats, product reviews — most of the internet is structured data that nobody bothered to expose as an API. With 50 lines of Python, you can collect in seconds what would take a human days.
This guide covers everything you need to scrape professionally:
- Setup and the rules you must follow
- Your first scraper with
requests+BeautifulSoup - Handling JavaScript-heavy sites with Playwright
- Pagination, headers, rate limits, and other real-world problems
- Saving scraped data (CSV, JSON, SQLite)
- A full real-world example you can run today
Every code block here is production-ready.
Setup: Install What You Need
The two libraries you'll use for 90% of scraping work:
pip install requests beautifulsoup4 lxml
For JavaScript-rendered sites, add Playwright (better than Selenium in 2026):
pip install playwright
playwright install chromium
For saving data:
pip install pandas
That's the entire stack. You don't need Scrapy or any heavyweight framework for most projects.
The Rules (Don't Skip This)
Scraping carelessly will get your IP banned, your account suspended, or in rare cases — sued. Three rules keep you safe:
| Rule | Why |
|---|---|
Check /robots.txt |
Tells you what's off-limits (e.g., example.com/robots.txt) |
| Add delays between requests | Hammering a server is rude and triggers blocks |
| Set a real User-Agent | Lets the site identify you; some block default Python UAs |
Also: never scrape personal user data, paywalled content, or anything behind a login unless you own that login. Public, displayed-to-everyone data is fair game in most jurisdictions.
1. Your First Scraper
Let's scrape book titles and prices from books.toscrape.com — a sandbox site built for practicing:
import requests
from bs4 import BeautifulSoup
def scrape_books(url: str) -> list[dict]:
headers = {
"User-Agent": "Mozilla/5.0 (compatible; LearningScraper/1.0)"
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
books = []
for article in soup.select("article.product_pod"):
title = article.select_one("h3 a")["title"]
price = article.select_one(".price_color").get_text(strip=True)
availability = article.select_one(".availability").get_text(strip=True)
books.append({
"title": title,
"price": price,
"availability": availability,
})
return books
if __name__ == "__main__":
data = scrape_books("https://books.toscrape.com/")
for book in data[:5]:
print(book)
print(f"\nTotal: {len(data)} books scraped")
Three concepts to understand:
| Concept | What it does |
|---|---|
requests.get() |
Downloads the HTML of the page |
BeautifulSoup(html, "lxml") |
Parses the HTML into a searchable tree |
soup.select(...) |
Finds elements using CSS selectors |
You don't need to learn regex or XPath. CSS selectors (the same ones you use in CSS files) handle 99% of scraping.
2. Finding the Right Selectors
The biggest beginner question: how do I know which selector to use?
Use your browser's DevTools. Right-click any element on the page → Inspect. The HTML for that element opens in the panel. Look at:
- The tag (
<h3>,<div>,<span>) - The class (
class="price_color") - The id (
id="main")
Then translate to CSS:
soup.select_one("h3") # first <h3>
soup.select_one(".price_color") # first element with class price_color
soup.select_one("#main") # element with id="main"
soup.select_one("article.product_pod h3 a") # nested — <a> inside <h3> inside article.product_pod
soup.select("a.btn") # ALL <a class="btn"> (returns a list)
Pro tip: in Chrome DevTools → right-click an element → Copy → Copy selector. You get the CSS path instantly. Then simplify it.
3. Handling JavaScript-Heavy Sites
requests only downloads the raw HTML. If a site renders content with JavaScript (React, Vue, Angular), your scraper will see an empty <div> where the data should be.
The fix: use Playwright to run a real browser, then scrape the rendered HTML.
from playwright.sync_api import sync_playwright
def scrape_dynamic_page(url: str) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
# Wait for a specific element to be present
page.wait_for_selector(".product-card", timeout=10000)
html = page.content()
browser.close()
return html
def parse_products(html: str) -> list[dict]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
products = []
for card in soup.select(".product-card"):
products.append({
"name": card.select_one(".name").get_text(strip=True),
"price": card.select_one(".price").get_text(strip=True),
})
return products
if __name__ == "__main__":
html = scrape_dynamic_page("https://example-shop.com/products")
products = parse_products(html)
print(f"Scraped {len(products)} products")
When to use Playwright vs requests:
| Situation | Use |
|---|---|
| Page source already contains the data (view-source: shows it) | requests + BeautifulSoup |
| Data appears only after JavaScript runs | Playwright |
| Site has anti-bot protection (Cloudflare, etc.) | Playwright with headless=False |
| You need to click buttons, fill forms, scroll | Playwright |
Playwright is 10–50× slower than requests because it launches a real browser. Use it only when you must.
4. Pagination — Scraping Multiple Pages
Most real-world scrapes need many pages. Three common patterns:
Pattern A: Page numbers in URL
import time
def scrape_all_pages(base_url: str, total_pages: int) -> list[dict]:
all_items = []
for page in range(1, total_pages + 1):
url = f"{base_url}?page={page}"
items = scrape_books(url) # from earlier
all_items.extend(items)
print(f"Page {page}: {len(items)} items (total: {len(all_items)})")
time.sleep(1) # be polite
return all_items
Pattern B: "Next" button until it disappears
def scrape_with_next_button(start_url: str) -> list[dict]:
all_items = []
url = start_url
while url:
response = requests.get(url, timeout=10)
soup = BeautifulSoup(response.text, "lxml")
all_items.extend(parse_items(soup))
next_link = soup.select_one("li.next a")
url = next_link["href"] if next_link else None
if url and not url.startswith("http"):
url = f"https://books.toscrape.com/{url}"
time.sleep(1)
return all_items
Pattern C: Infinite scroll (use Playwright)
def scrape_infinite_scroll(url: str, max_scrolls: int = 10) -> str:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until="networkidle")
for _ in range(max_scrolls):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(1500)
html = page.content()
browser.close()
return html
5. Headers, Sessions, and Avoiding Blocks
Real websites can tell you're a bot if you don't fake being a real browser. Here's how to look human:
import requests
def make_session() -> requests.Session:
session = requests.Session()
session.headers.update({
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
})
return session
session = make_session()
response = session.get("https://example.com")
A Session keeps cookies between requests — useful for sites that set a session cookie on the first visit.
If you're still getting blocked, escalate in this order:
1. Add random delays: time.sleep(random.uniform(2, 5))
2. Rotate User-Agents: fake_useragent library
3. Use proxies: BrightData, ScraperAPI, Smartproxy
4. Use Playwright: looks like a real browser
5. Use a scraping API: ScrapingBee, ZenRows handle everything for you
6. Rate Limiting and Retries
A polite scraper handles failures gracefully:
import time
import random
import requests
from requests.exceptions import RequestException
def fetch_with_retry(url: str, max_retries: int = 3) -> str | None:
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
time.sleep(random.uniform(1, 3)) # be polite
return response.text
except RequestException as e:
wait = 2 ** attempt
print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait}s…")
time.sleep(wait)
print(f"Gave up on {url} after {max_retries} attempts")
return None
The 2 ** attempt is exponential backoff — wait 1s, then 2s, then 4s. This is what professional scrapers do.
7. Saving Your Data
Three formats cover almost everything.
CSV (best for spreadsheets)
import csv
def save_to_csv(data: list[dict], filename: str):
if not data:
return
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
JSON (best for nested data)
import json
def save_to_json(data: list[dict], filename: str):
with open(filename, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
SQLite (best for big or repeated scrapes)
import sqlite3
def save_to_sqlite(data: list[dict], db_path: str = "scraped.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS books (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT,
price TEXT,
availability TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.executemany(
"INSERT INTO books (title, price, availability) VALUES (?, ?, ?)",
[(b["title"], b["price"], b["availability"]) for b in data],
)
conn.commit()
conn.close()
8. A Complete Real-World Example
Putting it all together — a scraper that gets all books from books.toscrape.com, follows pagination, saves to CSV and SQLite, and handles failures:
import csv
import time
import random
import sqlite3
import requests
from bs4 import BeautifulSoup
from requests.exceptions import RequestException
BASE = "https://books.toscrape.com/"
def make_session() -> requests.Session:
s = requests.Session()
s.headers.update({"User-Agent": "Mozilla/5.0 LearningScraper/1.0"})
return s
def fetch(session: requests.Session, url: str, retries: int = 3) -> str | None:
for attempt in range(retries):
try:
r = session.get(url, timeout=10)
r.raise_for_status()
return r.text
except RequestException:
time.sleep(2 ** attempt)
return None
def parse_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
return [
{
"title": a["title"],
"price": p.get_text(strip=True),
"availability": av.get_text(strip=True),
}
for article in soup.select("article.product_pod")
for a, p, av in [(
article.select_one("h3 a"),
article.select_one(".price_color"),
article.select_one(".availability"),
)]
]
def scrape_all(session: requests.Session) -> list[dict]:
books = []
url = BASE
page = 1
while url:
html = fetch(session, url)
if not html:
break
books.extend(parse_page(html))
soup = BeautifulSoup(html, "lxml")
next_link = soup.select_one("li.next a")
if not next_link:
break
# Build absolute URL
if "catalogue/" in url:
url = f"{BASE}catalogue/{next_link['href']}"
else:
url = f"{BASE}catalogue/{next_link['href']}"
page += 1
print(f"Scraped page {page - 1}: {len(books)} books so far")
time.sleep(random.uniform(1, 2))
return books
def save_csv(data: list[dict], path: str):
with open(path, "w", newline="", encoding="utf-8") as f:
w = csv.DictWriter(f, fieldnames=data[0].keys())
w.writeheader()
w.writerows(data)
if __name__ == "__main__":
session = make_session()
books = scrape_all(session)
save_csv(books, "books.csv")
print(f"\n✅ Done. Scraped {len(books)} books → books.csv")
Run it: python scraper.py. In ~60 seconds you'll have all 1,000 books on the site saved to a CSV.
Scraping Cheat Sheet
| Task | Use |
|---|---|
| Static HTML page | requests + BeautifulSoup |
| JavaScript-rendered page | Playwright |
| Login / form submission | Playwright |
| Pagination | Loop + sleep |
| Avoid getting blocked | Sessions, real User-Agent, delays, proxies |
| Save data | CSV (small), JSON (nested), SQLite (big) |
| Parallel scraping | concurrent.futures.ThreadPoolExecutor |
What to Build Next
You now have a complete scraping toolkit. Real projects you can build today:
1. Price tracker — scrape Amazon/Daraz daily, email when prices drop
2. Job aggregator — combine listings from 5 job boards into one CSV
3. Real estate analyzer — scrape Zameen/OLX, find undervalued listings
4. Stock news monitor — scrape headlines, flag mentions of your tickers
5. Lead generator — scrape company sites for emails (be careful with TOS)
Final Thought
Web scraping looks like dark magic until you write your first one. After that, you'll see every webpage differently — as a structured data source waiting to be unlocked. The hardest part isn't the code. It's deciding which problem is worth scraping for in the first place.
Pick one of the project ideas above. Build the scraper in a single sitting. By tomorrow, you'll have data that nobody else has — and that's where every interesting analysis starts.