Advanced Web Scraping Using Python
Python-Scrapping
Read Time: 0 mins

Share This

Web scraping is a powerful technique used to extract from websites. While basic scraping can be done using libraries like requests and BeautifulSoup, advanced scraping requires handling authentication, dynamic content, and bot detection. In this blog, we will explore advanced web scraping techniques using Python.

1. Handling Authentication and Session Management

Many websites require users to log in before accessing content. To scrape such sites, we need to handle authentication and maintain session cookies.

Using requests.Session() for Authentication

import requests
from bs4 import BeautifulSoup

# Create a session
session = requests.Session()

# Define login URL and credentials
login_url = "https://example.com/login"
payload = {
    "username": "your_username",
    "password": "your_password"
}

# Perform login
response = session.post(login_url, data=payload)

# Check if login was successful
if "Logout" in response.text:
    print("Login successful!")
else:
    print("Login failed!")

2. Handling CSRF Tokens

Some websites include CSRF tokens in their forms to prevent automated logins. You need to extract and include the CSRF token in your request.

# Get login page to extract CSRF token
login_page = session.get(login_url)
soup = BeautifulSoup(login_page.content, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]

# Include token in login payload
payload["csrf_token"] = csrf_token
session.post(login_url, data=payload)

3. Scraping JavaScript-Rendered Content

Some websites load data dynamically using JavaScript, making it difficult to scrape using requests. In such cases, Selenium can be used to automate browser interaction.

Using Selenium for Dynamic Content

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Initialize WebDriver
driver = webdriver.Chrome()

driver.get("https://example.com/login")

driver.find_element(By.NAME, "username").send_keys("your_username")
driver.find_element(By.NAME, "password").send_keys("your_password")
driver.find_element(By.NAME, "password").send_keys(Keys.RETURN)

time.sleep(5)  # Wait for the page to load

driver.get("https://example.com/protected-page")
print(driver.page_source)

driver.quit()

4. Handling CAPTCHA Challenges

Many sites use CAPTCHA to prevent automated access. You can use external services like 2Captcha to bypass CAPTCHAs.

import requests

api_key = "your_2captcha_api_key"
site_key = "site_specific_key"
page_url = "https://example.com/login"

response = requests.get(
    f"http://2captcha.com/in.php?key={api_key}&method=userrecaptcha&googlekey={site_key}&pageurl={page_url}"
)
captcha_id = response.text.split('|')[1]

# Wait for solution
import time
time.sleep(15)

captcha_solution = requests.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text

5. Avoiding Detection & Rate Limiting

Websites often block scrapers by detecting unusual activity. Here are some techniques to avoid detection:

Using Headers and User-Agents

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = session.get("https://example.com", headers=headers)

Rotating Proxies and User-Agents

Use scrapy or third-party proxy services to rotate IPs and avoid bans.

from fake_useragent import UserAgent
import random

ua = UserAgent()
headers = {"User-Agent": ua.random}

proxies = ["http://proxy1.com", "http://proxy2.com"]
proxy = {"http": random.choice(proxies)}

response = requests.get("https://example.com", headers=headers, proxies=proxy)

6. Scraping Data at Scale with Scrapy

Scrapy is a powerful web scraping framework that handles large-scale scraping efficiently.

Example Scrapy Spider

import scrapy

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = ["https://example.com"]

    def parse(self, response):
        for item in response.css(".data-item"):
            yield {
                "title": item.css("h2::text").get(),
                "price": item.css(".price::text").get()
            }

Conclusion

Advanced web scraping requires handling authentication, JavaScript-rendered content, CAPTCHA challenges, and anti-bot measures. By combining requests, BeautifulSoup, Selenium, and Scrapy, you can extract data efficiently while respecting website policies.

Important Considerations:

  • Always check the website’s robots.txt file.
  • Use official APIs if available.
  • Avoid aggressive scraping to prevent IP bans.

By implementing these techniques, you can scrape data more effectively while minimizing risks. Happy scraping!

Leave a Reply

Your email address will not be published. Required fields are marked *