
Share This
Web scraping is a powerful technique used to extract Data from websites. While basic scraping can be done using libraries like requests and BeautifulSoup, advanced scraping requires handling authentication, dynamic content, and bot detection. In this blog, we will explore advanced web scraping techniques using Python.
Many websites require users to log in before accessing content. To scrape such sites, we need to handle authentication and maintain session cookies.
import requests
from bs4 import BeautifulSoup
# Create a session
session = requests.Session()
# Define login URL and credentials
login_url = "https://example.com/login"
payload = {
"username": "your_username",
"password": "your_password"
}
# Perform login
response = session.post(login_url, data=payload)
# Check if login was successful
if "Logout" in response.text:
print("Login successful!")
else:
print("Login failed!")
Some websites include CSRF tokens in their forms to prevent automated logins. You need to extract and include the CSRF token in your request.
# Get login page to extract CSRF token
login_page = session.get(login_url)
soup = BeautifulSoup(login_page.content, "html.parser")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
# Include token in login payload
payload["csrf_token"] = csrf_token
session.post(login_url, data=payload)
Some websites load data dynamically using JavaScript, making it difficult to scrape using requests. In such cases, Selenium can be used to automate browser interaction.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
# Initialize WebDriver
driver = webdriver.Chrome()
driver.get("https://example.com/login")
driver.find_element(By.NAME, "username").send_keys("your_username")
driver.find_element(By.NAME, "password").send_keys("your_password")
driver.find_element(By.NAME, "password").send_keys(Keys.RETURN)
time.sleep(5) # Wait for the page to load
driver.get("https://example.com/protected-page")
print(driver.page_source)
driver.quit()
Many sites use CAPTCHA to prevent automated access. You can use external services like 2Captcha to bypass CAPTCHAs.
import requests
api_key = "your_2captcha_api_key"
site_key = "site_specific_key"
page_url = "https://example.com/login"
response = requests.get(
f"http://2captcha.com/in.php?key={api_key}&method=userrecaptcha&googlekey={site_key}&pageurl={page_url}"
)
captcha_id = response.text.split('|')[1]
# Wait for solution
import time
time.sleep(15)
captcha_solution = requests.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text
Websites often block scrapers by detecting unusual activity. Here are some techniques to avoid detection:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = session.get("https://example.com", headers=headers)
Use scrapy or third-party proxy services to rotate IPs and avoid bans.
from fake_useragent import UserAgent
import random
ua = UserAgent()
headers = {"User-Agent": ua.random}
proxies = ["http://proxy1.com", "http://proxy2.com"]
proxy = {"http": random.choice(proxies)}
response = requests.get("https://example.com", headers=headers, proxies=proxy)
Scrapy is a powerful web scraping framework that handles large-scale scraping efficiently.
import scrapy
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ["https://example.com"]
def parse(self, response):
for item in response.css(".data-item"):
yield {
"title": item.css("h2::text").get(),
"price": item.css(".price::text").get()
}
Advanced web scraping requires handling authentication, JavaScript-rendered content, CAPTCHA challenges, and anti-bot measures. By combining requests, BeautifulSoup, Selenium, and Scrapy, you can extract data efficiently while respecting website policies.
robots.txt file.By implementing these techniques, you can scrape data more effectively while minimizing risks. Happy scraping!