Konsep Dasar Web Scraping
Web scraping adalah teknik ekstraksi data otomatis dari website dengan cara:
- Mengunduh konten halaman web
- Menganalisis struktur HTML/XML
- Mengekstrak informasi spesifik
- Menyimpan data dalam format terstruktur
Tujuan Web Scraping :
- Pengumpulan data untuk analisis
- Monitoring harga produk kompetitor
- Aggregasi konten dari berbagai sumber
- Riset pasar dan kompetitif
Legalitas dan Etika
- Selalu cek
robots.txt
(contoh:https://www.example.com/robots.txt
) - Hormati
User-Agent
danCrawl-Delay
- Jangan melakukan scraping data pribadi
- Hindari request berlebihan yang membebani server
Tools dan Library Python
BeautifulSoup
Library untuk parsing HTML/XML:
from bs4 import BeautifulSoup import requests url = "https://example.com" response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser')
Selenium
Untuk scraping website berbasis JavaScript:
from selenium import webdriver from selenium.webdriver.common.by import By driver = webdriver.Chrome() driver.get("https://example.com") element = driver.find_element(By.ID, "element-id")
Scrapy
Framework untuk proyek scraping besar:
import scrapy class MySpider(scrapy.Spider): name = 'example' start_urls = ['https://example.com'] def parse(self, response): yield { 'title': response.css('h1::text').get() }
Library Pendukung
requests
: HTTP requestspandas
: Pengolahan datalxml
: Parser alternatiffake-useragent
: Rotasi User-Agent
Teknik Dasar Web Scraping
Mengakses Halaman Web
import requests headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...' } try: response = requests.get('https://example.com', headers=headers) response.raise_for_status() # Cek error HTTP print(response.status_code) # 200 jika sukses except requests.exceptions.RequestException as e: print(f"Error: {e}")
Parsing HTML
from bs4 import BeautifulSoup html_doc = """ <html> <head><title>Contoh Halaman</title></head> <body> <div class="product"> <h2>Laptop Gaming</h2> <span class="price">Rp 15.000.000</span> </div> </body> </html> """ soup = BeautifulSoup(html_doc, 'html.parser') title = soup.find('title').text product_name = soup.find('h2').text price = soup.find('span', class_='price').text print(f"Judul: {title}") print(f"Produk: {product_name}") print(f"Harga: {price}")
Navigasi Struktur HTML
# Find vs Find All first_product = soup.find('div', class_='product') # Elemen pertama all_products = soup.find_all('div', class_='product') # Semua elemen # CSS Selector products = soup.select('div.product > h2') # Selector CSS # Parent, Children, Siblings parent = soup.find('h2').parent children = soup.find('div').contents next_sibling = soup.find('h2').find_next_sibling()
Teknik Lanjutan
Menangani Paginasi
base_url = "https://example.com/products?page=" for page in range(1, 6): # 5 halaman pertama url = base_url + str(page) response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Proses scraping products = soup.find_all('div', class_='product') for product in products: # Ekstrak data produk pass time.sleep(2) # Delay antar halaman
Menangani Form dan Login
from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() driver.get("https://example.com/login") # Isi form login username = driver.find_element(By.NAME, "username") password = driver.find_element(By.NAME, "password") username.send_keys("user123") password.send_keys("pass123") password.send_keys(Keys.RETURN) # Tunggu sampai login sukses WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "dashboard")) )
Bypass Anti-Scraping
from fake_useragent import UserAgent import random import time ua = UserAgent() headers = {'User-Agent': ua.random} proxies = [ 'http://proxy1.example.com:8080', 'http://proxy2.example.com:8080' ] proxy = random.choice(proxies) response = requests.get(url, headers=headers, proxies={'http': proxy}) # Random delay time.sleep(random.uniform(1, 3))
Penyimpanan Data
Format CSV
import pandas as pd data = [ {'nama': 'Produk A', 'harga': 100000}, {'nama': 'Produk B', 'harga': 200000} ] df = pd.DataFrame(data) df.to_csv('produk.csv', index=False)
Format JSON
import json with open('produk.json', 'w') as f: json.dump(data, f, indent=4)
Database SQL
import sqlite3 conn = sqlite3.connect('database.db') cursor = conn.cursor() # Buat tabel cursor.execute(''' CREATE TABLE IF NOT EXISTS products ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, price INTEGER ) ''') # Insert data for item in data: cursor.execute('INSERT INTO products (name, price) VALUES (?, ?)', (item['nama'], item['harga'])) conn.commit() conn.close()
Error Handling
Menangani Request Gagal
try: response = requests.get(url, timeout=10) response.raise_for_status() except requests.exceptions.HTTPError as errh: print(f"HTTP Error: {errh}") except requests.exceptions.ConnectionError as errc: print(f"Error Connecting: {errc}") except requests.exceptions.Timeout as errt: print(f"Timeout Error: {errt}") except requests.exceptions.RequestException as err: print(f"Something went wrong: {err}")
Menangani Elemen Tidak Ditemukan
element = soup.find('div', class_='product') if element is not None: price = element.find('span', class_='price').text else: print("Elemen tidak ditemukan") price = None
Best Practices
Pola Desain Scraper
class WebScraper: def __init__(self, base_url): self.base_url = base_url self.session = requests.Session() self.session.headers.update({'User-Agent': 'My Scraper'}) def scrape_page(self, url): try: response = self.session.get(url) soup = BeautifulSoup(response.text, 'html.parser') return self.parse(soup) except Exception as e: print(f"Error scraping {url}: {e}") return None def parse(self, soup): # Implementasi parsing spesifik raise NotImplementedError class ProductScraper(WebScraper): def parse(self, soup): products = [] for item in soup.find_all('div', class_='product'): products.append({ 'name': item.find('h2').text, 'price': item.find('span', class_='price').text }) return products
Logging dan Monitoring
import logging logging.basicConfig( filename='scraper.log', level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s' ) try: logging.info("Memulai scraping...") # Proses scraping logging.info("Scraping selesai") except Exception as e: logging.error(f"Error: {e}", exc_info=True)
Referensi dan Sumber Belajar
- Dokumentasi Resmi:
- Buku:
- “Web Scraping with Python” by Ryan Mitchell
- “Automate the Boring Stuff with Python” (Chapter 11)
- Tutorial Online:
- Real Python Web Scraping Tutorials
- ScrapingBee Blog
Tips Pengembangan Skill:
- Mulai dari website statis sederhana
- Pelajari selector CSS/XPath
- Bangun scrapers modular
- Simpan data mentah sebelum diproses
- Implementasikan error handling yang kuat
- Pelajari teknik bypass anti-scraping
- Praktekkan penyimpanan data berbagai format
- Ikuti perkembangan teknologi web scraping
Dengan menguasai teknik web scraping, Anda dapat:
- Mengumpulkan data aktual dari berbagai sumber
- Membangun dataset kustom untuk proyek data science
- Mengotomatiskan pengumpulan informasi
- Meningkatkan efisiensi riset dan analisis