Data Collection - Web Scraping

Konsep Dasar Web Scraping

Web scraping adalah teknik ekstraksi data otomatis dari website dengan cara:

Mengunduh konten halaman web
Menganalisis struktur HTML/XML
Mengekstrak informasi spesifik
Menyimpan data dalam format terstruktur

Tujuan Web Scraping :

Pengumpulan data untuk analisis
Monitoring harga produk kompetitor
Aggregasi konten dari berbagai sumber
Riset pasar dan kompetitif

Legalitas dan Etika

Selalu cek robots.txt (contoh: https://www.example.com/robots.txt)
Hormati User-Agent dan Crawl-Delay
Jangan melakukan scraping data pribadi
Hindari request berlebihan yang membebani server

Tools dan Library Python

BeautifulSoup

Library untuk parsing HTML/XML:

from bs4 import BeautifulSoup
import requests

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

Selenium

Untuk scraping website berbasis JavaScript:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://example.com")
element = driver.find_element(By.ID, "element-id")

Scrapy

Framework untuk proyek scraping besar:

import scrapy

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']
    
    def parse(self, response):
        yield {
            'title': response.css('h1::text').get()
        }

Library Pendukung

requests: HTTP requests
pandas: Pengolahan data
lxml: Parser alternatif
fake-useragent: Rotasi User-Agent

Teknik Dasar Web Scraping

Mengakses Halaman Web

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ...'
}

try:
    response = requests.get('https://example.com', headers=headers)
    response.raise_for_status()  # Cek error HTTP
    print(response.status_code)  # 200 jika sukses
except requests.exceptions.RequestException as e:
    print(f"Error: {e}")

Parsing HTML

from bs4 import BeautifulSoup

html_doc = """
<html>
<head><title>Contoh Halaman</title></head>
<body>
<div class="product">
    <h2>Laptop Gaming</h2>
    <span class="price">Rp 15.000.000</span>
</div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
title = soup.find('title').text
product_name = soup.find('h2').text
price = soup.find('span', class_='price').text

print(f"Judul: {title}")
print(f"Produk: {product_name}")
print(f"Harga: {price}")

Navigasi Struktur HTML

# Find vs Find All
first_product = soup.find('div', class_='product')  # Elemen pertama
all_products = soup.find_all('div', class_='product')  # Semua elemen

# CSS Selector
products = soup.select('div.product > h2')  # Selector CSS

# Parent, Children, Siblings
parent = soup.find('h2').parent
children = soup.find('div').contents
next_sibling = soup.find('h2').find_next_sibling()

Teknik Lanjutan

Menangani Paginasi

base_url = "https://example.com/products?page="

for page in range(1, 6):  # 5 halaman pertama
    url = base_url + str(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Proses scraping
    products = soup.find_all('div', class_='product')
    for product in products:
        # Ekstrak data produk
        pass
    
    time.sleep(2)  # Delay antar halaman

Menangani Form dan Login

from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("https://example.com/login")

# Isi form login
username = driver.find_element(By.NAME, "username")
password = driver.find_element(By.NAME, "password")

username.send_keys("user123")
password.send_keys("pass123")
password.send_keys(Keys.RETURN)

# Tunggu sampai login sukses
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "dashboard"))
)

Bypass Anti-Scraping

from fake_useragent import UserAgent
import random
import time

ua = UserAgent()
headers = {'User-Agent': ua.random}

proxies = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080'
]

proxy = random.choice(proxies)
response = requests.get(url, headers=headers, proxies={'http': proxy})

# Random delay
time.sleep(random.uniform(1, 3))

Penyimpanan Data

Format CSV

import pandas as pd

data = [
    {'nama': 'Produk A', 'harga': 100000},
    {'nama': 'Produk B', 'harga': 200000}
]

df = pd.DataFrame(data)
df.to_csv('produk.csv', index=False)

Format JSON

import json

with open('produk.json', 'w') as f:
    json.dump(data, f, indent=4)

Database SQL

import sqlite3

conn = sqlite3.connect('database.db')
cursor = conn.cursor()

# Buat tabel
cursor.execute('''
    CREATE TABLE IF NOT EXISTS products (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        name TEXT,
        price INTEGER
    )
''')

# Insert data
for item in data:
    cursor.execute('INSERT INTO products (name, price) VALUES (?, ?)', 
                  (item['nama'], item['harga']))

conn.commit()
conn.close()

Error Handling

Menangani Request Gagal

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print(f"HTTP Error: {errh}")
except requests.exceptions.ConnectionError as errc:
    print(f"Error Connecting: {errc}")
except requests.exceptions.Timeout as errt:
    print(f"Timeout Error: {errt}")
except requests.exceptions.RequestException as err:
    print(f"Something went wrong: {err}")

Menangani Elemen Tidak Ditemukan

element = soup.find('div', class_='product')
if element is not None:
    price = element.find('span', class_='price').text
else:
    print("Elemen tidak ditemukan")
    price = None

Best Practices

Pola Desain Scraper

class WebScraper:
    def __init__(self, base_url):
        self.base_url = base_url
        self.session = requests.Session()
        self.session.headers.update({'User-Agent': 'My Scraper'})
    
    def scrape_page(self, url):
        try:
            response = self.session.get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            return self.parse(soup)
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            return None
    
    def parse(self, soup):
        # Implementasi parsing spesifik
        raise NotImplementedError

class ProductScraper(WebScraper):
    def parse(self, soup):
        products = []
        for item in soup.find_all('div', class_='product'):
            products.append({
                'name': item.find('h2').text,
                'price': item.find('span', class_='price').text
            })
        return products

Logging dan Monitoring

import logging

logging.basicConfig(
    filename='scraper.log',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

try:
    logging.info("Memulai scraping...")
    # Proses scraping
    logging.info("Scraping selesai")
except Exception as e:
    logging.error(f"Error: {e}", exc_info=True)

Referensi dan Sumber Belajar

Dokumentasi Resmi:
Buku:
- “Web Scraping with Python” by Ryan Mitchell
- “Automate the Boring Stuff with Python” (Chapter 11)
Tutorial Online:
- Real Python Web Scraping Tutorials
- ScrapingBee Blog

Tips Pengembangan Skill:

Mulai dari website statis sederhana
Pelajari selector CSS/XPath
Bangun scrapers modular
Simpan data mentah sebelum diproses
Implementasikan error handling yang kuat
Pelajari teknik bypass anti-scraping
Praktekkan penyimpanan data berbagai format
Ikuti perkembangan teknologi web scraping

Dengan menguasai teknik web scraping, Anda dapat:

Mengumpulkan data aktual dari berbagai sumber
Membangun dataset kustom untuk proyek data science
Mengotomatiskan pengumpulan informasi
Meningkatkan efisiensi riset dan analisis

Data Collection – Web Scraping