Konsep Dasar Outlier
Outlier adalah observasi yang secara signifikan berbeda dari data lainnya dalam dataset. Outlier dapat :
- Mewakili variasi alami dalam data
- Menunjukkan kesalahan pengukuran atau input
- Mengindikasikan kejadian langka yang penting
Jenis Outlier :
- Point Outliers : Nilai tunggal yang ekstrem
- Contextual Outliers : Nilai yang tidak biasa dalam konteks tertentu
- Collective Outliers : Kumpulan data yang tidak biasa
Dampak Outlier
- Analisis Statistik :
- Mengubah mean dan standar deviasi
- Mempengaruhi hasil regresi
- Machine Learning :
- Performa model menurun
- Pembelajaran algoritma bias
- Visualisasi :
- Skala grafik menjadi tidak optimal
- Pola data sulit dikenali
Metode Deteksi Outlier
1 Metode Statistik Dasar
a. Z-Score
import numpy as np from scipy import stats data = np.array([12, 15, 12, 14, 13, 16, 11, 15, 300, 12, 14]) z_scores = stats.zscore(data) outliers_z = np.where(np.abs(z_scores) > 3)[0] print("Indeks outlier (Z-Score):", outliers_z)
b. IQR (Interquartile Range)
Q1 = np.percentile(data, 25) Q3 = np.percentile(data, 75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers_iqr = np.where((data < lower_bound) | (data > upper_bound))[0] print("Indeks outlier (IQR):", outliers_iqr)
2 Visualisasi Outlier
a. Boxplot
import matplotlib.pyplot as plt import seaborn as sns plt.figure(figsize=(8,5)) sns.boxplot(data=data, orient='h') plt.title('Deteksi Outlier dengan Boxplot') plt.show()
b. Scatter Plot
x = np.arange(len(data)) plt.figure(figsize=(10,5)) plt.scatter(x, data, color='blue', label='Data Normal') plt.scatter(x[outliers_z], data[outliers_z], color='red', label='Outlier') plt.axhline(y=upper_bound, color='r', linestyle='--', label='Batas Atas') plt.axhline(y=lower_bound, color='r', linestyle='--', label='Batas Bawah') plt.legend() plt.title('Scatter Plot dengan Outlier') plt.show()
Metode Lanjutan Deteksi Outlier
1 DBSCAN (Density-Based Clustering)
from sklearn.cluster import DBSCAN # Reshape data untuk clustering X = data.reshape(-1, 1) # Inisialisasi DBSCAN dbscan = DBSCAN(eps=2.5, min_samples=2) clusters = dbscan.fit_predict(X) outliers_dbscan = np.where(clusters == -1)[0] print("Indeks outlier (DBSCAN):", outliers_dbscan)
2 Isolation Forest
from sklearn.ensemble import IsolationForest # Inisialisasi model iso_forest = IsolationForest(contamination=0.1, random_state=42) outliers_iso = iso_forest.fit_predict(X.reshape(-1, 1)) outliers_iso = np.where(outliers_iso == -1)[0] print("Indeks outlier (Isolation Forest):", outliers_iso)
3 Local Outlier Factor (LOF)
from sklearn.neighbors import LocalOutlierFactor lof = LocalOutlierFactor(n_neighbors=5, contamination=0.1) outliers_lof = lof.fit_predict(X.reshape(-1, 1)) outliers_lof = np.where(outliers_lof == -1)[0] print("Indeks outlier (LOF):", outliers_lof)
Teknik Penanganan Outlier
1 Penghapusan Outlier
data_clean = np.delete(data, outliers_iqr) print("Data setelah penghapusan outlier:", data_clean)
2 Transformasi Data
# Transformasi logaritmik data_log = np.log(data[data > 0]) plt.figure(figsize=(12,5)) plt.subplot(1,2,1) sns.boxplot(data=data, orient='h') plt.title('Sebelum Transformasi') plt.subplot(1,2,2) sns.boxplot(data=data_log, orient='h') plt.title('Setelah Transformasi Log') plt.show()
3 Imputasi Nilai
from sklearn.impute import SimpleImputer # Ganti outlier dengan median data_imputed = data.copy() data_imputed[outliers_iqr] = np.nan imputer = SimpleImputer(strategy='median') data_imputed = imputer.fit_transform(data_imputed.reshape(-1, 1)) print("Data setelah imputasi:", data_imputed.flatten())
4 Binning Data
bins = np.linspace(min(data), max(data), 4) data_binned = np.digitize(data, bins) print("Data setelah binning:", data_binned)
Validasi Hasil
1 Perbandingan Statistik
print("\nPerbandingan Statistik:") print(pd.DataFrame({ 'Original': [np.mean(data), np.median(data), np.std(data)], 'Clean': [np.mean(data_clean), np.median(data_clean), np.std(data_clean)], 'Imputed': [np.mean(data_imputed), np.median(data_imputed), np.std(data_imputed)] }, index=['Mean', 'Median', 'Std Dev']))
2 Visualisasi Perbandingan
plt.figure(figsize=(12,5)) methods = ['Original', 'Clean', 'Imputed', 'Log Transform'] values = [data, data_clean, data_imputed.flatten(), data_log] for i, (method, val) in enumerate(zip(methods, values), 1): plt.subplot(2,2,i) sns.boxplot(data=val, orient='h') plt.title(method) plt.tight_layout() plt.show()
Best Practices
1 Pipeline Deteksi Outlier
from sklearn.pipeline import Pipeline from sklearn.preprocessing import FunctionTransformer def detect_outliers(X): Q1 = np.percentile(X, 25) Q3 = np.percentile(X, 75) IQR = Q3 - Q1 return ((X < (Q1 - 1.5 * IQR)) | (X > (Q3 + 1.5 * IQR))) outlier_pipeline = Pipeline([ ('detect', FunctionTransformer(detect_outliers)), ('impute', SimpleImputer(strategy='median')) ]) X_transformed = outlier_pipeline.fit_transform(data.reshape(-1, 1))
2 Dokumentasi Proses
outlier_report = f""" OUTLIER ANALYSIS REPORT ======================= Dataset: data_samples.npy Total Observations: {len(data)} Outliers Detected: {len(outliers_iqr)} ({len(outliers_iqr)/len(data):.1%}) Methods Applied: 1. IQR Method (Threshold: 1.5*IQR) 2. Z-Score Method (Threshold: ±3σ) 3. DBSCAN Clustering (eps=2.5) Handling Strategy: - Removed {len(outliers_iqr)} extreme outliers - Applied log transform to reduce skewness Statistics Before/After: Before After Mean {np.mean(data):.2f} {np.mean(data_clean):.2f} Median {np.median(data):.2f} {np.median(data_clean):.2f} Standard Dev {np.std(data):.2f} {np.std(data_clean):.2f} """ print(outlier_report)