Latar Belakang Masalah
Sebuah perusahaan telekomunikasi memiliki dataset keluhan pelanggan dengan masalah :
- Kolom tidak konsisten (huruf besar/kecil campur).
- Missing values di kolom penting (tanggal, kategori keluhan).
- Duplikat entri pelanggan.
- Kategori keluhan tidak standar.
Dataset Contoh (complaints_raw.csv
) :
id,date,customer_name,complaint_type,priority,status,response_time 1,2023-01-15,ANDI,"Jaringan Lemot","High","Pending",72 2,2023-02-10,Budi,"tidak bisa akses internet","high","resolved",48 3,2023-03-05,,NULL,"Medium","pending",NA 4,2023-01-15,ANDI,"jaringan lemot","High","Pending",72 5,2023-04-20,eka,"TV tidak nyala","low","resolved",24 6,2023-05-10,fani,"Billing Error","Urgent","Pending",96
Langkah-Langkah Cleanup dengan Siuba (dplyr-style)
Langkah 1 : Load Data dan Library
from siuba import * from siuba.dply.verbs import * import pandas as pd df = pd.read_csv("complaints_raw.csv") print("Data Awal:") print(df)
Output :
id date customer_name complaint_type priority status response_time 0 1 2023-01-15 ANDI Jaringan Lemot High Pending 72.0 1 2 2023-02-10 Budi tidak bisa akses internet high resolved 48.0 2 3 2023-03-05 NaN NULL Medium pending NaN 3 4 2023-01-15 ANDI jaringan lemot High Pending 72.0 4 5 2023-04-20 eka TV tidak nyala low resolved 24.0 5 6 2023-05-10 fani Billing Error Urgent Pending 96.0
Langkah 2 : Standarisasi Format Teks
df_clean = df >> mutate( customer_name = _.customer_name.str.title(), complaint_type = _.complaint_type.str.lower(), priority = _.priority.str.capitalize(), status = _.status.str.capitalize() ) print("Standarisasi Format Teks:") print(df_clean >> select(_.customer_name, _.complaint_type, _.priority))
Output :
customer_name complaint_type priority 0 Andi jaringan lemot High 1 Budi tidak bisa akses internet High 2 None null Medium 3 Andi jaringan lemot High 4 Eka tv tidak nyala Low 5 Fani billing error Urgent
Langkah 3 : Menangani Missing Values
df_clean = df_clean >> mutate( customer_name = _.customer_name.fillna("Anonim"), complaint_type = _.complaint_type.replace("null", "unknown"), response_time = _.response_time.fillna(_.response_time.median()) ) print("Handle Missing Values:") print(df_clean >> filter(_.id.isin([2, 3])))
Output :
id date customer_name complaint_type priority status response_time 1 2 2023-02-10 Budi tidak bisa akses internet High Resolved 48.0 2 3 2023-03-05 Anonim unknown Medium Pending 72.0 # Diisi median
Langkah 4 : Hapus Duplikat
df_clean = df_clean >> distinct(_.date, _.customer_name, _.complaint_type, keep="first") print("Data Tanpa Duplikat:") print(df_clean)
Output :
id date customer_name complaint_type priority status response_time 0 1 2023-01-15 Andi jaringan lemot High Pending 72.0 1 2 2023-02-10 Budi tidak bisa akses internet High Resolved 48.0 2 3 2023-03-05 Anonim unknown Medium Pending 72.0 4 5 2023-04-20 Eka tv tidak nyala Low Resolved 24.0 5 6 2023-05-10 Fani billing error Urgent Pending 96.0
Langkah 5 : Standarisasi Kategori
priority_map = {"High": "H", "Urgent": "H", "Medium": "M", "Low": "L"} df_clean = df_clean >> mutate( priority = _.priority.replace(priority_map), complaint_category = case_when( _.complaint_type.str.contains("jaringan|internet"), "Jaringan", _.complaint_type.str.contains("billing"), "Tagihan", _.complaint_type.str.contains("tv"), "Layanan TV", True, "Lainnya" ) ) print("Kategori Standar:") print(df_clean >> select(_.complaint_type, _.complaint_category, _.priority))
Output :
complaint_type complaint_category priority 0 jaringan lemot Jaringan H 1 tidak bisa akses internet Jaringan H 2 unknown Lainnya M 3 tv tidak nyala Layanan TV L 4 billing error Tagihan H
Langkah 6 : Filter dan Transformasi Lanjutan
# Hitung SLA violation (response_time > 72 jam) df_final = df_clean >> mutate( is_overdue = _.response_time > 72, resolution_days = _.response_time / 24 ) >> filter( _.status == "Pending" ) >> arrange( _.priority, _.resolution_days ) print("Data Final untuk Tim CS:") print(df_final)
Output :
id date customer_name complaint_type priority status response_time complaint_category is_overdue resolution_days 5 6 2023-05-10 Fani billing error H Pending 96.0 Tagihan True 4.0 0 1 2023-01-15 Andi jaringan lemot H Pending 72.0 Jaringan False 3.0 2 3 2023-03-05 Anonim unknown M Pending 72.0 Lainnya False 3.0
Visualisasi Hasil Cleanup
import matplotlib.pyplot as plt # Plot jumlah keluhan per kategori (df_final >> group_by(_.complaint_category) >> summarize(count=_.id.count()) >> mutate(complaint_category=_.complaint_category.str.replace(" ", "\n")) ).plot.bar( x="complaint_category", y="count", title="Keluhan per Kategori" ) plt.show()
Kesimpulan
- dplyr-style (Siuba) membuat kode lebih ringkas vs Pandas murni untuk :
- Standarisasi teks (
mutate + str.methods
). - Handling missing values dengan pipe.
- Transformasi kondisi kompleks (
case_when
).
- Standarisasi teks (
- Pola Umum Cleanup :
(df >> mutate(...) # Transformasi kolom >> filter(...) # Filter data >> group_by(...) >> summarize(...) # Aggregasi )
Tips untuk Dataset Besar :
- GunakanÂ
>>
untuk debugging tiap step :df_transformed = df >> mutate(...) >> pipe(print) >> filter(...)