Studi Kasus Cleanup Dataset Layanan Pelanggan yang Berantakan dengan Dplyr (Siuba)

Studi Kasus Cleanup Dataset Layanan Pelanggan yang Berantakan dengan Dplyr (Siuba)

Latar Belakang Masalah

Sebuah perusahaan telekomunikasi memiliki dataset keluhan pelanggan dengan masalah :

  • Kolom tidak konsisten (huruf besar/kecil campur).
  • Missing values di kolom penting (tanggal, kategori keluhan).
  • Duplikat entri pelanggan.
  • Kategori keluhan tidak standar.

Dataset Contoh (complaints_raw.csv) :

csv
id,date,customer_name,complaint_type,priority,status,response_time
1,2023-01-15,ANDI,"Jaringan Lemot","High","Pending",72
2,2023-02-10,Budi,"tidak bisa akses internet","high","resolved",48
3,2023-03-05,,NULL,"Medium","pending",NA
4,2023-01-15,ANDI,"jaringan lemot","High","Pending",72
5,2023-04-20,eka,"TV tidak nyala","low","resolved",24
6,2023-05-10,fani,"Billing Error","Urgent","Pending",96

Langkah-Langkah Cleanup dengan Siuba (dplyr-style)

Langkah 1 : Load Data dan Library

python
from siuba import *
from siuba.dply.verbs import *
import pandas as pd

df = pd.read_csv("complaints_raw.csv")
print("Data Awal:")
print(df)

Output :

   id        date customer_name             complaint_type priority   status  response_time
0   1  2023-01-15          ANDI             Jaringan Lemot     High  Pending           72.0
1   2  2023-02-10          Budi  tidak bisa akses internet     high  resolved          48.0
2   3  2023-03-05           NaN                       NULL   Medium  pending            NaN
3   4  2023-01-15          ANDI             jaringan lemot     High  Pending           72.0
4   5  2023-04-20           eka             TV tidak nyala      low  resolved          24.0
5   6  2023-05-10          fani              Billing Error   Urgent  Pending           96.0

Langkah 2 : Standarisasi Format Teks

python
df_clean = df >> mutate(
    customer_name = _.customer_name.str.title(),
    complaint_type = _.complaint_type.str.lower(),
    priority = _.priority.str.capitalize(),
    status = _.status.str.capitalize()
)
print("Standarisasi Format Teks:")
print(df_clean >> select(_.customer_name, _.complaint_type, _.priority))

Output :

  customer_name             complaint_type  priority
0          Andi             jaringan lemot     High
1          Budi  tidak bisa akses internet     High
2          None                       null   Medium
3          Andi             jaringan lemot     High
4           Eka             tv tidak nyala      Low
5          Fani              billing error   Urgent

Langkah 3 : Menangani Missing Values

python
df_clean = df_clean >> mutate(
    customer_name = _.customer_name.fillna("Anonim"),
    complaint_type = _.complaint_type.replace("null", "unknown"),
    response_time = _.response_time.fillna(_.response_time.median())
)
print("Handle Missing Values:")
print(df_clean >> filter(_.id.isin([2, 3])))

Output :

   id        date customer_name             complaint_type priority    status  response_time
1   2  2023-02-10          Budi  tidak bisa akses internet     High  Resolved           48.0
2   3  2023-03-05       Anonim                     unknown   Medium   Pending           72.0  # Diisi median

Langkah 4 : Hapus Duplikat

python
df_clean = df_clean >> distinct(_.date, _.customer_name, _.complaint_type, keep="first")
print("Data Tanpa Duplikat:")
print(df_clean)

Output :

   id        date customer_name             complaint_type priority   status  response_time
0   1  2023-01-15          Andi             jaringan lemot     High  Pending           72.0
1   2  2023-02-10          Budi  tidak bisa akses internet     High  Resolved          48.0
2   3  2023-03-05       Anonim                     unknown   Medium  Pending           72.0
4   5  2023-04-20           Eka             tv tidak nyala      Low  Resolved          24.0
5   6  2023-05-10          Fani              billing error   Urgent  Pending           96.0

Langkah 5 : Standarisasi Kategori

python
priority_map = {"High": "H", "Urgent": "H", "Medium": "M", "Low": "L"}

df_clean = df_clean >> mutate(
    priority = _.priority.replace(priority_map),
    complaint_category = case_when(
        _.complaint_type.str.contains("jaringan|internet"), "Jaringan",
        _.complaint_type.str.contains("billing"), "Tagihan",
        _.complaint_type.str.contains("tv"), "Layanan TV",
        True, "Lainnya"
    )
)
print("Kategori Standar:")
print(df_clean >> select(_.complaint_type, _.complaint_category, _.priority))

Output :

           complaint_type complaint_category priority
0            jaringan lemot           Jaringan        H
1  tidak bisa akses internet           Jaringan        H
2                  unknown            Lainnya        M
3           tv tidak nyala        Layanan TV        L
4            billing error           Tagihan        H

Langkah 6 : Filter dan Transformasi Lanjutan

python
# Hitung SLA violation (response_time > 72 jam)
df_final = df_clean >> mutate(
    is_overdue = _.response_time > 72,
    resolution_days = _.response_time / 24
) >> filter(
    _.status == "Pending"
) >> arrange(
    _.priority, _.resolution_days
)
print("Data Final untuk Tim CS:")
print(df_final)

Output :

   id        date customer_name complaint_type priority   status  response_time complaint_category  is_overdue  resolution_days
5   6  2023-05-10          Fani  billing error        H  Pending           96.0           Tagihan        True             4.0
0   1  2023-01-15          Andi jaringan lemot        H  Pending           72.0           Jaringan       False             3.0
2   3  2023-03-05        Anonim        unknown        M  Pending           72.0            Lainnya       False             3.0

Visualisasi Hasil Cleanup

python
import matplotlib.pyplot as plt

# Plot jumlah keluhan per kategori
(df_final 
 >> group_by(_.complaint_category) 
 >> summarize(count=_.id.count()) 
 >> mutate(complaint_category=_.complaint_category.str.replace(" ", "\n"))
).plot.bar(
    x="complaint_category", 
    y="count",
    title="Keluhan per Kategori"
)
plt.show()

Kesimpulan

  1. dplyr-style (Siuba) membuat kode lebih ringkas vs Pandas murni untuk :
    • Standarisasi teks (mutate + str.methods).
    • Handling missing values dengan pipe.
    • Transformasi kondisi kompleks (case_when).
  2. Pola Umum Cleanup :
    python
    (df
     >> mutate(...)  # Transformasi kolom
     >> filter(...)  # Filter data
     >> group_by(...) >> summarize(...)  # Aggregasi
    )

Tips untuk Dataset Besar :

  • Gunakan >> untuk debugging tiap step :
    python
    Copy
    df_transformed = df >> mutate(...) >> pipe(print) >> filter(...)

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *