DAY27 Pandas使用drop_duplicates刪除重複

2024 iThome 鐵人賽

DAY 27

佛心分享-IT 人自學之術

走在Pandas資料操縱與分析的路上持續前進系列第 27 篇

16th鐵人賽

٩۹(๑•̀ω•́ ๑)۶

2024-09-01 00:17:37

186 瀏覽

分享至

前兩天講了實用的Pandas的dropna語法用來刪除NaN的資料，
今天要在講一個非常實用的語法drop_duplicates，
用來刪除重複資料。

範例

首先，先建立一個DataFrame結構的資料，
或是有匯入的資料轉成DataFrame結構也行。
這邊為了方便對照，先印出完整的資料來看。

P.S這裡特別放入兩個重複資料

studentsData = {
    'studentId': ['001', '002', '001', '002', '003'],
    'Name': ['A', 'B', 'A', 'B', 'C'],
    'Height': [175, 153, 175, 153, 164],
    'Weight': [80, 45, 80, 45, 75],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago']
}
students = pd.DataFrame(studentsData)
print(students)

印出資料如下，
可以看到列index0與2以及1與3的資料是相同的。

  studentId Name  Height  Weight         City
0       001    A     175      80     New York
1       002    B     153      45  Los Angeles
2       001    A     175      80     New York
3       002    B     153      45  Los Angeles
4       003    C     164      75      Chicago

刪除重複的列

這裡的語法也非常簡單，
在資料後加上.drop_duplicates()，
使用方式如下，

print(students.drop_duplicates())

印出資料如下，
可以看到，當資料重複時前面的資料會被留下，
而後面重複的列index2以及3已被刪除。

  studentId Name  Height  Weight         City
0       001    A     175      80     New York
1       002    B     153      45  Los Angeles
4       003    C     164      75      Chicago