除了 sum()，Python pandas 還有哪些更新資料的方法？【Python 處理 Excel #13】

2024 iThome 鐵人賽

DAY 13

Python

30 天學會用 Python pandas 和 openpyxl 處理 Excel —— 成為用 Python 處理 Excel 檔案的高手系列第 13 篇

16th鐵人賽 python excel pandas

april

2024-09-26 19:43:02

225 瀏覽

分享至

本篇文章同步發布於除了 sum()，Python pandas 還有哪些更新資料的方法？【Python 處理 Excel #13】

前言

前一篇文章案例透過 pandas 的 iterrows 方法逐一取得每張訂單在當月的出貨數量，這個過程中使用了 sum() 計算當月的總出貨量。除了 sum 這個方法，在 iterrows 的迴圈中也可以使用不同的方法更新資料。這篇文章介紹 Python pandas 還有哪些更新資料的方法。

案例說明

現在有兩個 Excel 檔案，order.xlsx 有訂單的相關資訊，shipment.xlsx 記錄訂單的出貨記錄，這篇文章中的 shipment.xlsx 和前一篇文章相比增加了 ship_date、unit_price 跟 shipping_cost 欄位。

order.xlsx 的內容

order_id	product_name
101	Laptop Pro
102	Gaming Desktop
103	Ultrabook

shipment.xlsx 的內容

order_id	ship_date	ship_qty	unit_price	shipping_cost
101	2024/7/5	4	1000	50
102	2024/7/10	2	1500	75
101	2024/8/15	3	1000	50
103	2024/8/20	1	800	40
102	2024/9/1	5	1500	75
101	2024/9/5	4	1000	50
102	2024/9/10	2	1500	75
101	2024/9/15	1	1000	50
103	2024/9/20	3	800	40
102	2024/9/25	2	1500	75

使用不同的方法更新 order_df

import pandas as pd
from datetime import datetime, timedelta
import calendar

# 讀取 EXCEL 檔案
order_df = pd.read_excel('order.xlsx')
shipment_df = pd.read_excel('shipment.xlsx')

# 轉換 ship_date 為 datetime 類型
shipment_df['ship_date'] = pd.to_datetime(shipment_df['ship_date'])

# 新增欄位
order_df['current_month_ship_qty'] = 0
order_df['latest_ship_date'] = pd.NaT
order_df['total_revenue'] = 0.0
order_df['avg_shipping_cost'] = 0.0
order_df['max_single_shipment'] = 0

# 尋找上個月的最後一天
today = datetime.now()
prev_month_end = today.replace(day=1) - timedelta(days=1)

# 篩選當月資料
current_month_mask = prev_month_end < shipment_df['ship_date']
current_month_shipment_df = shipment_df[current_month_mask].copy()

# 更新資料
for index, row in order_df.iterrows():
    order_id = row['order_id']
    mask = current_month_shipment_df['order_id'] == order_id
    matching_rows = current_month_shipment_df[mask]
    
    if not matching_rows.empty:
        # 1. 計算總出貨量 (與前一篇文章相同)
        order_df.at[index, 'current_month_ship_qty'] = matching_rows['ship_qty'].sum()
        
        # 2. 找出最新的出貨日期
        order_df.at[index, 'latest_ship_date'] = matching_rows['ship_date'].max()
        
        # 3. 計算總收入 (數量 * 單價)
        total_revenue = (matching_rows['ship_qty'] * matching_rows['unit_price']).sum()
        order_df.at[index, 'total_revenue'] = total_revenue
        
        # 4. 計算平均運輸成本
        order_df.at[index, 'avg_shipping_cost'] = matching_rows['shipping_cost'].mean()
        
        # 5. 找出最大的單次出貨數量
        order_df.at[index, 'max_single_shipment'] = matching_rows['ship_qty'].max()

# 顯示更新後的訂單 DataFrame
print(order_df)

解釋

這個例子演練了 5 種不同的更新方式：

總和 (sum)：與前一篇文章相同，計算當月的總出貨量。
最大值 (max) ：找出最新的出貨日期，使用 max 方法。
一般數學計算：計算總收入，將出貨量和單價相乘後求和。
平均值 (mean)：計算平均運輸成本，使用 mean 方法。
最大值 (max)：找出了最大的單次出貨數量，再次使用 max 方法。

其他可能的更新方式

除了上述方法，還有許多其他可能的更新方式，例如：

最小值 (min)：例如，找出最早的出貨日期或最小的單次出貨數量。
計數 (count)：計算符合某條件的列數，例如matching_rows['ship_qty'].gt(3).sum() 可以計算出貨數量大於 3 的列數。
第一個或最後一個值：使用 iloc 來獲取第一個 (iloc[0]) 或最後一個 (iloc[-1]) 匹配資料列的某個值。
條件更新：基於某些條件來決定如何更新，例如：

if matching_rows['ship_qty'].sum() > 10:
    order_df.at[index, 'status'] = 'High Volume'
else:
    order_df.at[index, 'status'] = 'Normal'

考慮 iterrows 的執行效率

雖然 iterrows 方法能用靈活更新 DataFrame，但對於大型 DataFrame，它可能不是最有效率的選擇。因此在使用 iterrows 方法前，記得先考慮使用 apply 方法或其他向量化操作。例如，可以使用 apply 方法優化總收入的計算方式：

def calculate_revenue(order_id):
    mask = current_month_shipment_df['order_id'] == order_id
    matching_rows = current_month_shipment_df[mask]
    return (matching_rows['ship_qty'] * matching_rows['unit_price']).sum()

order_df['total_revenue'] = order_df['order_id'].apply(calculate_revenue)

apply 方法通常比使用 iterrows 方法更快，尤其對於大型 DataFrame。

總結

iterrows 方法讓使用者能實現各種複雜的數據更新。
常見的更新方法包括加總求和、求最大/最小值、計算平均值、條件更新等。對於每種更新操作，pandas 通常提供了對應方法，例如 sum()、 max()、mean() 等。
處理大型 DataFame 時，記得先考慮效率較高的方法。
依照具體的業務需求，可以組合使用多種更新方法以獲取所需的結果。