與不同維數的 NumPy 陣列相同,pandas 也定義 DataFrame 與 Series 之間的算術運算。
假如來計算一個二維陣列和裡面的一列之間的差:
In [207]: arr = np.arange(12.).reshape((3, 4))
In [208]: arr
Out[208]:
array([[0., 1., 2., 3.],
[4., 5., 6., 7.],
[8., 9., 10., 11.]])
In [209]: arr[0]
Out[209]: array([0., 1., 2., 3.])
In [210]: arr - arr[0]
Out[210]:
array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
將 arr
減去 arr[0]
時,每一列都會執行一次減法,這種情況稱為廣播(broadcasting),因為與一般的 NumPy 陣列有關,在 DataFrame 與 Series 之間的運算也大致相同:
In [211]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])
In [212]: series = frame.iloc[0]
In [213]: frame
Out[213]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [214]: series
Out[214]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
預設情況下,DataFrame 與 Series 之間的算術運算會拿 Series 的索引與 DataFrame 的欄做比較,並沿著列往下廣播:
In [215]: frame - series
Out[215]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
如果有索引值在 DataFrame 的欄或是 Seires 的索引裡找不到,物件會被 reindex
形成聯集:
In [216]: series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
In [217]: series
Out[217]:
b 0
d 1
e 2
dtype: int64
In [218]: frame + series2
Out[218]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
如果想要比對列,沿著欄來廣播,必須使用算術方法,並指定想比對的 index
:
In [219]: series3 = frame["d"]
In [220]: frame
Out[220]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [221]: series3
Out[221]:
Utah 1.0
Ohio 4.0
Texas 7.0
Oregon 10.0
Name: d, dtype: float64
In [222]: frame.sub(series, axis="index")
Out[222]:
b d e
Utah -1.0 1.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
今日的分享就到這囉,我們明天見,掰掰!