[Day13]Learning Pandas - 處理資料分組 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2019 iT 邦幫忙鐵人賽

DAY 13

AI & Data

python 入門到分析股市系列第 13 篇

[Day13]Learning Pandas - 處理資料分組

2019鐵人賽

Summer

團隊浪流連九程式匠自然產生的佛系碼農專區

2018-10-28 10:58:54

34204 瀏覽

分享至

導讀

安裝環境
- 安裝Anaconda
- 安裝Jupyter notebook
Numpy
Pandas
程式碼位置
- github
  因為作者本身也是第一次學習Python和寫程式文章，所以編排上會有點亂，觀念可能也會錯誤，如果有疑問可以提出一起討論。

前言

今天是鐵人第13天，不要懷疑還是在Pandas，再過幾天就可以跳脫了....，今天主要是介紹Pandas裡面資料計算和分組

資料計算sum()、mean()、median()、min()、max()

以下用Series示範上述涵式

import numpy as np
import pandas as pd
ironman_ser = pd.Series(np.random.randint(10, size=6))
print(ironman_ser)
print('陣列中的總和--------------->',ironman_ser.sum())
print('陣列中的平均值------------->',ironman_ser.mean())
print('陣列中middle two的平均值--->',ironman_ser.median())
print('陣列中最小值--------------->',ironman_ser.min())
print('陣列中最大值--------------->',ironman_ser.max())

# 輸出結果
0    7
1    4
2    6
3    7
4    7
5    8
dtype: int32
陣列中的總和---------------> 39
陣列中的平均值-------------> 6.5
陣列中middle two的平均值---> 7.0
陣列中最小值---------------> 4
陣列中最大值---------------> 8

示範用DataFrame進行運算，預設會用欄做計算，如果要改成用列作計算可使用axis

ironman_df = pd.DataFrame({'A':np.random.randint(10, size=6), 'B':np.random.randint(10, size=6)})
print(ironman_df)
ironman_df.sum()  #如果想用列作計算可以使用 ironman_df.sum(axis='columns')

# 輸出結果
   A  B
0  3  6
1  3  1
2  1  8
3  0  6
4  4  5
5  0  1

A    11
B    27
dtype: int64

Pandas中使用的計算方法

計算	描述
count()	元素總數
first(),last()	第一個,最後一個元素
mean(),median()	全部平均,中位數平均
min(),max()	最小值,最大值
std(),var()	標準差,變異數
mad()	平均絕對差
prod()	元素的積
sum()	所有元素的總和

GroupBy

之前介紹的聚合資料運算都是針對全部的元素，如果想要針對某些索引做計算可以透過groupby
示範建立一個DataFrame，並算出每個team('A','B','C')的總和

ironman_df = pd.DataFrame({'team':['A','B','C','C','B','B'],'number':np.random.randint(10, size=6)}, columns=['team','number'])
print(ironman_df)
ironman_df.groupby('team').sum() #

# 輸出結果
  team  number
0    A       9
1    B       6
2    C       9
3    C       4
4    B       2
5    B       9
        number
team	
A	      9
B	     17
C	     13

aggregate()
如果同時間要拿到sum()、min()、max()可以使用聚合(aggregate)

ironman_df.groupby('team').aggregate(['sum','max','min'])

# 輸出結果
        number
     sum   max	min
team			
A	   9	9	  9
B	  17	9	  2
C	  13	9	  4

fliter()
過濾掉計算結果不符合的部分。ex: 將每個team總和大於10留下，剩下過濾掉

def filter_func(x):
    return x['number'].sum() > 10
ironman_df.groupby('team').filter(filter_func)

# 輸出結果(可以看到A team被過濾掉了)
   team	number
1	  B	    6
2	  C	    9
3	  C	    4
4	  B	    2
5	  B	    9

transform
當需要將原始資料和計算後的資料作運算，可以使用轉換(transform)。ex: 將每個team的元素和總和做計算。

ironman_df.groupby('team').transform(lambda x: x- x.sum())

# 輸出結果
	number
0	     0
1	   -11
2	    -4
3	    -9
4	   -15
5	    -8

apply
使用呼叫函式的方式，可以回傳Pandas物件或是純量(這是跟transform比較不一樣的地方)。ex: 將每個team的元素和總和做計算。

def apply_func(x):
    x['number'] = x['number'] - x['number'].sum()
    return x
ironman_df.groupby('team').apply(apply_func)

# 輸出結果
   team	number
0	A	    0
1	B	  -11
2	C	   -4
3	C	   -9
4	B	  -15
5	B	   -8

其他指定groupby分組條件的方式
以下示範自己定義不同的分組組別來做總和。

newList = [0,1,1,2,0,2]
ironman_df.groupby(newList).sum()

# 輸出結果
	number
0	   11
1	   15
2	   13

樞紐分析表(Pivot Table)

現在要用seaborn內建的數據庫來表示樞紐分析表，seaborn有哪些內建的數據庫可以看這裡
pviot_table函式呼叫的參數
DataFrame.pviot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
以下示範鐵達尼存活的平均值用艙做分類

import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.pivot_table('survived', index='sex', columns='class')

# 輸出結果(aggfunc預設是平均值)
class	   First	  Second	    Third
sex			
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

對向量中的字串做操作

可操作的函式如下
len() lower() translate() islower()
ljust() upper() starswitch() issupper()
rjust() find() endswitch() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()
match() extract() findall() replace()
contains() count() get() slice_replace()
slice() cat() repeat() normalize()
pad() wrap() join() get_dummies()

範例：Series回傳小寫字串，如果是DataFrame需要針對某個欄位做操作

ironman = pd.Series(['Elephant','Lion','Lamigo','Guardian'])
ironman.str.lower()

# 輸出結果
0    elephant
1        lion
2      lamigo
3    guardian
dtype: object

範例：用正規表示

ironman.str.extract('([ELa-z]+)')

# 輸出結果
	       0
0	Elephant
1	Lion
2	Lamigo
3	uardian

範例：文字切片

ironman.str.slice(0,3)

# 輸出結果
0    Ele
1    Lio
2    Lam
3    Gua
dtype: object

[Day12]Learning Pandas - 資料合併

[Day14]Learning Pandas - Time、eval

系列文

python 入門到分析股市共 30 篇

RSS系列文訂閱系列文

322 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22207 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

python 入門到分析股市系列 第 13 篇