[Day10]Learning Pandas - Series、DataFrame、Index - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2019 iT 邦幫忙鐵人賽

DAY 10

AI & Data

python 入門到分析股市系列第 10 篇

[Day10]Learning Pandas - Series、DataFrame、Index

2019鐵人賽 python

Summer

團隊浪流連九程式匠自然產生的佛系碼農專區

2018-10-25 00:05:17

37767 瀏覽

分享至

前言

終於換到下一個主題了，今天要來介紹Pandas，提供高效率，資料更容易使用的架構，且讓資料更容易分析的開源程式碼。想像Pandas是Numpy陣列的加強版

前置作業

打開Anaconda Prompt看是否有安裝pandas(使用「conda list」指令)，沒有就執行

pip install pandas

Pandas物件

pandas底下有三大物件Series和DataFrame和Index，主要都是用這三個物件在做運算。

Series(一維陣列)

跟numpy的陣列不同的是，可以定義自己的index(任何資料型態)，也可想像成是特殊化的Dictionary
以下示範建立一個簡單的Series

import numpy as np
import pandas as pd
ironman = pd.Series([0.11,0.22,0.33,0.44])
ironman

# 輸出結果
0    0.11
1    0.22
2    0.33
3    0.44
dtype: float64

從上面的例子可以看出 [0,1,2,3]為Series的index

如果想從Series中單純取得value可以使用values屬性。取得index可以使用index屬性

print('ironman.values------->',ironman.values)
print('ironman.index------->',ironman.index)

# 輸出結果
ironman.values-------> [0.11 0.22 0.33 0.44]
ironman.index-------> RangeIndex(start=0, stop=4, step=1)

以下範例示範如果想要自己定義Series的index，在這邊使用string。

ironman = pd.Series([0.11,0.22,0.33,0.44], index=['a','b','c','d'])
ironman

# 輸出結果
a    0.11
b    0.22
c    0.33
d    0.44
dtype: float64

在這邊可以看到index從原本的[0,1,2,3]變成自己定義的['a','b','c','d']

剛剛說Series和dict很類似，可以使用dict來建立Series，以下範例顯示從dict->Series

dic_ironman = {
    'a': 11,
    'b': 22,
    'c': 33
}
ironman = pd.Series(dic_ironman)
ironman

# 輸出結果
a    11
b    22
c    33
dtype: int64

DataFrame(多個Series組成)

DataFrame跟Series一樣，可以指定index，但這邊可以想像成DataFrame是多個Series組成。

number = pd.Series({'taipei':200, 'taichung': 300, 'changhua': 400, 'kaohsiung' : 150})
mayor = pd.Series({'taipei': 'Kui', 'taichung': 'Ha', 'changhua': 'Chin', 'kaohsiung' : 'Lui'})
ironman_df = pd.DataFrame({'number':number, 'mayor':mayor})
ironman_df

# 輸出結果
|  | number | mayor |
| -------- | -------- | -------- |
| taipei     | 200     | Kui     |
| taichung     | 300     | Ha     |
| changhua     | 400     | Chin     |
| kaohsiung     | 150     | Lui     |

不同於Series只有index和values屬性，DataFrame還有columns屬性

print('ironman_df.values------->',ironman_df.values)
print('ironman_df.index------->',ironman_df.index)
print('ironman_df.columns------->',ironman_df.columns)

# 輸出結果
ironman_df.values-------> [[200 'Kui']
 [300 'Ha']
 [400 'Chin']
 [150 'Lui']]
ironman_df.index-------> Index(['taipei', 'taichung', 'changhua', 'kaohsiung'], dtype='object')
ironman_df.columns-------> Index(['number', 'mayor'], dtype='object')

剛剛示範從dict建立Series，而DataFrame既然是由Series組成則代表：

Series可以建立DataFrame
dict也可以建立DataFrame

pd.DataFrame(number, columns=['number']) #從單一Series
pd.DataFrame({'number':{'taipei':200, 'taichung': 300, 'changhua': 400, 'kaohsiung' : 150}}) #從dict建立

# 輸出結果
|  | number |
| -------- | -------- |
| taipei     | 200     |
| taichung     | 300     |
| changhua     | 400     | 
| kaohsiung     | 150     |

上述兩個DataFrame輸出結果都一樣，所以只有印出一個

我們在day9學習structured array，也可以從structured array建立DataFrame

team =np.zeros(4, dtype={'names':('name','number','team'),'formats':('U10','i2','U10')})
team['name'] =['彭政閔','林智勝','蘇偉達','陽耀勳']
team['number'] = [23,32,96,23]
team['team'] = ['兄弟象','兄弟象','兄弟象','lamigo']
pd.DataFrame(team)

# 輸出結果
	name	number	team
0	彭政閔	  23	兄弟象
1	林智勝	  32	兄弟象
2	蘇偉達	  96	兄弟象
3	陽耀勳	  23	lamigo

Index(不可修改的陣列)

也可想像成(immutable array)或是一個(ordered set)

ironman_index = pd.Index([0.11,0.22,0.33,0.44])
ironman_index

# 輸出結果
Float64Index([0.11, 0.22, 0.33, 0.44], dtype='float64')

因為是不可以修改的，所以當執行修改的動作時會出現錯誤

DataFrame、Series的存取、修改

前幾天學習Numpy，其中有學到使用切片、遮罩、fancy的方式存取ndarray，現在要在DataFrame、Series上使用這些方法

Series

切片方式

ironman = pd.Series([0.11,0.22,0.33,0.44], index=['a','b','c','d'])
ironman['a':'c']

# 輸出結果
a    0.11
b    0.22
c    0.33
dtype: float64

雖然建立Series有設定文字的index，但Series還是有隱含的整數索引，因此可以使用整數來切片

ironman[0:3]

# 輸出結果
a    0.11
b    0.22
c    0.33
dtype: float64

遮罩的方式

ironman[ironman > 0.22]

# 輸出結果
c    0.33
d    0.44
dtype: float64

從上面的例子看出從ironman中取出值大於0.22

fancy的方式
重溫一下fancy就是指傳遞一個陣列當作index去取得元素。

ironman[['a','d']]

# 輸出結果
a    0.11
d    0.44
dtype: float64

loc
當用切片取得陣列，index也是數字情況下就要使用loc，以下範例展示有使用loc和沒有的取得陣列結果

沒有使用loc

ironman = pd.Series([0.11,0.22,0.33,0.44], index=[1,3,5,7])
ironman[1:3]

# 輸出結果
3    0.22
5    0.33
dtype: float64

會發現取出來的結果是隱含的整數索引[1~(3-1)]，而不是index

使用loc

ironman.loc[1:3]

# 輸出結果
1    0.11
3    0.22
dtype: float64

這個例子取出來的結果才是使用index

DataFrame

以下示範用兩個欄位相除來寫入新的欄

number = pd.Series({'taipei':200, 'taichung': 300, 'changhua': 400, 'kaohsiung' : 150})
area = pd.Series({'taipei': 22, 'taichung': 25, 'changhua': 35, 'kaohsiung' : 10})
ironman_pd = pd.DataFrame({'number':number, 'area':area})
ironman_pd['divided'] = ironman_pd['number'] / ironman_pd['area']
ironman_pd

# 輸出結果
	        number	area	divided
taipei	    200	     22	   9.090909
taichung	300	     25	   12.000000
changhua	400	     35	   11.428571
kaohsiung	150	     10	   15.000000

將陣列作轉置

ironman_pd.T

# 輸出結果
          taipei	   taichung	    changhua	   kaohsiung
number	  200.000000	300.0	   400.000000	    150.0
area	  22.000000	    25.0	   35.000000	    10.0
divided	   9.090909	    12.0	   11.428571	    15.0

使用iloc來做切片

ironman_pd.iloc[:2,:2]

# 輸出結果
	      number	area
taipei	    200	     22
taichung	300	     25

使用loc來做遮罩

ironman_pd.loc[ironman_pd.divided > 12]

# 輸出結果
            number	area	divided
kaohsiung	 150	 10	     15.0

Pandas資料操作

Numpy的ufunc都可以在Series和DataFrame上操作。

絕對值abs

ironman_series = pd.Series({'a':-50, 'b': 20, 'c': -30, 'd' : 22, 'e' : -40})
print(np.abs(ironman_series))

# 輸出結果
a    50
b    20
c    30
d    22
e    40
dtype: int64

索引對齊
只要是對齊不到index都會使用NaN表示

ironman_series = pd.Series({'a':-50, 'b': 20, 'c': -30, 'd' : 22, 'e' : -40})
ironman_series2 = pd.Series({'a':12, 'c': -15, 'd': -10, 'f' : -31, 'g' : 20})
ironman_series.add(ironman_series2)

# 輸出結果
a   -38.0
b     NaN
c   -45.0
d    12.0
e     NaN
f     NaN
g     NaN
dtype: float64

可以指定當對齊不到的index時，指定特定值

ironman_series.add(ironman_series2, fill_value = 0)

# 輸出結果
a   -38.0
b    20.0
c   -45.0
d    12.0
e   -40.0
f   -31.0
g    20.0
dtype: float64

Python的運算子和Pandas之間的對應

運算子	Pandas方法
+	add()
-	sub(),subtract()
*	mul(),multiply()
/	truediv(),div(),divide()
//	floordiv()
%	mod()
**	pow()

心得分享

之後的金融分析，很多資料都是用DataFrame儲存，所以可以快速看過一次，我也是在分析的時候，忽然忘記怎麼用，就會回來翻一下。

之前的章節導覽

安裝環境
- 安裝Anaconda
- 安裝Jupyter notebook
Numpy
程式碼位置
- github
  因為作者本身也是第一次學習Python和寫程式文章，所以編排上會有點亂，觀念可能也會錯誤，如果有疑問可以提出一起討論，等30天完成之後有其他時間會將之前寫的文章加入一些想法。

[Day09]Learning Numpy - Fancy、sort、structured array

[Day11]Learning Pandas - 處理空值的資料和使用多重index

系列文

python 入門到分析股市共 30 篇

RSS系列文訂閱系列文

322 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

1 則留言

Andy Chiu

iT邦研究生 2 級 ‧ 2018-11-08 23:13:00

ironman.loc[ironman.divided > 12]

=> ironman_pd.loc[ironman_pd.divided > 12]

回應 1
檢舉

Summer iT邦新手 5 級 ‧ 2018-11-09 00:05:20 檢舉

已經修正了，大大真的很細心太感謝你的回應了

登入發表回應

我要留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19838 篇

完賽人數

528 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

請問ASUS伺服器RS300-E8-PS4硬碟問題

IT邦幫忙

python 入門到分析股市系列 第 10 篇