【Day 11】認識Pandas模組

2022 iThome 鐵人賽

DAY 11

自我挑戰組

養爬蟲的人學爬蟲系列第 11 篇

14th鐵人賽

teresawang

2022-09-24 22:31:47

908 瀏覽

分享至

閒聊
昨天簡單的學習了具有清洗功能的Requests-HTML，今天來認識Pandas模組。

什麼是Pandas
Pandas是專門為編寫Python的外部模組，主要執行數據處理跟分析。
Panda由Panel、DataFame、Serise組成，等等會做介紹。

安裝

pip install pandas

想要查看版本的話，可以使用

import pandas as pd
pd.__version__

Series
Series是一種一維陣列資料結構，陣列中可以存放整數、浮點數、字串、Python物件（例如list、dict）、純量...等。
雖然是一維陣列，但Series的資料是由一個index (索引)或稱為label（標籤）搭配一個實際的資料，所以看起來會像是一個二維陣列資料。

建立Series物件

import pandas as pd
#使用list
s = pd.Series([11,22,33,44,55])
print(s)
#output
0    11
1    22
2    33
3    44
4    55
dtype: int64

#使用dict
mydict = {'台灣':'Taiwan', '東京':'Tokyo'}
s1 = pd.Series(mydict)
print(s1)
#output
台灣    Taiwan
東京     Tokyo
dtype: object

#使用Numpy的ndarray
import pandas as pd
import numpy as np
s2 = pd.Series(np.arange(1,2,5)) #產生從1到(2-5)之間序列數字，每次增加5
print(s2)
#output
0    1
dtype: int64

建立含索引的物件
預設所以是從0開始計數，使用dict建立的時候，dict的key就是索引。因此可以建立索引不是從0開始計數的，或是自訂義索引值。

import pandas as pd
myindex = [3,5,7]
price = [100,200,300]
s3 = pd.Series(price, index = myindex)
print(s3)
#output
3    100
5    200
7    300
dtype: int64

純量物件

import pandas as pd
s4 = pd.Series(9, index = [1,2,3])
print(s4)
#output
1    9
2    9
3    9
dtype: int64

運算

#加
import pandas as pd
s5 = pd.Series([1,2])
s6 = pd.Series([3,4])
print(s5+s6)
#output
0    4
1    6
dtype: int64

#減
s5 = pd.Series([1,2])
s6 = pd.Series([3,4])
print(s5-s6)
#output
0   -2
1   -2
dtype: int64

#乘
s5 = pd.Series([1,2])
s6 = pd.Series([3,4])
print(s5*s6)
#output
0    3
1    8
dtype: int64

#除
s5 = pd.Series([1,2])
s6 = pd.Series([3,4])
print(s5/s6)
#output
0    0.333333
1    0.500000
dtype: float64

邏輯運算

import pandas as pd
s7 = pd.Series([1,5,3])
s8 = pd.Series([2,2,4])
print(s7>s8)
#output
0    False
1     True
2    False

DataFame
DataFame是一種二微陣列的資料，類似excel的工作表。
可以存放整數、浮點數、字串、python物件...等。

#建立DataFame使用Series

import pandas as pd
years = range(2020,2022)
beijing = pd.Series([20,21], index = years)
tokyo = pd.Series([25,26], index = years)
citydf = pd.concat([beijing,tokyo])
print(citydf)
print(type(citydf))

#output
2020    20
2021    21
2020    25
2021    26
dtype: int64
<class 'pandas.core.series.Series'>

Pandas資料分析與處理
1.索引參照屬性