【Day 10】強大的python套件

第 11 屆 iThome 鐵人賽

DAY 10

AI & Data

Python&ML資料分析系列第 10 篇

11th鐵人賽

小魚兒Fischer

2019-09-25 15:33:32

1446 瀏覽

分享至

NumPy 中的基本資料結構 ndarray
ndarray 是 NumPy 的基石，它有哪些好處呢？
• 對於大計算量的性能非常好，所以 list 做運算的時候一定要轉化為 array
• ndarray 自帶一些非常實用的函數，列舉幾個常用的：sum，shape（返回當前矩陣給的緯度）、argmax 等
具體案例：


In[1]: import numpy as np

In[2]: a_array = np.array([[1,2,3], [1,2,3]])  #創建 ndarray

In[3]: b_array = np.array([(2,2,2), (5,5,5)])

In[4]: a_array.shape   #獲取陣列的維度
Out[4]: (2, 3)

In[5]: a_array * b_array   #陣列相乘
Out[5]:
array([[ 2,  4,  6], [ 5, 10, 15]])

In[6]: a_array > 3     #判斷陣列中是否每個都大於 3
Out[6]:
array([[False, False, False], [False, False, False]], dtype=bool)

In[7]: np.sin(a_array)     #求陣列中每個數位的 sin 值
Out[7]:
array([[ 0.84147098,  0.90929743,  0.14112001], [ 0.84147098,  0.90929743,  0.14112001]])

In[8]: a_array.sum()       #求陣列中各項相加的和
Out[8]: 12

In[9]: a_array.sum(axis = 0)   #求陣列中每一列的值的和
Out[9]: array([2, 4, 6])

變長字典 Series
基本特徵：
• 類似一維陣列的物件
• 由資料和索引組成
具體案例：


In[1]: import pandas as pd

In[2]: a_series = pd.Series([1, 2, 3])     #創建索引

In[3]: a_series
Out[3]:
0    1
1    2
2    3
dtype: int64

In[4]: b_series = pd.Series(['apple', 'banana', 'lemon'], index = [1, 'b', 'l'])       #自訂索引

In[5]: b_series
Out[5]:
1     apple
b    banana
l     lemon
dtype: object

In[6]: b_series['l']   #利用索引獲取值
Out[6]: 'lemon'

In[7]: print a_series * 2    #基本運算子，對每個值進行計算
0    2
1    4
2    6
dtype: int64

In[8]: import numpy as np

In[9]: np.exp(a_series)      #指數函數運算
Out[9]:
0     2.718282
1     7.389056
2    20.085537
dtype: float64
數據對齊：
In[1]: import pandas as pd

In[2]: data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In[3]: index = ['Ohio', 'Texas', 'Utah']

In[4]: a_series = pd.Series(data, index = index)   #創建指定 index 的 Series

In[5]: a_series
Out[5]:
Ohio     35000
Texas    71000
Utah      5000
dtype: int64

In[6]: pd.isnull(a_series)
Out[6]:
Ohio     False
Texas    False
Utah     False
dtype: bool

In[7]: b_data = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'California': 83000, 'Nevada': 15000}

In[8]: b_series = pd.Series(b_data)

In[9]: a_series + b_series #在資料運算中自動對齊不同索引的資料
Out[9]:
California         NaN
Nevada             NaN
Ohio           70000.0
Oregon             NaN
Texas         142000.0
Utah               NaN
dtype: float64

表格型資料結構 DataFrame
基本特徵：
• 一個表格型的資料結構
• 含有一組有序的列（類似 index）
• 擁有多個列的資料表，每個列擁有一個 Label（類似 excel 的表頭）
• 大致可以看成共用一個 index 的 Series 集合
具體案例：

In[1]: import pandas as pd

In[2]: import numpy as np

In[3]: dates = pd.date_range('20161007', periods=6)

In[4]: dates
Out[4]:
DatetimeIndex(['2016-10-07', '2016-10-08', '2016-10-09', '2016-10-10','2016-10-11', '2016-10-12'],dtype='datetime64[ns]', freq='D')

In[5]: df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))    #創建一個 DataFrame

In[6]: df
Out[6]:
                A       B       C       D
2016-10-07  2.437131  0.004184 -0.204022 -0.395558
2016-10-08  0.988348 -0.346398  0.190402  3.268118
2016-10-09 -1.574105  1.491294  0.597307 -0.944745
2016-10-10 -2.362435  0.620811  1.807417  0.345957
2016-10-11 -0.091778  1.408165 -0.121032  0.528897
2016-10-12 -1.319251  0.698142 -1.366151 -0.523682

In[7]: df.dtypes   #獲取 DataFrame 的資料結構類型
Out[7]:
A    float64
B    float64
C    float64
D    float64
dtype: object

In[8]: df['A']     #根據列名獲取 DataFrame 中的一列，結構為 Series
Out[8]:
2016-10-07    2.437131
2016-10-08    0.988348
2016-10-09   -1.574105
2016-10-10   -2.362435
2016-10-11   -0.091778
2016-10-12   -1.319251
Freq: D, Name: A, dtype: float64

In[9]: df.ix[2]    #獲取某一行的資料，結構為 Series
Out[9]:
A   -1.574105
B    1.491294
C    0.597307
D   -0.944745
Name: 2016-10-09 00:00:00, dtype: float64

In[10]: df.head(2)     #獲取前兩行的資料
Out[10]:
                   A         B         C         D
2016-10-07  2.437131  0.004184 -0.204022 -0.395558
2016-10-08  0.988348 -0.346398  0.190402  3.268118

In[11]: df.T           #資料表轉置運算
Out[11]:
   2016-10-07  2016-10-08  2016-10-09  2016-10-10  2016-10-11  2016-10-12
A    2.437131    0.988348   -1.574105   -2.362435   -0.091778   -1.319251
B    0.004184   -0.346398    1.491294    0.620811    1.408165    0.698142
C   -0.204022    0.190402    0.597307    1.807417   -0.121032   -1.366151
D   -0.395558    3.268118   -0.944745    0.345957    0.528897   -0.523682

In[12]: df.sort_values(by='B')     #根據 B 序列排序
Out[12]:
               A        B       C      D
2016-10-08  0.988348 -0.346398  0.190402  3.268118
2016-10-07  2.437131  0.004184 -0.204022 -0.395558
2016-10-10 -2.362435  0.620811  1.807417  0.345957
2016-10-12 -1.319251  0.698142 -1.366151 -0.523682
2016-10-11 -0.091778  1.408165 -0.121032  0.528897
2016-10-09 -1.574105  1.491294  0.597307 -0.944745

In[16]: del df['A']

In[17]: df
Out[17]:
                B       C       D
2016-10-07  0.004184 -0.204022 -0.395558
2016-10-08 -0.346398  0.190402  3.268118
2016-10-09  1.491294  0.597307 -0.944745
2016-10-10  0.620811  1.807417  0.345957
2016-10-11  1.408165 -0.121032  0.528897
2016-10-12  0.698142 -1.366151 -0.523682

DataFrame 強大的功能還有很多，包括：
• 通過標籤選擇資料，比如：df.loc['20130102':'20130104',['A','B']]
• 設置表格的值，df.at[dates[0],'A'] = 0
• 統計運算，df.mean() 求平均值
• 拼接，Merge、ConCat、Join(和 SQL 裡的 Join 是一個意思)
• Grouping，和 SQL 中的 Group By 類似
• Pivot Tables，會使用 excel 的人都不陌生，這就是樞紐分析表功能