2024 iThome 鐵人賽

DAY 11

AI/ ML & Data

自動交易程式探索系列第 11 篇

Day 11 - 複習Pandas與Datetime

16th鐵人賽

jjchen1

團隊北投溫泉公園的蛞蝓觀察小隊

2024-09-25 11:33:49

266 瀏覽

分享至

昨天生成的報表只針對PPO是因為在生成報表時，對Pandas太不熟悉，然後對Datetime的很多細節用法也是不夠清楚，導致使用pyfolio時一直弄不好，跳去使用quantstats後，也沒法順利的生成完整比較報表；今天花了一些時間終於搞定，並且更新了昨天的報表內容。

我決定在繼續FinRL的學習之前，先花些時間來重新熟悉一下Pandas的一些基本常識；由於在金融分析時，使用的index大多是Datetime，所以也花一些時間重新熟悉一下。

這是轉換成繁體中文後的文章：

Pandas筆記

參考資料 Pandas Tutorial - w3c

pd.Series

什麼是 Series？
Pandas 的 Series 就像表格中的一列。
它是一個可容納任何類型資料的一維陣列。

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)

Series Label

注意 Series 跟 DataFrame 不同，DataFrame沒有設定index的話myvar[0]會報錯

如果Series沒有指定label的話，會自動生成range流水號，第一個value的index=0，第二個value的index=1
使用label可以直接存取某個特定的值

假設myvar沒有設定index的話
print(myvar.index) 會顯示 RangeIndex(start=0, stop=3, step=1)

import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar[0]) # 印出 1

鍵/值 (Key/Value) 物件作為 Series

像dict這樣key/value的物件可以直接轉成Series

注意：key會直接變成index

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

dict轉Series時，只使用部份的key可以在轉換時直接排除不要的資料

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

建立標籤：索引

要選擇字典中的部分項目，使用 index 參數，並指定你想包含在 Series 中的項目。

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)

DataFrame

什麼是 DataFrame？
Pandas 的 DataFrame 是一種二維的資料結構，類似於二維陣列，或者帶有行和列的表格。

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df)
#    calories  duration
# 0       420        50
# 1       380        40
# 2       390        45

Access Data with `.loc[]`

因為沒有設定，所以預設的標籤是流水號
正如上面的結果所示，DataFrame 就像帶有行和列的表格。
Pandas 使用 loc 來返回標籤對應的一個或多個指定的行
透過.loc[標籤]回傳一個或多個row(s)的內容

#refer to the row index:
print(df.loc[0]) # -> 回傳 `pandas.core.series.Series`
# calories    420
# duration     50
# Name: 0, dtype: int64

#use a list of indexes:
print(df.loc[[0, 1]]) # -> 回傳 `pandas.core.frame.DataFrame`
#    calories  duration
# 0       420        50
# 1       380        40

Access Data with `.iloc[]`

iloc[row_number]類似.loc[]不過輸入的是row index，並回傳對應的資料。

命名索引

使用 index 參數，你可以命名你自己的索引。

import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 
#       calories  duration
# day1       420        50
# day2       380        40
# day3       390        45

定位命名索引

#refer to the named index:

print(df.loc["day2"])
# calories    380
# duration     40
# Name: day2, dtype: int64

Pandas - 分析 DataFrames

這邊筆記了幾個可能會常用到的function

以下幾個範例會用到data.csv

import pandas as pd

df = pd.read_csv('data.csv')

1. `info()`

顯示每個column資訊
- column名稱
- 統計 null value
- 資料型態
顯示多少row
memory usage

print(df.info()) 
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 169 entries, 0 to 168
# Data columns (total 4 columns):
#  #   Column    Non-Null Count  Dtype  
# ---  ------    --------------  -----  
#  0   Duration  169 non-null    int64  
#  1   Pulse     169 non-null    int64  
#  2   Maxpulse  169 non-null    int64  
#  3   Calories  164 non-null    float64
# dtypes: float64(1), int64(3)
# memory usage: 5.4 KB
# None

2. `dropna()`

dropna()會把含有null值的row全部移除

print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 169 entries, 0 to 168
# Data columns (total 4 columns):
#  #   Column    Non-Null Count  Dtype
# ---  ------    --------------  -----
#  0   Duration  169 non-null    int64
#  1   Pulse     169 non-null    int64
#  2   Maxpulse  169 non-null    int64
#  3   Calories  164 non-null    float64
# dtypes: float64(1), int64(3)
# memory usage: 5.4 KB

res = df.dropna()

print(res.info())
# Index: 164 entries, 0 to 168
# Data columns (total 4 columns):
#  #   Column    Non-Null Count  Dtype
# ---  ------    --------------  -----
#  0   Duration  164 non-null    int64
#  1   Pulse     164 non-null    int64
#  2   Maxpulse  164 non-null    int64
#  3   Calories  164 non-null    float64
# dtypes: float64(1), int64(3)
# memory usage: 6.4 KB

3. `reset_index()`

dropna() 雖然刪除了包含 NaN 的行，但不會自動重置索引，原始索引仍保留。
這就是為什麼即使刪除部分行，info() 仍顯示原來的索引範圍 (如 0~168)。
如果需要索引連續，可以使用 reset_index() 重置索引：

在dropna()後，將索引重置，並且避免索引中間有斷點。

info() 顯示 Index: 164 entries, 0 to 168
其中index=[17, 27, 91, 118, 141]的地方已經被刪除，不存在。

null_indices = df[df.isna().any(axis=1)].index
print(null_indices)
# Index([17, 27, 91, 118, 141], dtype='int64')

res = df.dropna().reset_index(drop=True)

4. `fillna()`

用固定值填補: df["Calories"].fillna(130, inplace = True)
用mean值填補: df["Calories"].fillna(df["Calories"].mean(), inplace = True)
用median值填補: df["Calories"].fillna(df["Calories"].median(), inplace = True)
用眾數填補: df["Calories"].fillna(df["Calories"].mode(), inplace = True)

使用時常見問題

date_data.csv

Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01,110,130,409.1
60,2020/12/02,117,145,479.0
60,2020/12/03,103,135,340.0
45,2020/12/04,109,175,282.4
45,2020/12/05,117,148,406.0
60,2020/12/06,102,127,300.0
60,2020/12/07,110,136,374.0
450,2020/12/08,104,134,253.3
30,2020/12/09,109,133,195.1
60,2020/12/10,98,124,269.0
60,2020/12/11,103,147,329.3
60,2020/12/12,100,120,250.7
60,2020/12/12,100,120,250.7
60,2020/12/13,106,128,345.3
60,2020/12/14,104,132,379.3
60,2020/12/15,98,123,275.0
60,2020/12/16,98,120,215.2
60,2020/12/17,100,120,300.0
45,2020/12/18,90,112,
60,2020/12/19,103,123,323.0
45,2020/12/20,97,125,243.0
60,2020/12/21,108,131,364.2
45,,100,119,282.0
60,2020/12/23,130,101,300.0
45,2020/12/24,105,132,246.0
60,2020/12/25,102,126,334.5
60,2020/12/26,100,120,250.0
60,2020/12/27,92,118,241.0
60,2020/12/28,103,132,
60,2020/12/29,100,132,280.0
60,2020/12/30,102,129,380.3
60,2020/12/31,92,115,243.0

1. 錯誤格式的資料

df = pd.read_csv('date_data.csv')
print(df.info())
#  #   Column    Non-Null Count  Dtype
# ---  ------    --------------  -----
#  ......
#  1   Date      31 non-null     object
#  ......
# dtypes: float64(1), int64(3), object(1)
# memory usage: 1.4+ KB

df['Date'] = pd.to_datetime(df['Date'])
#  #   Column    Non-Null Count  Dtype
# ---  ------    --------------  -----
#  ......
#  1   Date      31 non-null     datetime64[ns]
#  ......
# memory usage: 1.4 KB

移除'Date'中有null value的row

print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 32 entries, 0 to 31
# Data columns (total 5 columns):
#  #   Column    Non-Null Count  Dtype
# ---  ------    --------------  -----
#  0   Duration  32 non-null     int64
#  1   Date      31 non-null     datetime64[ns]
#  2   Pulse     32 non-null     int64
#  3   Maxpulse  32 non-null     int64
#  4   Calories  30 non-null     float64
# dtypes: datetime64 , float64(1), int64(3)
# memory usage: 1.4 KB

df.dropna(subset=['Date'], inplace = True)

print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# Index: 31 entries, 0 to 31
# Data columns (total 5 columns):
#  #   Column    Non-Null Count  Dtype
# ---  ------    --------------  -----
#  0   Duration  31 non-null     int64
#  1   Date      31 non-null     datetime64[ns]
#  2   Pulse     31 non-null     int64
#  3   Maxpulse  31 non-null     int64
#  4   Calories  29 non-null     float64
# dtypes: datetime64 , float64(1), int64(3)
# memory usage: 1.5 KB

2. DataFrame 沒有設定 index

設定index

df.set_index('Date', inplace=True)

3. 清理錯誤資料

刪除 'Duration' 大於 120 的行：

for x in df.index:
  if df.loc[x, "Duration"] > 120:
    df.drop(x, inplace = True)

4. 移除重複值

確認資料中是否有重複資料 (2個row資料一模一樣)

print(df.duplicated())
# True

刪除重複的row

df.drop_duplicates(inplace = True)

Correlation

df = pd.read_csv('data.csv')
df.corr()
#           Duration     Pulse  Maxpulse  Calories
# Duration  1.000000 -0.155408  0.009403  0.922721
# Pulse    -0.155408  1.000000  0.786535  0.025120
# Maxpulse  0.009403  0.786535  1.000000  0.203814
# Calories  0.922721  0.025120  0.203814  1.000000

畫圖

import matplotlib
matplotlib.use('QTAgg')
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('data.csv')

折線圖

df.plot()

plt.show()

散點圖 (Scatter)

畫指定兩個column，一個為X，另一個為Y

df = pd.read_csv('data.csv')

df.plot(kind = 'scatter', x = 'Duration', y = 'Calories')

plt.show()

直方圖 (Histogram)

df["Duration"].plot(kind = 'hist')

Date vs. DateTime vs. Time Zones in Pandas

在處理數據的時候，特別是時間相關數據，我們經常需要用到 Date 和 DateTime。它們通常作為 Pandas DataFrame 的索引進行分析，尤其在多筆資料整合時，稍有不一致就可能導致錯誤。這篇教學會深入探討 Date、DateTime、時間區間（Time Zones）及其在 pandas 中的使用和常見問題。

1. Date 和 DateTime 的區別

Date：只包含年月日（如：2023-09-25），適合用於不涉及具體時間的情境。例如每天的股票開盤價。
DateTime：同時包含年月日及具體時間（如：2023-09-25 14:30:00），適合用於記錄具體的時間點，比如交易的精確時間戳。

在 pandas 中，你可以使用 pd.to_datetime() 函數來將字串轉換為 DateTime 格式。

轉換例子：

import pandas as pd

# 將字串轉換為日期格式
df['date_column'] = pd.to_datetime(df['date_column'])

這個方法自動會處理日期和時間部分，並將資料轉換成 DateTime 物件。

2. tz_native vs. tz_aware

tz_native（無時區）：

這指的是不含時區資訊的 DateTime，即時間沒有明確標記它屬於哪個時區。這種時間比較簡單，但在全球化的應用場合會造成混淆。

tz_aware（有時區）：

這指的是包含時區資訊的 DateTime。有時區資訊的時間能夠正確地表示世界各地不同的時間點，尤其是在涉及跨國交易等需要全球協作的應用場合。

添加時區：

你可以使用 pandas 的 tz_localize 或 tz_convert 函數來設置或轉換時區。

# 添加時區
df['datetime_column'] = df['datetime_column'].dt.tz_localize('UTC')

# 將時區轉換為另一時區
df['datetime_column'] = df['datetime_column'].dt.tz_convert('Asia/Taipei')

時區處理中的問題：

當合併不同時區的資料時，如果時區不一致，pandas 會報錯。這是因為帶時區的時間和不帶時區的時間不能直接比較或進行計算。

3. UTC（協調世界時間）

UTC 是一種全世界統一的時間標準。將所有時間標準化為 UTC 可以避免時區混亂，特別是在跨國交易和時間戳的應用中。

# 將日期轉換為 UTC
df['datetime_column'] = pd.to_datetime(df['datetime_column']).dt.tz_localize('UTC')

4. pandas 的 Index 使用 Date 和 DateTime

在 pandas 中，使用 DateTime 作為索引是非常常見的做法，尤其是在時間序列分析中。不過，使用 DateTime 作為索引時，有時會遇到問題，特別是在合併多筆資料時，因為些微的日期或時間不一致，可能導致錯誤。

例子：

df1 = pd.DataFrame({
    'value': [1, 2, 3],
    'date': ['2023-09-25', '2023-09-26', '2023-09-27']
})
df1['date'] = pd.to_datetime(df1['date'])
df1.set_index('date', inplace=True)

df2 = pd.DataFrame({
    'value': [4, 5],
    'date': ['2023-09-26', '2023-09-27']
})
df2['date'] = pd.to_datetime(df2['date'])
df2.set_index('date', inplace=True)

# 嘗試合併兩個DataFrame
merged_df = df1.join(df2, lsuffix='_left', rsuffix='_right')

可能的錯誤：

精度問題：時間可能精確到毫秒甚至更高的精度，而兩個時間雖然看似相同，但其實有微小差異。
時區問題：當一個時間有時區（tz_aware），另一個沒有時區（tz_native）時，它們無法比較。

解決辦法：

使用 .dt.floor('D') 或 .dt.normalize() 方法可以統一時間的粒度，確保它們在合併時不會出現細微差異。
確保所有的 DateTime 都在同一時區或標準化到 UTC。

# 統一時間的粒度
df1.index = df1.index.floor('D')
df2.index = df2.index.floor('D')

# 重新合併
merged_df = df1.join(df2, lsuffix='_left', rsuffix='_right')

5. 常見問題和解決方式

(1) 合併多筆資料時 DateTime 不一致：

常見問題之一是當 DateTime 索引不完全一致時，合併可能會產生 NaN 值或錯誤。

解決方法：檢查並統一時間格式和時區，使用 .floor() 或 .normalize() 方法處理時間粒度。

(2) 使用 `tz_native` 和 `tz_aware` 時報錯：

不能直接比較有時區和無時區的時間。

解決方法：要麼將兩個時間標準化為 UTC，要麼統一時區。

(3) 跨時區數據整合問題：

跨時區數據合併時，pandas 會報錯，這是因為不同時區的時間無法直接比較。

解決方法：使用 .tz_convert('UTC') 將所有時間轉換為 UTC，然後再進行合併。

總結

在 pandas 中處理 Date 和 DateTime，尤其是作為索引時，對於合併和比較的時間一致性問題需要特別注意。這包括精度問題、時區不一致、和時區轉換的處理。合理地使用時間標準化工具，如 .floor()、.normalize() 和 tz_convert()，可以有效避免大部分的錯誤。

Day 10 - 繪製圖表

Day 12 - 使用 Polygon.io 抓大量日內數據 (1/3)

系列文

自動交易程式探索共 30 篇

RSS系列文訂閱系列文

5 人訂閱

完整目錄

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22206 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

自動交易程式探索系列 第 11 篇