Day 27 [Python ML、資料清理] 處理資料中的時間

2021 iThome 鐵人賽

DAY 26

AI & Data

使用python學習Machine Learning系列第 27 篇

13th鐵人賽

guancioul

團隊人工逗點智慧

2021-10-11 10:15:43

2888 瀏覽

分享至

設定環境

首先我們需要讀取libraries跟dataset，我們將會使用一個dataset是包含在2007~2016間發生地震的情形

# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import datetime

# read in our data
landslides = pd.read_csv("./catalog.csv")

# set seed for reproducibility
np.random.seed(0)

確定時間列的資料型態

landslides.head()

我們將會處理date的column，因此要確報資料中包含日期資料

# print the first few rows of the date column
print(landslides['date'].head())

0     3/2/07
1    3/22/07
2     4/6/07
3    4/14/07
4    4/15/07
Name: date, dtype: object

我們是人類所以看得懂日期資料，但這不代表說python也看得懂這些資料

而且我們注意到這筆資料的type是object

pandas使用object資料型態來代表許多資料型態，但最常會出現的是String

一般若是日期資料，會取得的資料型態應該會是datetime64，我們可以使用dtype來看資料型態是什麼

# check the data type of our date column
landslides['date'].dtype

dtype('O')

資料型態O代表object的意思

轉換時間列的資料型態成datatime

我們可以使用以下的方法來format資料，並且讓資料型態改為datetime

# Create a new column, date_parsed, with the parsed dates
landslides['date_parsed'] = pd.to_datetime(landslides['date'], format="%m/%d/%y")

然後我們來看一下format完之後的資料

# print the first few rows
landslides['date_parsed'].head()

0   2007-03-02
1   2007-03-22
2   2007-04-06
3   2007-04-14
4   2007-04-15
Name: date_parsed, dtype: datetime64[ns]

# get the day of the month from the date_parsed column
day_of_month_landslides = landslides['date_parsed'].dt.day
day_of_month_landslides.head()

0     2.0
1    22.0
2     6.0
3    14.0
4    15.0
Name: date_parsed, dtype: float64

資料型態一定要是datetime，才能使dt這個方式取得資料

繪製月份的日期以檢查日期是否正確

我們將資料轉換為直方圖來確保資料轉換過後的日期只有在1~31之間

# remove na's
day_of_month_landslides = day_of_month_landslides.dropna()

# plot the day of the month
sns.distplot(day_of_month_landslides, kde=False, bins=31)

/opt/conda/lib/python3.6/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)





<AxesSubplot:xlabel='date_parsed'>