許多資料集都會有日期和時間特徵,它們是一個重要的欄位,妥善的處理它們,可以幫助機器模型加快學習和作出較正確預測。
表示日期和時間的數字對應者日期和時間的某ㄧ個特定的部分,是很好的資訊來源。至於要從日期和時間變數中選取那些特徵完全視各個專案內容而定。
我們將使用Pandas來從日期時間欄位提取重要特徵。首先,讀取kaggle的紐約市計程車費率預測(New York City Taxi Fare Prediction)資料集。
import numpy as np
import pandas as pd
df_train = pd.read_csv('../input/new-york-city-taxi-fare-prediction/train.csv', nrows = 2_000, parse_dates=["pickup_datetime"])
df_train.dtypes
key | object |
---|---|
fare_amount | float64 |
pickup_datetime | datetime64[ns, UTC] |
pickup_longitude | float64 |
pickup_latitude | float64 |
dropoff_longitude | float64 |
dropoff_latitude | float64 |
passenger_count | int64 |
dtype: object |
df_train['pickup_datetime'].head()
0 | 2009-06-15 17:26:21+00:00 |
---|---|
1 | 2010-01-05 16:52:16+00:00 |
2 | 2011-08-18 00:35:00+00:00 |
3 | 2012-04-21 04:30:42+00:00 |
4 | 2010-03-09 07:51:00+00:00 |
Name: pickup_datetime, dtype: datetime64[ns, UTC] |
當我們有**日期時間(Datetime)**變數,我們可以提取下列資訊:
提取日期Date特徵
df_train['pickup_date'] = df_train['pickup_datetime'].dt.date
df_train[['pickup_datetime','pickup_date']].head()
/|pickup_datetime| pickup_date
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 2009-06-15
1| 2010-01-05 16:52:16+00:00| 2010-01-05
2| 2011-08-18 00:35:00+00:00| 2011-08-18
3| 2012-04-21 04:30:42+00:00| 2012-04-21
4| 2010-03-09 07:51:00+00:00| 2010-03-09
提取年Year、月Month、日Day of month特徵
df_train['pickup_year'] = df_train['pickup_datetime'].dt.year
df_train['pickup_month'] = df_train['pickup_datetime'].dt.month
df_train['pickup_day'] = df_train['pickup_datetime'].dt.day
df_train[['pickup_datetime','pickup_year','pickup_month','pickup_day']].head()
/|pickup_datetime| pickup_year| pickup_month| pickup_day
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 2009| 6| 15
1| 2010-01-05 16:52:16+00:00| 2010| 1 | 5
2| 2011-08-18 00:35:00+00:00| 2011| 8| 18
3| 2012-04-21 04:30:42+00:00| 2012| 4| 21
4| 2010-03-09 07:51:00+00:00| 2010| 3| 9
提取星期相關的特徵
# 是星期幾(Day of the week)
df_train['pickup_dayofweek'] = df_train['pickup_datetime'].dt.dayofweek
# 是周末嗎?
df_train['is_weekend'] = np.where(df_train['pickup_dayofweek'].isin([5, 6]), 1,0)
# 是當年度第幾周(Week of the year)
df_train['pickup_week'] = df_train['pickup_datetime'].dt.isocalendar().week
df_train[['pickup_datetime','pickup_dayofweek','is_weekend','pickup_week']].head()
/|pickup_datetime| pickup_dayofweek| is_weekend| pickup_week
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 0| 0| 25
1| 2010-01-05 16:52:16+00:00| 1| 0| 1
2| 2011-08-18 00:35:00+00:00| 3| 0| 33
3| 2012-04-21 04:30:42+00:00| 5| 1| 16
4| 2010-03-09 07:51:00+00:00| 1| 0| 10
提取年度相關的特徵
# 是當年度第幾季 (1 to 4)
df_train['pickup_quarter'] = df_train['pickup_datetime'].dt.quarter
# 是上半年還是下半年 (1 to 2)
df_train['pickup_semester'] = np.where(df_train['pickup_quarter'].isin([1, 2]), 1, 2)
df_train[['pickup_datetime','pickup_quarter',pickup_semester]].head()
/|pickup_datetime| pickup_quarter| pickup_semester
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 2| 1
1| 2010-01-05 16:52:16+00:00| 1| 1
2| 2011-08-18 00:35:00+00:00| 3| 2
3|2012-04-21 04:30:42+00:00 |2| 1
4| 2010-03-09 07:51:00+00:00| 1| 1
提取時間Time特徵
df_train['pickup_time'] = df_train['pickup_datetime'].dt.time
df_train[['pickup_datetime','pickup_time']].head()
/|pickup_datetime| pickup_time
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 17:26:21
1 |2010-01-05 16:52:16+00:00| 16:52:16
2| 2011-08-18 00:35:00+00:00| 00:35:00
3| 2012-04-21 04:30:42+00:00| 04:30:42
4| 2010-03-09 07:51:00+00:00| 07:51:00
提取時Hour、分Minute、秒Second特徵
df_train['pickup_hour'] = df_train['pickup_datetime'].dt.hour
df_train['pickup_minute'] = df_train['pickup_datetime'].dt.minute
df_train['pickup_second'] = df_train['pickup_datetime'].dt.second
df_train[['pickup_datetime','pickup_hour','pickup_minute','pickup_second']].head()
/|pickup_datetime| pickup_hour |pickup_minute| pickup_second
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 17| 26| 21
1| 2010-01-05 16:52:16+00:00| 16| 52| 16
2| 2011-08-18 00:35:00+00:00| 0| 35| 0
3| 2012-04-21 04:30:42+00:00| 4| 30| 42
4| 2010-03-09 07:51:00+00:00| 7| 51| 0
提取是否為上班時間,是否為上午特徵
# 是否為上班時間(business hour, 8AM 到 12AM)(1 or 0)
df_train['pickup_business'] = np.where(df_train['pickup_hour'].isin([8, 9, 10, 11]), 1, 0)
# 是否為上午
df_train['pickup_is_morning'] = np.where((df_train['pickup_hour']<12) & (df_train['pickup_hour']>6), 1, 0)
df_train[['pickup_datetime','pickup_business','pickup_is_morning']].head()
/|pickup_datetime|pickup_business| pickup_is_morning
------------- | -------------
0| 2009-06-15 17:26:21+00:00| 0| 0
1 |2010-01-05 16:52:16+00:00| 0| 0
2 |2011-08-18 00:35:00+00:00| 0| 0
3| 2012-04-21 04:30:42+00:00| 0| 0
4| 2010-03-09 07:51:00+00:00| 0| 1