from google.colab import drive
drive.mount('/content/gdrive')
train = pd.read_csv("/content/gdrive/My Drive/train.csv")
test = pd.read_csv("/content/gdrive/My Drive/test.csv")
記得先上傳至雲端硬碟才能執行底下的程式碼哦~
import pandas as pd
import numpy as np
import seaborn as sns
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# 觀察資料
train.shape, test.shape
# ((1460, 81), (1459, 80))
train.head()
id是預測過程中完全用不到的資料,因此在處理過程中會先行去除。
train.drop(['Id'], axis=1, inplace=True)
test.drop(['Id'], axis=1, inplace=True)
我們可以針對預測目標進行視覺化,並且在這類的回歸型問題中,盡量將數據整理成常態分佈的型式,對於模型而言會較於均衡,並不會傾向於預測高或者低。
sns.distplot(train["SalePrice"])
sns.distplot(np.log1p(train["SalePrice"]))
記得指派回原始數據才有效哦!
train.loc[:,"SalePrice"] = np.log1p(train["SalePrice"])