DAY 4
0

## 為什麼要做特徵工程呢?

### 資料清理

#### 分類數據(Categorical Features)

``````data = [{'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
{'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
{'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
{'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}]
``````

``````from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)
``````

#### 文字特徵(Text Features)

``````from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

vec = CountVectorizer()
X = vec.fit_transform(sample)

pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
``````

`文字頻率（term frequency-inverse document frequency,TF-IDF）`，可以通過衡量它們在文檔中出現的頻率來對字數進行加權。

``````from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
``````

#### 圖片特徵(Image Features)

##### Derived Features

``````%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.array([1, 2, 3, 4, 5])
y = np.array([4, 2, 1, 3, 7])
plt.scatter(x, y);
``````

``````from sklearn.linear_model import LinearRegression
X = x[:, np.newaxis]
model = LinearRegression().fit(X, y)
yfit = model.predict(X)
plt.scatter(x, y)
plt.plot(x, yfit);
``````

``````from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3, include_bias=False)
X2 = poly.fit_transform(X)
print(X2)
``````

``````model = LinearRegression().fit(X2, y)
yfit = model.predict(X2)
plt.scatter(x, y)
plt.plot(x, yfit);
``````

## 參考資料

1. 特徵工程之資料預處理（上）

2.Day11-Scikit-learn介紹(3)_Feature Engineering

https://ithelp.ithome.com.tw/articles/10205475