Day18 Categorical Data 2/2 counting and feature hashing 類別型特徵 2/2 計數編碼與特徵雜湊

第 11 屆 iThome 鐵人賽

DAY 18

AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列第 18 篇

Day18 Categorical Data 2/2 counting and feature hashing 類別型特徵 2/2 計數編碼與特徵雜湊

11th鐵人賽 feature hashing 特徵雜湊計數編碼 data cleaning

kyt

2019-09-19 07:06:10

1978 瀏覽

分享至

計數編碼

如果類別型特徵的目標值與類別筆數呈相關，可將筆數本身當作特徵，例如：自然語言處理中，字詞的計數編碼稱為詞頻，是自然語言處理中很重要的特徵。

Counting

If the target value of the categorical data and the counting are correlated, we can then use the counting as a feature. For example, in Natural Language Processing, word counts itself is a very important and useful feature.

特徵雜湊

特徵雜湊將類別型特透過徵雜湊函數對應到一組數字，調整雜湊函數控制對應值的數量，在計算成本與鑑別度間取折衷，提高訊息密度並減少無用的標籤。當相異類別數量相當大時可考慮使用雜湊編碼以節省時間。

Feature Hashing

Feature hashing is projecting categorical features onto numbers using hash functions. It is a method compromised computational costs and discrimination. Feature hashing could reduce useless labels and increase the density of information of the data. We could consider using it when there are a lot of categorical features to save time.