iT邦幫忙

第 11 屆 iThome 鐵人賽

DAY 18
0
AI & Data

Hands on Data Cleaning and Scraping 資料清理與爬蟲實作系列 第 18

Day18 Categorical Data 2/2 counting and feature hashing 類別型特徵 2/2 計數編碼與特徵雜湊

  • 分享至 

  • xImage
  •  

計數編碼

如果類別型特徵的目標值與類別筆數呈相關,可將筆數本身當作特徵,例如:自然語言處理中,字詞的計數編碼稱為詞頻,是自然語言處理中很重要的特徵。

Counting

If the target value of the categorical data and the counting are correlated, we can then use the counting as a feature. For example, in Natural Language Processing, word counts itself is a very important and useful feature.
https://ithelp.ithome.com.tw/upload/images/20190919/2011970908ON1XwPcd.jpg

特徵雜湊

特徵雜湊將類別型特透過徵雜湊函數對應到一組數字,調整雜湊函數控制對應值的數量,在計算成本與鑑別度間取折衷,提高訊息密度並減少無用的標籤。當相異類別數量相當大時可考慮使用雜湊編碼以節省時間。

Feature Hashing

Feature hashing is projecting categorical features onto numbers using hash functions. It is a method compromised computational costs and discrimination. Feature hashing could reduce useless labels and increase the density of information of the data. We could consider using it when there are a lot of categorical features to save time.
https://ithelp.ithome.com.tw/upload/images/20190919/20119709rbsX420Y54.png

本篇程式碼請參考Github。The code is available on Github.

文中若有錯誤還望不吝指正,感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料:

[1] 第二屆機器學習百日馬拉松內容

[2] Word to Vectors

[3] 数据特征处理之特征哈希


上一篇
Day17 Categorical Data 1/2 mean encoding 類別型特徵 1/2 均值編碼
下一篇
Day19 Time Series Feature 時間型特徵
系列文
Hands on Data Cleaning and Scraping 資料清理與爬蟲實作30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言