Day17 Categorical Data 1/2 mean encoding 類別型特徵 1/2 均值編碼


We discussed one-hot encoding and label encoding in the Day03 article, other than those two methods, mean encoding is also used sometimes when dealing with categorical features. Normally, we use label encoding as default while dealing with categorical data, and only use one-hot encoding when the feature is important and the data is not too large (otherwise it’s going to be too computationally expensive). And we could consider to use mean encoding when the feature is highly related to target the values (like areas and housing price range).

均值編碼 Mean Encoding


Using the mean of the target values to replace the original categorical features. One thing need to be careful with this method is really easy to overfit the data even after smoothing (overfitting happens when the model learned with not enough data or for a too long period of time, the model ended up fit perfectly with the training data but fail to perform well on new unknown data).



Using smoothing to slightly fix the overfitting problem when using mean encoding

If we only have very a little dataset and we accidently chose an extreme value will end up getting a mean value with deviation. So we add in the counts of the values as reliability when using mean encoding.
When the reliability of the target value is low, we tend to trust the mean of all the data more; while when the reliability is high, we then tend to use the mean of the mean of that category.

