So, as an example, suppose you have words in an NLP or Natural Language Processing system, and the things that you do to the words to make them numeric is that you could typically run something like word2vec or word to vector.
Feature不是數值,不利於建立模型。過去研究已經法展出一些處理非數值到數值的方式,像是hot encode;或是將單字(word)轉換為向量(vector)的word2vec。(不過word2vec不知道能不能處理中文0.0)
Statistics on the other hand is about keeping the data that you have in getting the best results out of the data that you have. The difference in philosophy affects how you treat outliers. In ML you go out and find enough outliers that becomes something that you can actually train with. Remember that five sample rule that we had? With statistics you say, "I've got all the data I'll ever be able to collect." So, you throw out outliers.
在傳統統計會排除離群值,已達到統計上認為的理想結果。但是機器學習認為就算是離群值,只要數量夠多,機器也可以學會。(學會例外?)
Statistics is often used in a limited data regime or ML operates with lots of data.
So having an extra column to flag whether on you're missing data is what you would normally do in ML.
When you don't have enough data and you imputed to replace it by an average.
統計通常使用一部分資料來評估整體,但是機器學習使用多數資料。此外,還需要考慮缺失值得處理辦法,一種作法是填入平均值。