Day17 總之該來點Unsupervised ML了

2019鐵人賽

總之系統工程師

2018-10-17 22:38:31

3789 瀏覽

分享至

電腦出了點問題，今天沒CODE，下面的CODE是憑印象打的psuedo-code，拜託各位大神鞭小力一點<(_ _)>

介紹

上一篇已經做了一部份的前處理，接下來要考慮如何讓使用資料，這邊我先用Unsupervised ML來做clustering，在Unsupervised ML裡個人偏好Density-based spatial clustering of applications with noise(DBSCAN)來關聯不同事件，DBSCAN比K-NN具有更準確clustering的優勢，透過core point和directly density-reachable的概念來讓Cluster不再單純只是距離近，而是更具有特徵相近的特性。

工具

Python的scikit-learn模組就包含了DBSCAN，from sklearn.cluster import DBSCAN就可以輕鬆使用，如果想要提高Cluster的門檻，只讓特徵很相近的樣本形成Cluster，就把eps調低或min_samples調高，下面是我嘗試過的其中一個方法，透過IP、Port、Process去分析是否有類似後門、反向後門或Proxy的跡象(這樣是不夠用來分析的，而且遇到"一句掛馬"就直接沒效了，或是真的是同一支程式在固定連線(SQL)，另外也應該要在前處理的時候加強同一台主機輸入輸出連線的特徵，總之還需要其他前處理、特徵和降維的方法)：

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import LabelEncoder

source_ip = LabelEncoder().fit_transform(dataframe['event_data_SourceIp'])
source_port = np.array(dataframe['event_data_SourcePort'])
destination_ip = LabelEncoder().fit_transform(dataframe['event_data_DestinationIp'])
destination_port = np.array(dataframe['event_data_DestinationPort'])
process_id = LabelEncoder().fit_transform(dataframe['event_data_ProcessId'])
data = np.stack([source_ip, source_port, destination_ip, destination_port, process_id], axis=1)

clustering = DBSCAN(eps=3, min_samples=2).fit(data)
print(clustering.labels_)