最常都是使用Pandas來處理csv的檔案格式,
但若是要處理log類型的資料,如下
month=>[0:3]
day=>[4:6]
time=>[7:15]
clftp1=>[16:22]
.........略.........
想請問各位前輩遇到這類的data是否有較好的前處理方式
不論是直接操作該格式
或是
replace後轉成csv使用pandas操作都可
希望可以分享大家寶貴的意見
謝謝
為何不考慮用一次讀一行,再用split的對空白做切割,再一一塞入資料結構?
with open(r'data.txt', 'r') as f:
x = f.readlines()
preProcess = [item.split(']: ') for item in x]
# 本來切 ": " 發現會切到後面
# 改切 "]: " 不過如果後面資料不確定用原本的方式應該會比較好
preProcess2 = [[item[0].replace(' ', ' '), item[1]] for item in preProcess]
preProcess3 = [[item[0] + ']', item[1]] for item in preProcess2]
preProcess4 = [item[0].replace(' ', ',') + ',' + item[1] for item in preProcess3]
str=''
with open(r'data.csv', 'w') as f:
f.write(str.join(preProcess4))
print(str.join(preProcess4))
# output
'''
Dec,31,23:59:58,clftp1,ftpd[1739],NLST ASECL_snapshot.xml
Dec,31,23:59:58,clftp1,ftpd[1739],RETR ASECL_snapshot.xml
Dec,31,23:59:58,clftp1,ftpd[139],QUIT
Dec,31,23:59:58,clftp1,ftpd[1739],FTP session closed
Dec,31,23:59:58,clftp1,ftpd[1739],PASV
Dec,31,23:59:59,clftp1,ftpd[1081],TYPE ASCII
Dec,31,23:59:59,clftp1,ftpd[181],QUIT
Dec,31,23:59:59,clftp1,ftpd[1081],FTP session closed
Jan,1,00:00:00,clftp1,ftpd[1740],Data port : 20
Jan,1,00:00:00,clftp1,ftpd[1740],FTP server (Revision 1.1 Version wuftpd-2.6.1(PHNE_40380) Fri Dec 4 10:05:22 GMT 2009) ready.
Jan,1,00:00:00,clftp1,ftpd[1740],SYST
Jan,1,00:00:00,clftp1,ftpd[1740],USER nxp_cp
Jan,1,00:00:00,clftp1,ftpd[1740],PASS password
Jan,1,00:00:00,clftp1,ftpd[1740],FTP LOGIN FROM 10.14.7.24 [10.14.7.24], nxp_cp
Jan,1,00:00:00,clftp1,ftpd[40],TYPE Image
Jan,1,00:00:00,clftp1,ftpd[1740],CWD /nxpcp_quota_check
Jan,1,00:00:00,clftp1,ftpd[1740],PASV
Jan,1,00:00:00,clftp1,ftpd[1743],RETR nxpcp_quota.log
Jan,1,00:00:00,clftp1,ftpd[1740],QUIT
Jan,1,00:00:00,clftp1,ftpd[1740],FTP session closed
'''
使用Regular Expression處理': '
# -*- coding: utf-8 -*-
import re
preProcess =[]
with open(r'Rawdata.txt', 'r') as f:
for line in f.readlines():
prestring = re.split(r'(ftpd\[\d+\]): ', line)
result = prestring[0].split() + prestring[1:]
preProcess.append(result)
print(result)