iT邦幫忙

2017 iT 邦幫忙鐵人賽
DAY 3
0
Big Data

從Data Engineer、Data Architecture到Data Science系列 第 3

能應付變化的資料格式

  • 分享至 

  • twitterImage
  •  

在專案進行中,我們需要決定資料格式。

一開始,我們採用json作為資料的傳輸格式。主因是我們串接的SSP也都是使用json作為資料交換的格式,另一個主因則是我們的服務是使用nodejs架設的,後端串接mongodb。這樣的架構下,選擇json作為儲存資料的格式是非常自然的。

另一方面,json使用起來非常的自由,當我們需要任何新的資訊時,可以直接變更資料。但是隨著專案的成長,這樣的自由也帶來了維護的挑戰。等到之後,我想再向讀者分享我的看法。這裡我先分享使用json做資料格式的心得。

由於公司指派支援的工程師很忙,所以我不期待能夠建構一個Hadoop/Spark Cluster來進行資料分析,而是以節省工程能量的方式做分析。

在以下的內文中,我拿Smaato的範例資料為例做解釋。

讀者想練習的話,可以用剪貼簿從附錄中複製貼上那些複雜的json到檔案中,再存到路徑:/tmp/test.ndjson。或是在command line下指令:

wget https://gist.githubusercontent.com/wush978/861d34eb8a97cc6d8b3fd180a2c1d2f7/raw/db0d6caee834251e68049cb5520ba546114216ab/test.ndjson /tmp/test.ndjson

直接下載我放在gist上的範例資料檔案。從這些範例也可以看看這些真實資料產生的JSON,我想大家都不希望只靠肉眼從裡面找出資訊。

ps. 這裡使用副檔名ndjson,代表資料是以斷行做分隔,每行是一個json物件。細節請見 http://ndjson.org/

在Ubuntu或OS X上,我們可以安裝jq這個小程式在命令列上直接探索json資料。

這個範例資料中包含四個json物件,每個物件是RTB的一個Bid Request。在RTB系統中,SSP(如Smaato)會在一般使用者玩特定App時,詢問我們要花多少錢來買流量,播放廣告。因此在做資料分析時,我可能會想看看每個Bid Request是發生在哪一個手機App中。我們可以下指令:

jq '.app.name' /tmp/test.ndjson

那螢幕上就會顯示:

jq能夠很好的處理ndjson的格式:每一行json物件都會輸出一個對應的結果。中間的'.app.name'會讓jq選取出json物件中名稱為app的element,再選取該元素中名稱為name的element。因此,透過jq我就可以選取出json中的資料。

如果要尋找特定的資料,例如:publisher id 為 1001000001的紀錄,那可以利用grep

grep '"publisher":{"id":"1001000001"' test.ndjson

如果又想從其中提煉出特定的資料,就可以再用|這個pipe operator來把輸出導入到jq中:

grep '"publisher":{"id":"1001000001"' test.ndjson | jq '.app.name'

我個人認為像這樣透過stdin、stdout處理資料的技巧,是很值得學習的。有興趣的讀者可以再自己google以下的關鍵字找各種資源做自學:bash piperedirection。這些都是很老的技術,但是也因為歷史悠久,所以都是很可靠的。

命令列中像jqgrep這樣的小工具很多,有需要時我們也可以使用R(Linux command line tool + pipe 學習筆記之一:讓R 加入pipe的一環)、python、c++或nodejs等工具寫出以stdin輸入、stdout輸出的小程式,甚至透過小技巧(Linux command line tool + pipe 學習筆記之二:平行運算)做出簡單的平行資料處理,更有效率的處理資料。

之前我看過一篇文章:命令列上的工具可以比Hadoop叢集要快上235倍(Command-line tools can be 235x faster than your Hadoop cluster)。在團隊早期,並且沒有充沛的工程能量來維護叢集系統時,這句話更正確。因為我們只要花點時間學會這些命令列工具,就不用再等工程師幫我們準備:如Hadoop等分析工具呢!

附錄

以下是範例資料:

{"app":{"cat":["IAB14"],"domain":"demo.com","id":"13000001","name":"Demo_US_480x80","publisher":{"id":"100801001","name":"Demo"}},"at":2,"bcat":["IAB25-5","AND1-6","IAB25-4","IAB25-7","IAB23-1","IAB25-6","AND1-3","IAB25-1","IAB25-3","IAB25-2","IAB9-9","IAB14-4","IAB22-1","IAB14-1","IAB22-2","IAB14-2","IAB14-3","IAB23-6","IAB13-1","IAB7-45","IAB26","IAB7-44","IAB23-7","IAB7-3","IAB8-5","IAB25","IAB23-8","IAB24","IAB23-9","IAB23","IAB23-2","IAB23-3","IAB23-4","IAB8-18","IAB23-5","IAB4","IAB7-28","IAB18-2","IAB3-11","IAB19-3","IAB17-18","IAB7-31","IAB7-30","IAB7-39","IAB23-10","IAB26-3","IAB26-4","IAB26-1","IAB26-2","IAB7-41","APL8-7","IAB7-42","APL8-6","APL8-5","APL8-4"],"device":{"connectiontype":0,"devicetype":1,"dnt":0,"ifa":"874273775852857007","didsha1":"03791cf87e352da6434e4d964f54f0f64db933aa","didmd5":"68b329da9893e34099c7d8ad5cb9c940","dpidsha1":"bfc523f83fa54a19e7b2d8bdf8acd687a09fba09","dpidmd5":"223632c428784fecaaa3e2a6aaaf6d8e","geo":{"country":"USA","lat":29.8327,"lon":-95.6627,"type":1,"zip":"77084"},"ip":"172.56.14.6","js":0,"make":"Generic","model":"Windows Phone 8","os":"Windows Phone OS","osv":"8","ua":"Windows Phone Ad Client/6.2.960.0 (Silverlight; MS_ORMMA_1_0; Windows Phone OS 8.10.15148.0; Microsoft; RM-1073_1006)"},"ext":{"carriername":"T-Mobile","coppa":0,"operaminibrowser":0,"udi":{"wpid":"874273775852857007"}},"id":"1DGXhoQYtm","imp":[{"banner":{"battr":[1,3,5,8,9],"mimes":["image/gif","image/jpeg","image/png"],"btype":[1,3],"w":320},"displaymanager":"SOMA","id":"1","instl":0}],"user":{}}
{"app":{"bundle":"302000101","cat":["IAB1"],"domain":"itunes.apple.com","id":"130000001","name":"Demo_Adspace_Placement_320x50","publisher":{"id":"1001000001","name":"Demo"}},"at":2,"bcat":["IAB17-18","IAB7-42","IAB23","IAB7-28","IAB26","IAB25","IAB9-9","IAB24"],"device":{"connectiontype":0,"devicetype":1,"dnt":0,"ifa":"F585B1EF-D543-4F9C-A39F-A204BD9C1E33","didsha1":"03791cf87e352da6434e4d964f54f0f64db933aa","didmd5":"68b329da9893e34099c7d8ad5cb9c940","dpisha1":"bfc523f83fa54a19e7b2d8bdf8acd687a09fba09","dpimd5":"223632c428784fecaaa3e2a6aaaf6d8e","geo":{"city":"Shenzhen","country":"HKG","lat":22.319147,"lon":114.22968,"metro":"0","region":"30","type":3,"zip":"518019"},"ip":"210.6.191.180","js":1,"make":"Apple","model":"iPhone","os":"iOS","osv":"7.1","ua":"Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D257"},"ext":{"carriername":"unknown - probably WLAN","operaminibrowser":0,"udi":{"atuid":"edc22bd4-aa58-f32a-63b7-8c7fcadfb1b7","idfa":"F585B1EF-D543-4F9C-A39F-A204BD9C1E33","idfatracking":1}},"id":"BqzFJc1Ze7","imp":[{"banner":{"api":[3,5],"h":50,"mimes":["application/javascript","text/javascript"],"btype":[1,2],"pos":0,"w":320},"displaymanager":"SOMA","displaymanagerver":"adtag2210s","id":"1","instl":0,"secure":0}],"regs":{"coppa":0},"user":{"id":"0088ba87-b96f-483d-85e3-36723ad6c60d","keywords":["m_gender:m","m_interestedIn:f","m_age:23"]}}
{"at":2,"bcat":["IAB7-28","IAB17-18","IAB26","IAB25","IAB24","IAB9-9","IAB7-42","IAB23"],"device":{"connectiontype":2,"devicetype":1,"dnt":0,"ifa":"F585B1EF-D543-4F9C-A39F-A204BD9C1E33","didsha1":"03791cf87e352da6434e4d964f54f0f64db933aa","didmd5":"68b329da9893e34099c7d8ad5cb9c940","dpisha1":"bfc523f83fa54a19e7b2d8bdf8acd687a09fba09","dpimd5":"223632c428784fecaaa3e2a6aaaf6d8e","geo":{"country":"USA","type":3},"ip":"173.209.211.199","js":0,"make":"generic web browser","ua":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:35.0) Gecko/20100101 Firefox/35.0"},"ext":{"carriername":"WLAN","coppa":0,"operaminibrowser":0,"udi":{"atuid":"edc22bd4-aa58-f32a-63b7-8c7fcadfb1b7","idfa":"F585B1EF-D543-4F9C-A39F-A204BD9C1E33","idfatracking":1}},"id":"FxRm2MJBpS","imp":[{"banner":{"battr":[1,3,5,8,9],"btype":[1,2,3],"h":0,"w":0},"displaymanager":"SOMA","ext":{"native":{"iconsize":[80,80],"imagesize":[1200,627],"pubnsupport":["title","text","iconimage","mainimage","ctatext","rating"],"ver":"1.0"}},"id":"1","instl":0}],"site":{"domain":"example.com","id":"65837062","name":"Example","publisher":{"id":"923869874","name":"Example"}},"user":{}}
{"app":{"cat":["IAB14"],"id":"1001000","name":"Facebook Event","publisher":{"id":"900000001","name":"Facebook Event"}},"at":2,"bcat":["IAB7-28","IAB17-18","IAB26","IAB25","IAB24","IAB9-9","IAB7-42","IAB23"],"device":{"carrier":"310-004","connectiontype":0,"devicetype":1,"dnt":0,"ifa":"F585B1EF-D543-4F9C-A39F-A204BD9C1E33","didsha1":"03791cf87e352da6434e4d964f54f0f64db933aa","didmd5":"68b329da9893e34099c7d8ad5cb9c940","dpisha1":"bfc523f83fa54a19e7b2d8bdf8acd687a09fba09","dpimd5":"223632c428784fecaaa3e2a6aaaf6d8e","geo":{"country":"USA","type":3},"ip":"46.211.129.201","js":0,"make":"Apple","model":"iPhone","os":"iOS","osv":"5.0","ua":"Mozilla/5.0 (iPhone; U; CPU iPhone OS 5_1_1 like Mac OS X; da-dk) AppleWebKit/534.46.0 (KHTML, like Gecko) CriOS/19.0.1084.60Mobile/9B206 Safari/7534.48.3"},"ext":{"carriername":"unknown - probably WLAN","coppa":0,"operaminibrowser":0,"udi":{"atuid":"edc22bd4-aa58-f32a-63b7-8c7fcadfb1b7","idfa":"F585B1EF-D543-4F9C-A39F-A204BD9C1E33","idfatracking":1}},"id":"9xMuXkrL8v","imp":[{"displaymanager":"SOMA","id":"1","instl":0,"video":{"linearity":1,"maxduration":18000,"mimes":["video/mp4","video/3gpp"],"minduration":0,"protocol":2}}],"user":{}}

上一篇
資料系統的挑戰 --- 專屬的工程能量
下一篇
第四天就忘記了
系列文
從Data Engineer、Data Architecture到Data Science8
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言