如題,小白一個平時只用R語言,不熟悉python。最難自學網綠爬蟲遇到了一個基礎問題,在網上搜索了很久也沒有解決。希望有人能提出意見
這是我的list,值網路爬蟲爬到的:我希望能把他轉換成一個row的dataframe,方便我等會能和其他row合併,現在主要問題有兩個,如何把list轉換成一個row的dataframe?然後如何用類似R裡面Rbind的功能把好幾個list轉換好的dataframe合併?
row = ['Gr',
'http://www.purriodictableofcats.com/images/b-grumpycat.jpg',
'Female',
'Real Name: Tardar Sauce',
'Hit the internet: 2012',
"Interesting Facts: Being both grumpy and adorableNamed as MSNBC's 2012 Most Influential CatWon Buzzfeed's Meme of the Year Award in 2013Published 2 books and a wall calendarAppeared on: Today Show, Good Morning America, CBS Evening News, American Idol and many more",
'https://www.facebook.com/TheOfficialGrumpyCat',
'https://twitter.com/RealGrumpyCat',
'http://instagram.com/realgrumpycat/']
我已經試過了好幾個方法,比如transpose...以下是我最後一次嘗試的結果,和錯誤提示:
pd.DataFrame(row, columns = ["id","img","sex","real_name","year","fact","link1","link2","link3"])
ValueError: Shape of passed values is (9, 1), indices imply (9, 9)
應該還沒成功轉化目前沒有寫到類似rbind的公式,希望有人呢介紹一下工嗯呢該相同的公式,謝謝。
用dict,以下瞎做個示例,實際依你的狀況自己修改
#!/usr/bin/env python3
import os
mylist = [["00001","John","Paris","12345"] , ["00002","May","Taiwan","66666"] , ["00003","Toyota","Japan","22222"] ]
mydict = dict()
for ml in mylist: #把LIST轉成DICT
mydict.update({ ml[0]:{"name":ml[1],"city":ml[2]} ,"tel":ml[3]} )
print(mydict["00001"]["name"]) # 印出00001號的名字
#最好先檢查這個00001是不是存在,不然會跳出KEY NOT EXIST錯誤
#檢查很簡單 if "00001" in mydict:
#幾百萬行的DICT在合理的CPU下,找到其中一項,時間會在0.幾毫秒計
合併dict ........
mydict.update(yourdict)
dict 用gzip壓縮後儲存,我建議再搭jsonpickle套件
if mydict is not None:
with gzip.open(self.dictgz,"wb") as gz :
sp = jsonpickle.encode(mydict)
bsp = sp.encode()
gz.write(bsp)
從磁碟機取回
if os.path.exists("/home/user001/mydict.gz"):
with gzip.open("/home/user001/mydict.gz","rb") as gz :
gzdata = gz.read()
bsp=gzdata.decode()
if len(bsp)>0:
mydict=jsonpickle.decode(bsp)
import pandas as pd
AA = ['Gr',
'http://www.purriodictableofcats.com/images/b-grumpycat.jpg',
'Female',
'Real Name: Tardar Sauce',
'Hit the internet: 2012',
"Interesting Facts: Being both grumpy and adorableNamed",
'https://www.facebook.com/TheOfficialGrumpyCat',
'https://twitter.com/RealGrumpyCat',
'http://instagram.com/realgrumpycat/']
BB={'id':[AA[0]],'img':[AA[1]],'sex':[AA[2]],'real_name':[AA[3]],'year':[AA[4]],'fact':[AA[5]],'link1':[AA[6]],'link2':[AA[7]],'link3':[AA[8]]}
data =pd.DataFrame(BB)
print(data)
參考看看...