今天學習如何將資料進行整理,可以使用Day7、8所學的容器型態進行存放,讓我們開始吧!
依照日期進行資料排序
將資料存放至List中並用Dict儲存每筆資料,由於網站日期沒有年份所以預設補上當前年份,由於想透過日期進行排序,但沒有找到可以直接使用type = datetime.date
進行排序的方式,所以將日期變成數字,再透過數字進行排序
from bs4 import BeautifulSoup
import requests
import datetime
uri = 'https://www.ptt.cc/bbs/Soft_Job/index.html'
Ptt_Domain = 'https://www.ptt.cc'
lastPage = ''
category = ''
articles = []
date = datetime.datetime.strptime('12/09', '%m/%d')
Now = datetime.date.today()
NowYear = Now.year
#個位數補0
def Add_Zero(s):
if len(s) < 2:
s = '0' + s
return s
#判斷連線狀態
def Connection_Check(html):
if html.status_code != requests.codes.ok:
return False
else:
return True
#撈取網頁資料並列印
def Get_Page_Data(uri) :
html = requests.get(uri)
if Connection_Check(html):
soup = BeautifulSoup(html.content, 'html.parser')
menuDiv = soup.find('div', class_ = 'btn-group btn-group-paging')
lastLinks = menuDiv.find_all('a')
for lastLink in lastLinks:
if '上頁' in lastLink.string:
global lastPage
lastPage = lastLink.get('href')
content = soup.find('div', class_='r-list-container action-bar-margin bbs-screen')
r_list_sep = content.find('div', class_ = 'r-list-sep')
if r_list_sep == None:
r_ent_div = content.find_all('div', class_ = 'r-ent') #找出指定的 class
else:
r_ent_div = r_list_sep.find_previous_siblings('div', class_ = 'r-ent')
i = 0
for item in r_ent_div:
title = item.find( class_ = 'title')
if title.find('a'): #過濾掉被刪除的文章
s = title.find('a')
title_text = s.string
#轉換日期型態
date = item.find('div', class_ = 'date')
date = str(NowYear) + '/' + str(date.string) #加上年份
dateNum = datetime.datetime.strptime(date, '%Y/%m/%d')
dateNum = int( str(dateNum.year) + Add_Zero(str(dateNum.month)) + Add_Zero(str(dateNum.day)) )
global category
if category in title_text :
i = i+1
#存入List
articles.append({
'title':title_text,
'href':Ptt_Domain + s.get('href'),
'date':date,
'dateNum' : dateNum
})
return
else:
print('無法連線網站')
return
while True:
try:
page = int(input('請輸入要搜尋頁數:'))
break
except:
print('請輸入數字!')
SearchRange = range(1, page + 1)
category = '[' + input('請輸入要搜尋的類別:') + ']'
for num in SearchRange:
print('第{}頁'.format(num))
if num == 1:
Get_Page_Data(uri)
else:
uri = Ptt_Domain + lastPage
Get_Page_Data(uri)
#重新排序
articles = sorted(articles, key = lambda e:e.__getitem__('dateNum'), reverse = True)
print(articles)
輸出結果:
轉換成Json格式
為了讓輸出結果方便閱讀我們使用json模組讓資料呈現樹狀結構:
import json
print(json.dumps(articles, indent=1, ensure_ascii=False))
輸出結果:
[
{
"title": "[徵才] TinkLabs 徵 Android Engineer",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541127584.A.FAC.html",
"date": "2018/11/02",
"dateNum": 20181102
},
{
"title": "[徵才] 高雄香港商台灣千里目-網路程式設計師",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541127580.A.331.html",
"date": "2018/11/02",
"dateNum": 20181102
},
{
"title": "[徵才] OneDegree 徵資安主管 (60~100K up/5Y)",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541001848.A.6A2.html",
"date": "2018/11/01",
"dateNum": 20181101
},
{
"title": "[徵才] 百睿達有限公司 誠徵後端工程師",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541049353.A.4C1.html",
"date": "2018/11/01",
"dateNum": 20181101
},
{
"title": "[徵才] 留學顧問公司徵前端工程師(台北)",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541051202.A.EA2.html",
"date": "2018/11/01",
"dateNum": 20181101
},
{
"title": "[徵才] H&L 代徵 Software Engineer (70K~120K+)",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541063366.A.BDC.html",
"date": "2018/11/01",
"dateNum": 20181101
},
{
"title": "[徵才] H&L 代徵 DevOps Engineer (80K~120K+)",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541063561.A.E88.html",
"date": "2018/11/01",
"dateNum": 20181101
},
{
"title": "Fw: [徵才] 思華科技-DBA資料庫技術工程師",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1541067017.A.5FD.html",
"date": "2018/11/01",
"dateNum": 20181101
},
{
"title": "[徵才] COBINHOOD 徵求前端工程師(72K~120K/mon)",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1540950761.A.B25.html",
"date": "2018/10/31",
"dateNum": 20181031
},
{
"title": "[徵才] 徵Senior DevOps Engineer(90K~120K/mont",
"href": "https://www.ptt.cc/bbs/Soft_Job/M.1540891740.A.FCB.html",
"date": "2018/10/30",
"dateNum": 20181030
}
]
*
使用json.dumps
時indent=1
是將json格式化成樹狀結構indent=0
時會輸出未格式化的json,若json中有中文需加上ensure_ascii=False
避免中文被encode
以上,我們撈取的資料就已經整理完成了!
日期模組參考資料:https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/369869/
排序參考資料:https://blog.csdn.net/Tangzongyu123/article/details/75200619
Json模組參考資料:https://www.crifan.com/format_dictionary_list_variable_into_prettified_tree_like_with_indent_json_string_then_output/
爬蟲程式參考資料:http://bnn00023.pixnet.net/blog/post/1077333-%E5%AD%B8%E7%BF%92python-ptt%E6%AD%A3%E5%A6%B9%E7%89%88%E7%88%AC%E8%9F%B2%E7%BF%92%E9%A1%8C%EF%BC%9A%E5%A4%9A%E9%A0%81%E7%88%AC%E5%8F%96
文章內容如果有錯誤歡迎留言告知,可以幫忙糾正錯誤的觀念,感謝!