先前發文
DAY 01 : 參賽目的與規劃
DAY 02 : python3 virtualenv 建置
DAY 03 : python3 request
DAY 04 : 使用beautifulsoup4 和lxml
先來一如往常的跑完code 再來看一下beautifulsoup4 和lxml到底是什麼!
import requests
from bs4 import BeautifulSoup
homepage = requests.get('https://www.ptt.cc/bbs/hotboards.html')
soup = BeautifulSoup(homepage.text,"lxml")
print(soup)
進入env後 python 檔名.py , 就可以看到一般網站把js關閉後
關閉方式:
案f12或是ctrl+shift+i 看到html的tag , 通過get後經過BeautifulSoup 轉成unicode後再由lxml轉為xml格式的樹狀結構 python對象 print後 出現在terminal 上 。
這樣就可以輕易的拿到樹狀結構裡的資料了!
第一為 print出 tag為title的結果為 內文
第二為 差異就是去除tag只show出內文
print(soup.title)
print(soup.title.text)
出於好奇心 , 想要測試看看lxml解析 是否真的快於其他解析器
驗證4個解析器的文檔大小與code執行速度
下2圖 比對整個網頁的檔案大小 , 和使用lxml後的檔案大小
import requests
from bs4 import BeautifulSoup
import time
start = time.time()
homepage = requests.get('https://www.ptt.cc/bbs/hotboards.html')
# print(homepage.text)
a = 0
b = 0
c = 0
d = 0
for i in range(1,100):
with open ('nosoup.text','a') as f:
f.write(homepage.text)
end = time.time()
elapsed = end - start
a = a + elapsed
print("Time taken nosoup: ", elapsed, "seconds.")
start = time.time()
soup = BeautifulSoup(homepage.text,'lxml')
# print(soup)
with open ('lxml.text','a') as f:
f.write(soup.text)
end = time.time()
elapsed = end - start
b = b + elapsed
print("Time taken lxml.text: ", elapsed, "seconds.")
start = time.time()
h5soup = BeautifulSoup(homepage.text,'html5lib')
# print(h5soup)
with open ('html5lib.text','a') as f:
f.write(h5soup.text)
end = time.time()
elapsed = end - start
c = c + elapsed
print("Time taken html5lib: ", elapsed, "seconds.")
start = time.time()
pasersoup = BeautifulSoup(homepage.text,'html.parser')
# print(pasersoup)
with open ('htmlparser.text','a') as f:
f.write(pasersoup.text)
end = time.time()
elapsed = end - start
d = d + elapsed
print("Time taken htmlparser: ", elapsed, "seconds.")
print("Time taken nosoup: ", a, "seconds.")
print("Time taken lxml: ", b, "seconds.")
print("Time taken html5lib: ", c, "seconds.")
print("Time taken htmlparser: ", d, "seconds.")
由這個測試code 可以發現lxml在速度及檔案大小上 都是優於其他解析器的!
今日歌曲: 告五人(accusefive)-法蘭西多士
明天使用select 跟find來抓我們需要的資料吧!