[Day6] 擷取靜態HTML網頁資料1_Beautiful Soup

2022 iThome 鐵人賽

DAY 6

自我挑戰組

用Python學習網路爬蟲30天系列第 6 篇

14th鐵人賽

rouanchen

2022-09-20 20:24:24

1253 瀏覽

分享至

網路爬蟲資料擷取工作

使用Request 送出HTTP請求取得回應的HTML網頁內容後，需要定位出要找尋資料的位置，才能方便從網頁中擷取。其主要工作有三項:

定位HTML網頁:
從網頁中找出特定的HTML標籤或標籤集合，可以使用HTML標籤定位、正規表達式、CSS選擇器和XPath表達式來定位
走訪HTML網頁:
找到特定的HTML元素後，如果只能定位在目標資料的附近，或附近還有其他想擷取的資料，可以透過向上、向下、向左、向右走訪HTML元素來定位出資料位置
修改HTML網頁:
若取得的網頁有不完整或遺失標籤，需要修改HTML標籤和屬性以方便進行爬蟲

剖析HTML網頁工具 – Beautiful Soup

Beautiful Soup是一Python套件，可以將HTML標籤轉換成一顆Python物件樹，幫助我們從HTML網頁中擷取出需要的資料。

實作練習

開啟FJU_website.html檔案後剖析HIML網頁

from bs4 import BeautifulSoup 

with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")

print(soup.prettify())

Beautiful Soup有四種物件: Tag、NavigableString、BeautifulSoup和Comment物件，可以把HTML網頁剖析轉換成Python物件樹。
(1)Tag物件:提供多種屬性和函數來搜尋和走訪Python物件樹，下方例子說明如何取得標籤名稱和屬性值

from bs4 import BeautifulSoup 

with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")

tag = soup.div
print(type(tag))     # Tag型態
print(tag.name)      # 標籤名稱
print(tag["id"])     # 標籤屬性
print(tag.attrs)     # 標籤所有屬性值的字典

(2)NavigableString物件:為標籤內容，即位在標籤中的文字內容

from bs4 import BeautifulSoup 

with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")

tag = soup.div
print(tag.string)          #標籤內容
print(type(tag.string))    #NavigableString型別

(3)BeautifulSoup物件: 代表整份HTML網頁

from bs4 import BeautifulSoup 

with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")

tag = soup.div
print(soup.name)
print(type(soup))   #BeautifulSoup型態

(4)Comment物件: 可以取得HTML網頁的註解文字

from bs4 import BeautifulSoup 

with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")

comment = soup.p.string
print(comment)
print(type(comment))   # Comment型態