使用Request 送出HTTP請求取得回應的HTML網頁內容後,需要定位出要找尋資料的位置,才能方便從網頁中擷取。其主要工作有三項:
Beautiful Soup是一Python套件,可以將HTML標籤轉換成一顆Python物件樹,幫助我們從HTML網頁中擷取出需要的資料。
開啟FJU_website.html檔案後剖析HIML網頁
from bs4 import BeautifulSoup
with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")
print(soup.prettify())
Beautiful Soup有四種物件: Tag、NavigableString、BeautifulSoup和Comment物件,可以把HTML網頁剖析轉換成Python物件樹。
(1)Tag物件:提供多種屬性和函數來搜尋和走訪Python物件樹,下方例子說明如何取得標籤名稱和屬性值
from bs4 import BeautifulSoup
with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")
tag = soup.div
print(type(tag)) # Tag型態
print(tag.name) # 標籤名稱
print(tag["id"]) # 標籤屬性
print(tag.attrs) # 標籤所有屬性值的字典
(2)NavigableString物件:為標籤內容,即位在標籤中的文字內容
from bs4 import BeautifulSoup
with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")
tag = soup.div
print(tag.string) #標籤內容
print(type(tag.string)) #NavigableString型別
(3)BeautifulSoup物件: 代表整份HTML網頁
from bs4 import BeautifulSoup
with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")
tag = soup.div
print(soup.name)
print(type(soup)) #BeautifulSoup型態
(4)Comment物件: 可以取得HTML網頁的註解文字
from bs4 import BeautifulSoup
with open("FJU_website.html", "r", encoding="utf8") as fp:
soup = BeautifulSoup(fp, "lxml")
comment = soup.p.string
print(comment)
print(type(comment)) # Comment型態