[Python] Beautiful Soup

2024 iThome 鐵人賽

DAY 9

Python

一些Python可以做的事系列第 9 篇

16th鐵人賽

yi00007120

團隊推甄就上!

2024-08-18 09:04:43

677 瀏覽

分享至

Beautiful Soup 函式庫 ( 模組 ) 是一個 Python 外部函式庫，可以從 HTML 或 XML 檔案中分析資料，並將分析的結果轉換成「網頁標籤樹」( tag ) 的型態，tag 是指 html 中 < > 包覆的程式碼，讓資料讀取方式更接近網頁的操作語法，處理起來也更為便利。

安裝 Beautiful Soup 模組

$pip install beautifulsoup4

引入

from bs4 import BeautifulSoup

使用 Beautiful Soup

將 HTML 的原始碼提供給 Beautiful Soup，就能轉換成可讀取的標籤樹 ( tag )

安裝 html5lib 網頁解析器

Beautiful Soup 分析資料前需要有解析器來做預處理，雖然 Python 本身內建有一個 html.parser ，但使用 html5lib 解析器容錯率較強、速度較慢。

$pip install html5lib

接下來搭配之前所學的 request 使用 get 方法，獲取輔大首頁內容，並使用 html5lib 去分析找出 title

Beautiful Soup 的方法

下列為 Beautiful Soup 尋找網頁內容最常用的方法 :

find_all() : 以所在的 tag 位置，尋找內容裡所有指定的 tag ，由字串表示
find() : 以所在的 tag 位置，尋找第一個找到的 tag ，由串列表示
select() : 以 CSS 選擇器的方式尋找指定的 tag

可以透過標籤、 id 或 class 來定位元素

下方的程式碼，使用 Beautiful Soup 取得範例網頁中指定 tag 的內容

import requests
from bs4 import BeautifulSoup

URL = 'https://www.iana.org/domains'
web = requests.get(URL)

# 使用 html5lib 解析器
soup = BeautifulSoup(web.text, "html5lib")

# 搜尋 id 是 logo 的 tag 內容
print(soup.select('#logo'))
print('\n----------\n')

# 搜尋所有 id 為 logo 的 div
print(soup.find_all('div',id="logo"))  
print('\n----------\n')

# 搜尋所有的 div
divs = soup.find_all('div')            
print(divs[1])                         
print('\n----------\n')

Beautiful Soup 方法的參數

使用 Beautiful Soup 方法時，可以加入一些參數，幫助篩選結果
下列是一些常用方法 :

string : 搜尋 tag 包含的文字
limit : 搜尋 tag 後只回傳多少個結果，前面使用find_all會回傳串列，limit會限制找幾個
id : 搜尋 tag 的 id
class_ : 搜尋 tag class，因為 class 為 Python 保留字，所以後方要加上底線
href : 搜尋 tag href
attrs : 搜尋 tag attribute 屬性

取得並輸出內容

抓取到內容後，可以使用下列兩種常用的方法，將內容或屬性輸出為字串

.get_text() : 輸出 tag 裡某個屬性的內容
[屬性] : 輸出 tag 裡某個屬性的內容

下方的程式碼，使用 Beautiful Soup 取得範例網頁中帶有 class="navigation" 的 div 標籤，下的所有 li 標籤的內容

import requests
from bs4 import BeautifulSoup

URL = 'https://www.iana.org/domains'
web = requests.get(URL)

# 使用 html5lib 解析器
soup = BeautifulSoup(web.text, "html5lib")

# 先找到帶有 class="navigation" 的 div 標籤
navigation_div = soup.find("div", class_="navigation")

# 在該 div 下找到所有的 li 標籤
li_elements = navigation_div.find_all("li")

# 打印每個 li 中的 a 標籤的文本和 href 屬性
for li in li_elements:
    a_tag = li.find('a')
    if a_tag:
        href = a_tag['href']
        text = a_tag.text.strip() # 去除文本前後文的空白鍵
        print(f'Text: {text}, Href: {href}')

參考資料 :
https://steam.oxxostudio.tw/category/python/spider/beautiful-soup.html
https://www.learncodewithmike.com/2020/02/python-beautifulsoup-web-scraper.html