【Day 12】常出現的BeautifulSoup

2022 iThome 鐵人賽

DAY 12

自我挑戰組

養爬蟲的人學爬蟲系列第 12 篇

14th鐵人賽

teresawang

2022-09-25 23:50:07

1320 瀏覽

分享至

閒聊
昨天試著了解了Pandas，今天要來看看上網找爬蟲基本上一定會出現的BeautifulSoup。

BeautifulSoup
Beautiful是一個Python的函式庫，可以從HTML或XML檔案中分析資料，也可以拿來修復錯誤文件。

安裝

pip install beautifulsoup4

導入

from bs4 import beautifulsoup

常用著名解析HTML文件的方法

html.parser：相容性較不好。
lxml：速度快，相容性佳。
html5lib：速度較慢，但解析能力強，本篇會使用這個方法。

在使用之前需要先下載

pip install html5lib

完成以上後，第一步我們先來解析一個HTML。
這邊用的範例是https://ithelp.ithome.com.tw/users/20145359 ，會使用到Requests套件進行爬蟲。

import requests
from bs4 import BeautifulSoup
url = 'https://ithelp.ithome.com.tw/users/20145359'
r = requests.get(url) #get請求
soup = BeautifulSoup(r.text,'html5lib') #將r.text內容定義到Beautifulsoup物件
print(type(soup)) #output <class 'bs4.BeautifulSoup'>

取得標籤性
如果想在一個標籤內取得這個標籤的屬性，只需要像dict(字典)一樣就好了。

import requests
from bs4 import BeautifulSoup
url = 'https://ithelp.ithome.com.tw/users/20145359'
r = requests.get(url) 
soup = BeautifulSoup(r.text,'html5lib')

links = soup.find_all('a')
for link in links:
    if 'href' in link.attrs:
        print(link['href'])

BeautifulSoup定位

soup.find() ：回傳第一個符合條件的元素，用str(字串)表示，若沒有則回傳None。
soup.find_all()：回傳所有符合條件的元素，用list(串列)表示，若沒有則回傳None
soup.select()： Css Selector。
也可以透過id、class定位，例如

soup.find(id = 'name', class_ = 'myclass')

這裡的class後面要加上_，是因為避免跟Python中的class衝突。

實作定位
這裡用的是https://ithelp.ithome.com.tw/users/20145359/ironman/5361 的標題【Day 1】從0開始學習爬蟲！，先使用選取工具找到他在 <h3 class="qa-list__title"> <a href="[https://ithelp.ithome.com.tw/articles/10290463] " class="qa-list__title-link">【Day 1】從0開始學習爬蟲!</a></h3>

從這裡可以先看到一個標籤class叫做「qa-list_title-link」，找到標籤後就可以來定位了。

import requests
from bs4 import BeautifulSoup
url = 'https://ithelp.ithome.com.tw/users/20145359'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html5lib')
link = soup.find('a', class_='qa-list__title-link')
print(link['href'].strip())

結語
今天初步認識了Beautiful這個套件，目前會了定位，他可以做的功能還很多，等之後碰到了再來介紹。
明天輕鬆一點，來聊聊不同的爬蟲種類吧！