[Day30] 爬蟲實戰演練 - iThome文章標題2.0 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2022 iThome 鐵人賽

DAY 30

AI & Data

30天帶你從零基礎到Python爬蟲系列第 30 篇

[Day30] 爬蟲實戰演練 - iThome文章標題2.0

14th鐵人賽

霓霓

2022-09-30 00:29:23

1322 瀏覽

分享至

還記得[Day29] 爬蟲實戰演練 - iThome文章標題抓下來的內容嗎？有沒有發現它只有我第一頁文章的標題，明明我就有超級多優質的文章（自己說），這是因為給的網址就是第一頁的網址而已，程式當然不會幫你抓之後的資訊，那我要怎麼樣才能抓完所有的呢？就要結合上上篇提到的Selenium套件來模擬人點選下一頁。

最一開始先import需要用到的package，並開啟ChromeDriver，再透過BeautifulSoup的get()抓取網頁程式碼。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from bs4 import BeautifulSoup

s = Service(executable_path=r'./chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://ithelp.ithome.com.tw/users/20140998/ironman/4362?page=1")

接著先找到所需資訊在HTML裡的所在位置，和之前是一樣的寫法。

html = BeautifulSoup(driver.page_source, "html.parser")
elements = html.find_all("div", {"class": "qa-list"})
    for element in elements:
        title = element.find("a", {"class", "qa-list__title-link"}).getText().strip()
        print(title)
        print("-"*30)

再來就是要動態抓取了，抓完第一頁我還要去第二頁找，抓完第二頁再去第三頁...，這邊就加上迴圈。
page_next = driver.find_elements(By.XPATH, "//div[@class='profile-pagination']//li/a")[-1]是讓程式找到「下一頁」的所在位置，這邊我使用By.XPATH來定位轉換頁面的按鈕，由於畫面底下有很多頁的選項，//li/a找到的會是每一個按鈕，但我要的是「下一頁」，它是所有按鈕中的最後一個，就可以用[-1]快速找到！

from selenium.webdriver.common.by import By
import time

for page in range(1, 5):  # 執行1~4頁
    html = BeautifulSoup(driver.page_source, "html.parser")
    elements = html.find_all("div", {"class": "qa-list"})

    print("-"*10, "第", page, "頁", "-"*10)
    for element in elements:
        title = element.find("a", {"class", "qa-list__title-link"}).getText().strip()
        print(title)
        print("-"*30)
    page_next = driver.find_elements(By.XPATH, "//div[@class='profile-pagination']//li/a")[-1]
    page_next.click()  # 點擊下一頁按鈕

    time.sleep(1)  # 暫停1秒

完整程式碼

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time

s = Service(executable_path=r'./chromedriver')
driver = webdriver.Chrome(service=s)
driver.get("https://ithelp.ithome.com.tw/users/20140998/ironman/4362?page=1")

for page in range(1, 5):  # 執行1~4頁
    html = BeautifulSoup(driver.page_source, "html.parser")
    elements = html.find_all("div", {"class": "qa-list"})

    print("-"*10, "第", page, "頁", "-"*10)
    for element in elements:
        title = element.find("a", {"class", "qa-list__title-link"}).getText().strip()
        print(title)
        print("-"*30)
    page_next = driver.find_elements(By.XPATH, "//div[@class='profile-pagination']//li/a")[-1]
    page_next.click()  # 點擊下一頁按鈕

    time.sleep(1)  # 暫停1秒