Day 5 . 欸今天要幹嘛 - 我那個爬蟲有分頁欸！( python 靜態爬蟲) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 5

自我挑戰組

待業不頹廢系列第 5 篇

Day 5 . 欸今天要幹嘛 - 我那個爬蟲有分頁欸！( python 靜態爬蟲)

15th鐵人賽

yojijun

2023-09-20 17:17:57

1003 瀏覽

分享至

行前提要

下方程式碼，昨天看似完成沒問題，但其實該網頁的分享文約莫70~80多篇
10篇就分頁，但我只抓到10篇，目前程式碼還能跨頁抓取
今天就來學習這部分吧

import requests
from bs4 import BeautifulSoup

url = "https://astro.5xruby.tw/testimony/"
response = requests.get(url)
response.encoding = "utf-8"

soup = BeautifulSoup(response.text, "html.parser")
data = {}

author_infos = soup.find_all("h4", class_="is-author")

for author in (author_infos):
    print(author.text)

靜態爬蟲多頁處理

迴圈處理：分頁、翻頁這樣行為，需要寫一個迴圈才能達到效果，以便遍歷不同分頁並抓取所需的數據。
網址處理：在迴圈中，需要更新網址以訪問不同分頁。這可能需要修改網址中的某些參數，例如頁數。
爬取數據：在訪問每個分頁後，爬取所需的數據，就像對單一頁面進行爬取一樣。

爬取多頁靜態網頁，需要處理的有：

起始網頁 URL & 頁數
定義起始網頁 URL ( base_url ) 和要爬取的分頁數 ( num_pages )
```
base_url = "https://example.com/page="

# 設定要爬取的分頁數
num_pages = 5
```
上面 base_url ，是原來的 url = "https://astro.5xruby.tw/testimony/" 賦予更有意義的命名而已，若要保持 url = "https://astro.5xruby.tw/testimony/" 也是可以的

num_pages = 5，假設有五頁，那我們希望接下來的迴圈跑五次。
分頁迴圈處理且顧慮狀態
```
for page in range(num_pages + 1):
    # 構建分頁的完整網址
    page_url = f"{base_url}{page}"

    # 發送 HTTP GET 請求
    response = requests.get(page_url)
    #顧慮狀態
    if response.status_code == 200:
        # 使用 BeautifulSoup 解析 HTML
        soup = BeautifulSoup(response.text, "html.parser")
        # 在這裡進行數據的提取和處理
        # ...
        # 打印或儲存數據
        # ...
    else:
        print(f"Failed to retrieve page {page_url}")
```
由於分頁會改變網址，所以說 page_url = f"{base_url}{page}" 就是這麼一回事
接下來分頁後需要去 requests.get() 去發送 HTTP GET 請求
比較特別的是，多了一道檢查，200 表示請求成功。GET：資源成功獲取並於訊息主體中發送。
檢查 ok 就回到之前做過的使用 BeautifulSoup 解析 HTML步驟

關於 response.status_code ，是用於檢查 HTTP 請求的回應狀態碼的一個屬性。
HTTP 狀態碼是一個三位數的數字，用於指示 HTTP 請求的結果。不同的狀態碼有不同的意義，以下是一些常見的 HTTP 狀態碼及其一般含義：
- 200：成功。表示請求成功並返回所需的資源。
- 404：未找到。表示所請求的資源在伺服器上不存在。
- 403：禁止。表示伺服器理解請求，但拒絕執行它。
- 500：內部伺服器錯誤。表示伺服器遇到了內部錯誤，無法完成請求。

經過以上提供的方向來修改，好像有點意思了，的確出現好多作者了

import requests
from bs4 import BeautifulSoup

#起始網頁 URL
base_url = "https://astro.5xruby.tw/testimony/"

#設定要爬取的分頁數
num_pages = 7

for page in range(num_pages + 1):
   #構建分頁的完整網址
    page_url = f"{base_url}page/{page}"
    
    response = requests.get(page_url)
    response.encoding = "utf-8"
    
    if response.status_code == 200:
        # 使用 BeautifulSoup 解析 HTML
        soup = BeautifulSoup(response.text, "html.parser")
        
        data = {}
        author_infos = soup.find_all("h4", class_="is-author")

        for author in author_infos:
            print(f"Author: {author.text}")

    else:
        print(f"Failed to retrieve page {page_url}")

是說這兩行在...？

Failed to retrieve page https://astro.5xruby.tw/testimony/page/0
Failed to retrieve page https://astro.5xruby.tw/testimony/page/1

網頁分頁總共七頁面，但第一頁網址其實是 "https://astro.5xruby.tw/testimony/"
第二頁網址才開始有"page/2"的遞增。
目前遇到問題是，沒有顯示第一頁資訊以及 page/0、/page/1 是多於存在

整理修正後如下

import requests
from bs4 import BeautifulSoup

# 起始網頁 URL
base_url = "https://astro.5xruby.tw/testimony/"

# 分頁處理！！設定要爬取的分頁數
num_pages = 7

for page in range( 1, num_pages + 1):
   # 構建分頁的完整網址
    if page == 1:
        page_url = base_url  # 第一頁的 URL
    else:
        page_url = f"{base_url}page/{page}"  # 其他頁的 URL

    # 發送 HTTP GET 請求
    response = requests.get(page_url)
    response.encoding = "utf-8"
    
    if response.status_code == 200:
        # 使用 BeautifulSoup 解析 HTML
        soup = BeautifulSoup(response.text, "html.parser")
        
        data = {}
        author_infos = soup.find_all("h4", class_="is-author")

        for author in author_infos:
            print(f"Author: {author.text}")
        
    else:
        print(f"Failed to retrieve page {page_url}")

看起來效果有對的唷！