【Day 12】- 這頁爬完了，爬下一頁。PTT 爬好爬滿！(實戰 PTT 爬蟲 2/3)

2021 iThome 鐵人賽

DAY 12

AI & Data

網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術系列第 12 篇

13th鐵人賽

Vincent55

團隊肝已經，死了

2021-09-27 22:32:50

3423 瀏覽

分享至

前情提要

前一篇文章帶大家寫了能爬取 PTT 當前頁面文章的爬蟲，且透過攜帶已滿 18 歲的 cookies 避免驗證 18 歲。

開始之前

本篇將繼續帶各位寫 PTT 爬蟲，今天會將持續爬取的部分做完，技術上來說就是每爬取到一個頁面就也去爬取下一頁的網址，爬取完文章後再發一個 requests 到下一頁的網址重複此動作。

預期效果

當前頁面全部文章爬取並跳過 18 歲檢定(Day11 已實作)

爬取下一頁網址

向下一頁發送請求

重複此循環 n 次

實作

這是昨天的程式碼。

import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
cookies = {
    'over18': '1'
}
resp = requests.get(url, cookies=cookies)
soup = BeautifulSoup(resp.text, 'html5lib')
arts = soup.find_all('div', class_='r-ent')
for art in arts:
    title = art.find('div', class_='title').getText().strip()
    link = 'https://www.ptt.cc' + \
        art.find('div', class_='title').a['href'].strip()
    author = art.find('div', class_='author').getText().strip()
    print(f'title: {title}\nlink: {link}\nauthor: {author}')

我們能先將取得 resp 以及爬取當前頁面文章的功能寫成一個 function。

import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Gossiping/index.html'

def get_resp():
    cookies = {
        'over18': '1'
    }
    resp = requests.get(url, cookies=cookies)
    if resp.status_code != 200:
        return 'error'
    else:
        return resp

def get_articles(resp):
    soup = BeautifulSoup(resp.text, 'html5lib')
    arts = soup.find_all('div', class_='r-ent')
    for art in arts:
        title = art.find('div', class_='title').getText().strip()
        link = 'https://www.ptt.cc' + \
            art.find('div', class_='title').a['href'].strip()
        author = art.find('div', class_='author').getText().strip()
        print(f'title: {title}\nlink: {link}\nauthor: {author}')

resp = get_resp()
get_articles(resp)

接下來，我們用開發工具找一下下一頁(PTT 中會自動到最新，因此需找下一頁)的網址在哪邊。

可以發現到下一頁網址出現在 class 為 btn-group btn-group-paging 的 div 下的第二個子元素 a 的 href 屬性中。

知道哪在哪邊後，我們可以加上一些 code 讓程式能爬取到該網址，這邊能直接使用 Css Selector 直接選取到該元素，若不知如何在開發工具上取得某元素的 Css Selector 可以去看這篇的後面部分【Day 08】- 有著資料清洗功能的 Requests-HTML。

next_url = 'https://www.ptt.cc' + soup.select('#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')[0]['href']

現在能取得下一頁的文章網址了。因此，我們能寫個迴圈讓它能夠重複爬取吧。

import requests
from bs4 import BeautifulSoup

def get_resp(url):
    cookies = {
        'over18': '1'
    }
    resp = requests.get(url, cookies=cookies)
    if resp.status_code != 200:
        return 'error'
    else:
        return resp

def get_articles(resp):
    soup = BeautifulSoup(resp.text, 'html5lib')
    arts = soup.find_all('div', class_='r-ent')
    for art in arts:
        title = art.find('div', class_='title').getText().strip()
        link = 'https://www.ptt.cc' + \
            art.find('div', class_='title').a['href'].strip()
        author = art.find('div', class_='author').getText().strip()
        print(f'title: {title}\nlink: {link}\nauthor: {author}')
    # 利用 Css Selector 定位下一頁網址
    next_url = 'https://www.ptt.cc' + \
        soup.select_one(
            '#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')['href']
    return next_url

# 當執行此程式時成立
if __name__ == '__main__':
    # 第一個頁面網址
    url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
    # 先讓爬蟲爬 10 頁
    for now_page_number in range(10):
        resp = get_resp(url)
        if resp != 'error':
            url = get_articles(resp)
        print(f'======={now_page_number+1}/10=======')
''' 已將部分不必要內容刪除
title: [問卦] 30歲的魔法該學冰系還是火系好？
link: https://www.ptt.cc/bbs/Gossiping/M.1632417857.A.562.html
author: ejo3and503
title: Re: [新聞] 清大設「後醫系」 醫師公會怒：醫師浮濫
link: https://www.ptt.cc/bbs/Gossiping/M.1632417931.A.799.html
author: driftingjong
title: [問卦] 翁達瑞，是一群人嗎？
link: https://www.ptt.cc/bbs/Gossiping/M.1632417965.A.716.html
author: LEDG
title: [問卦] 幾歲開始刷Leetcode才有競爭力？
link: https://www.ptt.cc/bbs/Gossiping/M.1632418019.A.DE6.html
author: dixitdeus
title: Re: [爆卦] 美國教授踢爆高虹安大數據招牌造假
link: https://www.ptt.cc/bbs/Gossiping/M.1632418069.A.106.html
author: zombiechen
title: [新聞] 3+11沒紀錄？ 陳時中：再問100遍就是沒有
link: https://www.ptt.cc/bbs/Gossiping/M.1632418088.A.8E8.html
author: shinmoner
title: [問卦] 為啥女生愛看耽美，男生沒那麼愛看百合?
link: https://www.ptt.cc/bbs/Gossiping/M.1632418139.A.B6D.html
author: s9234032
title: [問卦] 在美國賣壽司是不是暴利？
link: https://www.ptt.cc/bbs/Gossiping/M.1632418248.A.F80.html
author: hwang1460
title: [問卦] 早九上課現在還沒睡怎辦？
link: https://www.ptt.cc/bbs/Gossiping/M.1632418281.A.9F1.html
author: WeiU
title: [問卦] 五倍券可以拿來吃魚喝茶嗎
link: https://www.ptt.cc/bbs/Gossiping/M.1632418285.A.C57.html
author: blessbless
title: Re: [問卦] 台灣為什麼不盛行餐酒文化?
link: https://www.ptt.cc/bbs/Gossiping/M.1632418312.A.A1F.html
author: noway
title: [問卦] 周杰倫是什麼時候開始走下坡的？
link: https://www.ptt.cc/bbs/Gossiping/M.1632418477.A.959.html
author: boboken
title: [公告] 八卦板板規(2021.05.11)
link: https://www.ptt.cc/bbs/Gossiping/M.1620716589.A.F0C.html
author: arsonlolita
title: [協尋] 求行車紀錄畫面(9/16上午內湖遊戲橘子旁)
link: https://www.ptt.cc/bbs/Gossiping/M.1631948458.A.D73.html
author: umbrella0613
title: [公告] 中秋節我家兔兔辣麼口愛活動投票
link: https://www.ptt.cc/bbs/Gossiping/M.1632244429.A.388.html
author: ubcs
title: [協尋] 橘貓咪嚕快點回家！（大安區）
link: https://www.ptt.cc/bbs/Gossiping/M.1632305989.A.5E0.html
author: k020231310
title: [協尋] 新北蘆洲區環提大道行車記錄器
link: https://www.ptt.cc/bbs/Gossiping/M.1632345107.A.8AF.html
author: anpep
=======1/10=======
title: Re: [新聞] 清大設「後醫系」 醫師公會怒：醫師浮濫
link: https://www.ptt.cc/bbs/Gossiping/M.1632417296.A.2F5.html
author: organize222
'''

此時我們的程式會遇到一些問題，可以看出它沒有文章連結，猜測應該是文章刪除了。

Traceback (most recent call last):
  File "c:\Users\50205\OneDrive\桌面\a\test.py", line 41, in <module>
    url = get_articles(resp)
  File "c:\Users\50205\OneDrive\桌面\a\test.py", line 22, in get_articles
    art.find('div', class_='title').a['href'].strip()
TypeError: 'NoneType' object is not subscriptable

我們加個判斷式，即可解決這個問題。

title = art.find('div', class_='title').getText().strip()
if not title.startswith('(本文已被刪除)'):
    link = 'https://www.ptt.cc' + \
        art.find('div', class_='title').a['href'].strip()

整體程式碼

import requests
from bs4 import BeautifulSoup

def get_resp(url):
    cookies = {
        'over18': '1'
    }
    resp = requests.get(url, cookies=cookies)
    if resp.status_code != 200:
        return 'error'
    else:
        return resp

def get_articles(resp):
    soup = BeautifulSoup(resp.text, 'html5lib')
    arts = soup.find_all('div', class_='r-ent')
    for art in arts:
        title = art.find('div', class_='title').getText().strip()
        if not title.startswith('(本文已被刪除)'):
            link = 'https://www.ptt.cc' + \
                art.find('div', class_='title').a['href'].strip()
        author = art.find('div', class_='author').getText().strip()
        print(f'title: {title}\nlink: {link}\nauthor: {author}')
    # 利用 Css Selector 定位下一頁網址
    next_url = 'https://www.ptt.cc' + \
        soup.select_one(
            '#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')['href']
    return next_url

# 當執行此程式時成立
if __name__ == '__main__':
    # 第一個頁面網址
    url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
    # 先讓爬蟲爬 10 頁
    for now_page_number in range(10):
        print(f'crawing {url}')
        resp = get_resp(url)
        if resp != 'error':
            url = get_articles(resp)
        print(f'======={now_page_number+1}/10=======')

結語

今天實作了持續爬取 PTT 的文章，透過也爬取下一頁網址並發請求的方式。

明日內容

將繼續 PTT 爬蟲，目前只會將爬取到的資料 print 在終端機上面，明天會帶各位將爬取到的資料儲存到 JSON 檔案中。

補充資料

PTT 八卦版 : https://www.ptt.cc/bbs/Gossiping/index.html

【Day 11】- 還在 PTT 點擊已滿 18 歲? 帶上 cookies 吧！(實戰 PTT 爬蟲 1/3)

【Day 13】- 用 JSON 儲存爬來的 PTT 文章。(實戰 PTT 爬蟲 3/3)

系列文

網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術共 30 篇

RSS系列文訂閱系列文

58 人訂閱

完整目錄

直播研討會

1 則留言

arguskao

iT邦新手 3 級 ‧ 2022-05-16 14:35:59

next_url = 'https://www.ptt.cc' + soup.select('#action-bar-container > div > div.btn-group.btn-group-paging > a:nth-child(2)')[0]['href']

請問更上面一層有main-container為何不用找？

回應 1
檢舉

Vincent55 iT邦新手 4 級 ‧ 2022-05-26 23:45:11 檢舉

action-bar-container 是一個理論上不會變動的 id 所以直接拿它往下找到我們要的元素即可

登入發表回應

我要留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22200 篇

完賽人數

600 人

架構零信任的基礎並打造更安全的 OT 環境

臺灣資安大會 |

32 分

PostgreSQL：找出效能瓶頸，打造高效資料庫！

歐立威科技 |

49 分

「企業混合雲實戰攻略三策」App Innovation with Azure Hybrid Solution ＆「企業混合雲實戰攻略三策」擁抱混合雲優勢，創造企業價值

IT EXPLAINED |

43 分

安碁學苑資安人才培訓藍圖

臺灣資安大會 |

20 分

審計機關因應數位轉型發展趨勢之作為

2023 數位政府高峰會 |

29 分

從零開始，資料開放平臺的雲原生化與除雷經驗談

Cloud Summit 臺灣雲端大會 |

36 分

SEMI E187 / E188 落地實務指引

臺灣資安大會 |

30 分

使用 Microsoft Orleans 開發高併發高可用性的分散式雲原生服務

Cloud Summit 臺灣雲端大會 |

29 分

打通導入零信任資安架構的關鍵

零信任資安講堂 |

30 分

用NVMe全快閃儲存陣列提升虛擬化環境效能

Cloud Summit 臺灣雲端大會 |

26 分

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

網路爬蟲，萬物皆爬 - 30 天搞懂並實戰網路爬蟲及應對反爬蟲技術系列 第 12 篇