【Day 14】爬進PTT的網頁吧！（實戰PTT 1/3）

2022 iThome 鐵人賽

DAY 14

自我挑戰組

養爬蟲的人學爬蟲系列第 14 篇

14th鐵人賽

teresawang

2022-09-27 00:21:56

1899 瀏覽

分享至

閒聊
在前面幾天學習後，今天要來試著爬取PTT的八卦版。

今天會用到的工具

VS Code
Requests
BeautifulSoup

預期目標
繞過確認已滿18歲

能夠爬取到當前網頁的文章

實作

第一步先導入需要的函式庫，將目標網址存在url中。

import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Gossiping/index.html'

這時候依據剛剛發送請求的url，回應物件（text）。
可以看到，print出來的是「是否滿18歲」，而不是直接看到所有文章的葉面內容（HTML）。

r = requests.get(url)
print(r.text)

#output
<!DOCTYPE html>
<html>
        <head>
                <meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>批踢踢實業坊</title>
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-base.css" media="screen">  
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.27/bbs-print.css" media="print">  
        </head>
    <body>
<div class="bbs-screen bbs-content">
    <div class="over18-notice">
        <p>本網站已依網站內容分級規定處理</p>
        <p>警告︰您即將進入之看板內容需滿十八歲方可瀏覽。</p>
        <p>若您尚未年滿十八歲，請點選離開。若您已滿十八歲，亦不可將本區之內容派發、傳閱、出售、出租、交給或借予年齡未滿18歲的人士瀏覽，或將本網站內容向該人士出示、播
放或放映。</p>
    </div>
</div>
<div class="bbs-screen bbs-content center clear">
    <form action="/ask/over18" method="post">
        <input type="hidden" name="from" value="/bbs/Gossiping/index.html">
        <div class="over18-button-container">
            <button class="btn-big" type="submit" name="yes" value="yes">我同意，我已年滿十八歲<br><small>進入</small></button>
        </div>
        <div class="over18-button-container">
            <button class="btn-big" type="submit" name="no" value="no">未滿十八歲或不同意本條款<br><small>離開</small></button>
        </div>
    </form>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
  ga('create', 'UA-32365737-1', {
    cookieDomain: 'ptt.cc',
    legacyCookieDomain: 'ptt.cc'
  });
  ga('send', 'pageview');
</script>
<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/bbs/v2.27/bbs.js"></script>

    <script defer src="https://static.cloudflareinsights.com/beacon.min.js/v652eace1692a40cfa3763df669d7439c1639079717194" integrity="sha512-Gi7xpJR8tSkrpF7aordPZQlW2DLtzUlZcumS8dMQjwDHEnw9I7ZLyiOj/6tZStRBGtGgN6ceN6cMH8z7etPGlw==" data-cf-beacon='{"rayId":"750a0d541db6e05a","version":"2022.8.1","r":1,"token":"515615eb5fab4c9b91a11e9bf529e6cf","si":100}' crossorigin="anonymous"></script>
</body>
</html>

第二步，我們可以觀察，點擊第一次已滿18歲後，要再次進入時，就不需要了。
所以可以判斷，這裡應該用了之前在Requests篇章介紹過的cookies（有修改功能）。
這裡可以用Cookie-Editor來查看點擊已滿18歲後，網域下的cookies。

這裡的over 18是用來記錄是否點擊過已滿18歲，我們可以把這個cookies當作requests.get()的參數一同傳入。

from http import cookies
import requests
url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
cookies = {'over18':1}
r = requests.get(url, cookies= cookies)
print(r.text)

#output
<!DOCTYPE html>
<html>
    <body>
<a href="/bbs/Gossiping/M.1632305989.A.5E0.html">[協尋] 橘貓咪嚕快點回家！（大安區）</a>
</div>
<div class="meta">
        <div class="author">k020231310</div>      
        <div class="article-menu">
                <div class="trigger">&#x22ef;</div>
                <div class="dropdown">
                        <div class="item"><a href="/bbs/Gossiping/search?q=thread%3A%5B%E5%8D%94%E5%B0%8B%5D&#43;%E6%A9%98%E8%B2%93%E5%92%AA%E5%9A%95%E5%BF%AB%E9%BB%9E%E5%9B%9E%E5%AE%B6%EF%BC%81%EF%BC%88%E5%A4%A7%E5%AE%89%E5%8D%80%EF%BC%89">搜尋同標題文章</a></div>       
                        <div class="item"><a href="/bbs/Gossiping/search?q=author%3Ak020231310">搜尋看板內 k020231310 的文章
</a></div>
                </div>
        </div>
        <div class="date"> 9/22</div>
        <div class="mark"></div>
    </body>
</html>
'''

第三步要使用BeautifulSoup進行資料清洗，將我們要的內容清理出來。
先使用開發工具的選取功能，選出文章範圍。

可以在這個div下看到包含該文章的資訊（紅框）。

我們這邊可以先爬取：文章內容、文章連結和作者，這邊會用到第12天學習到的BeautifulSoup。

import requests
from bs4 import BeautifulSoup
url = 'https://www.ptt.cc/bbs/Gossiping/index.html'
cookies = {
    'over18' : 1
}
r = requests.get(url, cookies = cookies)
soup = BeautifulSoup(resp.text, 'html5lib')
arts = soup.find_all('div', class_='r-ent')
for art in arts :
    title = art.find('div', class_='title').getText().strip()
    link = 'https://www.ptt.cc' + \
        art.find('div', class_='title').a['href'].strip()
    author = art.find('div', class_='author').getText().strip()
    print(f'title: {title}\nlink: {link}\nauthor: {author}')

結語
今天正式的第一次使用Requests爬蟲，也使用了cookies的功能。

明天！
【Day 15】爬完這邊繼續！（實戰PTT 2/3）

參考資料
PTT八卦版https://www.ptt.cc/bbs/Gossiping/index.html