iT邦幫忙

2021 iThome 鐵人賽

DAY 25
0
影片教學

文組生的Python爬蟲之旅系列 第 25

Day 25 PTT八卦版爬取

  • 分享至 

  • xImage
  •  

終於可以踏出新手村了!
經歷扎實的訓練後,我們已經有相當實力來爬取想要的網站啦~
今天的影片內容為爬取知名論壇PTT的八卦版,會詳細的解析網頁並一步步擴充程式碼
那事不宜遲,出發囉~

Yes

以下為影片中有使用到的程式碼

import requests, bs4

url = "https://www.ptt.cc/bbs/Gossiping/index.html"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
cookies = {'over18':'1'}
htmlfile = requests.get(url, headers = headers, cookies = cookies)
objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')

articles = objsoup.find_all('div', class_ = 'r-ent') #尋找文章區塊

number = 0

for article in articles:

    title = article.find('a')
    author = article.find('div', class_ = 'author')
    date = article.find('div',  class_ = 'date')
    
    if title == None: #防止(本文已被刪除)的情形
        continue
    else:
        number += 1 #number = number +1
        print("文章編號:", number)
        print("文章標題:", title.text)
        print("文章作者:", author.text)
        print("發文時間:", date.text)
        print("="*100)
import requests, bs4

url_1 = "https://www.ptt.cc"
url_2 = "/bbs/Gossiping/index.html"

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
cookies = {'over18':'1'}
htmlfile = requests.get(url_1 + url_2, headers = headers, cookies = cookies)
objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')

articles = objsoup.find_all('div', class_ = 'r-ent')

number = 0

for article in articles:

    title = article.find('a')
    author = article.find('div', class_ = 'author')
    date = article.find('div',  class_ = 'date')
    
    if title == None: #防止(本文已被刪除)的情形
        continue
    else:
        number +=1
        print("文章編號:", number)
        print("文章標題:", title.text)
        print("文章作者:", author.text)
        print("發文時間:", date.text)
        print("="*100)
        
before = objsoup.find_all('a', class_ = 'btn wide') #尋找上一頁網址
url_2 = before[1].get('href') #將上一頁網址加入website串列中
print("上一頁的網址:", url_1 + url_2)
import requests, bs4

page = int(input("請輸入想搜尋的頁數:")) 
url_1 = "https://www.ptt.cc"
url_2 = "/bbs/Gossiping/index.html"

counter = 0
number = 0

while counter < page:

    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
    cookies = {'over18':'1'}
    htmlfile = requests.get(url_1 + url_2, headers = headers, cookies = cookies)
    objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')

    articles = objsoup.find_all('div', class_ = 'r-ent')

    for article in articles:

        title = article.find('a')
        author = article.find('div', class_ = 'author')
        date = article.find('div',  class_ = 'date')
    
        if title == None: #防止(本文已被刪除)的情形
            continue
        else:
            number +=1
            print("文章編號:", number)
            print("文章標題:", title.text)
            print("文章作者:", author.text)
            print("發文時間:", date.text)
            print("="*100)
            
    before = objsoup.find_all('a', class_ = 'btn wide')
    url_2 = before[1].get('href')
    counter += 1

本篇影片及程式碼僅提供研究使用,請勿大量惡意地爬取資料造成對方網頁的負擔呦!
如果在影片中有說得不太清楚或錯誤的地方,歡迎留言告訴我,謝謝您的指教。


上一篇
Day 24 Selenium模組三
下一篇
Day 26 Dcard熱門文章爬取
系列文
文組生的Python爬蟲之旅30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言