Day 26 Dcard熱門文章爬取

2021 iThome 鐵人賽

DAY 26

影片教學

文組生的Python爬蟲之旅系列第 26 篇

13th鐵人賽

水母君

2021-10-10 10:40:19

5134 瀏覽

分享至

今天要來爬取另一個知名論壇—Dcard
比起昨天的批踢踢，爬取Dcard論壇的過程會稍微複雜一些些
但了解其中的奧妙後，大部分的網站都可以順利的爬取啦～

以下為影片中有使用到的程式碼

import requests, bs4

url = "https://www.dcard.tw/f"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
htmlfile = requests.get(url, headers = headers)
objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')

articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章編號:", number)
    print("文章標題:", title.text)
    print("心情數量:", emotion.text)
    print("留言數量:", comment.text)
    print("="*100)

#請將C:\\spider\\修改為chromedriver.exe在您電腦中的路徑
from selenium import webdriver
import bs4
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)

objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章編號:", number)
    print("文章標題:", title.text)
    print("心情數量:", emotion.text)
    print("留言數量:", comment.text)
    print("="*100)

#請將C:\\spider\\修改為chromedriver.exe在您電腦中的路徑
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time

dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)


move = browser.find_element_by_tag_name('body')
time.sleep(3)
move.send_keys(Keys.PAGE_DOWN) 
time.sleep(3)

objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')

number = 0

for article in articles:
    title = article.find('a')
    emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
    comment = article.find('div', class_ = 'uj732l-2 ghvDya')
    number += 1
    print("文章編號:", number)
    print("文章標題:", title.text)
    print("心情數量:", emotion.text)
    print("留言數量:", comment.text)
    print("="*100)

#請將C:\\spider\\修改為chromedriver.exe在您電腦中的路徑
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time

page = int(input("請輸入頁面向下捲動次數:"))
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)


number = 0
counter = 0
post_title = []

while page > counter:
    move = browser.find_element_by_tag_name('body')
    time.sleep(1)
    move.send_keys(Keys.PAGE_DOWN) 
    time.sleep(1)

    objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
    articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')



    for article in articles:
        title = article.find('a')
        emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
        comment = article.find('div', class_ = 'uj732l-2 ghvDya')
        
        if title.text not in post_title:
            number += 1
            post_title.append(title.text)
            print("文章編號:", number)
            print("文章標題:", title.text)
            print("心情數量:", emotion.text)
            print("留言數量:", comment.text)
            print("="*100)
            
    counter += 1
    
print(post_title)

本篇影片及程式碼僅提供研究使用，請勿大量惡意地爬取資料造成對方網頁的負擔呦！
如果在影片中有說得不太清楚或錯誤的地方，歡迎留言告訴我，謝謝您的指教。