今天要來爬取另一個知名論壇—Dcard
比起昨天的批踢踢,爬取Dcard論壇的過程會稍微複雜一些些
但了解其中的奧妙後,大部分的網站都可以順利的爬取啦~
以下為影片中有使用到的程式碼
import requests, bs4
url = "https://www.dcard.tw/f"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'}
htmlfile = requests.get(url, headers = headers)
objsoup = bs4.BeautifulSoup(htmlfile.text, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
number = 0
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
number += 1
print("文章編號:", number)
print("文章標題:", title.text)
print("心情數量:", emotion.text)
print("留言數量:", comment.text)
print("="*100)
#請將C:\\spider\\修改為chromedriver.exe在您電腦中的路徑
from selenium import webdriver
import bs4
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)
objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
number = 0
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
number += 1
print("文章編號:", number)
print("文章標題:", title.text)
print("心情數量:", emotion.text)
print("留言數量:", comment.text)
print("="*100)
#請將C:\\spider\\修改為chromedriver.exe在您電腦中的路徑
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)
move = browser.find_element_by_tag_name('body')
time.sleep(3)
move.send_keys(Keys.PAGE_DOWN)
time.sleep(3)
objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
number = 0
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
number += 1
print("文章編號:", number)
print("文章標題:", title.text)
print("心情數量:", emotion.text)
print("留言數量:", comment.text)
print("="*100)
#請將C:\\spider\\修改為chromedriver.exe在您電腦中的路徑
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import bs4, time
page = int(input("請輸入頁面向下捲動次數:"))
dirverPath = 'C:\\spider\\chromedriver.exe'
browser = webdriver.Chrome(executable_path = dirverPath)
url = 'https://www.dcard.tw/f'
browser.get(url)
number = 0
counter = 0
post_title = []
while page > counter:
move = browser.find_element_by_tag_name('body')
time.sleep(1)
move.send_keys(Keys.PAGE_DOWN)
time.sleep(1)
objsoup = bs4.BeautifulSoup(browser.page_source, 'lxml')
articles = objsoup.find_all('article', class_ = 'tgn9uw-0 bReysV')
for article in articles:
title = article.find('a')
emotion = article.find('div', class_ = 'cgoejl-3 jMiYgp')
comment = article.find('div', class_ = 'uj732l-2 ghvDya')
if title.text not in post_title:
number += 1
post_title.append(title.text)
print("文章編號:", number)
print("文章標題:", title.text)
print("心情數量:", emotion.text)
print("留言數量:", comment.text)
print("="*100)
counter += 1
print(post_title)
本篇影片及程式碼僅提供研究使用,請勿大量惡意地爬取資料造成對方網頁的負擔呦!
如果在影片中有說得不太清楚或錯誤的地方,歡迎留言告訴我,謝謝您的指教。
如果是需要用滑鼠點擊的留言回覆
這樣需要怎麼處理呢?
比方紅色圈起來是縮起來的
要用滑鼠點開,才會變成藍色圈起來,也就是我們要爬的內容
如果我希望下滑到沒有新文章為止
而不是寫死的次數
這樣可以怎麼改python的code呢?
我如果要在google colab上跑
所以沒有chromedriver.exe
這樣我可以怎麼改
dirverPath = 'C:\spider\chromedriver.exe'
這一行呢?