我想從蘋果日報的要聞網頁中抓取新聞內容,
網址為 https://tw.news.appledaily.com/headline/daily/20190329/38294459/
標題、新聞刊登日期都沒問題,只有內文中的第3段 (content3)抓取的內文是錯誤的,
請教高手,我應該如何修改 content3的語法,謝謝!
我真正要抓的content3內容,如圖所示!
程式碼如下:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import re
# 讀取蘋果日報每日頭版頭條之網頁,例如 : https://tw.news.appledaily.com/headline/daily/20190325/38290504/
target_url = "https://tw.news.appledaily.com/headline/daily/20190329/38294459/"
#driver = webdriver.Chrome('./chromedriver')
response = requests.get(target_url)
soup = BeautifulSoup(response.text, "lxml")
# 抓取標題
headline = soup.select('#article > div.wrapper > div > main > article > hgroup > h1')
print(headline[0].string)
# 抓取出版日期
publish_date = soup.select('#article > div.wrapper > div > main > article > hgroup > div')
print(publish_date[0].string)
# 抓取新聞內文
tag_p = soup.select("p ")
content1 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > p:nth-of-type(1)')
print("content1 =", content1[0].text)
content2 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > h2:nth-of-type(1)')
print("content2 =", content2[0].text)
content3 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > p:nth-of-type(2)')
print("content3 =", content3[0].text)
content4 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > h2:nth-of-type(2)')
print("content4 =", content4[0].text)
content5 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > p:nth-of-type(3)')
print("content5 =", content5[0].text)
一般會建議你去連他們的rss會比較快。
雖然....這間也是我的拒絕往來戶。
所以不太清楚你發生的問題。