Python爬蟲_無法抓到正確新聞內文

網頁爬蟲

chihshanliao 2019-04-16 14:08:06 ‧ 6364 瀏覽

分享至

我想從蘋果日報的要聞網頁中抓取新聞內容，
網址為 https://tw.news.appledaily.com/headline/daily/20190329/38294459/
標題、新聞刊登日期都沒問題，只有內文中的第3段 (content3)抓取的內文是錯誤的，
請教高手，我應該如何修改 content3的語法，謝謝!
我真正要抓的content3內容，如圖所示!

程式碼如下:

import requests 
from bs4 import BeautifulSoup 
from selenium import webdriver
import re

# 讀取蘋果日報每日頭版頭條之網頁，例如 : https://tw.news.appledaily.com/headline/daily/20190325/38290504/ 
target_url = "https://tw.news.appledaily.com/headline/daily/20190329/38294459/" 

#driver = webdriver.Chrome('./chromedriver')
response = requests.get(target_url)
soup = BeautifulSoup(response.text, "lxml")

# 抓取標題
headline = soup.select('#article > div.wrapper > div > main > article > hgroup > h1')
print(headline[0].string)

# 抓取出版日期
publish_date = soup.select('#article > div.wrapper > div > main > article > hgroup > div')
print(publish_date[0].string)

# 抓取新聞內文
tag_p = soup.select("p ")

content1 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > p:nth-of-type(1)')
print("content1 =", content1[0].text)
content2 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > h2:nth-of-type(1)')
print("content2 =", content2[0].text)
content3 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > p:nth-of-type(2)')
print("content3 =", content3[0].text)
content4 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > h2:nth-of-type(2)')
print("content4 =", content4[0].text)
content5 = soup.select('#article > div.wrapper > div > main > article > div > div.ndArticle_contentBox > article > div > p:nth-of-type(3)')
print("content5 =", content5[0].text)