爬蟲: 使用排除標籤的語法，執行結果和預期不符

網路爬蟲 beautifulsoup requests html標籤、屬性

聰明貓 2021-05-31 17:16:09 ‧ 1183 瀏覽

分享至

目前在抓取泛新聞裡面完整的內容

是使用BeautifulSoup和requests來抓取

在排除標籤時遇到一個問題，就是將目標tag中，第一個或最後一個標籤排除時

會連帶影響到其他原本正常選取的標籤

舉例來說，新聞內容都放在<div> class="post-content-container"底下的<p>裡面

標題文字都放在<h2>底下的<strong>裡面

前面有特殊符號的文字都放在<ul>底下的<li>裡面

因為標題文字<strong>中最後一個標籤的內容 "延伸閱讀" 不是文章內容，所以要排除

使用 :not(:last-of-type) 或 :not(:nth-last-child) 排除最後一個標籤的語法
section#firstSingle div.post-content-container h2 strong:not(:last-of-type)

除了最後一個標籤的內容不見，中間原本抓出的標題也會不見
原:

使用後:

同理，使用 :not(:first-of-type) 或 :not(:nth-first-child) 排除第一個標籤的語法
section#firstSingle div.post-content-container ul li:not(:first-of-type)

將前面含特殊符號文字<li>第一個標籤的內容 "蔣維倫 / 泛科學 PanSci 專欄作家..."

作者介紹給排除，中間會少抓一個tag的內容
原:

使用後:

有人知道是什麼原因嗎? 該怎麼解決?

程式碼:

import json
from bs4 import BeautifulSoup
import requests


class crawlerClass:
    def __init__(self):
        print("init")
    def PansciCrawler(self, url):
        response = requests.get(url, verify=False)
        soup = BeautifulSoup(response.text, "html.parser")
        section = ""
        # p文章內容，h2 strong標題文字，ul li前面含特殊符號文字
        for tag in soup.select('section#firstSingle div.post-content-container p:not(:nth-last-child(4)), section#firstSingle div.post-content-container h2 strong, section#firstSingle div.post-content-container ul li'):
            children = tag.findChild()
            if children == None:
                if tag.get_text() != "":
                    section += tag.get_text()
                    section += "\n\n"
        article = {'status': 0, 'content': section}
        return json.dumps(article)


if __name__ == "__main__":
    crawler = crawlerClass()
    # ==== Pansci ====
    url = "https://pansci.asia/archives/320488"
    pansciJsonStr = crawler.PansciCrawler(url)
    pansciContent = json.loads(pansciJsonStr, encoding="utf-8")
    print("status:"+str(pansciContent['status']))
    print(pansciContent['content'])

登入發表討論

直播研討會

1 個回答

wrxue

iT邦好手 1 級 ‧ 2021-05-31 19:25:19

最佳解答

section#firstSingle div.post-content-container h2 strong:not(:last-of-type)

應該為

section#firstSingle div.post-content-container h2:not(:last-of-type) strong

而

section#firstSingle div.post-content-container ul li:not(:first-of-type)

應該為

section#firstSingle div.post-content-container ul:not(:first-of-type) li

可以發現你的 pseudo class 都放錯地方，可以趁這個機會好好的了解一下 pseudo class。

回應 2
分享
檢舉

聰明貓 iT邦新手 3 級 ‧ 2021-06-02 09:19:29 檢舉

好的，謝謝~

聰明貓 iT邦新手 3 級 ‧ 2021-06-02 15:15:53 檢舉

應該是這篇裡面介紹的
https://ithelp.ithome.com.tw/articles/10218197
貼上來給其他有困惑的小夥伴了解

登入發表回應

我要發表回答

立即登入回答

參賽組數

1064 組

團體組數

40 組

累計文章數

22199 篇

完賽人數

600 人

基於LLM模型的 AI Agent 從零到進階實踐

Hello World Dev Conference |

42 分

2021 Q4 - Progress MFT 安全檔案傳輸管理軟體 - MOVEit Automation 培訓課程 (2)

EC NETWORKER |

113 分

AI 在 Microsoft Security 中的實踐與應用：資安防禦的新方向

臺灣資安大會 |

29 分

企業搜索解決方案最佳利器

IT EXPLAINED |

38 分

零信任資安大趨勢

零信任資安講堂 |

35 分

Greenplum - 開源分散式資料庫，流數據整合方案

歐立威科技 |

57 分

識別出遠距團隊的 bad smell

MWC |

34 分

F5 NGINX Modernizes App 系列一：現代應用架構的基石：NGINX的技術藍圖與未來

IT EXPLAINED |

39 分

A11y & customization

MWC |

41 分

F5 Kubernetes Networking - 多雲容器網路解決方案系列第一堂：容器整體網路架構概述

IT EXPLAINED |

57 分

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙