爬蟲: 如何去除含有特定文字的標籤

網路爬蟲 beautifulsoup requests html標籤、屬性文字處理

聰明貓 2021-05-27 20:35:08 ‧ 2036 瀏覽

分享至

目前在抓取Yahoo新聞裡面完整的內容

是使用BeautifulSoup和requests來抓取

在抓取時遇到一個問題，就是目標tag中，有些tag是圖片底下的註解

舉例來說，新聞內容都放在<div> class="caas-body"底下的<p>裡面

其中有些<p>的內容像是 "▲▼籃籃是「樂天女孩」隊長。（圖／籃籃臉書）"

因為不是新聞內容所以不需要

想請問有沒有方法可以去除含特定文字(例如:▲或者是指定的文字)的tag？

PS: 因為是將每一個tag的內容變成string，再加起來變成字典
如果可以針對每個抓下來的string來判斷的方法也OK

程式碼:

import json
from bs4 import BeautifulSoup
import requests


class crawlerClass:
    def __init__(self):
        print("init")

    def YahooCrawler(self, url):
        response = requests.get(url, verify=False)
        soup = BeautifulSoup(response.text, "html.parser")

        section = ""
        for tag in soup.select('div.caas-body p'):
            # 看底下有沒有其他tag
            children = tag.findChild()
            # 沒有的話才是想要的
            if children == None:
                section += tag.get_text()
                section += "\n\n"
        article = {'status': 0, 'content': section}
        return json.dumps(article)


if __name__ == "__main__":
    crawler = crawlerClass()
    # ==== Yahoo ====
    url = "https://tw.news.yahoo.com/%E5%95%A6%E5%95%A6%E9%9A%8A%E5%A5%B3%E5%AD%A9-%E8%A4%B2%E5%BA%95-%E5%8C%85-%E7%B6%B2%E5%B4%A9%E6%BD%B0-104902369.html"
    yahooJsonStr = crawler.YahooCrawler(url)
    yahooContent = json.loads(yahooJsonStr, encoding="utf-8")
    print("status:"+str(yahooContent['status']))
    print(yahooContent['content'])

登入發表討論

直播研討會

1 個回答

wrxue

iT邦好手 1 級 ‧ 2021-05-27 21:03:44

最佳解答

excepts = ['▲', '▼', '或其他想排除的文字']
for tag in soup.select('div.caas-body p'):
    # 看底下有沒有其他tag
    children = tag.findChild()
    # 沒有的話才是想要的
    if children == None:
        # 查看 tag 中是否有想排除的文字
        if sum([x in tag.get_text() for x in excepts]) != 0:
            continue
        section += tag.get_text()
        section += "\n\n"