【Day 27】在 Scrapy 中處理爬取結果 - Item Pipelines - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 11 屆 iThome 鐵人賽

DAY 28

AI & Data

爬蟲在手、資料我有 - 30 天 Scrapy 爬蟲實戰系列第 28 篇

【Day 27】在 Scrapy 中處理爬取結果 - Item Pipelines

11th鐵人賽 python 爬蟲 scrapy pipeline

Rex Chien

2019-10-12 21:25:55

3628 瀏覽

分享至

當 scrapy.Spider 爬蟲抓到資料後，會將資料送往 Item Pipelines 進行一系列的處理。常見的使用情境是：

清理 HTML 資料
驗證資料
檢查重複
存到資料庫中

Pipelines 中每一個組件都是一個 Python 類別，不需要繼承其他類別，但必須實作這個方法：

process_item(self, item, spider)：實際處理爬取項目的方法，應該要回傳處理後的 dict 物件、Item 物件、Twisted Deferred 或拋出 DropItem 例外。

另外還可以視需求另外實作其他方法：

open_spider(self, spider)：在爬蟲啟動時被呼叫。
close_spider(self, spider)在爬蟲關閉時被呼叫。
from_crawler(cls, crawler)：用來初始化 Pipeline 元件的 classmethod。

建立 Pipeline 元件

跟 Item 相同，建立好專案後，專案目錄中會有一個 pipelines.py 檔案，其中有 Scrapy 根據專案名稱自動建立的 IthomeCrawlersPipeline 類別。

class IthomeCrawlersPipeline(object):
    def process_item(self, item, spider):
        return item

假設我們不想要保存瀏覽次數小於 20 的文章可以這樣做：（只是找個範例沒有別的意思）

from scrapy.exceptions import DropItem

class IthomeCrawlersPipeline(object):
    def process_item(self, item, spider):
        if item['view_count'] < 20:
            raise DropItem(f'[{item["title"]}] 瀏覽數小於 20')
        return item

設定 Pipeline 執行順序

建立 Pipeline 元件後還需要設定每個元件的執行順序。在專案目錄中的 settings.py 檔案中有一個 dict 型態的 ITEM_PIPELINE 變數，key 是元件的完整名稱，value 是 0~1000 的整數，數字小的會先執行。

把我們剛剛建立的元件加入後會長這樣：

ITEM_PIPELINES = {
    'ithome_crawlers.pipelines.IthomeCrawlersPipeline': 300,
}

執行爬蟲

最後執行 scrapy crawl ithome -o ithome.csv 指令來執行爬蟲，可以在啟動的 log 中看到元件已經被加入 Pipeline 中。

執行過程中，有可能會看到這樣的 log，代表有文章被過濾掉了。

最後檢查輸出的 ithome.csv 檔案中沒有瀏覽數小於 20 的文章。

參考資料

Item Pipeline — Scrapy 1.7.3 documentation

【Day 26】Scrapy 的結構化資料 - Item

【Day 28】Item Pipelines 應用 - 儲存資料到 MongoDB

系列文

爬蟲在手、資料我有 - 30 天 Scrapy 爬蟲實戰共 33 篇

RSS系列文訂閱系列文

129 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19865 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

爬蟲在手、資料我有 - 30 天 Scrapy 爬蟲實戰系列 第 28 篇