[Day 15] Scrapy Item Pipeline 操作

2019 iT 邦幫忙鐵人賽

DAY 15

AI & Data

Scrapy爬蟲與資料處理30天筆記系列第 15 篇

2019鐵人賽

plusone

團隊NUTC_imac

2018-10-30 14:02:49

7297 瀏覽

分享至

嗨，在上一篇文章中說明了如何定義Field及資料封裝的方法，今天將會說明對爬取到的資料進行處理！這時候就會使用到Item Pipeline這個元件。透過它（依照自訂順序）來負責特定的功能處理。

什麼是Item Pipeline？

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

spider每抓取到一個(item)項目後，便會送到Item Pipeline，經過多個元件依序串起來成為一個資料處理的管線。

Typical uses of item pipelines are:

cleansing HTML data

validating scraped data (checking that the items contain certain fields)

checking for duplicates (and dropping them)

storing the scraped item in a database

Item pipelines 的典型應用：

清洗資料
驗證資料
過濾重複資料
資料存入資料庫

現在，我們就來實現Item Pipeline吧，在建立專案的時候就會出現一個叫做pipelines.py的檔案，內容如下：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

class MyfirstscrapyprojectPipeline(object):
    def process_item(self, item, spider):
        return item

在這個類別裡面並不需要繼承特定的類別，只需要用特定的方法就可以了，如process_item(self, item, spider)方法是用來處理每一項Spider爬到的資料，而這個方法也是Item Pipeline最主要的功能。

來看看怎麼使用吧：

將原本推文數量從字串轉為整數：

class MyfirstscrapyprojectPipeline(object):
    def process_item(self, item, spider):
        item['push'] = int(item['push'])
        return item

執行之後，會發現沒有變化還是字串啊？因為還沒結束！

要啟用pipeline，打開settings這隻檔案，找到ITEM_PIPELINES （如下程式碼）部分將註解拿掉：

ITEM_PIPELINES = {
 'myFirstScrapyProject.pipelines.MyfirstscrapyprojectPipeline': 300,
}

就可以發現終端輸出的內容推文次數為整數了！
這樣的方式，讓我們可以只選擇啟用特定的Item Pipeline。

後面的數字300表示Pipeline的執行順序，小的會先執行。

接著要介紹一個神奇的功能，可以直接在執行指令時新增一個-o參數，後面加上檔案名稱，如下：

scrapy crawl ptt -o ptt.csv

執行完之後，可以看到目錄下多了一個ptt.csv檔案！
Imgur

透過Item Pipeline自動把資料輸出成一個csv檔案了，若要輸出成json只需要更改副檔名：

scrapy crawl ptt -o ptt.json

就可以看到目錄下多了一個ptt.json的檔案了！而且內容是經過處理過的push資料型態為int。

當然，Item Pipeline還有其他重要的方法也會在之後進行說明：

open_spider(self, spider)
close_spider(self, spider)
from_crawler(cls, crawler)

今天對於Item Pipeline有基本的認識了，接下來會說明更多Item Pipeline的使用方式，包含過濾重複資料以及存入不同的資料庫中。

好的，那今天就到這了，明天見啦～

參考來源：
Item Pipeline — Scrapy 1.5.1 documentation

[Day 14] Scrapy Item&Field

[Day 16] Scrapy Item Pipeline 應用

系列文

Scrapy爬蟲與資料處理30天筆記共 30 篇

RSS系列文訂閱系列文

153 人訂閱

完整目錄

直播研討會

1 則留言

WenTingTseng

iT邦新手 4 級 ‧ 2022-01-20 21:55:01

請問有什麼方法可以當資料庫Table有資料進來時才做event trigger觸發 python spyder pipelines執行呢

回應
檢舉

登入發表回應

我要留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22210 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

Scrapy爬蟲與資料處理30天筆記系列 第 15 篇