【Day 32】Scrapy 爬取 iT 邦幫忙的回文

第 11 屆 iThome 鐵人賽

AI & Data

爬蟲在手、資料我有 - 30 天 Scrapy 爬蟲實戰系列第 33 篇

11th鐵人賽 python crawler 爬蟲 scrapy

Rex Chien

2019-10-22 17:35:12

3541 瀏覽

分享至

在 Day 13 的內容中，我們有把回文都爬回來，今天會把相關的邏輯都移植到 Scrapy，同時整理一下目前的程式碼。相關的程式碼都放在 gist 上了，接下來會分別做說明。

Items

IthomeArticleItem 類別中加了一個 _id 欄位來使用 MongoDB 新增資料後產生的識別值。

另外新增了一個 IthomeReplyItem 類別，用來儲存回文資料。

class IthomeReplyItem(scrapy.Item):
    _id = scrapy.Field()
    article_id = scrapy.Field()
    author = scrapy.Field()
    publish_time = scrapy.Field()
    content = scrapy.Field()

之前回文都是用 response 表示，為了避免跟 Scrapy 的回應物件搞混，這邊都改用 reply。

Pipelines

新增了一個 AbstractMongoPipeline 類別，把啟動與關閉 MongoDB 連線的邏輯都抽取出來，讓其他要使用 MongoDB 的 Pipelines 元件繼承，各元件只需要定義對應的 collection_name 即可。

import pymongo

class AbstractMongoPipeline(object):
    collection_name = None

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.collection = self.db[self.collection_name]

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def close_spider(self, spider):
        self.client.close()

實作處理原文和回文的元件可以分別參考 gist 中的原始碼：原文元件、回文元件。

要特別注意，在同一個專案中如果有不同 Pipelines 元件分別處理不同的 Items，要額外判斷收到的是不是預期要處理的類別，如果不是就要直接回傳交給後面的元件處理。

def process_item(self, item, spider):
    # 只有在收到原文的 Item 時才處理
    if type(item) is items.IthomeArticleItem:

最後要記得把元件加到執行序列中：

ITEM_PIPELINES = {
    'ithome_crawlers.pipelines.IthomeCrawlersPipeline': 300,
    'ithome_crawlers.pipelines.IthomeArticlePipeline': 400,
    'ithome_crawlers.pipelines.IthomeReplyPipeline': 410,
}

Spider

最後在爬蟲中加入 parse_reply(self, response, article_id) 方法，用來剖析回文資料。在處理原文方法的結尾處呼叫剖析回文的方法。

def parse_article(self, response):
    # ...剖析原文資料

    yield article
    
    '''
    瀏覽數小於 20 的文章會被移除
    就不會有新增後的識別值
    '''
    if '_id' in article:
        '''
        上一行執行後資料已更新到資料庫中
        因為是同一個物件參照
        可以取得識別值
        '''  
        article_id = article['_id']
        '''
        因為 iTHome 原文與回文都是在同一個畫面中
        剖析回文時使用原本的 response 即可
        否則這邊需要再回傳 Request 物件
        yield scrapy.Request(url, callback=self.parse_reply)
        '''
        yield from self.parse_reply(response, article_id)