【Day 19】資料持久化 - NoSQL (2/2) - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

第 11 屆 iThome 鐵人賽

DAY 20

AI & Data

爬蟲在手、資料我有 - 30 天 Scrapy 爬蟲實戰系列第 20 篇

【Day 19】資料持久化 - NoSQL (2/2)

11th鐵人賽 python crawler 爬蟲 mongodb

Rex Chien

2019-10-04 11:07:57

1575 瀏覽

分享至

先附上完整原始碼。跟 Day 17 的程式碼相比，只修改了 insert_article() 和 insert_responses() 兩個方法，把目標資料庫換成 MongoDB。

新增文章

儲存文章時，因為需要回傳一個識別值讓回應可以對應到原文，所以要分兩段邏輯來處理：

用網址來查詢，如果文章不存在就新增一筆，並取得新增後產生的 ObjectId
如果已存在，用 $set 運算式更新，回傳查詢到的文章 _id

def insert_article(article):
    """把文章插入到資料庫中

    :param article: 文章資料
    
    :return: 文章 ID
    :rtype: ObjectId
    """    
    # 查詢資料庫中是否有相同網址的資料存在
    doc = article_collection.find_one({'url': article['url']})
    article['update_time'] = datetime.now()
    
    if not doc:
        # 沒有就新增
        article_id = article_collection.insert_one(article).inserted_id
    else:
        # 已存在則更新
        article_collection.update_one(
            {'_id': doc['_id']},
            {'$set': article}
        )
        article_id = doc['_id']

    print(f'[{article["title"]}] 新增成功！')
    
    return article_id

新增回應

MongoDB 預設是使用 _id 欄位來做為主鍵，新增時如果沒指定，會自動帶入一個 ObjectId 的值。

在 116 行中，我直接用 HTML 原始碼中找到的回應 ID 來做為識別值。

# 回應 ID
result['_id'] = int(response.find('a')['name'].replace('response-', ''))

儲存回應的邏輯比較簡單，呼叫 update_one() 方法時多傳入一個 upsert=True 參數，如果找不到更新目標時會自動新增。

def insert_responses(responses):
    """把回文插入到資料庫中

    :param responses: 回文資料
    """
    for response in responses:
        response_collection.update_one(
            {'_id': response['_id']},
            {'$set': response},
            upsert=True
        )

今天的文章比較短，因為主要的邏輯在前幾天都說明過了～

明天會介紹一些常見的反爬蟲方法，再來就會開始介紹 Scrapy 囉！