[Day 14] Scrapy Item&Field

2019 iT 邦幫忙鐵人賽

DAY 14

AI & Data

Scrapy爬蟲與資料處理30天筆記系列第 14 篇

2019鐵人賽

plusone

團隊NUTC_imac

2018-10-29 13:08:50

5932 瀏覽

分享至

嗨，第14天了，在昨天的文章中，我們已經建立了一個爬ptt的spider檔案了（可以到昨天的文章查看程式碼），現在要來定義我們要的資料項目！

至於為什麼需要？

在上一篇的最後提到：

我們使用Python的dictionary方式存資料，不過這樣可能會有缺點，dictionary雖然方便卻缺少結構性，容易打錯字或者回傳不一致的數據，特別是在多個Spider的專案中。
所以明天我們會說明Item類別，用來封裝爬取到的資料，以及說明為什麼要用Item！

開啟專案中的items.py可以看到（一建立專案就會有的）：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MyfirstscrapyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

Scrapy提供了兩種類別Item與Field，可以使用它們自訂資料類別封裝爬到的資料。要如何使用只要繼承scrapy.Item（如上面程式碼），並且依照它註解的方式建立Field物件就可以了！像是：

name = scrapy.Field()

你可以為每個Field指定任何類型的數據，它接受的值沒有任何限制，可依照自己的需求定義。

現在我們就來建立ptt內容的item：

class MyfirstscrapyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()
    push = scrapy.Field()
    href = scrapy.Field()
    date = scrapy.Field()

接下來回到spider的程式碼：

在最上面新增一行，import我們item.py的class：

from ..items import MyfirstscrapyprojectItem

修改def parse(self, response)的內容為：

def parse(self, response):
        items = MyfirstscrapyprojectItem()
        for q in response.css('div.r-ent'):
            items['push'] = q.css('div.nrec > span.hl::text').extract_first()
            items['title'] = q.css('div.title > a::text').extract_first()
            items['href'] = q.css('div.title > a::attr(href)').extract_first()
            items['date'] = q.css('div.meta > div.date ::text').extract_first()
            items['author'] = q.css('div.meta > div.author ::text').extract_first()
            yield(items)
        next_page_url = response.css('div.action-bar > div.btn-group > a.btn::attr(href)')[3].extract()
        if (next_page_url) and (self.count_page < 10):
            self.count_page = self.count_page + 1 
            new = response.urljoin(next_page_url) 
        else:   
            raise  CloseSpider('close it')
        yield scrapy.Request(new, callback = self.parse, dont_filter = True)

注意 : 記得items = MyfirstscrapyprojectItem()建立Item物件。

Item和dictionary很相似用起來應該不會陌生，除此之外，Item會檢查欄位名稱，如果剛剛沒有定義該欄位，則會出現錯誤，可以試試看更改items：

items['pull'] = q.css('div.nrec > span.hl::text').extract_first()

原本定義欄位名稱為push，現在我們改成pull執行後會看到錯誤：KeyError: 'MyfirstscrapyprojectItem does not support field: pull' 提醒使用者來防止拼字錯誤。

今天我們說明了item類別的使用，不過到目前為止我們沒有將它輸出，而這樣爬取就沒有意義了，所以接下來會說明Item Pipeline的使用。快速輸出爬到的資料，讓你覺得：

我們前幾天不知道在講什麼東西，因為有Scrapy根本就用不到了？

好的，那今天就到這裡結束了！明天見！

[Day 13] 實戰：Scrapy爬PTT文章

[Day 15] Scrapy Item Pipeline 操作

系列文

Scrapy爬蟲與資料處理30天筆記共 30 篇

RSS系列文訂閱系列文

153 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19864 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

Scrapy爬蟲與資料處理30天筆記系列 第 14 篇