嗨,今天是第12天了,我們先簡單來了解Spider怎麼寫吧?
在昨天我們有下指令:scrapy genspider example example.com
所以可以看到在spiders目錄內有個example.py檔案:
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["example.com"]
    start_urls = ['http://example.com/']
    def parse(self, response):
        pass
現在來說明程式碼內容:
name = "example": name為每隻spider的名字。
scrapy list,會看到example,因為我們目前只有這隻spider檔案。class ExampleSpider(scrapy.Spider):繼承scrapy.Spider。
Scrapy Engine呼叫的介面settings存取設定檔中的設定start_urls定義爬蟲的起始點,可以是多個(為list),也可以是單一個,放入所有起始爬取點的url。parse(self, response)為解析的function,會把解析的程式碼放在這裡。所以我們可以知道,寫一個spider有以下幾個步驟:
scrapy.Spider
我們之所以可以直接丟url到start_urls內不需要requests,是因為在spider內已經有寫好了一個start_requests()的function。可以在spider程式碼裡複寫它,以下是透過start_requests()方法定義起始爬取的範例:
class ExampleSpider(scrapy.Spider):
    name = 'ptt_movie'
    def start_requests(self):
       yield scrapy.Request("https://www.ptt.cc/bbs/movie/index.html", headers = {'User-Agent': 'Mozilla/5.0'}, callback=self.parse)
    def parse_article(self, response):
        pass
下面是我們以start_urls設定起始點:
import scrapy
class MySpider(scrapy.Spider):
    name = 'ptt'
    allowed_domains = ['ptt.com'']
    start_urls = [
        'https://www.ptt.cc/bbs/Food/index.html',
        'https://www.ptt.cc/bbs/movie/index.html',
    ]
    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
執行:
crapy crawl ptt
就可以看到:[ptt] INFO: A response from https://www.ptt.cc/bbs/movie/index.html just arrived!
與[ptt] INFO: A response from https://www.ptt.cc/bbs/Food/index.html just arrived!
自行定義start_requests(self)可以:
import scrapy
class MySpider(scrapy.Spider):
    name = 'ptt'
    allowed_domains = ['ptt.com']
    def start_requests(self):
        yield scrapy.Request('https://www.ptt.cc/bbs/Food/index.html', self.parse)
        yield scrapy.Request('https://www.ptt.cc/bbs/movie/index.html', self.parse)
    def parse(self, response):
        self.logger.info('A response from %s just arrived!', response.url)
一樣可以看到:[ptt] INFO: A response from https://www.ptt.cc/bbs/movie/index.html just arrived![ptt] INFO: A response from https://www.ptt.cc/bbs/Food/index.html just arrived!
今天,介紹了一個spider如何撰寫,以及爬蟲設定起始爬取點的兩種方式:
start_urls
start_requests的方法明天會繼續介紹透過spider實作爬ptt!
Spiders — Scrapy 1.5.1 documentation