嗨,今天是第12天了,我們先簡單來了解Spider
怎麼寫吧?
在昨天我們有下指令:scrapy genspider example example.com
所以可以看到在spiders
目錄內有個example.py
檔案:
# -*- coding: utf-8 -*-
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/']
def parse(self, response):
pass
現在來說明程式碼內容:
name = "example"
: name為每隻spider的名字。
scrapy list
,會看到example
,因為我們目前只有這隻spider檔案。class ExampleSpider(scrapy.Spider):
繼承scrapy.Spider
。
Scrapy Engine
呼叫的介面settings
存取設定檔中的設定start_urls
定義爬蟲的起始點,可以是多個(為list),也可以是單一個,放入所有起始爬取點的url。parse(self, response)
為解析的function
,會把解析的程式碼放在這裡。所以我們可以知道,寫一個spider有以下幾個步驟:
scrapy.Spider
我們之所以可以直接丟url
到start_urls
內不需要requests
,是因為在spider
內已經有寫好了一個start_requests()
的function。可以在spider程式碼裡複寫它,以下是透過start_requests()
方法定義起始爬取的範例:
class ExampleSpider(scrapy.Spider):
name = 'ptt_movie'
def start_requests(self):
yield scrapy.Request("https://www.ptt.cc/bbs/movie/index.html", headers = {'User-Agent': 'Mozilla/5.0'}, callback=self.parse)
def parse_article(self, response):
pass
下面是我們以start_urls
設定起始點:
import scrapy
class MySpider(scrapy.Spider):
name = 'ptt'
allowed_domains = ['ptt.com'']
start_urls = [
'https://www.ptt.cc/bbs/Food/index.html',
'https://www.ptt.cc/bbs/movie/index.html',
]
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
執行:
crapy crawl ptt
就可以看到:[ptt] INFO: A response from https://www.ptt.cc/bbs/movie/index.html just arrived!
與[ptt] INFO: A response from https://www.ptt.cc/bbs/Food/index.html just arrived!
自行定義start_requests(self)
可以:
import scrapy
class MySpider(scrapy.Spider):
name = 'ptt'
allowed_domains = ['ptt.com']
def start_requests(self):
yield scrapy.Request('https://www.ptt.cc/bbs/Food/index.html', self.parse)
yield scrapy.Request('https://www.ptt.cc/bbs/movie/index.html', self.parse)
def parse(self, response):
self.logger.info('A response from %s just arrived!', response.url)
一樣可以看到:[ptt] INFO: A response from https://www.ptt.cc/bbs/movie/index.html just arrived!
[ptt] INFO: A response from https://www.ptt.cc/bbs/Food/index.html just arrived!
今天,介紹了一個spider如何撰寫,以及爬蟲設定起始爬取點的兩種方式:
start_urls
start_requests
的方法明天會繼續介紹透過spider
實作爬ptt!
Spiders — Scrapy 1.5.1 documentation