Scrapy（一）:python語言在爬網界的標準

DAY 21

用python擷取網頁上的開放資訊（數據資料），分析及畫出圖表系列第 20 篇

python 鐵人賽

timloo

2013-10-16 22:22:48

12096 瀏覽

分享至

回首之前的文章，
因為原來的程式庫（Beautiful Soup）解析網頁的異常，
導致改用了lxml, 而lxml包山包海，不但有基本的解析功能，還有創建XML/HTML的額外功能。
而筆者在過程中，因為XPATH的使用方式不熟悉，似乎把抓網頁的程式，寫的不一致，
讓自己也不好維護。這時看到Scrapy[http://scrapy.org/](Scrapy | An open source web scraping framework for Python), 有點像是黑暗中看到明燈。

在接下來的幾篇分享，淺談一些使用的案例。
本文參考自[http://doc.scrapy.org/en/latest/intro/overview.html](Scrapy at a glance)
Scrapy的官網文件也引起大陸同胞的注意，也有人開始進行了英翻中

產生一個專案，
scrapy會產生一個樣版目錄。

scrapy startproject ironman6

timloo@timloo-home:~/iron/ironman6$ tree
.
├── ironman6
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── items.py
│   ├── items.pyc
│   ├── pipelines.py
│   ├── settings.py
│   ├── settings.pyc
│   └── spiders
│       ├── __init__.py
│       ├── __init__.pyc
│       ├── Ironman6_spider.py
│       └── Ironman6_spider.pyc
└── scrapy.cfg

2 directories, 12 files

簡單的抓網頁範例，
只要定義抓回來資料的項目（items.py）,
抓資料的規則（spiders.py）

在指令產生的空檔裏，寫下要抓的欄位

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class Ironman6Item(Item):
    # define the fields for your item here like:
    # name = Field()
    #pass
	subject = Field()
    	summary = Field()
    	dwtime = Field()




抓的規則：
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from ironman6.items import Ironman6Item

class IronManSpider(CrawlSpider):

    name = 'Ironman6'
    allowed_domains = ['ithelp.ithome.com.tw']
    start_urls = ['http://ithelp.ithome.com.tw/ironman6/player/timloo']
    rules = [Rule(SgmlLinkExtractor(allow=['/life/\d+']), 'parse_iron')]

    def parse_iron(self, response):
        x = HtmlXPathSelector(response)

        article = Ironman6Item()
        article['subject'] = x.select("//h1/text()").extract()
        article['summary'] = x.select("//p/text()").extract()
        article['dwtime'] = x.select('//div[@class="text_dwtime"]').extract()
        return article

執行：
scrapy crawl Ironman6

如果想把輸出寫入檔案（json檔）
scrapy crawl Ironman6 -o scraped_data.json -t json

這裏是試抓本網站的一些資料。
還要解決中文問題。
一些細節明天再分享！