[Day 28] 實戰：線上書店爬蟲流程

2019 iT 邦幫忙鐵人賽

DAY 28

AI & Data

Scrapy爬蟲與資料處理30天筆記系列第 28 篇

2019鐵人賽

plusone

團隊NUTC_imac

2018-11-12 20:54:35

4094 瀏覽

分享至

Day 28

嗨，倒數三天，因為內容差不多都說明完了，所以今天我們就來爬取書店網站吧，知道了爬取的流程其實就可以爬其他的網站了，因為基本上就是得到商品連結、爬取商品細節、切到下一頁面的流程。

這是我們這次範例的連結： All products | Books to Scrape - Sandbox

首先先建立專案：

scrapy startproject books

建立spider

scrapy genspider book_example  books.toscrape.com

建立好後，右鍵檢查網頁，可以看到每本書都在<div class="image_container></div>"內。你可能會想說直接抓<a>，就可以取到href？但是這樣可能會抓到其他不相關的，只要是a元素都會取到，但是限定class ="image_container"的話，就會只抓到商品的部分，不會抓到其他內容。

Imgur

理解了之後，就可以來撰寫spider了，我們先抓取全部的href，用urljoin組成url。

import scrapy
from bs4 import BeautifulSoup
class BookExampleSpider(scrapy.Spider):
    name = "book_example"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        products = soup.select('div.image_container')
        for p in products:
            link = response.urljoin(p.select_one('a').get('href'))

            yield scrapy.Request(link, callback=self.parse_books)

有了url後，我們就可以來切換頁面了，切頁的按鈕在右下角（如下圖），在<li>且class="next"內的<a>標籤內：

Imgur

next_page = soup.select_one('li.next a').get('href')
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

這樣我們的parse方法就完成了：

def parse(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    products = soup.select('div.image_container')
    for p in products:
        link = response.urljoin(p.select_one('a').get('href'))
        yield scrapy.Request(link, callback=self.parse_books)
    next_page = soup.select_one('li.next a').get('href')
    print(response.urljoin(next_page))
    yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

現在，我們就可以開始爬取商品詳細資訊了，選擇一個商品：

Imgur

接著就剩下一個一個內容的擷取了，現在我們開始撰寫parse_books方法吧！

基本上都是平常會用到的，但這邊有一個比較特別的是findNext('p')，因為description的內容只包在一個<p>內，但我們如果只select('p')則會抓到其他不相關的資料，所以我們先抓'div#product_description'，再用findNext，去找它的下一個（不是裡面）。
評分在class內，所以用get('class')後再取後面的元素。

def parse_books(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    title = soup.select_one('div.product_main h1').text
    price = soup.select_one('p.price_color').text
    stock = soup.select_one('p.availability').text.strip()
    description = soup.select_one('div#product_description').findNext('p').text
    rating = soup.select_one('p.star-rating').get('class')[1]

可以直接寫成：
- 建議用BooksItem class，不過這裡確定只有一個spider才直接這樣寫。

item = {
            'title':soup.select_one('div.product_main h1').text,
            'price':soup.select_one('p.price_color').text,
            'stock':soup.select_one('p.availability').text.strip(),
            'description':soup.select_one('div#product_description').findNext('p').text,
            'rating': soup.select_one('p.star-rating').get('class')[1],
        }