嗨,昨天說明了透過Item Pipeline將資料存到MongoDB內,今天為實戰篇!我們來爬 全球新聞網的報導吧!
天氣變冷就感冒了,全身痠痛喉嚨痛.... 大家要注意保暖(Q_Q)
scrapy startproject traNews
news的Spider
scrapy genspider news example.com
.
├── scrapy.cfg
└── traNews
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── __init__.py
        └── news.py
2 directories, 8 files
Spider資料夾的news.py,定義allowed_domains, start_urls,num用來計算的頁數。import scrapy
class NewsSpider(scrapy.Spider):
    name = "news"
    num = 1
    start_urls = ['http://blog.tranews.com/blog/category/%E6%97%85%E9%81%8A',
                  'http://blog.tranews.com/blog/%E7%BE%8E%E9%A3%9F',
                  'http://blog.tranews.com/blog/%E8%97%9D%E6%96%87',
                  'http://blog.tranews.com/blog/%E4%BC%91%E9%96%92']
    def parse(self, response):
        pass

href在h2且class="entry-title"的元素內:def parse(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    titles = soup.select('h2.entry-title')
    for t in titles:
            link = t.select_one('a').get('href')
            title = t.text
            yield scrapy.Request(link, callback=self.article_parser)
t.text可以取得該文章的標題,t.select_one('a')可以取到下一層a標籤,再用.get('href')取得該文章的連結。這裡我們可以用meta來儲存兩個變數,讓article_parser這個function可以用這兩項變數。
所以我們改寫成這樣子(option):
def parse(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        titles = soup.select('h2.entry-title')
    
        for t in titles:
            meta = {
                'title':t.text,
                'link':t.select_one('a').get('href')
            }
            # link = t.select_one('a').get('href')
            # title = t.text
            yield scrapy.Request(meta['link'], callback=self.article_parser, meta=meta)
雖然這樣能抓到該頁的所有標題與連結了卻不是全部的文章,所以現在要寫切頁,透過觀察Network可以看到往下滾可以看到它會更改最後面的`頁數,如圖:

num變數了。概念是透過遞增num來切換頁面,每切換一個頁面再重新爬取該頁的所有連結:def parse(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    titles = soup.select('h2.entry-title')
    for t in titles:
        meta = {
            'title':t.text,
            'link':t.select_one('a').get('href')
        }
        yield scrapy.Request(meta['link'], callback=self.article_parser, meta=meta)
    self.num += 1
    next_page = self.start_urls[0] +'/page/'+ str(self.num)
    yield scrapy.Request(next_page, callback=self.parse)
Item還不確定需要哪些欄位先暫時定義這幾個變數!(之後也可以再新增)class TranewsItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    link = scrapy.Field()
    content = scrapy.Field()
    time = scrapy.Field()
    img = scrapy.Field()
Item,在news.py上面import:from ..items import TranewsItem

<p>標籤組合而成的,所以我們需解析出每個p元素的文字再將它組合起來,因為之前沒有提到Scrapy的解析方法,所以為了方便理解我將它拆開來寫:def article_parser(self, response):
        soup = BeautifulSoup(response.text, 'lxml')
        article = TranewsItem()
        article['title'] = response.meta['title']
        article['link'] = response.meta['link']
        
        contents = soup.select('div.entry-content p')
        article['content'] = ''
        for content in contents:
            article['content'] = article['content'] + content.text 
content = sel.css('div.entry-content p::text').extract()[1:]
article['content'] = ','.join(content)
article['img'] = soup.select_one('img').get('src')
article['time'] = soup.select_one('span.entry-date').text
整個article_parserfunction寫成:
def article_parser(self, response):
    soup = BeautifulSoup(response.text, 'lxml')
    article = TranewsItem()
    article['title'] = response.meta['title']
    article['link'] = response.meta['link']
    contents = soup.select('div.entry-content p')
    article['content'] = ''
    for content in contents:
        article['content'] = article['content'] + content.text
    article['img'] = soup.select_one('img').get('src')
    article['time'] = soup.select_one('span.entry-date').text
    return article

好的,那今天就先說明Spider的部分怎麼寫,明天再繼續說明如何存入MySQL資料庫吧!明天見~