iT邦幫忙

2018 iT 邦幫忙鐵人賽
DAY 7
1

【第七天】支線任務:XPath

大家晚上好,前情提要!
為了爬取巴哈動畫瘋的本季新番,我們要做一隻小爬蟲,用的套件如下:

composer require symfony/dom-crawler

其中的MethodfilterXPath是這樣用的,我用昨天晚上的一些時間看了一下相關資料,
這邊推薦幾個網站給大家當作認識XPath的起手式!

大概有些初步的了解即可開始寫寫看一些例子!話不多說,我們開始吧!
https://ithelp.ithome.com.tw/upload/images/20171212/20107380TqifH3m4xD.png

  • 首先,透過我的好眼力可以觀察到,一部動漫最新一集的資訊,都可以在A框裡面找到!
  • B框講述的內容是該影片是否有加入收藏(只要有登入巴哈帳戶,都可以使用此功能),但這先不在這次練習範圍內
  • C框則是影片的連結
  • D框是圖片影像的path
  • E和F則分別是動漫名稱和更新日期

所以我們可以先寫出一個function如下:
CrawlerService.php

<?php
namespace App\Services;

use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;

class CrawlerService
{
    /** @var Client  */
    private $client;

    public function __construct()
    {
        $this->client = app(Client::class);
    }

    /**
     * @param string $path
     * @return Crawler
     */
    public function getOriginalData(string $path): Crawler
    {
        $content = $this->client->get($path)->getBody()->getContents();
        $crawler = new Crawler();

        $crawler->addHtmlContent($content);

        return $crawler;
    }

    /**
     * @deprecated 
     */
    public function getNewAnimationFromBaHaDeprecated(Crawler $crawler)
    {
        $target = $crawler->filterXPath('//div[contains(@class, "newanime")]');

        return $target;
    }
}

CrawlerServiceTest.php

(...略)
    /**
     * @deprecated 
     */
    public function testGetNewAnimationFromBaHaDeprecated()
    {
        $crawler = $this->crawlerService->getOriginalData('https://ani.gamer.com.tw/');
        $target = $this->crawlerService->getNewAnimationFromBaHaDeprecated($crawler);

        dd($target);
    }

恩~照我看完上面一些參考連結的結論,我這樣子dd出來的東西,應該要會是一堆的A框的東西才是!
符合//div[contains(@class, "newanime")]這表達式
https://ithelp.ithome.com.tw/upload/images/20171212/20107380yzehu1OG9q.png
恩...這樣好像看不出來是啥鬼,東西太多了!
所以我們稍微調整一下吧!
CrawlerServiceTest.php

(...略)
    /**
     * @deprecated 
     */
    public function testGetNewAnimationFromBaHaDeprecated()
    {
        $crawler = $this->crawlerService->getOriginalData('https://ani.gamer.com.tw/');
        $target = $this->crawlerService->getNewAnimationFromBaHaDeprecated($crawler);
        
        $target->each(function ($node) {
            print_r($node->html());
        });
    }

矮唷~好像有點樣子囉!看來理解沒錯!
https://ithelp.ithome.com.tw/upload/images/20171212/20107380eW67E1Dybx.png

以上圖為例,有些時候我們要的資訊(例如C框)不在A框裡面,所以我們最後還是要再檢查一次!

很好!既然我們已經把A框們給抓出來了,接下來只要在each裡面去整理出我們要的資訊就好!
下一個來抓日期吧!
CrawlerService.php

(...略)
    /**
     * @deprecated 
     */
    public function getNewAnimationFromBaHaDeprecated(Crawler $crawler)
    {
        $target = $crawler->filterXPath('//div[contains(@class, "newanime")]')
            ->each(function (Crawler $node) {
                $date = $this->getDateForNewAnimationFromBaHa($node);
            });
        

        return $target;
    }
    
    private function getDateForNewAnimationFromBaHa(Crawler $node)
    {
        return $node->filterXPath('//span[contains(@class, "newanime-date")]')
            ->each(function (Crawler $node) {
                return $node->text();
            });
    }
}

這裡要稍微說一下,為何還要再加上一個each呢?
不是應該直接寫成下面這樣就可以了嗎?

    private function getDateForNewAnimationFromBaHa(Crawler $node)
    {
        return $node->filterXPath('//span[contains(@class, "newanime-date")]')->text();
    }

主要是因為還記得剛剛上面說到的有時候我們要的資訊不一定在A框裡面嗎?
這樣寫的話就會噴錯。
(雖然也只是我猜測的原因,如果有大大發現我說錯了再麻煩趕緊指教一下)
我們可以試著追code會發現到
Crawler.php

<?php
namespace Symfony\Component\DomCrawler;

use Symfony\Component\CssSelector\CssSelectorConverter;

/**
 * Crawler eases navigation of a list of \DOMNode objects.
 *
 * @author Fabien Potencier <fabien@symfony.com>
 */
class Crawler implements \Countable, \IteratorAggregate
{
    (...略)
    
    /**
     * Returns the node value of the first node of the list.
     *
     * @return string The node value
     *
     * @throws \InvalidArgumentException When current node is empty
     */
    public function text()
    {
        var_dump($this->nodes);    // 自行加上去會發現最後面死掉的時候dump出的是空陣列
        if (!$this->nodes) {
            throw new \InvalidArgumentException('The current node list is empty.');
        }

        return $this->getNode(0)->nodeValue;
    }
    
    (...略)
}

回到正題來!我們就可以啪啪啪的把我們的function寫出來了!
CrawlerService.php

    /**
     * @param Crawler $crawler
     * @return array
     */
    public function getNewAnimationFromBaHa(Crawler $crawler): array
    {
        $target = $crawler->filterXPath('//div[contains(@class, "newanime")]')
            ->each(function (Crawler $node) {
                $date = $this->getDateForNewAnimationFromBaHa($node);
                $link = $this->getLinkForNewAnimationFromBaHa($node);
                $image = $this->getImageForNewAnimationFromBaHa($node);
                $info = $this->getInfoForNewAnimationFromBaHa($node);

                $response = [
                    'date' => array_first($date),
                    'directUri' => array_first($link),
                    'imagePath' => array_first($image),
                    'label' => array_first($info),
                ];
                return in_array(null, array_values($response), true) ? null : $response;
            });
        $target = array_filter($target, function ($d) {
            return null !== $d;
        });
        return $target;
    }
    
        private function getDateForNewAnimationFromBaHa(Crawler $node)
    {
        return $node->filterXPath('//span[contains(@class, "newanime-date")]')
            ->each(function (Crawler $node) {
                return $node->text();
            });
    }

    private function getLinkForNewAnimationFromBaHa(Crawler $node)
    {
        return $node->filterXPath('//a[contains(@class, "newanime__content")]')
            ->evaluate('string(@href)');
    }

    private function getImageForNewAnimationFromBaHa(Crawler $node)
    {
        return $node->filterXPath('//img[contains(@class, "lazyload")]')
            ->evaluate('string(@data-src)');
    }

    private function getInfoForNewAnimationFromBaHa(Crawler $node)
    {
        return $node->filterXPath('//p[contains(@class, "newanime-title")]')
            ->each(function (Crawler $node) {
                return $node->text();
            });
    }

CrawlerServiceTest.php

    public function testGetNewAnimationFromBaHa()
    {
        $crawler = $this->crawlerService->getOriginalData('https://ani.gamer.com.tw/');
        $target = $this->crawlerService->getNewAnimationFromBaHa($crawler);

        $this->assertArrayHasKey('date', $target[0]);
        $this->assertArrayHasKey('directUri', $target[0]);
        $this->assertArrayHasKey('imagePath', $target[0]);
        $this->assertArrayHasKey('label', $target[0]);
    }

這麼一來我們需求中的動漫瘋本季新番就已經爬好,只要再把他和之前的推送通知整在一起就大功告成了!
時間差不多,今天就到此為止囉~剩下明天再來做!


上一篇
【第六天】爬蟲聽起來很強欸!那是啥?
下一篇
【第八天】完成當季動畫通知推送!
系列文
用 laravel 尻出自己形狀的 line bot,還要撐三十天!30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言