大家晚上好,前情提要!
為了爬取巴哈動畫瘋的本季新番,我們要做一隻小爬蟲,用的套件如下:
composer require symfony/dom-crawler
其中的MethodfilterXPath
是這樣用的,我用昨天晚上的一些時間看了一下相關資料,
這邊推薦幾個網站給大家當作認識XPath的起手式!
大概有些初步的了解即可開始寫寫看一些例子!話不多說,我們開始吧!
所以我們可以先寫出一個function如下:CrawlerService.php
<?php
namespace App\Services;
use GuzzleHttp\Client;
use Symfony\Component\DomCrawler\Crawler;
class CrawlerService
{
/** @var Client */
private $client;
public function __construct()
{
$this->client = app(Client::class);
}
/**
* @param string $path
* @return Crawler
*/
public function getOriginalData(string $path): Crawler
{
$content = $this->client->get($path)->getBody()->getContents();
$crawler = new Crawler();
$crawler->addHtmlContent($content);
return $crawler;
}
/**
* @deprecated
*/
public function getNewAnimationFromBaHaDeprecated(Crawler $crawler)
{
$target = $crawler->filterXPath('//div[contains(@class, "newanime")]');
return $target;
}
}
CrawlerServiceTest.php
(...略)
/**
* @deprecated
*/
public function testGetNewAnimationFromBaHaDeprecated()
{
$crawler = $this->crawlerService->getOriginalData('https://ani.gamer.com.tw/');
$target = $this->crawlerService->getNewAnimationFromBaHaDeprecated($crawler);
dd($target);
}
恩~照我看完上面一些參考連結的結論,我這樣子dd出來的東西,應該要會是一堆的A框的東西才是!
符合//div[contains(@class, "newanime")]
這表達式
恩...這樣好像看不出來是啥鬼,東西太多了!
所以我們稍微調整一下吧!CrawlerServiceTest.php
(...略)
/**
* @deprecated
*/
public function testGetNewAnimationFromBaHaDeprecated()
{
$crawler = $this->crawlerService->getOriginalData('https://ani.gamer.com.tw/');
$target = $this->crawlerService->getNewAnimationFromBaHaDeprecated($crawler);
$target->each(function ($node) {
print_r($node->html());
});
}
矮唷~好像有點樣子囉!看來理解沒錯!
以上圖為例,有些時候我們要的資訊(例如C框)不在A框裡面,所以我們最後還是要再檢查一次!
很好!既然我們已經把A框們給抓出來了,接下來只要在each
裡面去整理出我們要的資訊就好!
下一個來抓日期吧!CrawlerService.php
(...略)
/**
* @deprecated
*/
public function getNewAnimationFromBaHaDeprecated(Crawler $crawler)
{
$target = $crawler->filterXPath('//div[contains(@class, "newanime")]')
->each(function (Crawler $node) {
$date = $this->getDateForNewAnimationFromBaHa($node);
});
return $target;
}
private function getDateForNewAnimationFromBaHa(Crawler $node)
{
return $node->filterXPath('//span[contains(@class, "newanime-date")]')
->each(function (Crawler $node) {
return $node->text();
});
}
}
這裡要稍微說一下,為何還要再加上一個each呢?
不是應該直接寫成下面這樣就可以了嗎?
private function getDateForNewAnimationFromBaHa(Crawler $node)
{
return $node->filterXPath('//span[contains(@class, "newanime-date")]')->text();
}
主要是因為還記得剛剛上面說到的有時候我們要的資訊不一定在A框裡面嗎?
這樣寫的話就會噴錯。
(雖然也只是我猜測的原因,如果有大大發現我說錯了再麻煩趕緊指教一下)
我們可以試著追code會發現到Crawler.php
<?php
namespace Symfony\Component\DomCrawler;
use Symfony\Component\CssSelector\CssSelectorConverter;
/**
* Crawler eases navigation of a list of \DOMNode objects.
*
* @author Fabien Potencier <fabien@symfony.com>
*/
class Crawler implements \Countable, \IteratorAggregate
{
(...略)
/**
* Returns the node value of the first node of the list.
*
* @return string The node value
*
* @throws \InvalidArgumentException When current node is empty
*/
public function text()
{
var_dump($this->nodes); // 自行加上去會發現最後面死掉的時候dump出的是空陣列
if (!$this->nodes) {
throw new \InvalidArgumentException('The current node list is empty.');
}
return $this->getNode(0)->nodeValue;
}
(...略)
}
回到正題來!我們就可以啪啪啪的把我們的function寫出來了!CrawlerService.php
/**
* @param Crawler $crawler
* @return array
*/
public function getNewAnimationFromBaHa(Crawler $crawler): array
{
$target = $crawler->filterXPath('//div[contains(@class, "newanime")]')
->each(function (Crawler $node) {
$date = $this->getDateForNewAnimationFromBaHa($node);
$link = $this->getLinkForNewAnimationFromBaHa($node);
$image = $this->getImageForNewAnimationFromBaHa($node);
$info = $this->getInfoForNewAnimationFromBaHa($node);
$response = [
'date' => array_first($date),
'directUri' => array_first($link),
'imagePath' => array_first($image),
'label' => array_first($info),
];
return in_array(null, array_values($response), true) ? null : $response;
});
$target = array_filter($target, function ($d) {
return null !== $d;
});
return $target;
}
private function getDateForNewAnimationFromBaHa(Crawler $node)
{
return $node->filterXPath('//span[contains(@class, "newanime-date")]')
->each(function (Crawler $node) {
return $node->text();
});
}
private function getLinkForNewAnimationFromBaHa(Crawler $node)
{
return $node->filterXPath('//a[contains(@class, "newanime__content")]')
->evaluate('string(@href)');
}
private function getImageForNewAnimationFromBaHa(Crawler $node)
{
return $node->filterXPath('//img[contains(@class, "lazyload")]')
->evaluate('string(@data-src)');
}
private function getInfoForNewAnimationFromBaHa(Crawler $node)
{
return $node->filterXPath('//p[contains(@class, "newanime-title")]')
->each(function (Crawler $node) {
return $node->text();
});
}
CrawlerServiceTest.php
public function testGetNewAnimationFromBaHa()
{
$crawler = $this->crawlerService->getOriginalData('https://ani.gamer.com.tw/');
$target = $this->crawlerService->getNewAnimationFromBaHa($crawler);
$this->assertArrayHasKey('date', $target[0]);
$this->assertArrayHasKey('directUri', $target[0]);
$this->assertArrayHasKey('imagePath', $target[0]);
$this->assertArrayHasKey('label', $target[0]);
}
這麼一來我們需求中的動漫瘋本季新番就已經爬好,只要再把他和之前的推送通知整在一起就大功告成了!
時間差不多,今天就到此為止囉~剩下明天再來做!