接下來幾天會把基本的 API 開發做到一個階段。但這之前,除了測試用的資料之外,我們會需要真實的 Podcast 資料,所以我們今天就開始來實作如何定時更新的 Podcast feed 爬蟲吧。
根據這邊的解釋,簡單來說就是:
所以其實 Podcast 其實不難,Podcast 的散播大多使倚靠 Podcast RSS feed
來讓聽眾可以更新聽到最新的集數。
只要針對 podcast feed 去定時更新就可以得到最新的資訊了。
這個部分會分為主要三個元件構成:
所以我們先開始製作 Data processor 與一個臨時用的 Feed crawler。
首先我們可以在幾個知名的 Podcast 索引的平台去尋找,這邊我使用了:
這邊我使用了套件 lukaswhite/podcast-feed-parser
來幫助我分析 feed XML 的內容。
composer require lukaswhite/podcast-feed-parser
我先使用了 laravel 的 command 來測試匯入的 feed 資料與解析。
首先定義了主要 handle 方法的主流程:
public function handle(): void
{
$this->feeds = [
'https://feeds.simplecast.com/54nAGcIl', // The Daily, By: The New York Times
'https://www.thisamericanlife.org/podcast/rss.xml', // This American Life
'https://feeds.simplecast.com/PxEW_ipK', // Office Ladies
'https://anchor.fm/s/599522d0/podcast/rss', // Lex Fridman Podcast | 5 minute podcast summaries
'https://anchor.fm/s/8c1524bc/podcast/rss', // Y Combinator
];
$bar = $this->output->createProgressBar(count($this->feeds));
$bar->start();
foreach ($this->feeds as $key => $feed) {
$data = $this->getPodcastData($this->getCurrentPodcastFeed($key));
$this->saveToDatabase($this->parseData($data), $this->getCurrentPodcastFeed($key));
$bar->advance();
}
$bar->finish();
}
接下來分別是抓取
podcast feed 的資料、解析
、與把結構化資料存入資料庫
:
protected function getCurrentPodcastFeed(int $itemKey): string
{
return $this->feeds[$itemKey];
}
protected function getPodcastData(string $feedLocation): string
{
$cacheKey = hash('sha3-256', $feedLocation);
if (Cache::has($cacheKey)) {
return Cache::get($cacheKey);
}
$response = Http::get($feedLocation)->body();
return tap(
$response,
static fn ($content) => Cache::put($cacheKey, $content, 3600)
);
}
/**
* @throws Exception
*/
protected function parseData(string $content): Podcast
{
return (new Parser())
->setContent($content)
->run();
}
/**
* @throws \Throwable
*/
protected function saveToDatabase(Podcast $podcast, string $feedLocation): void
{
DB::transaction(function () use ($podcast, $feedLocation) {
$channel = Channel::query()->updateOrCreate(
[
'title' => $podcast->getTitle(),
],
[
'locale' => Str::lower($podcast->getLanguage()),
'cover_image' =>self::replaceHttpWithHttps($podcast->getImage()?->getUrl() ?? $podcast->getArtwork()->getUri()),
'slug' => Str::slug($podcast->getTitle()),
'metadata' => [
'sub_title' => $podcast->getSubtitle(),
'summary' => $podcast->getDescription(),
'owner' => [
'name' => $podcast->getOwner()->getName(),
'email' => $podcast->getOwner()->getEmail(),
],
'categories' => collect($podcast->getCategories())
->transform(function (Category $category) {
return [
'name' => $category->getName(),
'type' => $category->getType(),
'children' => [...array_keys($category->getChildren())],
];
}),
'copyright' => ! empty($podcast->getCopyright()) ? $podcast->getCopyright() : null,
],
'source' => [
'origin' => $podcast->getNewFeedUrl() ?? $feedLocation,
...($podcast->getRawvoiceSubscribe()?->getLinks() ?? []),
],
'status' => ChannelStatus::published,
]
);
collect($podcast->getEpisodes())
->each(function (Episode $episode) use ($channel) {
if (is_null($episode->getMedia()) || is_null($episode->getMedia()->getUri())) {
return;
}
\App\Models\Episode::query()->updateOrCreate(
[
'channel_id' => $channel->id,
'guid_hash' => hash('sha3-256', $episode->getGuid()),
],
[
'guid' => $episode->getGuid(),
'title' => $episode->getTitle(),
'metadata' => [
'sub_title' => $episode->getSubtitle(),
'summary' => $episode->getDescription(),
'artwork' => $episode->getArtwork() ? self::replaceHttpWithHttps($episode->getArtwork()->getUri()) : null,
'link' => $episode->getLink(),
],
'stream_url' => [self::replaceHttpWithHttps($episode->getMedia()->getUri())],
'published_at' => $episode->getPublishedDate(),
'status' => EpisodeStatus::published,
]
);
});
});
}
protected static function replaceHttpWithHttps(string $url): string
{
return str_replace('http://', 'https://', $url);
}
這樣就可以將真實資料匯入到了資料庫中,就有相對假資料更可以參考的真實 podcast 可以測試了!