12. Effect 的實戰分享 2：簡易爬蟲

2025 iThome 鐵人賽

DAY 13

Software Development

Effect 魔法：打造堅不可摧的應用程式系列第 13 篇

17th鐵人賽 typescript effect functional programming

DanSnow

2025-09-27 00:57:42

122 瀏覽

分享至

這篇要來介紹怎麼使用 Effect 做個簡單的爬蟲，雖說是簡易的爬蟲，但我們要來做一個「好」爬蟲，代表我們不會送一堆的請求到伺服器，會加上延遲等等，如果可以做到這點，那可以減少你被 ban 的可能性，也能減少伺服器的負擔，既然這次是在 iThelp 的鐵人賽發文，那我們就來爬 iThelp 的技術文章吧，這次的目標是爬取一頁的技術文章列表頁，並將爬到的每篇文章都轉成 markdown ，在現在 AI 的時代， markdown 已經快要變成 AI 的標準資料交換格式了

這次除了用到 Effect 外還會用到不少的套件，先簡單介紹一下

unstorage: 提供類似網頁的 local storage 的 cache 套件
ofetch: 比較好用的 fetch API
ohash: 計算 hash 當作 cache key 用的
linkedom: 簡單的 dom 實作，用來當作 html parser
@mozilla/readability: 由 Mozilla 出品的，抽出網頁中的主要文章區塊用的
turndown: html 轉 markdown

題外話，上面的六個套件中，有三個是 unjs 出品的，這個其實就是 Nuxt.js 團隊在做 Nuxt.js 的過程中抽出來的各種小套件，我個人很喜歡他們推出的套件，我也推薦你有機會去看看，說不定可以找到什麼東西對你的專案有所幫助

這篇文章完整的程式碼在 https://github.com/DanSnow/ithelp-2025-ironman-sample-codes/tree/main/ithelp-scraper

抓取列表

第一步我們要取得文章的列表，雖然有 rss ，但這次我們直接取 html 來用，先簡單的對 ofetch 的 $fetch 做包裝，以便我們之後使用

import { $fetch, type FetchError, type FetchOptions } from "ofetch";

const $fetchText = Effect.fn("$fetch")(
  (url: string, options?: FetchOptions<"text">) =>
    Effect.tryPromise({
      try: (signal) => $fetch(url, { ...options, signal }),
      catch: (err) => err as FetchError,
    })
);

這邊使用了 Effect.fn，雖然看起來好像跟直接寫 function 沒有什麼差別，我們之後有機會再來詳細介紹它可以做到什麼

接著我們就可以用這個 $fetchText 來取得列表了

pipe(
  // 使用上面的 $fetchText 取得文章列表頁
  $fetchText("https://ithelp.ithome.com.tw/articles?tab=tech"),
  // Effect.tap 可以檢查一下前一個 Effect 的結果，但不會影響回傳值
  Effect.tap((html) => {
    console.log(html);
  }),
  Effect.runPromise
);

就這樣，再來我們要取出文章列表頁中的文章與連結，但在那之前，寫爬蟲時我們常會需要多次的 trial and error ，我們先來把我們抓到的東西 cache 起來吧

建立 cache

我們同樣的把 unstorage 做個簡單的包裝，以方便我們使用

import { createStorage, type Driver } from "unstorage";
import fs from "unstorage/drivers/fs-lite";

class Cache extends Effect.Service<Cache>()("Cache", {
  accessors: true,
  // 預設存在 `.cache` 的資料夾中
  effect: (driver: Driver = fs({ base: ".cache" })) =>
    Effect.gen(function* () {
      const storage = createStorage<string>({ driver });

      return {
        getItem: (key: string) => Effect.promise(() => storage.getItem(key)),
        setItem: (key: string, value: string) =>
          Effect.promise(() => storage.setItem(key, value)),
      };
    }),
}) {}

接著我們再把 $fetchText 包裝，加上 cache 的功能

import { hash } from "ohash";

const $fetchTextWithCache = Effect.fn("$fetchTextWithCache")(function* (
  url: string,
  options?: FetchOptions<"text">
) {
  // 這邊刻意的將網址 hash 過後才當 key ，因為一些特殊字元當 key 時，對 unstorage 而言是有特殊意義的，將 key hash 可以解決網址裡有特殊字元的問題
  const key = hash(url);
  const cacheItem = yield* Cache.getItem(key);
  if (cacheItem) {
    // 加個 console log 方便我們知道有成功 cache
    console.log(url, "cache hit");
    return cacheItem;
  }
  const html = yield* $fetchText(url, options);
  yield* Cache.setItem(key, html);
  return html;
});

將原本的 $fetchText 換掉後，執行兩次，第二次應該就會看到 cache hit 了

pipe(
  $fetchTextWithCache("https://ithelp.ithome.com.tw/articles?tab=tech"),
  Effect.tap((html) => {
    console.log(html);
  }),
  Effect.provide(Cache.Default()),
  Effect.runPromise
);

取得文章連結

iThelp 的 css 取名是真的取得不錯，其實要抓出文章的連結並不難，只需要用一個 selector 就行了

function parseArticleLinks(html: string) {
  const { document } = parseHTML(html);
  const links = document.querySelectorAll("a.qa-list__title-link");
  const articles = Array.from(links).map((link) => ({
    title: link.textContent,
    url: link.href,
  }));
  console.log(articles);
  return articles;
}

pipe(
  $fetchTextWithCache("https://ithelp.ithome.com.tw/articles?tab=tech"),
  Effect.tap((html) => parseArticleLinks(html)),
  Effect.provide(Cache.Default()),
  Effect.runPromise
);

這邊我有用一個叫 typed-query-selector 的小套件，它可以讓你的 query selector 可以根據你的 selector 裡指定的元素而有不同的 type ，還挺方便的

上面的執行完你應該就會看到一個文章列表了，像這樣我們已經完成一半了

取得文章內容

我們只需要把我們拿到的連結，再送給 $fetchTextWithCache 我們就可以拿到各篇文章的內容了，我們可以用 Effect.all 並控制一下 concurrency 來避免一次送出太多的 request

pipe(
  $fetchTextWithCache("https://ithelp.ithome.com.tw/articles?tab=tech"),
  Effect.map((html) => parseArticleLinks(html)),
  Effect.flatMap((articles) =>
    Effect.all(
      articles.map((article) => $fetchTextWithCache(article.url)),
      { concurrency: 2 }
    )
  ),
  Effect.provide(Cache.Default()),
  Effect.runPromise
);

你也可以使用 Effect.delay 來做進一步的限速，另外如果你跑第二次，你會發現速度快上很多，這是因為我們的 cache 發揮功用了

抽取文章內容，並轉成 markdown

這段其實跟 Effect 比較沒關係了，我們要用 linkedom + @mozilla/readability + turndown 一共三個套件的組合技，將文章內容轉換成 markdown

import { Readability } from "@mozilla/readability";
import Turndown from "turndown";
import { parseHTML } from "linkedom";

const turndown = new Turndown();

function extractContent(html: string) {
  // 先將 html 轉成 dom
  const { document } = parseHTML(html);
  // 將 dom 提供給 readability 提取文章主要內容
  const parser = new Readability(document);
  const parsed = parser.parse();
  // 將提取出來的內容透過 turndown 轉成 markdown
  const markdown = turndown.turndown(parsed?.content ?? "");
  console.log(markdown);
  return markdown;
}

最後的流程會像這樣

import { Array, Effect, pipe } from 'effect'

pipe(
  $fetchTextWithCache("https://ithelp.ithome.com.tw/articles?tab=tech"),
  Effect.map((html) => parseArticleLinks(html)),
  Effect.flatMap((articles) =>
    pipe(
      articles,
      Array.map((article) =>
        pipe(
          $fetchTextWithCache(article.url),
          Effect.map((html) => extractContent(html))
        )
      ),
      Effect.allWith({ concurrency: 2 })
    )
  ),
  Effect.provide(Cache.Default()),
  Effect.runPromise
);