iT邦幫忙

2025 iThome 鐵人賽

DAY 9
0

這篇文章不是 100% 純技術——因為我偷偷撒了一點魔法粉。
程式碼是咒語、流程是魔法陣,錯誤訊息則是暗黑詛咒。
如果你只想看乾巴巴的程式碼解析,這裡可能不適合;但如果你願意把技術當成冒險,那就歡迎踏進來。


前言

昨天,我在荒野中探索了電子郵件的魔法門;今天,我將深入魔法森林——讓這條 Email Pipeline 學會自動生成每日摘要,並精準飛進每位訂閱者的收件匣。

這次任務不只是抓論文或發郵件,而是讓每位訂閱者都能收到「專屬智慧精華」,就像卷軸自動識別誰是合格巫師,誰還得再練習魔法才能開啟——失敗者只能看錯亂的 HTML 標籤、404黑魔法 。

手中的魔法卷軸微微發光,它不只是資料的集合,而是一條通往知識寶藏的秘密通道。
我背上行囊,準備解讀卷軸裡那些未解的符文。


1. 訂閱者名單:卷軸上的魔法名錄 get user

每個使用者都是卷軸上的一個魔法符號,包含:

  • email(傳送魔法)
  • user_language(語言魔法)
  • translate 偏好(翻譯魔法,別把英文摘要塞給只會看中文的同學 😅)

後續可以解鎖更多「量身訂做」魔法,例如:風格魔法

def get_subscribed_users(db: Session) -> List[Dict]:
    subscribed_users = []
    try:
        # join users 與 user_setting
        query = (
            db.query(User, UserSetting)
            .join(UserSetting, User.id == UserSetting.user_id)
            .filter(UserSetting.subscribe_email)
        )
        for user, setting in query.all():
            try:
                email = get_user_email_from_firebase(user.id)
            except HTTPException:
                logger.warning(f"Skipping user {user.id}, no email found")
                continue

            subscribed_users.append(
                {
                    "user_id": user.id,
                    "email": email,
                    "translate": setting.translate,
                    "user_language": setting.user_language,
                    "temperature": setting.temperature,
                    "system_prompt": setting.system_prompt,
                    "top_k": setting.top_k,
                }
            )

    return subscribed_users

2. 論文抓取:冒險地圖的最新區域 fetch_new_papers

抓取 since_date 之後 published 的新論文,避免重複。每篇論文都帶著:

def fetch_new_papers(db: Session, since_date=None, limit: int = 500) -> list[Paper]:
    if since_date is None:
        since_date = datetime.utcnow() - timedelta(days=30)  # 預設抓過去 30 天
    papers = (
        db.query(Paper)
        .filter(and_(Paper.pdf_processed, Paper.published_date >= since_date.date()))
        .limit(limit)
        .all()
    )
    return papers

批次抓太大?別怕,limit 就像「卷軸防呆裝置」,一次最多 500 篇,避免魔法爆炸。

with db_session() as db 做了什麼?

使用 context manager (with db_session()) 讓資料庫連線更安全。
db_session() 是一個 context manager,會自動處理:

  1. 建立 session → db = SessionLocal()
  2. 提供 session → yield db
  3. 提交或回滾:
    • 如果中間程式正常結束 → db.commit()
    • 如果發生例外 → db.rollback() 並把例外丟出去
  4. 關閉 session → db.close()

所以在 with 區塊內,你可以安全使用 db 去查詢或寫入資料,區塊結束後 session 一定會關閉,不會留下半開連線。

@contextmanager
def db_session():
    db = SessionLocal()
    try:
        yield db
        db.commit()
    except:
        db.rollback()
        raise
    finally:
        db.close()

3. 內容聚合:Qdrant 魔法圖書館 fetch_paper_content_from_qdrant

def fetch_paper_content_from_qdrant(arxiv_id: str = None, title: str = None) -> str:
    """
    從 Qdrant 依 arxiv_id 或 title 抓取 raw_content
    """

    must_conditions = []
    if arxiv_id:
        must_conditions.append(
            models.FieldCondition(
                key="arxiv_id",
                match=models.MatchValue(value=arxiv_id),
            )
        )
    if title:
        must_conditions.append(
            models.FieldCondition(
                key="title",
                match=models.MatchValue(value=title),
            )
        )

    if not must_conditions:
        return ""


    result = qdrant_client.scroll(
        collection_name=COLLECTION_NAME,
        scroll_filter=models.Filter(must=must_conditions),
        limit=100
    )

    points, _ = result
    if not points:
        return ""

    payload = points[0].payload
    raw_content = payload.get("text", "")
    return raw_content

fetch_paper_content_from_qdrant 從向量資料庫拉取原始內容(raw_content),確保摘要生成有足夠資訊。

🪄 技術比喻:
scroll 就像魔法搜尋咒,只過濾 metadata,精準對應 arxiv_id 或 title。
search 則像隨機魔法陣,依相似度找到內容,不適合精準匹配。

💡 Tricky point: limit=100 → 就像圖書館一次只能給你 100 本書,如果符合條件超過 100 本,就要拿卷軸 (scroll_id) 繼續拿下一批。
我這裡假設每篇論文不超過 100 chunk,有失偏頗,但簡化流程。

points, scroll_id = qdrant_client.scroll(
    collection_name="my_collection",
    scroll_filter=models.Filter(must=must_conditions),
    limit=100
)

all_points = points.copy()

while scroll_id:
    points, scroll_id = qdrant_client.scroll(
        collection_name="my_collection",
        scroll_filter=models.Filter(must=must_conditions),
        scroll_id=scroll_id
    )
    all_points.extend(points)

4. 過濾已寄送:封印重複符文 filter_already_sent_papers

def filter_already_sent_papers(user_id: int, papers: list[dict]) -> list[dict]:
    with db_session() as db:
        sent_arxiv_ids = {
            r.arxiv_id
            for r in db.query(UserSentPaper)
            .filter(UserSentPaper.user_id == user_id)
            .all()
        }
    new_papers = [p for p in papers if p["arxiv_id"] not in sent_arxiv_ids]
    return new_papers

每位訂閱者的卷軸上都會有封印:已看過(已寄過)的符文不再重複顯現。這一層保證:

  • 不浪費資源寄送重複摘要
  • 使用者體驗保持專屬感

確保每次打開郵件都像拆禮物,而不是每天吃剩飯 😆

5. 知識煉金術:LLM 摘要召喚 generate_summary

  • 整理單篇資訊
  • 呼叫 LLM 生成摘要
  • 整合成 HTML,準備發送

def fetch_paper_info(paper: dict, content_map: dict[str, str]) -> dict:
    """整理單篇論文資訊,包含 raw_content(若有)"""
    paper_info = {
        "title": paper.get("title") or "No Title",
        "authors": paper.get("authors") or [],
        "abstract": paper.get("abstract") or "",
        "pdf_url": paper.get("pdf_url") or None,
        "published_date": paper.get("published_date") or None,
    }

    arxiv_id = paper.get("arxiv_id")
    if arxiv_id and arxiv_id in content_map:
        paper_info["raw_content"] = content_map[arxiv_id]

    return paper_info



def summarize_paper(paper_info: dict, user: dict) -> str:
    """呼叫 LLM 生成摘要,若失敗則 fallback"""
    try:
        summary = llm_summary(paper_info, user)
        return summary
    except Exception as e:
        return "Summary generation failed."


def generate_summary(
    papers_and_content: tuple[list[dict], dict[str, str]], user: dict
) -> str:
    """
    將每篇論文生成 LLM 摘要,並整理成 HTML
    """
    papers, content_map = papers_and_content

    if not papers:
        return "<p>No new papers today.</p>"

    papers_html = ""

    for idx, p in enumerate(papers, start=1):
        paper_info = fetch_paper_info(p, content_map)
        summary = summarize_paper(paper_info, user)
        pdf_url = paper_info.get("pdf_url")
        pdf_link_html = (
            f'<a href="{pdf_url}" target="_blank">Preview PDF</a>' if pdf_url else "N/A"
        )

        papers_html += f"""
        <div class="paper-summary">
            <div class="paper-title">{idx}. {paper_info["title"]}</div>
            <div class="paper-meta">
                <strong>Authors:</strong> {", ".join(paper_info.get("authors", []))} <br>
                <strong>Published:</strong> {paper_info.get("published_date", "N/A")} <br>
                <strong>PDF:</strong> {pdf_link_html}
            </div>
            <div class="paper-abstract">
                {summary}
            </div>
        </div>
        """

    template_path = pathlib.Path(__file__).parent / "template.html"
    template_text = template_path.read_text(encoding="utf-8")
    final_html = Template(template_text).substitute(papers_html=papers_html)
 
    return final_html

6. 符文投送術 send_email

生成了魔法卷軸後,要送到正確的巫師手中,這就是 send_email 的任務。


def send_email(
    subject: str,
    recipients: str,
    body: str,
    attachments: list[dict] = None,  # [{"filename": "summary.pdf", "content": bytes}]
):
    """
    使用 smtplib 發送 HTML 郵件 + 附件,保留與原 FastMail 相同的函數接口。
    """

    # 建立多部分郵件(HTML + 附件)
    msg = MIMEMultipart("mixed")
    msg["Subject"] = subject
    msg["From"] = settings.MAIL_FROM
    msg["To"] = recipients

    # 加入 HTML 內容
    msg.attach(MIMEText(body, "html"))

    # 加入附件
    for att in attachments or []:
        part = MIMEBase("application", "octet-stream")
        part.set_payload(att["content"])
        encoders.encode_base64(part)
        part.add_header(
            "Content-Disposition", f'attachment; filename="{att["filename"]}"'
        )
        msg.attach(part)

    try:
        # 使用 Gmail 或其他 SMTP server
        with smtplib.SMTP(settings.MAIL_SERVER, settings.MAIL_PORT) as server:
            if settings.MAIL_TLS:
                server.starttls()
            server.login(settings.MAIL_USERNAME, settings.MAIL_PASSWORD)
            server.send_message(msg)
    except Exception as e:
        logger.error(f"Failed to send email to {recipients}: {e}")


小結

今天,我完成了卷軸的第二層魔法:

抓論文 → 聚合內容 → 過濾重複 → 生成摘要 → 發送
  • 每篇論文 → 一塊魔法石
  • LLM 生成 → 卷軸化魔法
  • HTML + CSS → 魔法光芒
  • send_email → 精準投遞給巫師(或不小心寄到隔壁鄰居 😜)

每個步驟都是一次小冒險,必須精準可靠,才能把知識寶石安全送到使用者手中。要是失敗了,可能整份郵件都會變成 404 Not Found

這也再一次證明:打造 Email Pipeline,從來不只是冷冰冰的程式碼,而是一場真實的巫師試煉。
我們不只是 debug,更像在與魔法生物交涉;我們不只是排程,更像在操控時間流動。

接下來的篇章,我會繼續記錄這段冒險——既是工程筆記,也是魔法見聞錄。
如果你也準備好了,就拿起你的魔杖(鍵盤),一起探索這片未知的荒野吧。


彩蛋

llm_summary

借鏡 Day 6|你好 Ollama - 與 Ollama 模型初次見面 當中的 LangChain 封裝


def llm_summary(paper: Dict, user: dict, max_words: int = 300) -> str:
    if not paper:
        return "No paper provided."

    is_translate = user.get("translate", False)
    user_language = user.get("user_language", "English")
    temperature = user.get("temperature", 0.5)
    system_prompt = user.get("system_prompt", "")

    title = paper.get("title", "No Title")
    authors = ", ".join(paper.get("authors") or [])
    authors_str = ", ".join([a.replace("{", "{{").replace("}", "}}") for a in authors])

    content = paper.get("raw_content") or paper.get("abstract", "")
    content_type = "Full Content" if paper.get("raw_content") else "Abstract"

    translation_instruction = ""
    if is_translate:
        translation_instruction = (
            f"Translate the summary to {user_language}. Output ONLY in {user_language}."
        )

    # 讀取 prompt template
    template_text = PROMPT_FILE.read_text(encoding="utf-8")

    prompt_template = template_text.format(
        system_prompt=system_prompt,
        max_words=max_words,
        content_type=content_type,
        translation_instruction=translation_instruction,
        title=title,
        authors=authors_str,
        content=content,
    )

    chat_model = ChatOllama(
        model=MODEL_NAME,
        temperature=temperature,
        base_url=OLLAMA_API_URL,
    )

    prompt = ChatPromptTemplate.from_template(prompt_template)
    chain = prompt | chat_model

    try:
        resp = chain.invoke({})
        html_summary = resp.content.strip()
        html_summary = "\n".join(
            [line for line in html_summary.splitlines() if line.strip()]
        )
        return html_summary
    except Exception as e:
        return f"<p><strong>Summary generation failed:</strong> {e}</p>"


核心 - 咒語 prompt

{system_prompt}
You are a professional research assistant.
Summarize the following paper concisely, in no more than {max_words} words.
Keep it readable for an email newsletter.
(Note: the text provided is the paper's {content_type})
{translation_instruction}
OUTPUT MUST BE HTML and email-friendly.
Use headings (<h2>, <h3>), paragraphs (<p>), bold (<strong>), italics (<em>),
unordered lists (<ul>) and ordered lists (<ol>) for bullet points.
Do NOT use Markdown or plain text.
Ensure all tags are properly closed.

Instructions:
1. Base your answer STRICTLY on the provided paper excerpts.
2. Maintain academic accuracy and precision.
3. Structure your answer logically with clear paragraphs when appropriate.
4. DO NOT include any introductory paragraphs about the authors, affiliations, or background. Focus ONLY on the paper's content, key findings, methods, and important points.

Remember:
- Do NOT make up information not present in the excerpts.
- Do NOT use knowledge beyond what's provided in the paper excerpts.
- Always acknowledge uncertainty when the excerpts are ambiguous or incomplete.
- Prioritize relevance and clarity in your response.
- NEVER add introductory phrases or explanations before your HTML response.

Paper:
Title: {title}
Authors: {authors}
Content: {content}

Include key findings, methods, and any important points as bullet points or numbered lists.

變美的秘密 template.html

<!DOCTYPE html>
<html lang="zh-Hant">

<head>
    <meta charset="UTF-8">
    <title>今日論文摘要</title>
    <style>
        body {
            font-family: "Segoe UI", Tahoma, Geneva, Verdana, sans-serif;
            line-height: 1.6;
            background-color: #f9fafc;
            /* 淺色背景 */
            margin: 0;
            padding: 20px;
        }

        h2 {
            text-align: center;
            color: #2c3e50;
            margin-bottom: 30px;
            background: linear-gradient(90deg, #a0c4ff, #e0f7fa);
            -webkit-background-clip: text;
            -webkit-text-fill-color: transparent;
        }

        .paper-summary {
            border-radius: 12px;
            padding: 20px;
            margin-bottom: 20px;
            background: linear-gradient(145deg, #ffffff, #f0f4f8);
            /* 淺色卡片 */
            box-shadow: 0 4px 10px rgba(0, 0, 0, 0.08);
            transition: transform 0.3s, box-shadow 0.3s, background 0.3s;
            cursor: pointer;
        }

        .paper-summary:hover {
            transform: translateY(-5px);
            box-shadow: 0 8px 20px rgba(0, 0, 0, 0.15);
            background: linear-gradient(145deg, #f0f4f8, #e0ebfc);
            /* 淺色 hover */
        }

        .paper-title {
            font-size: 1.5em;
            font-weight: bold;
            margin-bottom: 10px;
            color: #f6b387;
            transition: color 0.3s;
        }

        .paper-title:hover {
            color: #ff5722;
        }

        .paper-meta {
            font-size: 0.9em;
            color: #616161;
            margin-bottom: 12px;
        }

        .paper-meta a {
            color: #1976d2;
            text-decoration: none;
            font-weight: bold;
        }

        .paper-meta a:hover {
            text-decoration: underline;
        }

        .paper-summary p,
        .paper-summary ul,
        .paper-summary ol {
            margin: 8px 0;
            color: #424242;
        }

        .label {
            display: inline-block;
            padding: 2px 6px;
            font-size: 0.75em;
            font-weight: bold;
            border-radius: 4px;
            margin-right: 6px;
            color: #fff;
        }

        .label-authors {
            background-color: #00796b;
        }

        .label-date {
            background-color: #f57c00;
        }

        .label-pdf {
            background-color: #1976d2;
        }

        .summary-content {
            display: none;
            margin-top: 10px;
        }

        .toggle-btn {
            display: inline-block;
            margin-top: 8px;
            padding: 5px 10px;
            background-color: #007bff;
            color: #fff;
            border-radius: 6px;
            font-size: 0.85em;
            cursor: pointer;
            transition: background 0.3s;
        }

        .toggle-btn:hover {
            background-color: #0056b3;
        }
    </style>
</head>

<body>
    <h2>今日論文摘要</h2>

    <!-- Papers Section Start -->
    ${papers_html}
    <!-- Papers Section End -->

    <p><em>本摘要僅供參考,最終請依原始論文與專業判斷。</em></p>

    <script>
        document.querySelectorAll('.paper-summary').forEach(function (card) {
            card.addEventListener('click', function () {
                const content = this.querySelector('.summary-content');
                if (content) {
                    content.style.display = content.style.display === 'none' ? 'block' : 'none';
                }
            });
        });
    </script>
</body>

</html>


上一篇
Day 8|Email Pipeline 技術拆解(上) - 打造訂閱系統
系列文
論文流浪記:我與AI 探索工具、組合流程、挑戰完整平台10
圖片
  熱門推薦
圖片
{{ item.channelVendor }} | {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言