Day15｜RAG 魔法課 (下)：Prompt 煉金術與答案生成

17th鐵人賽

冒牌者症候群的軟體攻城獅

團隊等待阿毛參賽中

2025-09-22 11:00:37

234 瀏覽

分享至

⚡《AI 知識系統建造日誌》這不是一篇純技術文章，而是一場工程師的魔法冒險。程式是咒語、流程是魔法陣、錯誤訊息則是黑暗詛咒。請準備好你的魔杖（鍵盤），今天，我們要踏入魔法學院的基礎魔法課，打造穩定、可擴展的 AI 知識系統。

前言

延續先前Ollam, RAG 介紹，若你錯過前情，可以先看看這幾篇：

Day14｜RAG 魔法課 (上)：Hybrid Search 與 Re-ranking 完整實戰
 Day 6｜你好 Ollama - 與 Ollama 模型初次見面
 Day 7 | 穿越 RAG 魔法迷宮：打造智慧問答系統的秘訣 - RAG Pipeline

今天的冒險，我帶著筆電與咖啡，像拿著探險地圖的魔法學徒，準備面對 RAG 系統的工程挑戰。上一回，我們透過 Hybrid Search 與 Re-ranking 找出最相關的片段；而今天，我將深入探討如何掌控這些資料精靈，並在現實中建造一座穩固、可擴展的知識塔。

被自己搞得精疲力竭，多喝幾杯咖啡續命寫文章。 ☕

回顧 pipeline

try:
    prompt_data = ollama_client.prompt_builder.create_structured_prompt(
        query, chunks, system_settings.user_language
    )
    final_prompt = prompt_data["prompt"]
except Exception:
    final_prompt = ollama_client.prompt_builder.create_rag_prompt(
        query, chunks, system_settings.user_language
    )
    
parsed_response, response = await ollama_client.generate_rag_answer(
        query=query,
        chunks=chunks,
        user_language=system_settings.user_language,
        use_structured_output=True,
        temperature=system_settings.temperature,
    )

prompt engineering - 讓模型輸出變成你想要的形狀 - `create_structured_prompt`

1. Structed outputs

Ollama 支援 structured-outputs ，讓 LLM 可以依照我們設計的結構回覆，例如遵循 RAGResponse。

官方範例：

from ollama import chat
from pydantic import BaseModel

class Country(BaseModel):
  name: str
  capital: str
  languages: list[str]

response = chat(
  messages=[
    {
      'role': 'user',
      'content': 'Tell me about Canada.',
    }
  ],
  model='llama3.1',
  format=Country.model_json_schema(),
)

country = Country.model_validate_json(response.message.content)
print(country)

output

name='Canada' capital='Ottawa' languages=['English', 'French']

2. 理想是美好的，但現實需要取捨

理想上，我們希望 LLM 可以完全遵守 RAGResponse 結構，但現實中需要取捨。

class RAGResponse(BaseModel):
    """Structured response model for RAG queries."""

    answer: str = Field(
        description="Comprehensive answer based on the provided paper excerpts"
    )
    sources: List[str] = Field(
        default_factory=list,
        description="List of PDF URLs from papers used in the answer",
    )
    confidence: Optional[str] = Field(
        default=None,
        description="Confidence level: high, medium, or low based on excerpt relevance",
    )
    citations: Optional[List[str]] = Field(
        default=None,
        description="Specific arXiv IDs or paper titles referenced in the answer",
    )

對應 JSON Schema：

{
  "description": "Structured response model for RAG queries.",
  "properties": {
    "answer": {...},
    "sources": {...},
    "confidence": {...},
    "citations": {...}
  },
  "required": ["answer"],
  "title": "RAGResponse",
  "type": "object"
}

3. Prompt 建構範例

def create_rag_prompt(
    self, query: str, chunks: List[Dict[str, Any]], user_language: str = "English"
) -> str:
    """Create a RAG prompt with query and retrieved chunks.

    Args:
        query: User's question
        chunks: List of retrieved chunks with metadata from Qdrant

    Returns:
        Formatted prompt string
    """
    prompt = f"{self.system_prompt}\n\n"
    prompt += "### Context from Papers:\n\n"

    for i, chunk in enumerate(chunks, 1):
        # Get the actual chunk text
        chunk_text = chunk.get("chunk_text", chunk.get("content", ""))
        arxiv_id = chunk.get("arxiv_id", "")
        prompt += f"[{i}. arXiv:{arxiv_id}]\n{chunk_text}\n\n"

    prompt += f"### Question:\n{query}\n\n"
    prompt += "### Answer:\n"
    prompt += (
        "Provide a natural, conversational response (not JSON), cite sources using [arXiv:id] format.\n\n"
        f"and Translate to {user_language}. "
        f"Output ONLY in {user_language}, formatted clearly for readability"
    )

    return prompt

def create_structured_prompt(
    self, query: str, chunks: List[Dict[str, Any]], user_language: str = "English"
) -> Dict[str, Any]:
    """Create a prompt for Ollama with structured output format.

    Args:
        query: User's question
        chunks: List of retrieved chunks

    Returns:
        Dictionary with prompt and format schema for Ollama
    """
    return {
        "prompt": self.create_rag_prompt(query, chunks, user_language),
        "format": RAGResponse.model_json_schema(),
    }

對不起，我英文不好，翻譯對我來說很重要 😢

4. Lesson Learn

一開始花了很多時間 debug，最後發現是因為使用了gpt-oss:20b，而官方範例使用 llama3.1。
只有特定模型（如 llama3.1）支援 structured output，其他大模型如 gpt-oss:20b 不支援。
果斷換了， llama3 家族，最後敲定 llama3.2:3b

這就是需要咖啡續命的原因

取捨策略

那現在就要取捨了。到底是用香香的 gpt-oss:20b 但是沒支援 Structured outputs 還是使用其他有支援的。

只能說下次需要先看使用說明書。

Structured Output：直接告訴模型要回 JSON 結構，如 RAGResponse。
- 優點：結構明確，可直接解析，便於 downstream 處理
- 缺點：只有特定模型支援（如 llama3.2:3b），對於大模型如 gpt-oss:20b 則不支持
Prompt-based Fallback：即使模型不支援結構化輸出，我們也可以在 prompt 裡面明確描述格式要求

"Return EXACTLY in JSON matching this schema:\n"
"{\n"
'  "answer": "",\n'
'  "sources": [],\n'
'  "confidence": "",\n'
'  "citations": []\n'
"}\n"
f"and Translate to {user_language}. "
f"Output ONLY in {user_language}, formatted clearly for readability"

答案生成流程（RAG + Ollama）

在使用 Ollama 生成 RAG 答案時，我們需要告訴模型 format=，才能啟用結構化輸出：

 async def generate_rag_answer(
        self,
        query: str,
        chunks: List[Dict[str, Any]],
        use_structured_output: bool = False,
        temperature: float = 0.5,
        user_language: str = "English",
    ) -> Dict[str, Any]:
        """
        Generate a RAG answer using retrieved chunks.

        Args:
            query: User's question
            chunks: Retrieved document chunks with metadata
            model: Model to use for generation
            use_structured_output: Whether to use Ollama's structured output feature

        Returns:
            Dictionary with answer, sources, confidence, and citations
        """
        try:
            if use_structured_output:
                # 使用結構化輸出
                prompt_data = self.prompt_builder.create_structured_prompt(
                    query, chunks, user_language=user_language
                )

                response = await self.generate(
                    prompt=prompt_data["prompt"],
                    temperature=temperature,
                    top_p=0.9,
                    format=prompt_data["format"], # 注意：僅特定模型支持
                )
            else:
                # 純文字 fallback
                prompt = self.prompt_builder.create_rag_prompt(
                    query, chunks, user_language=user_language
                )

                logger.info(f"promptprompt {prompt}")
                # Generate without format restrictions
                response = await self.generate(
                    prompt=prompt,
                    temperature=temperature,
                    top_p=0.9,
                )

            if response and "response" in response:
                answer_text = response["response"]
                logger.info(f"Raw LLM response: {answer_text}")

                if use_structured_output:
                    # Try to parse structured response if enabled
                    parsed_response = self.response_parser.parse_structured_response(
                        answer_text
                    )
                    logger.info(f"Parsed response:  {parsed_response}")
                    return parsed_response, response
                else:
                    # For plain text response, build simple response structure
                    sources = []
                    seen_urls = set()
                    for chunk in chunks:
                        arxiv_id = chunk.get("arxiv_id")
                        if arxiv_id:
                            arxiv_id_clean = (
                                arxiv_id.split("v")[0] if "v" in arxiv_id else arxiv_id
                            )
                            pdf_url = f"https://arxiv.org/pdf/{arxiv_id_clean}.pdf"
                            if pdf_url not in seen_urls:
                                sources.append(pdf_url)
                                seen_urls.add(pdf_url)

                    citations = list(
                        set(
                            chunk.get("arxiv_id")
                            for chunk in chunks
                            if chunk.get("arxiv_id")
                        )
                    )

                    return {
                        "answer": answer_text,
                        "sources": sources,
                        "confidence": "medium",
                        "citations": citations[:5],
                    }, response
            else:
                raise OllamaException("No response generated from Ollama")

        except Exception as e:
            logger.error(f"Error generating RAG answer: {e}")
            raise OllamaException(f"Failed to generate RAG answer: {e}")

觀察到 llama3.2:3b 的輸出通常過於簡短，而 gpt-oss:20b 又不支援結構化輸出。心中冒出一個大膽的策略：先用 gpt-oss:20b 生成答案，再交由 llama3.2:3b 生成結構化輸出，但這個想法暫時先不實現。因為有以下挑戰要被克服

實務挑戰

gpt-oss:20b context limit = 8192 tokens
llama3.2:3b context limit = 4096 tokens
過長輸入會被截斷 (truncating input prompt)

果不其然，這些都是套路。

工程師的世界裡，沒有一步到位的理想方案。每個決策都是權衡與取捨。 >_<

小結

明確的 schema 設計：定義清晰的結構化輸出，便於 downstream 處理
模型能力與限制考量：根據不同模型的 context limit 與功能，做出取捨
Prompt Engineering：設計 prompt 以引導模型生成想要的格式與內容
Fallback 與健壯性：對不支援 structured output 的模型，使用文字 fallback 或 prompt-based JSON
模型組合策略：思考多模型組合、分工使用，達到最優結果

重點不是單純技術，而是工程師在面對大模型的不確定性時，如何設計可控、可擴展、可容錯的系統。

補充 OllamaClient


class OllamaClient:
    """Client for interacting with Ollama local LLM service."""

    def __init__(self, settings: Settings):
        """Initialize Ollama client with settings."""
        self.base_url = settings.OLLAMA_API_URL
        self.model_name = settings.MODEL_NAME
        self.timeout = httpx.Timeout(float(settings.OLLAMA_TIMEOUT))
        self.prompt_builder = RAGPromptBuilder()
        self.response_parser = ResponseParser()

    async def generate(
        self,
        prompt: str = "",
        **kwargs,
    ) -> str:
        """
        Generate text using specified model.

        Args:
            model: Model name to use
            prompt: Input prompt for generation
            **kwargs: Additional generation parameters

        Returns:
            Response dictionary or None if failed
        """
        try:
            async with httpx.AsyncClient(timeout=self.timeout) as client:
                data = {
                    "model": self.model_name,
                    "prompt": prompt,
                    "stream": False,
                    **kwargs,
                }

                response = await client.post(f"{self.base_url}/api/generate", json=data)

                if response.status_code == 200:
                    return response.json()
                else:
                    raise OllamaException(f"Generation failed: {response.status_code}")

        except httpx.ConnectError as e:
            raise OllamaConnectionError(f"Cannot connect to Ollama service: {e}")
        except httpx.TimeoutException as e:
            raise OllamaTimeoutError(f"Ollama service timeout: {e}")
        except OllamaException:
            raise
        except Exception as e:
            raise OllamaException(f"Error generating with Ollama: {e}")

    async def generate_rag_answer(
        self,
        query: str,
        chunks: List[Dict[str, Any]],
        use_structured_output: bool = False,
        temperature: float = 0.5,
        user_language: str = "English",
    ) -> Dict[str, Any]:
        """
        Generate a RAG answer using retrieved chunks.

        Args:
            query: User's question
            chunks: Retrieved document chunks with metadata
            model: Model to use for generation
            use_structured_output: Whether to use Ollama's structured output feature

        Returns:
            Dictionary with answer, sources, confidence, and citations
        """
        try:
            if use_structured_output:
                # Use structured output with Pydantic model
                prompt_data = self.prompt_builder.create_structured_prompt(
                    query, chunks, user_language=user_language
                )

                logger.info(f"prompt_data {prompt_data}\n\n")
                # Generate with structured format
                response = await self.generate(
                    prompt=prompt_data,
                    temperature=temperature,
                    top_p=0.9,
                )
            else:
                # Fallback to plain text mode
                prompt = self.prompt_builder.create_rag_prompt(
                    query, chunks, user_language=user_language
                )

                logger.info(f"promptprompt {prompt}")
                # Generate without format restrictions
                response = await self.generate(
                    prompt=prompt,
                    temperature=temperature,
                    top_p=0.9,
                )

            if response and "response" in response:
                answer_text = response["response"]
                logger.info(f"Raw LLM response: {answer_text}")

                if use_structured_output:
                    # Try to parse structured response if enabled
                    parsed_response = self.response_parser.parse_structured_response(
                        answer_text
                    )
                    logger.info(f"Parsed response: {parsed_response}")
                    return parsed_response, response
                else:
                    # For plain text response, build simple response structure
                    sources = []
                    seen_urls = set()
                    for chunk in chunks:
                        arxiv_id = chunk.get("arxiv_id")
                        if arxiv_id:
                            arxiv_id_clean = (
                                arxiv_id.split("v")[0] if "v" in arxiv_id else arxiv_id
                            )
                            pdf_url = f"https://arxiv.org/pdf/{arxiv_id_clean}.pdf"
                            if pdf_url not in seen_urls:
                                sources.append(pdf_url)
                                seen_urls.add(pdf_url)

                    citations = list(
                        set(
                            chunk.get("arxiv_id")
                            for chunk in chunks
                            if chunk.get("arxiv_id")
                        )
                    )

                    return {
                        "answer": answer_text,
                        "sources": sources,
                        "confidence": "medium",
                        "citations": citations[:5],
                    }, response
            else:
                raise OllamaException("No response generated from Ollama")

        except Exception as e:
            logger.error(f"Error generating RAG answer: {e}")
            raise OllamaException(f"Failed to generate RAG answer: {e}")

補充 RAGPromptBuilder、ResponseParser

class RAGPromptBuilder:
    """Builder class for creating RAG prompts."""

    def __init__(self):
        """Initialize the prompt builder."""
        self.prompts_dir = Path(__file__).parent / "prompts"
        self.system_prompt = self._load_system_prompt()

    def _load_system_prompt(self) -> str:
        """Load the system prompt from the text file.

        Returns:
            System prompt string
        """
        prompt_file = self.prompts_dir / "rag_system.txt"
        if not prompt_file.exists():
            # Fallback to default prompt if file doesn't exist
            return (
                "You are an AI assistant specialized in answering questions about "
                "academic papers from arXiv. Base your answer STRICTLY on the provided "
                "paper excerpts."
            )
        return prompt_file.read_text().strip()

    def create_rag_prompt(
        self, query: str, chunks: List[Dict[str, Any]], user_language: str = "English"
    ) -> str:
        """Create a RAG prompt with query and retrieved chunks.

        Args:
            query: User's question
            chunks: List of retrieved chunks with metadata from Qdrant

        Returns:
            Formatted prompt string
        """
        prompt = f"{self.system_prompt}\n\n"
        prompt += "### Context from Papers:\n\n"

        for i, chunk in enumerate(chunks, 1):
            # Get the actual chunk text
            chunk_text = chunk.get("chunk_text", chunk.get("content", ""))
            arxiv_id = chunk.get("arxiv_id", "")

            # Only include minimal metadata - just arxiv_id for citation
            prompt += f"[{i}. arXiv:{arxiv_id}]\n"
            prompt += f"{chunk_text}\n\n"

        prompt += f"### Question:\n{query}\n\n"
        prompt += "### Answer:\n"
        prompt += (
            "Provide a natural, conversational response (not JSON), cite sources using [arXiv:id] format.\n\n"
            f"and Translate to {user_language}. "
            f"Output ONLY in {user_language}, formatted clearly for readability"
        )

        return prompt

    def create_structured_prompt(
        self, query: str, chunks: List[Dict[str, Any]], user_language: str = "English"
    ) -> Dict[str, Any]:
        """Create a prompt for Ollama with structured output format.

        Args:
            query: User's question
            chunks: List of retrieved chunks

        Returns:
            Dictionary with prompt and format schema for Ollama
        """
        return {
            "prompt": self.create_rag_prompt(query, chunks, user_language),
            "format": RAGResponse.model_json_schema(),
        }


class ResponseParser:
    """Parser for LLM responses."""

    @staticmethod
    def parse_structured_response(response: str) -> Dict[str, Any]:
        """Parse a structured response from Ollama.

        Args:
            response: Raw LLM response string

        Returns:
            Dictionary with parsed response
        """
        try:
            # Try to parse as JSON and validate with Pydantic
            parsed_json = json.loads(response)
            validated_response = RAGResponse(**parsed_json)
            return validated_response.model_dump()
        except (json.JSONDecodeError, ValidationError):
            # Fallback: try to extract JSON from the response
            return ResponseParser._extract_json_fallback(response)

    @staticmethod
    def _extract_json_fallback(response: str) -> Dict[str, Any]:
        """Extract JSON from response text as fallback.

        Args:
            response: Raw response text

        Returns:
            Dictionary with extracted content or fallback
        """
        # Try to find JSON in the response
        json_match = re.search(r"\{.*\}", response, re.DOTALL)
        if json_match:
            try:
                parsed = json.loads(json_match.group())
                # Validate with Pydantic, using defaults for missing fields
                validated = RAGResponse(**parsed)
                return validated.model_dump()
            except (json.JSONDecodeError, ValidationError):
                pass

        # Final fallback: return response as plain text
        return {
            "answer": response,
            "sources": [],
            "confidence": "low",
            "citations": [],
        }