2024 iThome 鐵人賽

DAY 9

生成式 AI

Python 新手的 AI 之旅：從零開始打造屬於你的 AI / LLM 應用系列第 9 篇

【Day9】初探檢索增強生成 RAG (2)：幫你的網站實作一個 RAG 問答系統吧！

16th鐵人賽

海狸大師

2024-09-23 17:59:25

1103 瀏覽

分享至

前言

昨天我們認識了模型的幻覺 (Hallucination)、檢索增強生成 (RAG) 這些酷名詞知識，相信你也想趕快自己做一個吧！接下來我將以成大住宿服務組網站為範例，用 gpt-crawler 抓資料，並使用 LlamaIndex 做出一個簡單的檢索系統，搭配前幾天學到的 LLM API 使用方式，做出屬於該網站專屬的機器人～

gpt-crawler

如果不知道什麼是爬蟲的朋友，可以看看學仁大大做的這個介紹影片喔！

gpt-crawler 是一個去年十一月釋出的開源專案，透過這個專案，我們把幾乎任何網站的資料都爬下來，不需要寫一行程式碼，只要 npm start 就可以了，超方便的。爬下來的資料再拿去給 gpt 當成知識庫，做出客製化的 GPT。

不過該專案不是由 python 撰寫的，但也不要緊，因為我們只要改變數就可以用了。

安裝

在開始之前，先確保電腦裡面有安裝 node 和 git，如果你使用 mac 可以用 homebrew 安裝
```
brew install node
brew install git
```
你可以用下面兩個命令來檢查有沒有安裝成功
```
node -v # 檢查 node 版本
git -v # 檢查 git 版本
```
出現數字表示有安裝成功
進到任何一個你喜歡的資料夾，將 gpt-crawler 專案複製進來，並且進到 gpt-crawler 專案
```
git clone git@github.com:BuilderIO/gpt-crawler.git
cd gpt-crawler
```
將所需的 node 套件安裝起來
```
npm install
```

使用

編輯 gpt-crawler 中的 config.ts ，把 url 和 match 改成你要抓的網站

解釋一下比較常用的這些變數代表什麼
- url: 目標網站
- match: 網站規則
- maxPagesToCrawl: 最多爬幾頁
- outputFileName: 輸出的檔案名稱
- maxTokens (optional): 爬下來的資料換算最大 token 數量，超過就停下來。這是為了避免目標網站資料量太大，導致在 ChatGPT 中當作上下文的時候超過上限。
其他可以設定的參數可以參考 README，寫得蠻清楚的。
執行命令開始爬蟲吧
```
npm start
```
爬蟲結束後即可在當前目錄找到這個 json 檔案

這樣子我們就有資料了！ output-1.json 即為爬蟲的結果

資料清理

如果你仔細觀察這個 JSON 檔案，會發現有一些元素的 title 可能是 404 Not found，這時候我們可以做「資料清理」，把它們拿掉。

寫一個簡單的 Python 腳本，把不要的資料清掉吧

import json

# 讀取原始 JSON 檔案
with open('output-1.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

# 剔除 title 為 "404 Not Found" 的項目
edited_data = [item for item in data if item.get('title') != '404 Not Found']

# 將編輯後的資料寫入新的 JSON 檔案
with open('edited.json', 'w', encoding='utf-8') as file:
    json.dump(edited_data, file, ensure_ascii=False, indent=4)

print("編輯完成，已儲存至 edited.json")

使用 gpt-crawler 不保證可以爬到所有網頁，也不一定可以爬到重要的資訊，有可能網站做一些特殊阻爬蟲機制，或者權限設定 blablabla。不過在這個例子很夠用了，畢竟我們只是要幫學校宿舍網站做一個機器人而已～

LlamaIndex 介紹

LlamaIndex 是一個專為資料檢索生成 (RAG) 和微調訓練 (Fine-tuning) 大型語言模型 (LLM) 而設計的框架，特別適合需要上下文增強的系統。主要目的是幫助使用者在使用大型語言模型 API 時，能夠更有效地獲取特定領域的資訊，尤其是當涉及到公司背景、員工資料或客服互動等專業內容時。(關於微調，之後會再介紹它)

開發環境

套件安裝

pip install llama-index

安裝了 llama-index 也就順便安裝了以下這些相關套件

llama-index-core
llama-index-legacy # temporarily included
llama-index-llms-openai
llama-index-embeddings-openai
llama-index-program-openai
llama-index-question-gen-openai
llama-index-agent-openai
llama-index-readers-file
llama-index-multi-modal-llms-openai

使用 LlamaIndex 的時候，他可能會從 HuggingFace 上安裝一些東東，你可以設定 LLAMA_INDEX_CACHE_DIR 來改變要存放這些資料的路徑。

你也可以分開安裝這些套件，詳情請參考官網文件。

環境變數

LlamaIndex 預設使用 OpenAI 的 gpt-3.5-turbo ，為了讓接下來簡單的範例可以成功跑起來，我們要設定環境變數

如果你是 Windows，後面的 XXX 請自行更改成你的 API Key

set OPENAI_API_KEY=XXX

如果你是 Linux / MacOS

export OPENAI_API_KEY=XXXXX

如果你不想要這麼做，你也可以將 api key 放到 .env 檔案中，然後用之前的 dotenv 來把他叫出來，如果你是這麼做那就表示你需要在接下來的腳本都 load_dotenv() ，然後再用 os.getenv("SERVER_IP") 把 api key 的值賦予給

OPENAI_API_KEY=XXXXX

腳本的最上方都要先引入環境變數

from dotenv import load_dotenv
load_dotenv()

實作程式碼

讀取資料

先把剛剛的資料放到一個資料夾，這邊我放到名為 data 的資料夾，底下有剛剛修改過的爬蟲結果 edited.json。

from llama_index import SimpleDirectoryReader
from rich import print

documents = SimpleDirectoryReader('data').load_data()

這邊建議大家可以使用 rich 套來讓 print 的輸出更好看，有關於 rich 套件的使用方式可以參考中文 README

# 這麼好用不安裝一下嗎
pip install rich

用我們在 Day3 的技巧來追蹤原始碼，會發現這個 documents 是一個由 Document 物件所組成的 List，每個 Document 物件的對應到 data 資料夾底下的檔案，像我這邊的檔案是 edited.json ，更酷的是它把檔案中的資料都用純文字 (在 text 屬性) 的形式來表示了。

在建立索引之前，我還想再做一些資料清理。如果你有注意到宿舍的網站的內文上方都有標題區塊 (header)，我不希望有這些東西出現在我的資料

觀察 json 檔案，會發現根據不同的檔案有不同的 header，目前看起來共通的是 跳到主要內容區\n學生事務長信箱\n聯絡我們\n網站地圖\nEnglish\n本校首頁\n回首頁 其他可以再去把它去除掉

所以讀取資料的完整程式碼會是

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from dotenv import load_dotenv
from rich import print
load_dotenv()

documents = SimpleDirectoryReader("data").load_data()
for doc in documents:
    doc.text = doc.text.replace("跳到主要內容區\n學生事務長信箱\n聯絡我們\n網站地圖\nEnglish\n本校首頁\n回首頁\n", "")
    doc.text = doc.text.replace("搜尋\n\n \n\n主選單\n最新消息 \n新生專區 [*]\n行事曆 \n單位介紹 \n法規與SOP \n表單下載\n住宿知多少 \n宿委會 \n常見Q&A\n連繫方式\n防疫專區\n性別友善專區\n宿舍場地借用\n業務分類\n宿舍申請\n110-113續住試辦計劃\n住宿費減免\n宿舍餐廳\n工程類進度\n東寧宿舍興建\n宿舍自修室\n宿舍簡易廚房\n服務學習三\n宿舍會議記錄\n", "")
    doc.text = doc.text.replace("Jump to the main content block\nOffice of Student Affairs\nContact us\nSite Map\n中文\nNCKU", "")

建立索引 & 向量化 (Indexing & Embedding)

LlamaIndex 中的索引 (Index) 是由 Documents 物件所構成的，讓 LLM 檢索用的。而「向量儲存索引 (VectorStoreIndex)」會將文本拆分為節點 (Nodes)，然後對每個節點的文本創建向量嵌入 (vector embeddings），以便能夠被大型語言模型查詢，所以這一步開始，就會用到 LLM 了。

# 沒錯，昨天講的那個複雜的東東用套件只要一行就可以了XD
index = VectorStoreIndex.from_documents(documents)

儲存索引 (Storing) & 載入索引 (Loading)

當然，我們不希望每次執行程式都要重新建立索引，所以會把這個索引存起來。你可以存在本地或者向量資料庫

將索引存在本地

from llama_index.core import StorageContext, load_index_from_storage

# 儲存索引
index.storage_context.persist(persist_dir="目錄的路徑")

載入索引

# StorageContext 你可以把它想成一個中繼站
# 讀取存在本地的結果，然後再用 load_index_from_storage
# 把它變成 LlamaIndex 的 Index
storage_context = StorageContext.from_defaults(persist_dir=INDEX_PATH)

# 載入索引
index = load_index_from_storage(storage_context)

使用向量資料庫

這邊用 ChromaDB 來示範，先安裝套件

pip install chromadb

創建資料庫並且存入索引

import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

# 和前面一樣，載入文本
documents = SimpleDirectoryReader("./data").load_data()

# 初始化客戶端，並且選擇向量資料庫的所在路徑
# 有了這個客戶端我們才可以操作資料庫
db = chromadb.PersistentClient(path="./chroma_db")

# 創建集合 (collection)
chroma_collection = db.get_or_create_collection("my_collection")

# 跟本地的作法一樣，用 LlamaIndex 的 API 來得 StorageContext
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 創建索引
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

這邊集合 (collection) 的意思有點像是資料庫中的 table，一個向量資料庫可以有多個集合，這些集合可以用來區分不同類別的資料，十分方便。

載入資料

import chromadb
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext

# 初始化客戶端
db = chromadb.PersistentClient(path="./chroma_db")

# 取得集合
chroma_collection = db.get_or_create_collection("my_collection")

# 取得 StorageContext
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# 把它變成 Index
index = VectorStoreIndex.from_vector_store(
    vector_store, storage_context=storage_context
)

查詢 (query)

最簡單的方法，就是用 as_query_engine() 來創建一個搜尋引擎 (QueryEngine)

query_engine = index.as_query_engine()
response = query_engine.query(
    "我要查有關宿舍費用的東東"
)
print(response)

沒錯就是這麼簡單，你甚至不用想 Prompt 怎麼寫，因為這些細節都被包在 QueryEngine 了。有關 Prompt 的細節可以看到 LlamaIndex 原始碼中的 default_prompt.py，或者直接用 get_prompt() 來查看 prompt。

prompts_dict = query_engine.get_prompts()
print(prompt_dict) # 這會把整個物件印出來，如果你有用 rich 套件看起來應該不會太醜

印出來的部分可以分成 response_synthesizer:text_qa_template 和 response_synthesizer:refine_template 這兩個常常搭配使用，分別用來

text_qa_template: 主要用來處理問答任務，從檢索到的資訊來回答
refine_template: 改進已經有的答案

我們來看看 refine_template 是怎麼寫的…恩…寫得真好，有明確、清晰的指示，可以當成大家的 Prompt 練習模範呦！

The original query is as follows: {query_str}\nWe have provided an existing answer: {existing_answer}\nWe have the opportunity to refine
the existing answer (only if needed) with some more context below.\n------------\n{context_msg}\n------------\nGiven the new context, refine the original 
answer to better answer the query. If the context isn't useful, return the original answer.\nRefined Answer:

text_qa_template

Context information is below.\n---------------------\n{context_str}\n---------------------\nGiven the context information and not prior 
knowledge, answer the query.\nQuery: {query_str}\nAnswer:

這個 propmt_dict 的細節就不談了，再講下去就要寫太長了XD。告訴大家一個 trace code 的小技巧，這是我從 Jserv 老師上課的時候學到的，如果你是用 Linux / MacOS 的終端機應該會有 grep 這個命令，你可以找到 Python 套件下載的地方然後直接查詢關鍵字，看看哪些檔案使用了這個關鍵字 (當然也可以直接用 VScode 搜尋，不過這樣比較帥)

有點離題了。不過你應該也發現了，他的 Propmt 都是英文的，那我們可以來修改 Propmt，叫這個 QueryEngine 使用中文回答問題。Prompt 的格式請參考官方文件。這邊我們一樣拿 text_qa_template 開刀一下，就把剛才的 Prompt 改成中文的就好，要注意 query_str 和 context_msg 不能去動它們，其他都沒差。以下是修改後的 Prompt，我讓他多加一個 XD 看看

qa_template = (
"""
以下是上下文
---------------------
{context_str}
---------------------
請根據上下文信息回答以下問題，不需要事先知識，並在最後加上一個 "XD" 笑臉符號。
問題: {query_str}
回答: 
"""
)

更新 QueryEngine 的 Prompt template

custom_qa_prompts = PromptTemplate(qa_template)
query_engine = index.as_query_engine()
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": custom_qa_prompts}
)

完整程式碼 (不使用 ChromaDB)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core import PromptTemplate
from dotenv import load_dotenv
from rich import print
import chromadb
import os
load_dotenv()

INDEX_PATH = "index"

# store if not exist
if not os.path.exists(INDEX_PATH):
    documents = SimpleDirectoryReader("data").load_data()
    for doc in documents:
        doc.text = doc.text.replace("跳到主要內容區\n學生事務長信箱\n聯絡我們\n網站地圖\nEnglish\n本校首頁\n回首頁\n", "")
        doc.text = doc.text.replace("搜尋\n\n \n\n主選單\n最新消息 \n新生專區 [*]\n行事曆 \n單位介紹 \n法規與SOP \n表單下載\n住宿知多少 \n宿委會 \n常見Q&A\n連繫方式\n防疫專區\n性別友善專區\n宿舍場地借用\n業務分類\n宿舍申請\n110-113續住試辦計劃\n住宿費減免\n宿舍餐廳\n工程類進度\n東寧宿舍興建\n宿舍自修室\n宿舍簡易廚房\n服務學習三\n宿舍會議記錄\n", "")
        doc.text = doc.text.replace("Jump to the main content block\nOffice of Student Affairs\nContact us\nSite Map\n中文\nNCKU", "")

    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(INDEX_PATH)
else:
    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir=INDEX_PATH)
    # load index
    index = load_index_from_storage(storage_context=storage_context)

qa_template = (
"""
以下是上下文
---------------------
{context_str}
---------------------
請根據上下文信息回答以下問題，不需要事先知識，並在最後加上一個 "XD" 笑臉符號。
問題: {query_str}
回答: 
"""
)

custom_qa_prompts = PromptTemplate(qa_template)
query_engine = index.as_query_engine()
query_engine.update_prompts(
    {"response_synthesizer:text_qa_template": custom_qa_prompts}
)

response = query_engine.query("補宿申請？")
print(response.response)

輸出結果

有沒有覺得很靠北，為什麼一個文字接龍要搞得這麼複雜，那我跟你說如果你用 LangChain 會更複雜哈哈哈哈哈，這也是為什麼我想要在之後使用圖形化的介面來設計工作流。詳情可以看看這個影片為什麼我們放棄了 LangChain？

改進

其實剛剛這個簡單的例子還有很多可以改進的地方，其中一個就是模型的選擇。還記得我們昨天提到 ihower 大大有做一個〈使用繁體中文評測各家 Embedding 模型的檢索能力〉嗎？這邊 LlamaIndex 預設使用的嵌入模型是 text-embedding-ada-002 ，如果我想用 OpenAI text-embedding-3-small 那你可以在程式碼最一開始先設定要用的模型。注意，這邊 Settings 是全域的變數。

這個是 OpenAI 的設定

from llama_index.embeddings.openai import OpenAIEmbedding

customize_embedding = OpenAIEmbedding(model="text-embedding-3-small")

你也可以自己從 HuggingFace 上找模型，會把模型下載到電腦裡面

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

customize_embedding = HuggingFaceEmbedding(
    model_name="jinaai/jina-embeddings-v2-base-zh" # 這邊寫模型的名字
)

套用設定

from llama_index.core import Settings

Settings.embed_model = customize_embedding

更多細節可以參考 YWC 大大的〈LlamaIndex 學習筆記 - 使用不同的 Embedding model〉

還有我們也可以只用 Retriever 來找到相似的 node，他還可以設定 top_k 等等數值，剩下的 Prompt 自己設計，QueryEngine 則是更高層次的工具。

喔對了雖然套件叫 LlamaIndex 但我今天好像都沒有用的 llama 哈哈哈哈，不過 LlamaIndex 有支援 Groq API 就是了，文件在這

剩下的就請各位自行閱讀文件了 **RTFM!!!**

小結

呼….總算寫完這一篇了，希望大家對 RAG 有基本的概念，做一個專屬自己的 ChatGPT 是不是很酷ㄚ～

不管是 LlamaIndex 還是 LangChain 都好，他們都是在程式碼的層面上提供一層抽象，就是讓你不用自己寫 Prompt、串接一些有的沒的等等等 (試想以上流程都只用 OpenAI API 會瘋掉吧？)，在 OpenAI 釋出 API 的那段時間真的非常的火熱，其實到現在也是。不過他們也有一些些問題，詳情可以參考這篇文章，主要就是「過於複雜」「不夠靈活」，不過這些都無所謂，會用的還是會用，不會用的還是不會用，工具是讓人提升效率的手段，用啥都行，不要本末倒置即可。

寫到現在有點累累的了，這兩天的內容整理花了我好多時間 QQ 不過還是寫得很開心，我們明天來聊點輕鬆的，來認識 Ollama 吧！期待一下吧～