2024 iThome 鐵人賽

DAY 17

生成式 AI

LLM與生成式AI筆記系列第 17 篇

Day 17: langchain 由入門到熟練(建立向量儲存和檢索器-使用Open AI -API)

16th鐵人賽

中年一般人

2024-08-18 17:27:01

607 瀏覽

分享至

前言:

原本想使用huggingface API endpoint 加上免費翻譯的API ，不過因為這幾天比較沒時間，有時間再做。

向量儲存與檢索器:

此範例是根據這個的中文化改編
本教學將帶您熟悉 LangChain 的向量儲存和檢索器抽象概念。這些抽象概念旨在支援從（向量）資料庫和其他來源檢索數據，以便與 LLM 工作流程整合。它們對於在模型推理過程中獲取數據進行推理的應用程式非常重要，例如檢索增強生成 (RAG) （請參閱我們的 RAG 教學）。

概念

本例子著重於文本數據的檢索。我們將涵蓋以下概念：

文件；
向量儲存；
檢索器。

設置

Jupyter Notebook

本教學和其他教程可能最方便在 Jupyter Notebook 中運行。請參閱[此處]了解如何安裝。

安裝

本教學需要以下套件：langchain、langchain-chroma 和 langchain-openai。

!pip install langchain langchain-chroma langchain-openai

安裝指引

更多細節，請參考我們的安裝指南。

LangSmith

許多您使用 LangChain 構建的應用程式將包含多個步驟和多次 LLM 調用。隨著這些應用程式變得越來越複雜，能夠檢查鏈或代理內部究竟發生了什麼變得至關重要。最好的方法是使用 LangSmith。

在您透過上面的連結註冊後，請務必設置您的環境變量以開始記錄追蹤：

import getpass
import os
from langchain_openai import ChatOpenAI

os.environ["LANGCHAIN_TRACING_V2"] = "true"
# 替換為你的LANGCHAIN_API_KEY
os.environ["LANGCHAIN_API_KEY"] = "替換為你的LANGCHAIN_API_KEY"

os.environ["OPENAI_API_KEY"] = "替換為你的OPENAI_API_KEY"


llm = ChatOpenAI(model="gpt-4o-mini")

安裝指引

更多細節，請參考我們的安裝指南。

LangSmith

在您透過上面的連結註冊後，請務必設置您的環境變量以開始記錄追蹤：

文件

LangChain 實現了一個 Document 抽象概念，旨在表示一個文本單元及其相關的元數據。它具有兩個屬性：

page_content：表示內容的字串。
metadata：包含任意元數據的字典。

metadata 屬性可以捕獲有關文檔來源、與其他文檔的關係以及其他資訊。請注意，單個 Document 對象通常表示較大文檔的一部分。

讓我們生成一些示例文件：

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

documentsTW = [
    Document(
        page_content="狗是人類的好夥伴，以忠誠和友善著稱。",
        metadata={"source": "哺乳動物寵物文檔"},
    ),
    Document(
        page_content="貓是獨立的寵物，通常喜歡擁有自己的空間。",
        metadata={"source": "哺乳動物寵物文檔"},
    ),
    Document(
        page_content="金魚是適合初學者的熱門寵物，只需要相對簡單的照顧。",
        metadata={"source": "魚類寵物文檔"},
    ),
    Document(
        page_content="鸚鵡是聰明的鳥類，能夠模仿人類的說話。",
        metadata={"source": "鳥類寵物文檔"},
    ),
    Document(
        page_content="兔子是社交動物，需要足夠的空間來蹦蹦跳跳。",
        metadata={"source": "哺乳動物寵物文檔"},
    ),
]

API 參考：Document

在這裡，我們生成了五個文件，其中包含的元數據標識了三個不同的「來源」。

向量儲存庫

向量搜索是儲存和搜索非結構化數據（例如非結構化文本）的常用方法。其基本概念是儲存與文本關聯的數字向量。給定一個查詢，我們可以將其嵌入( embed )為相同維度的向量，並使用向量相似度度量來識別儲存庫中的相關數據。

LangChain VectorStore 物件包含用於向儲存庫添加文本和 Document 對象，以及使用各種相似度度量查詢它們的方法。它們通常使用 embedding 模型初始化，這些模型決定了如何將文本數據轉換為數字向量。

LangChain 包含一系列與不同向量儲存技術的集成 ( integrations )。有些向量儲存由供應商託管（例如，各種雲供應商），需要特定的憑據才能使用；有些（例如 Postgres）則在單獨的基礎設施中運行，可以在本地或通過第三方運行；還有一些可以運行在內存中，以處理輕量級的工作負載。在這裏，我們將演示如何使用 LangChain VectorStores 和 Chroma，其中包括一個內存中實現。

要實例化一個向量儲存庫，我們通常需要提供一個 embedding 模型來指定如何將文本轉換為數字向量。在這裏，我們將使用 OpenAI 嵌入 (OpenAI embeddings)。

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(),
)

API 參考：OpenAIEmbeddings

調用 .from_documents 方法將文檔添加到向量存儲中。VectorStore 實現了添加文檔的方法，這些方法也可以在對象實例化後調用。大多數實現都允許您連接到現有的向量存儲庫——例如，通過提供客戶端、索引名稱或其他信息。有關特定 integration 的更多詳細資訊，請參閱文檔。

一旦我們實例化了一個包含文檔的 VectorStore，我們就可以查詢它。 VectorStore 包括用於查詢的方法：

同步和異步；
按字符串查詢和按向量；
有和沒有返回相似度得分；
按相似度和最大邊緣相關性 maximum marginal relevance （以平衡與查詢的相似性與檢索結果的多樣性）。

這些方法通常會在輸出中包含一個 Document 對象列表。

範例:

根據與字符串查詢的相似度返回文檔：

vectorstore.similarity_search("cat")

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social animals that need plenty of space to hop around.'),
 Document(metadata={'source': 'bird-pets-doc'}, page_content='Parrots are intelligent birds capable of mimicking human speech.')]

vectorstoreTW = Chroma.from_documents(
    documentsTW,
    embedding=OpenAIEmbeddings(),
)
vectorstoreTW.similarity_search("貓")

[Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='兔子是社交動物，需要足夠的空間來蹦蹦跳跳。'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='狗是人類的好夥伴，以忠誠和友善著稱。'),
 Document(metadata={'source': '魚類寵物文檔'}, page_content='金魚是適合初學者的熱門寵物，只需要相對簡單的照顧。')]

好的是中文在這邊可以使用

壞的是在這邊可以看到可能在搜索時是搜索目前在這頁程式碼上所有的資料，可能是因為是使用相同的embedding ，也有可能因為只是個bug。

異步查詢：

await vectorstore.asimilarity_search("cat")

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social animals that need plenty of space to hop around.')]

await vectorstoreTW.asimilarity_search("貓")

[Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='兔子是社交動物，需要足夠的空間來蹦蹦跳跳。'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='狗是人類的好夥伴，以忠誠和友善著稱。'),
 Document(metadata={'source': '魚類寵物文檔'}, page_content='金魚是適合初學者的熱門寵物，只需要相對簡單的照顧。')]

回傳分數:

# 請注意，不同的提供者實現不同的分數；
# 這裡的 Chroma 返回一個距離度量，該度量應與相似度成反比。

vectorstore.similarity_search_with_score("cat")

[(Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
  0.375326931476593),
 (Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。'),
  0.4653646945953369),
 (Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
  0.4833090305328369),
 (Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social animals that need plenty of space to hop around.'),
  0.4958883225917816)]

vectorstoreTW.similarity_search_with_score("貓")

[(Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。'),
  0.2581682503223419),
 (Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='兔子是社交動物，需要足夠的空間來蹦蹦跳跳。'),
  0.36962026357650757),
 (Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='狗是人類的好夥伴，以忠誠和友善著稱。'),
  0.38884037733078003),
 (Document(metadata={'source': '魚類寵物文檔'}, page_content='金魚是適合初學者的熱門寵物，只需要相對簡單的照顧。'),
  0.40212905406951904)]

根據與嵌入查詢的相似度返回文檔：

embedding = OpenAIEmbeddings().embed_query("cat")

vectorstore.similarity_search_by_vector(embedding)

[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Dogs are great companions, known for their loyalty and friendliness.'),
 Document(metadata={'source': 'mammal-pets-doc'}, page_content='Rabbits are social animals that need plenty of space to hop around.')]

embeddingTW = OpenAIEmbeddings().embed_query("貓")

vectorstoreTW.similarity_search_by_vector(embeddingTW)

[Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='兔子是社交動物，需要足夠的空間來蹦蹦跳跳。'),
 Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='狗是人類的好夥伴，以忠誠和友善著稱。'),
 Document(metadata={'source': '魚類寵物文檔'}, page_content='金魚是適合初學者的熱門寵物，只需要相對簡單的照顧。')]

了解更多：

檢索器

LangChain VectorStore 物件並未繼承自 Runnable，因此不能立即集成到 LangChain Expression Language（LCEL）chains中。

LangChain Retrievers 是 Runnables，因此它們實現了一組標準方法（例如，同步和異步的 invoke 和 batch 操作），並且被設計為可以融入到 LCEL 鏈中。

我們可以不繼承 Retriever 類別，自己創建一個簡單的檢索器版本。如果我們選擇希望用於檢索文檔的方法，我們可以輕鬆地創建一個可運行的檢索器。下面我們將圍繞 similarity_search 方法構建一個：

from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top result

retriever.batch(["cat", "shark"])

[[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')],
 [Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.')]]

retrieverTW = RunnableLambda(vectorstoreTW.similarity_search).bind(k=1)  # select top result

retrieverTW.batch(["貓", "鯊魚"])

[[Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。')],
 [Document(metadata={'source': '魚類寵物文檔'}, page_content='金魚是適合初學者的熱門寵物，只需要相對簡單的照顧。')]]

API Reference:Document | RunnableLambda

Vectorstores 實現了一個 as_retriever 方法，該方法將生成一個 Retriever，特別是 VectorStoreRetriever。這些檢索器包括特定的 search_type 和 search_kwargs 屬性，它們標識要調用底層向量存儲的哪些方法，以及如何參數化它們。例如，我們可以使用以下方式複製上述內容：

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(["cat", "shark"])

[[Document(metadata={'source': 'mammal-pets-doc'}, page_content='Cats are independent pets that often enjoy their own space.')],
 [Document(metadata={'source': 'fish-pets-doc'}, page_content='Goldfish are popular pets for beginners, requiring relatively simple care.')]]

retrieverTW = vectorstoreTW.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retrieverTW.batch(["貓", "鯊魚"])

[[Document(metadata={'source': '哺乳動物寵物文檔'}, page_content='貓是獨立的寵物，通常喜歡擁有自己的空間。')],
 [Document(metadata={'source': '魚類寵物文檔'}, page_content='金魚是適合初學者的熱門寵物，只需要相對簡單的照顧。')]]

VectorStoreRetriever 支援以下幾種搜尋類型："similarity"（預設）、"mmr"（最大邊緣相關性，如上所述）和 "similarity_score_threshold"。我們可以使用後者透過相似度分數來對檢索器輸出的文件進行閾值設定。

檢索器可以輕鬆地融入更複雜的應用中，例如檢索增強生成（RAG）應用，它將給定的問題與檢索到的上下文結合起來，形成 LLM 的提示。下面我們展示一個最小化的例子。

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

API 參考:ChatPromptTemplate | RunnablePassthrough

response = rag_chain.invoke("tell me about cats")

print(response.content)

Cats are independent pets that often enjoy their own space.

rag_chainTW = {"context": retrieverTW, "question": RunnablePassthrough()} | prompt | llm

responseTW = rag_chainTW.invoke("貓是怎樣的生物")

print(responseTW.content)

貓是獨立的寵物，通常喜歡擁有自己的空間。

了解更多：

檢索策略可以豐富而複雜。例如：

我們可以從查詢中加入寫死的規則和過濾器（例如，「使用 2020 年後發佈的文件」）；
我們可以返回以某種方式鏈接到檢索到的上下文的文檔（例如，通過某種文檔分類法）；
我們可以為每個上下文單元生成多個嵌入；
我們可以整合來自多個檢索器的結果；
我們可以為文檔分配權重，例如對最近的文檔賦予更高的權重。

操作指南的「檢索器」部分涵蓋了這些和其他內建的檢索策略。

擴展 BaseRetriever 類以實現自定義檢索器也很簡單。請參閱我們這裏的操作指南。

下面為實際運作的環境

實際運作的colab

心得：

在這邊是有文件搜索最好一次還是只做一個相同目的的搜索，要不然會有bug。

介紹:
教學:
1. 基礎：
  1. 使用 LCEL 建立簡單的 LLM 應用
  2. 建構一個聊天機器人
  3. 建立向量儲存和檢索器
  4. 建立 Agent
2. 將外部資訊導入到 Agent 的運作中
  5. 建立檢索增強生成 (RAG) 應用程式
  6. 建立會話式 RAG 應用程式
  7. 基於 SQL 資料建構問答系統
  8. 建構查詢分析系統
  9. 建立本地 RAG 應用程式
  10. 透過圖形資料庫建立問答應用程式
  11. 建構 PDF 攝取和問答系統
3. 特定的任務或功能
  12. 建構抽取資料的方法
  13. 產生合成資料
  14. 將文字用標籤分類
  15. 總結文本

LangGraph:

快速入門:
聊天機器人:
1. 客戶支持機器人
2. 根據使用者需求產生 prompt
3. 程式碼助手
RAG:
4.自適應 RAG
5.使用本地的LLM進行自適應 RAG
6.自主檢索 RAG（Agentic RAG)
7.自修正 RAG(Corrective RAG)
8. 使用本地的LLM進行自修正 RAG
9.自我詢問RAG(Self-RAG)
10.使用本地的LLM自我詢問RAG(Self-RAG)
11.SQL Agent
Agent 架構:
1. 多 Agent系統:
  12. 兩個Agent的協作
  13. 監督
  14. 分層團隊
  2.規劃型Agent:
  15. 規劃與執行
  16. 無觀察執行
  17. LLMCompiler
  3.反思與批評:
  18.基本反思
  19.反思
  20.語言 Agent 樹搜尋
  21.自主發現代理
評估與分析:
22. 基於代理的評估
23. 在LangSmith中的評估
實驗性項目:
24. 網路搜索Agent（STORM）
25. TNT-LLM
26. Web導航 Agent
27. 競賽中的程式設計
28. 複雜資料抽取

LangSmith:

快速入門:
為您的 LLM 應用添加可觀察的方法
評估您的 LLM 應用
優化分類器
RAG 評估
回測
代理商評價
優化 LangSmith 上的追蹤支出

整合 LangGraph 的工具介紹以及使用:

agent-service-toolkit

Day 16: langchain 由入門到熟練(建構一個Chatbot-使用Open AI -API)

Day 17-1: langgraph (結合 RAG 與自我修正的程式碼生成)

系列文

LLM與生成式AI筆記共 31 篇

RSS系列文訂閱系列文

12 人訂閱

完整目錄

直播研討會

{{ item.channelVendor }} {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

1064 組

團體組數

40 組

累計文章數

22195 篇

完賽人數

600 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# windows server linux css react vue.js

IT邦幫忙

LLM與生成式AI筆記系列第 17 篇

Day 17: langchain 由入門到熟練(建立向量儲存和檢索器-使用Open AI -API)

前言:

向量儲存與檢索器:

概念

設置

Jupyter Notebook

安裝

安裝指引

LangSmith

安裝指引

LangSmith

文件

向量儲存庫

範例:

好的是中文在這邊可以使用

壞的是在這邊可以看到可能在搜索時是搜索目前在這頁程式碼上所有的資料，可能是因為是使用相同的embedding ，也有可能因為只是個bug。

檢索器

心得：

目錄:

Langchain:

LangGraph:

LangSmith:

整合 LangGraph 的工具介紹以及使用:

尚未有邦友留言

LLM與生成式AI筆記系列 第 17 篇

Day 17: langchain 由入門到熟練(建立向量儲存和檢索器-使用Open AI -API)

前言:

向量儲存與檢索器:

概念

設置

Jupyter Notebook

安裝

安裝指引

LangSmith

安裝指引

LangSmith

文件

向量儲存庫

範例:

好的是中文在這邊可以使用

壞的是在這邊可以看到可能在搜索時是搜索目前在這頁程式碼上所有的資料，可能是因為是使用相同的embedding ，也有可能因為只是個bug。

檢索器

心得：

目錄:

Langchain:

LangGraph:

LangSmith:

整合 LangGraph 的工具介紹以及使用:

尚未有邦友留言

標記使用者

LLM與生成式AI筆記系列第 17 篇