Day 25 - 計算向量 & 建立資料及索引 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2024 iThome 鐵人賽

DAY 25

生成式 AI

初探 Langchain 與 LLM：打造簡易問診機器人系列第 25 篇

Day 25 - 計算向量 & 建立資料及索引

16th鐵人賽 langchain python3 genai mongodb

熊熊工程師

團隊意蘭拉麵蔥基本五倍辣半熟鹽味蛋三片叉燒

2024-10-09 09:04:47

432 瀏覽

分享至

每天的專案會同步到 GitLab 上，可以前往 GitLab 查看，有興趣的朋友歡迎留言 or 來信討論，我的信箱是 nickchen1998@gmail.com。

今天我們要把爬取到的資料存入 MongoDB 並且將 refactor_question 計算向量，一併存入 MongoDB 當中。

建立 MongoDB 的連線

第一段：`get_mongo_database` 函數

import contextlib
from pymongo.mongo_client import MongoClient
from pymongo.database import Database
from env_settings import EnvSettings

@contextlib.contextmanager
def get_mongo_database() -> Database:
    env_settings = EnvSettings()  # 從環境設定中讀取 MongoDB 的連線參數
    client = MongoClient(host=env_settings.MONGODB_ATLAS_URI)  # 使用 pymongo 創建 MongoClient，連接到 MongoDB Atlas
    try:
        yield Database(client, name=env_settings.MONGODB_DATABASE)  # 回傳指定的 MongoDB 資料庫物件
    finally:
        client.close()  # 確保在操作完成後，關閉 MongoDB 客戶端連線

說明：

contextlib.contextmanager: 這個裝飾器將函數轉換為上下文管理器，這意味著你可以使用 with 語句來簡化資源的管理。在這裡，它的作用是自動管理 MongoDB 客戶端連線，確保在操作完成後關閉連線。
EnvSettings: 這是從自定義的 EnvSettings 類別中讀取 MongoDB 的連線 URI 和資料庫名稱，這樣可以保持敏感資訊的隱私性（例如 MongoDB Atlas 的 URI）。
MongoClient: 這個類別來自 pymongo，用來建立與 MongoDB 的連接。
yield: 在 with 語句中暫時回傳一個 Database 物件，供後續的資料操作使用，並在操作結束後自動執行 finally 中的 client.close()，確保連線被安全關閉。

第二段：`insert_datas` 函數

poetry add langchain-mongodb

from pymongo.collection import Collection
from env_settings import EnvSettings
from langchain_openai import OpenAIEmbeddings
from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
from uuid import uuid4
from langchain_core.documents import Document


def insert_datas(datas: list):
    env_settings = EnvSettings()

    with get_mongo_database() as database:
        vector_store = MongoDBAtlasVectorSearch(
            collection=Collection(database, name="illness"),
            embedding=OpenAIEmbeddings(model="text-embedding-3-small", api_key=env_settings.OPENAI_API_KEY),
            index_name="illness_refactor_question",
            relevance_score_fn="cosine",
        )

        documents = []
        for data in datas:
            documents.append(Document(
                page_content=data.pop("refactor_question"),
                metadata=data
            ))

        uuids = [str(uuid4()) for _ in range(len(documents))]
        vector_store.add_documents(documents, uuids)

說明：

datas: list: 這個函數接受一個資料列表作為參數，這些資料將被批量插入到 MongoDB 的集合中。
with get_mongo_database(): 這一行使用 with 語句打開 get_mongo_database，並從中獲得 MongoDB 的資料庫物件，這樣可以確保操作完成後 MongoDB 連線被安全關閉。
vector_store: 使用 MongoDBAtlasVectorSearch 建立一個 vector store，這個類別是用來將向量存儲到 MongoDB 中，並提供向量的查詢功能。
for 迴圈: 將 refactor_question 作為 page_content 塞入 Document，將其他資料作為 metadata 一併塞入 Document，最後將 Document 逐一塞入 documents 列表。
vector_store.add_documents(datas): 這個方法用來將多筆資料插入到 MongoDB 當中並同時計算向量。

修改爬文程式碼

下方的程式碼當中，可以看到我們使用了剛剛建立的 get_content_embedding 這個 function 來取得重構後問題的向量，並且將資料整理到 data 裡面然後一併塞入 datas，而在程式碼的最後直接透過呼叫 insert_datas 將所有資料一併塞入：

...
datas = []
...
for paragraph in browser.find_elements(By.CSS_SELECTOR, "ul.QAunit"):
    ...
    refactor_question = get_refactor_question(question)
    refactor_answer = get_refactor_answer(answer)
    refactor_question_embedding = get_content_embedding(refactor_question)

    data = dict(
        category=category,
        subject=subject,
        question=question,
        gender=gender,
        question_time=question_time,
        answer=answer,
        doctor_name=doctor_name,
        doctor_department=doctor_department,
        answer_time=answer_time,
        view_amount=view_amount,
        refactor_question=refactor_question,
        refactor_question_embeddings=refactor_question_embedding,
        refactor_answer=refactor_answer
    )
    datas.append(data)
    ...

insert_datas(datas=datas)
...

讓我們看一下塞入後的資料：

data

可以看到紅色方框當中是 embedding 並且其餘資料也都成功被我們插入資料庫。

設定 MongoDB 向量索引

這邊直接進入設定 MongoDB 向量索引的部分，省略了一些有關建立連線、建立組織等步驟，完整的教學可以看一下玩轉 Python 與 MongoDB 這個系列的文章。

而設定向量索引請直接參考這篇文章，下面直接附上本次的設定：

{
  "fields": [
    {
      "numDimensions": 1536,
      "path": "embedding",
      "similarity": "dotProduct",
      "type": "vector"
    },
    {
      "path": "category",
      "type": "filter"
    }
  ]
}