2023 iThome 鐵人賽

DAY 28

AI & Data

AI 再次入門到進階系列第 28 篇

【Day28】 LLM Agent 玩具專案(三) Code_understanding

15th鐵人賽

中年一般人

2023-10-03 21:46:13

2771 瀏覽

分享至

1.使用Code LLAMA 和 Langchain 進行程式碼解析

在使用langchain的官網教學時找到的 Code understanding

極度推薦使用Colab打開

(聲明：以下內容都是在網路上整理並修改的，真正我原創的內容並不多，我主要只是搬運工)

這部分其實官方還在開發中，所以你可以預期連範例也可能會有bug

使用場景

程式碼分析是最受歡迎的 LLM 應用程式之一 (例如： GitHub Co-Pilot, Code Interpreter, Codium, and Codeium) 目前有以下的使用場景:

對程式碼庫進行問答以了解其工作原理
使用LLMs提出重構或改進建議
使用LLMs記錄代碼

概述

程式碼分析的問答流程遵循我們為文件問答執行的步驟, 但有一些差異:

特別的是我們可以採用分而治之的策略來完成以下幾件事:

讓程式碼中的每個最頂級函數和類別被載入到單獨的檔案中
將剩餘部分放入單獨的文件中
保留有關每個拆分來自何處的元數據

快速開始

!pip install openai tiktoken chromadb langchain
!pip install gitpython
import os
import dotenv
os.environ["OPENAI_API_KEY"] ="這邊要放自己的 OPEN AI API KEY"
# Set env var OPENAI_API_KEY or load from a .env file

dotenv.load_dotenv()

我們將遵循本筆記本的結構並採用根據上下文相關性進行的程式碼分割

載入

我們將使用langchain.document_loaders.TextLoader上傳所有的python project.

以下腳本迭代 LangChain 儲存庫中的檔案並載入每個.py檔案（又稱文件）：

from git import Repo
from langchain.text_splitter import Language
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser

# Clone
repo_path = "/Users/rlm/Desktop/test_repo"
repo = Repo.clone_from("https://github.com/langchain-ai/langchain", to_path=repo_path)

我們使用 LanguageParser加載 py 程式碼，這將：

將頂級函數和類別放在一起（放入單一文件中）
將剩餘程式碼放入單獨的文件中
保留有關每個拆分來自何處的元數據

# Load
loader = GenericLoader.from_filesystem(
    repo_path+"/libs/langchain/langchain",
    glob="**/*",
    suffixes=[".py"],
    parser=LanguageParser(language=Language.PYTHON, parser_threshold=500)
)
documents = loader.load()
len(documents)

1546

分割

將其分割Document 成區塊以進行嵌入和向量儲存。

我們可以使用 RecursiveCharacterTextSplitter w/ language設定。

from langchain.text_splitter import RecursiveCharacterTextSplitter
python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON,
                                                               chunk_size=2000,
                                                               chunk_overlap=200)
texts = python_splitter.split_documents(documents)
len(texts)

4695

RetrievalQA 檢索問答

我們需要以一種可以一般文字搜尋其內容的方式儲存文件。

最常見的方法是把每個文件的內容文字向量化，然後將嵌入向量和文件儲存在向量存儲中。

當設定向量存儲檢索器時：

我們測試檢索的最大邊際相關性
檢索器會傳回8份文件

深入學習

from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()))
retriever = db.as_retriever(
    search_type="mmr", # Also test "similarity"
    search_kwargs={"k": 8},
)

聊天

聊天測試，透過聊天機器人檢索程式碼資訊.

深入學習

在此瀏覽超過55 個 LLM 和聊天模型資料集
有關LLM和聊天模型的更多文件請參閱此處
使用本地 LLMS： PrivateGPT和GPT4All的流行反應了本地運行 LLM 的重要性。

from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationSummaryMemory
from langchain.chains import ConversationalRetrievalChain
llm = ChatOpenAI(model_name="gpt-4")
memory = ConversationSummaryMemory(llm=llm,memory_key="chat_history",return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory)

question = "How can I initialize a ReAct agent?"
result = qa(question)
result['answer']

questions = [
    "What is the class hierarchy?",
    "What classes are derived from the Chain class?",
    "What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
]

for question in questions:
    result = qa(question)
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer']} \n")

-> Question: What is the class hierarchy?

Answer: The class hierarchy for initializing a ReAct agent is as follows:

BaseSingleActionAgent
- LLMSingleActionAgent
- OpenAIFunctionsAgent
- XMLAgent
- Agent
  - Agent (For example: ZeroShotAgent, ChatAgent, ReActDocstoreAgent)
BaseMultiActionAgent
- OpenAIMultiFunctionsAgent

In this hierarchy, ReActDocstoreAgent is a subclass of the Agent class, which itself is a subclass of several classes including the BaseSingleActionAgent.

-> Question: What classes are derived from the Chain class?

Answer: The classes that are derived from the Chain class are:

BaseConversationalRetrievalChain
ConstitutionalChain

-> Question: What one improvement do you propose in code in relation to the class herarchy for the Chain class?

Answer: Based on the provided code, one improvement could be to include more explicit comments or docstrings for each class in the hierarchy. This would make it easier to understand the purpose and functionality of each class, especially for developers who are new to the codebase. For instance, it's not immediately clear what the purpose of classes like BaseConversationalRetrievalChain or SequentialChain are. Providing a brief explanation of each class in the hierarchy would improve readability and maintainability.

我們可以查看LangSmith 追蹤來了解幕後發生的情況：

特別得是該程式碼會有良好的結構且在檢索結果中整齊地組織在一起。
檢索到的程式碼和聊天記錄將傳遞給 LLM 進行答案蒸餾

Image description

開源 LLMs

我們可以透過 LLamaCPP or Ollama integration使用Code LLaMA

注意：請務必升級llama-cpp-python才能使用新的gguf 檔案格式

CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 /Users/rlm/miniforge3/envs/llama2/bin/pip install -U llama-cpp-python --no-cache-dir

查看最新的 code-llama 模型

!pip install llama-cpp-python

from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.memory import ConversationSummaryMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# 下載code llama模型到 colab
import requests


url = "https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF/resolve/e94db8d144152f0b5e153dcb0ac0a266f1588fc3/codellama-13b-instruct.Q4_K_M.gguf"


response = requests.get(url, stream=True)
response.raise_for_status()

with open("codellama-13b-instruct.Q4_K_M.gguf", "wb") as f:
    for chunk in response.iter_content(chunk_size=8192):
        f.write(chunk)

print("模型已成功下載！")


callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path="/content/codellama-13b-instruct.Q4_K_M.gguf",
    n_ctx=5000,
    n_gpu_layers=1,
    n_batch=512,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

llm("Question: In bash, how do I list all the text files in the current directory that have been modified in the last month? Answer:")

To get a list of all the text files in the current directory that have been modified in the last month.
You can use this bash command: find . -type f ( -iname '.txt' ) -mtime +30 -print This is because in order to perform this operation in bash, you need to use several different subcommands and options together when invoking the command find.
Here is an explanation of each of the main components that make up the bash command that is used to perform this particular operation in bash: . : This is a period symbol that is used as part of the name or path of the file that you want to perform this operation on.
Here is an example of how this symbol might be used: .txt : This is a slash symbol that is used as part of the name or path of the file that you want to perform this operation on.
Here is an example of how this symbol might be used: dir/ : This is a string literal that contains some text characters.
Here is an example of what this particular type of string literal is intended to represent in the context in which it is being used.
Note also that there are several different types of quotes that can be used in bash command line To get a list of all the text files in the current directory that have been modified in the last month.\nYou can use this bash command: find . -type f \( -iname '.txt' \) \-mtime +30 -print This is because in order to perform this operation in bash, you need to use several different subcommands and options together when invoking the command find.\nHere is an explanation of each of the main components that make up the bash command that is used to perform this particular operation in bash: . : This is a period symbol that is used as part of the name or path of the file that you want to perform this operation on.\nHere is an example of how this symbol might be used: .txt : This is a slash symbol that is used as part of the name or path of the file that you want to perform this operation on.\nHere is an example of how this symbol might be used: dir/ : This is a string literal that contains some text characters.\nHere is an example of what this particular type of string literal is inte

from langchain.chains.question_answering import load_qa_chain

# Prompt
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(
    input_variables=["context", "question"],
    template=template,
)

由於下列這個功能還處於 close Beta ，我目前沒拿到 LangSmith API 金鑰(close Beta) ，所以這邊我就沒RUN

我們還可以使用 LangChain Prompt Hub 來儲存和取得prompts。

這將與您的LangSmith API 金鑰一起使用。

讓我們在此處嘗試使用預設的 RAG 提示。

!pip install langchainhub

from langchain import hub
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=<your-api-key>
QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-default")

# Docs
question = "How can I initialize a ReAct agent?"
docs = retriever.get_relevant_documents(question)

# Chain
chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_CHAIN_PROMPT)

# Run
chain({"input_documents": docs, "question": question}, return_only_outputs=True)

Llama.generate: prefix-match hit

Here's the trace RAG, showing the retrieved docs.