在使用langchain的官網教學時找到的 Code understanding
(聲明:以下內容都是在網路上整理並修改的,真正我原創的內容並不多,我主要只是搬運工)
這部分其實官方還在開發中,所以你可以預期連範例也可能會有bug
程式碼分析是最受歡迎的 LLM 應用程式之一 (例如: GitHub Co-Pilot, Code Interpreter, Codium, and Codeium) 目前有以下的使用場景:
程式碼分析的問答流程遵循 我們為文件問答執行的步驟, 但有一些差異:
特別的是我們可以採用分而治之的策略來完成以下幾件事:
!pip install openai tiktoken chromadb langchain
!pip install gitpython
import os
import dotenv
os.environ["OPENAI_API_KEY"] ="這邊要放自己的 OPEN AI API KEY"
# Set env var OPENAI_API_KEY or load from a .env file
dotenv.load_dotenv()
我們將遵循 本筆記本的結構並採用 根據上下文相關性進行的程式碼分割
我們將使用langchain.document_loaders.TextLoader
上傳所有的python project.
以下腳本迭代 LangChain 儲存庫中的檔案並載入每個.py
檔案(又稱文件):
from git import Repo
from langchain.text_splitter import Language
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import LanguageParser
# Clone
repo_path = "/Users/rlm/Desktop/test_repo"
repo = Repo.clone_from("https://github.com/langchain-ai/langchain", to_path=repo_path)
我們使用 LanguageParser
加載 py 程式碼,這將:
# Load
loader = GenericLoader.from_filesystem(
repo_path+"/libs/langchain/langchain",
glob="**/*",
suffixes=[".py"],
parser=LanguageParser(language=Language.PYTHON, parser_threshold=500)
)
documents = loader.load()
len(documents)
1546
將其分割Document
成區塊以進行嵌入和向量儲存。
我們可以使用 RecursiveCharacterTextSplitter
w/ language
設定。
from langchain.text_splitter import RecursiveCharacterTextSplitter
python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON,
chunk_size=2000,
chunk_overlap=200)
texts = python_splitter.split_documents(documents)
len(texts)
4695
我們需要以一種可以一般文字搜尋其內容的方式儲存文件。
最常見的方法是把每個文件的內容文字向量化,然後將嵌入向量和文件儲存在向量存儲中。
當設定向量存儲檢索器時:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
db = Chroma.from_documents(texts, OpenAIEmbeddings(disallowed_special=()))
retriever = db.as_retriever(
search_type="mmr", # Also test "similarity"
search_kwargs={"k": 8},
)
聊天測試,透過聊天機器人檢索程式碼資訊.
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationSummaryMemory
from langchain.chains import ConversationalRetrievalChain
llm = ChatOpenAI(model_name="gpt-4")
memory = ConversationSummaryMemory(llm=llm,memory_key="chat_history",return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm, retriever=retriever, memory=memory)
question = "How can I initialize a ReAct agent?"
result = qa(question)
result['answer']
questions = [
"What is the class hierarchy?",
"What classes are derived from the Chain class?",
"What one improvement do you propose in code in relation to the class herarchy for the Chain class?",
]
for question in questions:
result = qa(question)
print(f"-> **Question**: {question} \n")
print(f"**Answer**: {result['answer']} \n")
-> Question: What is the class hierarchy?
Answer: The class hierarchy for initializing a ReAct agent is as follows:
In this hierarchy, ReActDocstoreAgent is a subclass of the Agent class, which itself is a subclass of several classes including the BaseSingleActionAgent.
-> Question: What classes are derived from the Chain class?
Answer: The classes that are derived from the Chain class are:
-> Question: What one improvement do you propose in code in relation to the class herarchy for the Chain class?
Answer: Based on the provided code, one improvement could be to include more explicit comments or docstrings for each class in the hierarchy. This would make it easier to understand the purpose and functionality of each class, especially for developers who are new to the codebase. For instance, it's not immediately clear what the purpose of classes like BaseConversationalRetrievalChain
or SequentialChain
are. Providing a brief explanation of each class in the hierarchy would improve readability and maintainability.
我們可以查看LangSmith 追蹤來了解幕後發生的情況:
我們可以透過 LLamaCPP or Ollama integration使用Code LLaMA
注意:請務必升級llama-cpp-python
才能使用新的gguf
檔案格式
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 /Users/rlm/miniforge3/envs/llama2/bin/pip install -U llama-cpp-python --no-cache-dir
!pip install llama-cpp-python
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.memory import ConversationSummaryMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
# 下載code llama模型到 colab
import requests
url = "https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF/resolve/e94db8d144152f0b5e153dcb0ac0a266f1588fc3/codellama-13b-instruct.Q4_K_M.gguf"
response = requests.get(url, stream=True)
response.raise_for_status()
with open("codellama-13b-instruct.Q4_K_M.gguf", "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print("模型已成功下載!")
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
model_path="/content/codellama-13b-instruct.Q4_K_M.gguf",
n_ctx=5000,
n_gpu_layers=1,
n_batch=512,
f16_kv=True, # MUST set to True, otherwise you will run into problem after a couple of calls
callback_manager=callback_manager,
verbose=True,
)
llm("Question: In bash, how do I list all the text files in the current directory that have been modified in the last month? Answer:")
To get a list of all the text files in the current directory that have been modified in the last month.
You can use this bash command: find . -type f ( -iname '.txt' ) -mtime +30 -print This is because in order to perform this operation in bash, you need to use several different subcommands and options together when invoking the command find.
Here is an explanation of each of the main components that make up the bash command that is used to perform this particular operation in bash: . : This is a period symbol that is used as part of the name or path of the file that you want to perform this operation on.
Here is an example of how this symbol might be used: .txt : This is a slash symbol that is used as part of the name or path of the file that you want to perform this operation on.
Here is an example of how this symbol might be used: dir/ : This is a string literal that contains some text characters.
Here is an example of what this particular type of string literal is intended to represent in the context in which it is being used.
Note also that there are several different types of quotes that can be used in bash command line To get a list of all the text files in the current directory that have been modified in the last month.\nYou can use this bash command: find . -type f \( -iname '.txt' \) \-mtime +30 -print This is because in order to perform this operation in bash, you need to use several different subcommands and options together when invoking the command find.\nHere is an explanation of each of the main components that make up the bash command that is used to perform this particular operation in bash: . : This is a period symbol that is used as part of the name or path of the file that you want to perform this operation on.\nHere is an example of how this symbol might be used: .txt : This is a slash symbol that is used as part of the name or path of the file that you want to perform this operation on.\nHere is an example of how this symbol might be used: dir/ : This is a string literal that contains some text characters.\nHere is an example of what this particular type of string literal is inte
from langchain.chains.question_answering import load_qa_chain
# Prompt
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate(
input_variables=["context", "question"],
template=template,
)
我們還可以使用 LangChain Prompt Hub 來儲存和取得prompts。
這將與您的LangSmith API 金鑰一起使用。
讓我們在此處嘗試使用預設的 RAG 提示。
!pip install langchainhub
from langchain import hub
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
export LANGCHAIN_API_KEY=<your-api-key>
QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-default")
# Docs
question = "How can I initialize a ReAct agent?"
docs = retriever.get_relevant_documents(question)
# Chain
chain = load_qa_chain(llm, chain_type="stuff", prompt=QA_CHAIN_PROMPT)
# Run
chain({"input_documents": docs, "question": question}, return_only_outputs=True)
Llama.generate: prefix-match hit
Here's the trace RAG, showing the retrieved docs.