[D17] LangChain 專案實做 - 教學詞彙推薦

15th鐵人賽 langchain chatgpt openai 對話機器人

Ted Chen

2023-09-21 05:31:19

2053 瀏覽

分享至

今天，我們將深入探討 LangChain 專案中「教學詞彙推薦」的實作細節及其相關工具。透過實際操作教學詞彙推薦，我們將引導大家了解如何整合文本讀取器（TextLoader）、提示器模板（PromptTempalte）和解析器（OutputParser）以完成一個全面的任務。以下是我們教學詞彙推薦的整體流程：
首先，我們將利用 LangChain 的 TextLoader 工具從一份模擬影片的文字檔案中提取資料，並將這些資料轉化為 Document 物件。然後，我們將從該 Document 物件中隨機選擇一些範例句子。這些選出的句子會模仿我們「例句推薦系統」的推薦結果。這些範例句子將作為教學詞彙推薦過程的基礎資料，以供語言模型從中選擇適合教學的詞彙。

此外，我們也將介紹 LangChain 提供的另一個關鍵工具—OutputParser。這個工具不僅能生成結構化資料的指令，還能用於從文本中解析這些結構化資料。

現在，讓我們正式深入探討今天要分享的 LangChain 使用技巧和實用方法吧！

文件讀取器(Document Loader)

LangChain 預設已提供多種資料讀取器（loader），支援包括 txt 檔、網頁，甚至 YouTube 字幕檔等多種文件格式。每一個讀取器都會實現一個 load 函式以完成讀取動作。通過這些讀取器，LangChain 將資料統一載入為 Document 物件。本次實作中，我們使用的是 TextLoader，實際使用方式如下：

# 外部資料 - 使用 TextLoader 讀取 txt 檔案
file_name = '/content/ironman2023/text_data/World Stories to Help You Learn _ practice English with Spotlight - English.txt'
loader = TextLoader(file_name, encoding="utf-8")
documents = loader.load()

print(documents)

--- 以下為實際的輸出內容 ---
>> 
[Document(page_content='Welcome to Spotlight. I’m Adam Navis... We hope you can join us again for the next Spotlight program. Goodbye!', metadata={'source': '/content/ironman2023/text_data/World Stories to Help You Learn _ practice English with Spotlight - English.txt'})]

部分提示樣板（Partial Prompt Templates）

在先前的文章中，我們已經看到 LangChain 的提示設計允許我們使用替代符號來部分地參數化提示內容。當提示訊息越來越複雜時，有時候你可能已經在執行提示之前就獲得了部分變數。這時，LangChain 的「部分提示樣板」功能便能派上用場。這允許我們將已獲得的變數值預先填入提示樣板。如以下範例：

# 我們原本的樣板本來需要兩個輸入
prompt = PromptTemplate(template="{foo}{bar}", input_variables=["foo", "bar"])
prompt

--- 實際輸出 ---
>> PromptTemplate(input_variables=['foo', 'bar'], output_parser=None, partial_variables={}, template='{foo}{bar}', template_format='f-string', validate_template=True)

下面是使用 partial 後的提示樣板：

# 使用 partial 函式預先填入變數值
partial_prompt = prompt.partial(foo="foo");
partial_prompt

--- 實際輸出 ---
>> PromptTemplate(input_variables=['bar'], output_parser=None, partial_variables={'foo': 'foo'}, template='{foo}{bar}', template_format='f-string', validate_template=True)

另外，LangChain 也提供了兩種將部分參數傳入提示樣板的方式，下方是方法一：

# partial prompt tempalte 用法一
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(template="{foo}{bar}", input_variables=["foo", "bar"])
partial_prompt = prompt.partial(foo="foo"); # 請注意這裏
print(partial_prompt.format(bar="baz"))

--- 實際輸出 ---
>> foobaz

方法二：

# partial prompt tempalte 用法二
prompt = PromptTemplate(template="{foo}{bar}",
            input_variables=["bar"],
            partial_variables={"foo": "foo"} # 請注意這裏
            )
print(prompt.format(bar="baz"))

--- 實際輸出 ---
>> foobaz

輸出解析器（OutputParser）

除了提示訊息的各種之外，LangChain 還提供了多種預設的輸出解析器。這些解析器至少包含兩個基本函式：get_format_instructions 和 parse。在使用輸出解析器之前，我們通常會先用 get_format_instructions 來生成指導語言模型的特定格式指令。例如，CommaSeparatedListOutputParser 會產生如下的格式指令：

# 使用 CommaSeparatedListOutputParser 生成格式指令
from langchain.output_parsers import CommaSeparatedListOutputParser

output_parser = CommaSeparatedListOutputParser()

print(output_parser.get_format_instructions())

--- 以下是實際輸出結果 ---
>> Your response should be a list of comma separated values, eg: `foo, bar, baz`

我們只需將這個指令附加到我們的提示訊息後面，語言模型就會以逗號分隔的格式來輸出列表。下面是一個實際範例，我們先來看沒有使用 OutputParser 的版本：

from langchain.prompts import (
    ChatPromptTemplate,
    PromptTemplate,
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
)

system_template = """你是一個專業的外語老師，你有一個特殊專長是在能夠以影片的內容找出值得教學的內容，你也很擅長做課程的規劃。

接下來我會提供給挑選出來的教學例句，請在所有教學例句裏面隨機找出 10 個左右值得教學的詞彙（教學詞彙）。
"""

system_message_prompt = SystemMessagePromptTemplate.from_template(system_template)

human_template = "教學例句：\n{captions}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

chat_messages = chat_prompt.format_messages(captions=recommend_lines)
chat_model(chat_messages)

--- 以下是實際的輸出 ---
>> AIMessage(content='教學詞彙：\n1. argued\n2. war\n3. stories\n4. laughed\n5. program\n6. Facebook\n7. YouTube\n8. friendship\n9. houses\n10. comment\n11. coat\n12. storytellers\n13. moral\n14. internet\n15. sea turtle\n16. celebrate\n17. smiled\n18. trickster\n19. Spotlight\n20. began', additional_kwargs={}, example=False)

接下來，則是加上了 OutputParser 的版本，大家同時也請注意，我們這裏也是用了前面介紹的【部分提示樣板】（partial prompt templates）：

# 將教學詞彙推薦的提示訊息加上 OutputParser 的指令
system_template = """你是一個專業的外語老師，你有一個特殊專長是在能夠以影片的內容找出值得教學的內容，你也很擅長做課程的規劃。

接下來我會提供給挑選出來的教學例句，請在所有教學例句裏面隨機找出 10 個左右值得教學的詞彙（教學詞彙）。
{format_instructions}
"""

system_message_prompt = SystemMessagePromptTemplate(
    prompt=PromptTemplate(
        template = system_template,
        input_variables=[],
        partial_variables={"format_instructions": output_parser.get_format_instructions()}
    )
)

human_template = "教學例句：\n{captions}"
human_message_prompt = HumanMessagePromptTemplate.from_template(human_template)

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

chat_messages = chat_prompt.format_messages(captions=recommend_lines)
response = chat_model(chat_messages)

print(response)

--- 實際的輸出 ---
>> content='argued, war, stories, laughed, program, Facebook, YouTube, shouts, houses, comment, coat, storytellers, teach, advice, moral, celebrate, smiled, agreed, trickster, Spotlight' additional_kwargs={} example=False

最後，LangChain 的 OutputParser 也提供了一個 parse 的函式，這個函式可以幫我們把得到的字串訊息，解析為字串清單，範例如下：

output_parser.parse(response.content)

--- 以下為實際輸出 ---
['argued',
 'war',
 'stories',
 'laughed',
 'program',
 'Facebook',
 'YouTube',
 'shouts',
 'houses',
 'comment',
 'coat',
 'storytellers',
 'teach',
 'advice',
 'moral',
 'celebrate',
 'smiled',
 'agreed',
 'trickster',
 'Spotlight']