Day26 - 如何使用 Databricks 做 LLMOps - Part 2 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 26

AI & Data

利用 Databricks 學習 ML/LLM 開發系列第 26 篇

Day26 - 如何使用 Databricks 做 LLMOps - Part 2

15th鐵人賽

jimmyliao

2023-10-11 17:27:34

870 瀏覽

分享至

透過 edX - Large Language Modes: Application through Production 課程提到的 LLMOps Notebook 來探討。這篇會說明 Notebook 內容與在 Databricks UI 的驗證。

Classroom Setup

透過 Classroom Setup 設定好環境，這個 Notebook 會使用 Extreme Summarization (XSum) Dataset 以及 HuggingFace 的 T5 Text-to-Text Transfer Transformer 當作範例的資料集和語言模型。

Prepare the data

透過 load_dataset 取得資料集。

from datasets import load_dataset
from transformers import pipeline

xsum_dataset = load_dataset(
    "xsum", version="1.2.0", cache_dir=DA.paths.datasets
)  # Note: We specify cache_dir to use pre-cached data.

指定 prod_data_path 以及 test_spark_dataset 來儲存資料集。

prod_data_path = f"{DA.paths.working_dir}/m6_prod_data"
test_spark_dataset = spark.createDataFrame(xsum_dataset["test"].to_pandas())
test_spark_dataset.write.format("delta").mode("overwrite").save(prod_data_path)

Develop an LLM pipeline

建立 Hugging Face `summarizer` pipeline。

from transformers import pipeline

# Later, we plan to log all of these parameters to MLflow.
# Storing them as variables here will help with that.
hf_model_name = "t5-small"
min_length = 20
max_length = 40
truncation = True
do_sample = True

summarizer = pipeline(
    task="summarization",
    model=hf_model_name,
    min_length=min_length,
    max_length=max_length,
    truncation=truncation,
    do_sample=do_sample,
    model_kwargs={"cache_dir": DA.paths.datasets},
)  # Note: We specify cache_dir to use pre-cached models.

Track LLM development with MLflow

之前有說明過可以透過 MLflow tracking server 來紀錄 experiment 的資訊，這邊除了透過 mlflow.start_run 來開始紀錄，裡面也用 mlflow.llm.log_predictions 來設定 experiment 的路徑，。

import mlflow

# Tell MLflow Tracking to user this explicit experiment path,
# which is in your home directory under the Workspace browser (left-hand sidebar).
mlflow.set_experiment(f"/Users/{DA.username}/LLM 06 - MLflow experiment")

with mlflow.start_run():
    # LOG PARAMS
    mlflow.log_params(
        {
            "hf_model_name": hf_model_name,
            "min_length": min_length,
            "max_length": max_length,
            "truncation": truncation,
            "do_sample": do_sample,
        }
    )

    # --------------------------------
    # LOG INPUTS (QUERIES) AND OUTPUTS
    # Logged `inputs` are expected to be a list of str, or a list of str->str dicts.
    results_list = [r["summary_text"] for r in results]

    # Our LLM pipeline does not have prompts separate from inputs, so we do not log any prompts.
    mlflow.llm.log_predictions(
        inputs=xsum_sample["document"],
        outputs=results_list,
        prompts=["" for _ in results_list],
    )

    # ---------
    # LOG MODEL
    # We next log our LLM pipeline as an MLflow model.
    # This packages the model with useful metadata, such as the library versions used to create it.
    # This metadata makes it much easier to deploy the model downstream.
    # Under the hood, the model format is simply the ML library's native format (Hugging Face for us), plus metadata.

    # It is valuable to log a "signature" with the model telling MLflow the input and output schema for the model.
    signature = mlflow.models.infer_signature(
        xsum_sample["document"][0],
        mlflow.transformers.generate_signature_output(
            summarizer, xsum_sample["document"][0]
        ),
    )
    print(f"Signature:\n{signature}\n")

    # For mlflow.transformers, if there are inference-time configurations,
    # those need to be saved specially in the log_model call (below).
    # This ensures that the pipeline will use these same configurations when re-loaded.
    inference_config = {
        "min_length": min_length,
        "max_length": max_length,
        "truncation": truncation,
        "do_sample": do_sample,
    }

    # Logging a model returns a handle `model_info` to the model metadata in the tracking server.
    # This `model_info` will be useful later in the notebook to retrieve the logged model.
    model_info = mlflow.transformers.log_model(
        transformers_model=summarizer,
        artifact_path="summarizer",
        task="summarization",
        inference_config=inference_config,
        signature=signature,
        input_example="This is an example of a long news article which this pipeline can summarize for you.",
    )