Day17 - Llama.cpp & Gemma.cpp

17th鐵人賽

cindy7020

2025-09-26 21:23:25

83 瀏覽

分享至

🧠 llama.cpp & gemma.cpp 是什麼？

專案名稱	開發者	主要功能	特點
`llama.cpp`	Georgi Gerganov	在 CPU / GPU 上高效執行 Meta 的 LLaMA 系列模型（含 LLaMA 2 / LLaMA 3）	支援量化、輕量、跨平台（Windows / macOS / Linux / Android / WebAssembly）
`gemma.cpp`	Google / 社群	在本地端執行 Google Gemma 模型	基於 llama.cpp 架構修改，針對 Gemma 模型格式優化

兩者都屬於 C++ 寫成的高效 LLM 推理引擎，目標是讓普通電腦甚至樹莓派都能跑得動大型語言模型

⚙️ 安裝與編譯

下載與編譯 llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

下載與編譯 gemma.cpp

git clone https://github.com/google/gemma.cpp
cd gemma.cpp
make

📦 模型準備與轉換

因為 .cpp 系列專案需要量化模型（.gguf 格式），需要先把原始 HuggingFace 模型轉成 .gguf

以 llama.cpp 為例

python3 convert.py /path/to/llama-2-hf
python3 quantize.py llama-2.gguf llama-2-q4_0.gguf q4_0

gemma.cpp 也有類似的 convert.py 與 quantize.py 腳本

💬 執行與中文互動

以 llama.cpp 為例

./main -m ./models/llama-2-q4_0.gguf -p "請用中文解釋費氏數列"

以 gemma.cpp 為例

./main -m ./models/gemma-7b-q4_0.gguf -p "請幫我用 Python 寫一個排序演算法"

📜 中文使用注意事項

💡模型選擇很重要

原生 LLaMA 與 Gemma 對中文支援有限，建議用 HuggingFace 上的中文微調版（例如:「Chinese-LLaMA」、「Gemma-zh」）

💡prompt 設計

在啟動時加 system prompt，確保語言一致

./main -m model.gguf -p "你是一位使用繁體中文的助理。請用中文回答問題。"

💡輸入編碼

確保終端機編碼是 UTF-8，否則中文可能亂碼

🖼️llama.cpp vs gemma.cpp 中文架構比較圖

+--------------------------------------------------------------+
|                         本地 LLM 架構                        |
+--------------------------------------------------------------+
| 使用者輸入（中文 / 英文）                                    |
+------------------------+-------------------------------------+
                         v
+------------------------+-------------------------------------+
| llama.cpp                                                  |
| - 支援 LLaMA 1/2/3 系列模型                                |
| - 多平台 (Win / Mac / Linux / Android / Web)                |
| - HuggingFace 社群有大量中文微調版本                         |
+------------------------+-------------------------------------+
| gemma.cpp                                                  |
| - 專為 Google Gemma 模型設計                                |
| - 架構與 llama.cpp 類似                                    |
| - 原生支援 Gemma 格式與最佳化                              |
+------------------------+-------------------------------------+
                         v
+--------------------------------------------------------------+
| 量化模型（.gguf）                                           |
| - 減少記憶體佔用，提升推理速度                               |
| - CPU / GPU 都可執行                                        |
+--------------------------------------------------------------+
                         v
+--------------------------------------------------------------+
| 本地推理引擎 (C++ 編寫，高效處理)                           |
+--------------------------------------------------------------+
                         v
+--------------------------------------------------------------+
| 中文回應輸出（可整合 API、Gradio、LangChain 等）             |
+--------------------------------------------------------------+

💻 範例：Python 調用 llama.cpp API（socket / bindings）

import subprocess

prompt = "請用 Python 寫一個計算質數的程式，並加上中文註解"
result = subprocess.run(
    ["./main", "-m", "./models/llama-2-q4_0.gguf", "-p", prompt],
    capture_output=True,
    text=True
)
print(result.stdout)

🔗 延伸應用

整合 Gradio 做中文網頁聊天
與 LangChain 接合成知識檢索系統
本地私有部署（離線客服、知識問答）
移植到手機或樹莓派

🧩 llama.cpp & gemma.cpp 中文最佳化流程

選模型

llama.cpp → 推薦 Chinese-LLaMA-2、Taiwan-LLaMA
gemma.cpp → 推薦 Gemma-7B-zh

轉換與量化（以 llama.cpp 為例）

python3 convert.py /path/to/model
python3 quantize.py model.gguf model-q4_0.gguf q4_0

中文啟動（System Prompt）

./main -m ./models/model-q4_0.gguf \
       -p "你是一位精通繁體中文與 Python 程式設計的助理。請用繁體中文回答。"
       ```
## 💻 Python 程式碼範例（通用 llama.cpp / gemma.cpp）
```python
import subprocess

# 中文提示詞
prompt = "請用 Python 寫一個快速排序演算法，並加上繁體中文註解"

# 執行本地推理引擎
result = subprocess.run(
    ["./main", "-m", "./models/model-q4_0.gguf", "-p", prompt],
    capture_output=True,
    text=True
)

# 輸出結果
print(result.stdout)

🌏 整合 Gradio 做中文網頁聊天

import gradio as gr
import subprocess

def chat_with_model(msg):
    result = subprocess.run(
        ["./main", "-m", "./models/model-q4_0.gguf", "-p", msg],
        capture_output=True,
        text=True
    )
    return result.stdout
gr.ChatInterface(chat_with_model, title="本地中文聊天機器人").launch()

📌 總結

功能	llama.cpp	gemma.cpp
`支援模型`	LLaMA 系列	Gemma 系列
`中文微調資源多寡`	✅ 多	⚠️ 較少
`編譯流程`	類似	類似
`速度`	高效	高效
`社群活躍度`	🔥 高	中
`適合用途`	通用 LLM	Google Gemma 專用