在 LicheePi 4A openEuler RISC-V 运行 llama.cpp

wyy · 2025 年7 月 29 日 23:46

在 `LicheepPi 4A openEuler RISC-V` 上运行 `llama.cpp`

演示视频：荔枝派4A openEuler RISC-V 运行 llama.cpp 跑 DeepSeek - 千早爱音代讲

0. 准备

硬件设备：荔枝派4A 16GB version

系统版本：openEuler-24.03-LTS-SP1

镜像下载：oerv-lpi4a

镜像烧录流程：sipeed wiki 烧录镜像

1. 编译 `llama.cpp`

下载 llama.cpp 源码

git clone https://github.com/ggml-org/llama.cpp

注意下载最新版本的，旧版本加载模型时会出现 error loading model vocabulary: unknown pre-tokenizer type 错误，新版本修复了这个问题。

编译

cd llama.cpp
cmake -B build -DGGML_RVV=OFF
cmake --build build --config Release -j 4

2. 准备模型

以 DeepSeek-R1-Distill-Qwen-7B 模型作为例子。

从 modelscope 下载模型文件

pip install modelscope
modelscope download --model deepseek-ai/DeepSeek-R1-Distill-Qwen-7B --local_dir DeepSeek-R1-Distill-Qwen-7B

下载好的模型是以 HuggingFace 的 safetensors 格式存放的，而 llama.cpp 使用的是 GGUF 格式，因此需要进行格式转换：

pip install -r requirements.txt
python convert_hf_to_gguf.py DeepSeek-R1-Distill-Qwen-7B/

荔枝派4A上暂时无法安装 requirements.txt 中的所有依赖，这一步可以在其他设备上完成之后再拷贝文件到荔枝派。

模型量化

FP16精度的模型跑起来会很慢，需要对模型进行 Q4_K_M 量化以提升推理速度。

./build/bin/llama-quantize DeepSeek-R1-Distill-Qwen-7B/DeepSeek-R1-Distill-Qwen-7B-F16.gguf DeepSeek-R1-Distill-Qwen-7B/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf Q4_K_M

最终得到了可以被 llama.cpp 加载的模型文件 DeepSeek-R1-Distill-Qwen-7B-F16.gguf。

3. 运行模型

命令行运行

./build/bin/llama-cli -m DeepSeek-R1-Distill-Qwen-7B/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -cnv

HTTP Server 模式运行

./build/bin/llama-server -m DeepSeek-R1-Distill-Qwen-7B/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf --port 7170

1dentity · 2025 年7 月 30 日 02:54

能跑多少token呀？教程里貌似没提及

wyy · 2025 年7 月 30 日 06:16

由于没有开启 rvv，即使是量化后的模型，速度也比较慢，输出速度大约为每 20 秒 1 个 token

wuwei · 2025 年7 月 30 日 06:17

看成了 20个token每秒

感觉还是有一些改进空间的。

2959363314 · 2025 年10 月 5 日 14:21

试了好久，原来是自己开了RVV