前言

大部分的情况下我们可以通过 ollama 来加载其自带的模型,但如果我们所需的模型 ollama 没有就需要我们自己来 finetune 微调后的模型,这里我们选择使用 Llama.cpp 来量化自己的模型为 Ollama 可以运行的格式。

下载模型的方式有多种,国内的环境可以参考D003-节省时间:AI 模型靠谱下载方案汇总 的魔塔下载方式。

下载 phi-2 模型

docker pull python:3.10-slim
docker run -d --name=downloader -v `pwd`:/models python:3.10-slim tail -f /etc/hosts
sed -i 's/snapshot.debian.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apt/sources.list.d/debian.sources
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
pip install modelscope
pip install huggingface-cli
python -c "from modelscope import snapshot_download;snapshot_download('AI-ModelScope/phi-2', cache_dir='./models/')"

构建新版本的 llama.cpp

配置系统基础环境

sed -i 's/snapshot.debian.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apt/sources.list.d/debian.sources
sed -i 's/deb.debian.org/mirrors.tuna.tsinghua.edu.cn/g' /etc/apt/sources.list.d/debian.sources
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
apt update

下载并编译 llama.cpp

  • 下载代码
  • 切换工作目录
  • 常规模式构建 llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git --depth=1
cd llama.cpp
cmake -B build
cmake --build build --config Release

# 如果你是 macOS,希望使用 Apple Metal
GGML_NO_METAL=1 cmake --build build --config Release

# 如果你使用 Nvidia GPU
apt install nvidia-cuda-toolkit -y
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

通过 llama.cpp 转换模型

apt install python3-full
pip install numpy pyyaml torch safetensors  transformers sentencepiece 
./convert_hf_to_gguf.py /data/models/models/AI-ModelScope/phi-2/
./convert_hf_to_gguf.py /data/models/models/LLM-Research/Meta-Llama-3___1-8B-Instruct/

INFO:hf-to-gguf:Loading model: Meta-Llama-3___1-8B-Instruct

INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only

INFO:hf-to-gguf:Exporting model...

INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}

INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'

INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00004.safetensors'

INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {4096, 128256}

INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}

INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}

INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.bfloat16 --> F16, shape = {4096, 4096}

INFO:hf-to-gguf:blk.0.attn_q.weight,         torch.bfloat16 --> F16, shape = {4096, 4096}

INFO:hf-to-gguf:blk.0.attn_v.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}

INFO:hf-to-gguf:blk.1.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.1.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}

INFO:hf-to-gguf:blk.1.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.1.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.1.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.1.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}

INFO:hf-to-gguf:blk.1.attn_output.weight,    torch.bfloat16 --> F16, shape = {4096, 4096}

INFO:hf-to-gguf:blk.1.attn_q.weight,         torch.bfloat16 --> F16, shape = {4096, 4096}

INFO:hf-to-gguf:blk.1.attn_v.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}

INFO:hf-to-gguf:blk.2.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.2.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}

INFO:hf-to-gguf:blk.2.ffn_gate.weight,       torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.2.ffn_up.weight,         torch.bfloat16 --> F16, shape = {4096, 14336}

INFO:hf-to-gguf:blk.2.ffn_norm.weight,       torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.2.attn_k.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}

INFO:hf-to-gguf:blk.2.attn_output.weight,    torch.bfloat16 --> F16, shape = {4096, 4096}

INFO:hf-to-gguf:blk.2.attn_q.weight,         torch.bfloat16 --> F16, shape = {4096, 4096}

INFO:hf-to-gguf:blk.2.attn_v.weight,         torch.bfloat16 --> F16, shape = {4096, 1024}

INFO:hf-to-gguf:blk.3.attn_norm.weight,      torch.bfloat16 --> F32, shape = {4096}

INFO:hf-to-gguf:blk.3.ffn_down.weight,       torch.bfloat16 --> F16, shape = {14336, 4096}

需要注意:torch 这个组件比较大,接近 700M

验证转换后模型

通过以下的指令验证转换后的模型

./build/bin/llama-lookup-stats -m /data/models/models/AI-ModelScope/phi-2/phi-2.8B-2-F16.gguf 

root@debian:/tmp/llama.cpp-master# ./build/bin/llama-lookup-stats -m /data/models/models/
added_tokens.json       generation_config.json  NOTICE.md
AI-ModelScope/          .gitattributes          README.md
.cache/                 LICENSE                 SECURITY.md
CODE_OF_CONDUCT.md      LLM-Research/           ._____temp/
config.json             merges.txt              
root@debian:/tmp/llama.cpp-master# ./build/bin/llama-lookup-stats -m /data/models/models/AI-ModelScope/phi-2/
added_tokens.json                 .msc
CODE_OF_CONDUCT.md                .mv
config.json                       NOTICE.md
configuration.json                phi-2.8B-2-F16.gguf
generation_config.json            README.md
LICENSE                           SECURITY.md
.mdl                              special_tokens_map.json
merges.txt                        tokenizer_config.json
model-00001-of-00002.safetensors  tokenizer.json
model-00002-of-00002.safetensors  vocab.json
model.safetensors.index.json      
root@debian:/tmp/llama.cpp-master# ./build/bin/llama-lookup-stats -m /data/models/models/AI-ModelScope/phi-2/phi-2.8B-2-F16.gguf 
build: 0 (unknown) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 31 key-value pairs and 453 tensors from /data/models/models/AI-ModelScope/phi-2/phi-2.8B-2-F16.gguf (version GGUF V3 (latest))

也可以通过指令对模型“跑分”测试

./build/bin/llama-bench -m /data/models/models/AI-ModelScope/phi-2/phi-2.8B-2-F16.gguf 

root@debian:/tmp/llama.cpp-master# ./build/bin/llama-bench -m /data/models/models/AI-ModelScope/phi-2/phi-2.8B-2-F16.gguf 
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| phi2 3B F16                    |   5.18 GiB |     2.78 B | CPU        |       6 |         pp512 |         62.43 ± 1.72 |

phi2 3B F165.18 GiB2.78 BCPU6tg1283.58 ± 0.00

 或者使用 simple 程序,来完成上面两个命令的“打包操作”

./build/bin/llama-simple -m /data/models/models/AI-ModelScope/phi-2/phi-2.8B-2-F16.gguf 

main: decoded 17 tokens in 5.21 s, speed: 3.26 t/s

llama_perf_sampler_print:    sampling time =       0.60 ms /    18 runs   (    0.03 ms per token, 30201.34 tokens per second)
llama_perf_context_print:        load time =   23402.49 ms
llama_perf_context_print: prompt eval time =     446.55 ms /     4 tokens (  111.64 ms per token,     8.96 tokens per second)
llama_perf_context_print:        eval time =    4758.15 ms /    17 runs   (  279.89 ms per token,     3.57 tokens per second)
llama_perf_context_print:       total time =   28169.79 ms /    21 tokens

llama.cpp 量化模型

llama.cpp 支持的量化模型可以参考官网文件 https://github.com/ggerganov/llama.cpp/blob/master/examples/quantize/quantize.cpp

static const std::vector<struct quant_option> QUANT_OPTIONS = {
    { "Q4_0",     LLAMA_FTYPE_MOSTLY_Q4_0,     " 4.34G, +0.4685 ppl @ Llama-3-8B",  },
    { "Q4_1",     LLAMA_FTYPE_MOSTLY_Q4_1,     " 4.78G, +0.4511 ppl @ Llama-3-8B",  },
    { "Q5_0",     LLAMA_FTYPE_MOSTLY_Q5_0,     " 5.21G, +0.1316 ppl @ Llama-3-8B",  },
    { "Q5_1",     LLAMA_FTYPE_MOSTLY_Q5_1,     " 5.65G, +0.1062 ppl @ Llama-3-8B",  },
    { "IQ2_XXS",  LLAMA_FTYPE_MOSTLY_IQ2_XXS,  " 2.06 bpw quantization",            },
    { "IQ2_XS",   LLAMA_FTYPE_MOSTLY_IQ2_XS,   " 2.31 bpw quantization",            },
    { "IQ2_S",    LLAMA_FTYPE_MOSTLY_IQ2_S,    " 2.5  bpw quantization",            },
    { "IQ2_M",    LLAMA_FTYPE_MOSTLY_IQ2_M,    " 2.7  bpw quantization",            },
    { "IQ1_S",    LLAMA_FTYPE_MOSTLY_IQ1_S,    " 1.56 bpw quantization",            },
    { "IQ1_M",    LLAMA_FTYPE_MOSTLY_IQ1_M,    " 1.75 bpw quantization",            },
    { "TQ1_0",    LLAMA_FTYPE_MOSTLY_TQ1_0,    " 1.69 bpw ternarization",           },
    { "TQ2_0",    LLAMA_FTYPE_MOSTLY_TQ2_0,    " 2.06 bpw ternarization",           },
    { "Q2_K",     LLAMA_FTYPE_MOSTLY_Q2_K,     " 2.96G, +3.5199 ppl @ Llama-3-8B",  },
    { "Q2_K_S",   LLAMA_FTYPE_MOSTLY_Q2_K_S,   " 2.96G, +3.1836 ppl @ Llama-3-8B",  },
    { "IQ3_XXS",  LLAMA_FTYPE_MOSTLY_IQ3_XXS,  " 3.06 bpw quantization",            },
    { "IQ3_S",    LLAMA_FTYPE_MOSTLY_IQ3_S,    " 3.44 bpw quantization",            },
    { "IQ3_M",    LLAMA_FTYPE_MOSTLY_IQ3_M,    " 3.66 bpw quantization mix",        },
    { "Q3_K",     LLAMA_FTYPE_MOSTLY_Q3_K_M,   "alias for Q3_K_M"                   },
    { "IQ3_XS",   LLAMA_FTYPE_MOSTLY_IQ3_XS,   " 3.3 bpw quantization",             },
    { "Q3_K_S",   LLAMA_FTYPE_MOSTLY_Q3_K_S,   " 3.41G, +1.6321 ppl @ Llama-3-8B",  },
    { "Q3_K_M",   LLAMA_FTYPE_MOSTLY_Q3_K_M,   " 3.74G, +0.6569 ppl @ Llama-3-8B",  },
    { "Q3_K_L",   LLAMA_FTYPE_MOSTLY_Q3_K_L,   " 4.03G, +0.5562 ppl @ Llama-3-8B",  },
    { "IQ4_NL",   LLAMA_FTYPE_MOSTLY_IQ4_NL,   " 4.50 bpw non-linear quantization", },
    { "IQ4_XS",   LLAMA_FTYPE_MOSTLY_IQ4_XS,   " 4.25 bpw non-linear quantization", },
    { "Q4_K",     LLAMA_FTYPE_MOSTLY_Q4_K_M,   "alias for Q4_K_M",                  },
    { "Q4_K_S",   LLAMA_FTYPE_MOSTLY_Q4_K_S,   " 4.37G, +0.2689 ppl @ Llama-3-8B",  },
    { "Q4_K_M",   LLAMA_FTYPE_MOSTLY_Q4_K_M,   " 4.58G, +0.1754 ppl @ Llama-3-8B",  },
    { "Q5_K",     LLAMA_FTYPE_MOSTLY_Q5_K_M,   "alias for Q5_K_M",                  },
    { "Q5_K_S",   LLAMA_FTYPE_MOSTLY_Q5_K_S,   " 5.21G, +0.1049 ppl @ Llama-3-8B",  },
    { "Q5_K_M",   LLAMA_FTYPE_MOSTLY_Q5_K_M,   " 5.33G, +0.0569 ppl @ Llama-3-8B",  },
    { "Q6_K",     LLAMA_FTYPE_MOSTLY_Q6_K,     " 6.14G, +0.0217 ppl @ Llama-3-8B",  },
    { "Q8_0",     LLAMA_FTYPE_MOSTLY_Q8_0,     " 7.96G, +0.0026 ppl @ Llama-3-8B",  },
    { "F16",      LLAMA_FTYPE_MOSTLY_F16,      "14.00G, +0.0020 ppl @ Mistral-7B",  },
    { "BF16",     LLAMA_FTYPE_MOSTLY_BF16,     "14.00G, -0.0050 ppl @ Mistral-7B",  },
    { "F32",      LLAMA_FTYPE_ALL_F32,         "26.00G              @ 7B",          },

选择 Q4_K_M 一类的量化类型,保持小巧,又不会太掉性能,可以根据自己的习惯来进行量化。

量化完成后可以选择再次验证

./build/bin/llama-quantize /data/models/models/AI-ModelScope/phi-2/phi-2.8B-2-F16.gguf  Q4_K_M
./build/bin/llama-simple -m /data/models/models/AI-ModelScope/phi-2/ggml-model-Q4_K_M.gguf 


main: decoded 17 tokens in 1.64 s, speed: 10.36 t/s

llama_perf_sampler_print:    sampling time =       0.38 ms /    18 runs   (    0.02 ms per token, 47368.42 tokens per second)
llama_perf_context_print:        load time =    5653.80 ms
llama_perf_context_print: prompt eval time =     112.59 ms /     4 tokens (   28.15 ms per token,    35.53 tokens per second)
llama_perf_context_print:        eval time =    1522.58 ms /    17 runs   (   89.56 ms per token,    11.17 tokens per second)
llama_perf_context_print:       total time =    7181.82 ms /    21 tokens

对比量化前性能的提升非常明显。

Ollama 模型的构建

创建一个干净的目录,将刚刚在其他目录中量化好的模型放进来,创建一个 ollama 模型配置文件,方便后续的操作。

mkdir ollama
cd ollama
cp /data/models/models/AI-ModelScope/phi-2/ggml-model-Q4_K_M.gguf 
echo "FROM ./ggml-model-Q4_K_M.gguf" > Modelfile
docker run -d --gpus=all -v `pwd`:/root/.ollama -p 11434:11434 --name ollama-llama3 ollama/ollama:0.5.2
docker exec -it ollama-llama3 bash 
ollama create custom_llama_3_1 -f ~/.ollama/Modelfile
llama show custom_llama_3_1
du -hs ~/.ollama/models/

GGUF 量化模式说明

GGUF:一种二进制模型文件格式,前身是GGML,这里的 GG 前缀就是作者的名字缩写,这种格式优化了模型文件的读取和写入速度,而且包含元数据,与仅包含张量的文件格式(如safetensors )不同,GGUF文件是 All-in-one 的。

量化:量化是一种通过降低模型参数的表示精度来减少模型大小和计算需求的方法,比如,把单精度FP32转变为INT8来减少存储和计算成本。量化有非常多的计算方法,比如常见的线性(仿射)量化:公式是 r=S(q-Z),意思是将实数值r 映射为量化的整数值q,其中缩放因子 S 和零点Z是根据参数的分布统计计算出来的。

基础数值类型

传统量化类型

K 系列量化类型

IQ 系列量化类型



  • 无标签
写评论...