使用 Qualcomm AI Engine Direct 后端构建和运行 Llama 3 8B Instruct¶

本教程演示了如何为 Qualcomm AI Engine Direct Backend 导出 Llama 3 8B Instruct 并在 Qualcomm 设备上运行模型。

先决条件¶

如果您尚未按照设置 ExecuTorch 设置存储库和开发环境，请按照以下步骤设置 ExecuTorch 存储库和环境。
阅读使用 Qualcomm AI Engine Direct Backend 构建和运行 ExecuTorch 页面，了解如何在 Qualcomm 设备上使用 Qualcomm AI Engine Direct Backend 导出和运行模型。
按照 executorch llama 的 README 了解如何通过 ExecuTorch 在移动设备上运行 llama 模型。
具有 16GB RAM 的 Qualcomm 设备
- 我们将继续优化内存使用情况，以确保与较低内存设备的兼容性。
Qualcomm AI Engine Direct SDK 版本为 2.26.0 及以上。

指示¶

第 1 步：从 Spin Quant 准备模型和优化矩阵的检查点¶

对于 Llama 3 分词器和检查点，请参阅 https://github.com/meta-llama/llama-models/blob/main/README.md 有关如何下载的进一步说明，以及。tokenizer.modelconsolidated.00.pthparams.json
要获取优化的矩阵，请参阅 GitHub 上的 SpinQuant。您可以在 Quantized Models 部分中下载优化的旋转矩阵。请选择 LLaMA-3-8B/8B_W4A16KV16_lr_1.5_seed_0。

第 2 步：使用 Qualcomm AI Engine Direct 后端导出到 ExecuTorch¶

在设备上部署像 Llama 3 这样的大型语言模型面临以下挑战：

模型大小太大，无法放入设备内存中进行推理。
模型加载和推理时间长。
量化困难。

为了应对这些挑战，我们实施了以下解决方案：

用于量化激活和权重，从而减小磁盘模型大小并减轻推理过程中的内存压力。--pt2e_quantize qnn_16a4w
用于将模型分片为多个子部分。--num_sharding 8
执行图形转换以将操作转换或分解为对加速器更友好的操作。
用于应用 Spin Quant 的 R1 和 R2 以提高准确性。--optimized_rotation_path <path_to_optimized_matrix>
用于确保在 Llama 3 8B 指令的量化过程中，校准在提示模板中包含特殊标记。关于提示模板的更多细节，请参考 meta llama3 instruct 的模型卡片。--calibration_data "<|start_header_id|>system<|end_header_id|..."

要使用 Qualcomm AI Engine Direct 后端导出 Llama 3 8B 指令，请确保满足以下条件：

主机具有超过 100GB 的内存（RAM + 交换空间）。
整个过程需要几个小时。

# Please note that calibration_data must include the prompt template for special tokens.
python -m examples.models.llama2.export_llama  -t <path_to_tokenizer.model>
llama3/Meta-Llama-3-8B-Instruct/tokenizer.model -p <path_to_params.json> -c <path_to_checkpoint_for_Meta-Llama-3-8B-Instruct>  --use_kv_cache  --qnn --pt2e_quantize qnn_16a4w --disable_dynamic_shape --num_sharding 8 --calibration_tasks wikitext --calibration_limit 1 --calibration_seq_length 128 --optimized_rotation_path <path_to_optimized_matrix> --calibration_data "<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

第 3 步：在带有 Qualcomm SoC 的 Android 智能手机上调用运行时¶

使用适用于 Android 的 Qualcomm AI Engine Direct Backend 构建 executorch

cmake \
    -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI=arm64-v8a \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=${QNN_SDK_ROOT} \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out .

cmake --build cmake-android-out -j16 --target install --config Release

构建适用于 Android 的 llama runner

    cmake \
        -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK_ROOT}"/build/cmake/android.toolchain.cmake  \
        -DANDROID_ABI=arm64-v8a \
        -DANDROID_PLATFORM=android-23 \
        -DCMAKE_INSTALL_PREFIX=cmake-android-out \
        -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
        -DEXECUTORCH_BUILD_QNN=ON \
        -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
        -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
        -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j16 --config Release

通过 adb shell 在 Android 上运行先决条件：确保通过手机上的开发人员选项启用 USB 调试

3.1 连接您的安卓手机

3.2 我们需要将所需的 QNN 库推送到设备中。

# make sure you have write-permission on below path.
DEVICE_DIR=/data/local/tmp/llama
adb shell mkdir -p ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtp.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnSystem.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV69Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV73Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/aarch64-android/libQnnHtpV75Stub.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v69/unsigned/libQnnHtpV69Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v73/unsigned/libQnnHtpV73Skel.so ${DEVICE_DIR}
adb push ${QNN_SDK_ROOT}/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so ${DEVICE_DIR}

3.3 将模型、分词器和 llama runner 二进制文件上传到手机

adb push <model.pte> ${DEVICE_DIR}
adb push <tokenizer.model> ${DEVICE_DIR}
adb push cmake-android-out/lib/libqnn_executorch_backend.so ${DEVICE_DIR}
adb push cmake-out-android/examples/models/llama2/llama_main ${DEVICE_DIR}

3.4 运行模型

adb shell "cd ${DEVICE_DIR} && ./llama_main --model_path <model.pte> --tokenizer_path <tokenizer.model> --prompt \"<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n\" --seq_len 128"

您应该会看到以下消息：

<|start_header_id|>system<|end_header_id|>\n\nYou are a funny chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCould you tell me about Facebook?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHello! I'd be delighted to chat with you about Facebook. Facebook is a social media platform that was created in 2004 by Mark Zuckerberg and his colleagues while he was a student at Harvard University. It was initially called "Facemaker" but later changed to Facebook, which is a combination of the words "face" and "book". The platform was initially intended for people to share their thoughts and share information with their friends, but it quickly grew to become one of the

即将发生的事情¶

提高 Llama 3 Instruct 的性能
减少推理期间的内存压力以支持 12GB Qualcomm 设备
支持更多 LLM

常见问题¶

如果您在复制本教程时遇到任何问题，请提交 github ExecuTorch 存储库和标签 use 标签上的问题#qcom_aisw