批量推理与TorchServe¶

本文档内容¶

介绍¶

批量推理是一个将推理请求聚合并一次性通过ML/DL框架发送给所有请求的过程。 TorchServe被设计为原生支持接收的推理请求批量处理。这一功能使您能够最优地使用主机资源，因为大多数ML/DL框架都是针对批量请求进行优化的。这种对主机资源的最优利用反过来会降低使用TorchServe托管推理服务的运营成本。

在这份文档中，我们展示了如何在本地使用Torchserve或使用Docker容器时使用批量推理的示例。

预备知识¶

在阅读本文档之前，请先阅读以下文档：

使用TorchServe的默认处理器进行批量推理¶

TorchServe的默认处理器支持批量推理，除了 text_classifier 个处理器。

批量推理使用TorchServe和ResNet-152模型¶

为了支持批量推理，TorchServe 需要以下内容：

TorchServe 模型配置：使用 “POST /models” 管理 API 或 config.properties 中的设置，配置 batch_size 和 max_batch_delay。 TorchServe 需要知道模型可以处理的最大批量大小和 TorchServe 应该等待填充每个批次请求的最大时间。
模型处理器代码：TorchServe 需要模型处理器来处理批量推理请求。

有关带有批处理的自定义模型处理器的完整工作示例，请参阅 Hugging Face 变压器通用处理器

TorchServe 模型配置¶

从Torchserve 0.4.1开始，有两种方法可以配置TorchServe使用批量处理功能：

通过 POST /models API 提供批次配置信息。
通过配置文件提供批次配置信息，config.properties。

我们感兴趣配置属性如下：

batch_size: 这是模型预期处理的最大批量大小。
max_batch_delay: 这是TorchServe等待接收ms个请求的最大批量延迟时间。如果TorchServe在计时器到期前没有收到batch_size个请求，它将发送所有收到的请求给模型handler。

让我们通过管理API查看一个使用此配置的示例：

# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milliseconds.
curl -X POST "localhost:8081/models?url=resnet-152.mar&batch_size=8&max_batch_delay=50"

这是一个通过config.properties使用此配置的示例：

# The following command will register a model "resnet-152.mar" and configure TorchServe to use a batch_size of 8 and a max batch delay of 50 milli seconds, in the config.properties.

models={\
  "resnet-152": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "resnet-152.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 8,\
        "maxBatchDelay": 50,\
        "responseTimeout": 120\
    }\
  }\
}

这些配置既用于TorchServe，也用于模型的自定义服务代码（即handler代码）。 TorchServe将与每个模型相关的批量配置关联起来。前端尝试聚合批量大小数量的请求，并将其发送到后端。

演示配置TorchServe ResNet-152模型，支持批量模型¶

在这个部分，让我们启动模型服务器并运行Resnet-152模型，该模型使用默认的image_classifier处理器进行批量推理。

安装 TorchServe 和 Torch 模型存档器¶

首先，按照主要的 Readme 进行操作，并安装所有所需的包，包括 torchserve。

批量推理配置为使用管理API的Resnet-152¶

启动模型服务器。在这个例子中，我们正在启动模型服务器以在推理端口8080和管理端口8081上运行。

$ cat config.properties
...
inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081
...
$ torchserve --start --model-store model_store

确认TorchServe是否已启动

$ curl localhost:8080/ping
{
  "status": "Healthy"
}

现在让我们启动ResNet-152模型，我们构建这个模型是为了处理批量推理。因为这是一个示例，我们将启动1个工作进程，处理批量大小为3，延迟时间为max_batch_delay 10毫秒。

$ curl -X POST "localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar&batch_size=3&max_batch_delay=10&initial_workers=1"
{
  "status": "Processing worker updates..."
}

确认工人是否启动正常。

curl http://localhost:8081/models/resnet-152-batch_v2

[
  {
    "modelName": "resnet-152-batch_v2",
    "modelVersion": "2.0",
    "modelUrl": "https://torchserve.pytorch.org/mar_files/resnet-152-batch_v2.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 3,
    "maxBatchDelay": 10,
    "loadedAtStartup": false,
    "workers": [
      {
        "id": "9000",
        "startTime": "2021-06-14T23:18:21.793Z",
        "status": "READY",
        "memoryUsage": 1726554112,
        "pid": 19946,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
      }
    ]
  }
]

现在让我们测试一下这个服务。

获取一张图片来测试这项服务

$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg

运行推理以测试模型。

  $ curl http://localhost:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
  {
      "tiger_cat": 0.5798614621162415,
      "tabby": 0.38344162702560425,
      "Egyptian_cat": 0.0342114195227623,
      "lynx": 0.0005819813231937587,
      "quilt": 0.000273319921689108
  }

批量推理配置为Resnet-152的模型¶

这里，我们首先在 config.properties 中设置 batch_size 和 max_batch_delay，确保 mar 文件位于 model-store 中，并且 models 设置中的版本与创建的 mar 文件的版本一致。要了解更多关于配置的信息，请参阅此文档。

load_models=resnet-152-batch_v2.mar
models={\
  "resnet-152-batch_v2": {\
    "2.0": {\
        "defaultVersion": true,\
        "marName": "resnet-152-batch_v2.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 3,\
        "maxBatchDelay": 5000,\
        "responseTimeout": 120\
    }\
  }\
}

然后将通过使用 --ts-config 选项传递配置.properties 文件来启动Torchserve

torchserve --start --model-store model_store  --ts-config config.properties

确认TorchServe是否已启动

$ curl localhost:8080/ping
{
  "status": "Healthy"
}

确认工人是否启动正常。

curl http://localhost:8081/models/resnet-152-batch_v2

[
  {
    "modelName": "resnet-152-batch_v2",
    "modelVersion": "2.0",
    "modelUrl": "resnet-152-batch_v2.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 3,
    "maxBatchDelay": 5000,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2021-06-14T22:44:36.742Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 19116,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
      }
    ]
  }
]

现在让我们测试一下这个服务。

获取一张图片来测试这项服务

$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg

运行推理以测试模型。

  $ curl http://localhost:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
  {
      "tiger_cat": 0.5798614621162415,
      "tabby": 0.38344162702560425,
      "Egyptian_cat": 0.0342114195227623,
      "lynx": 0.0005819813231937587,
      "quilt": 0.000273319921689108
  }

演示如何使用Docker配置支持批量的TorchServe ResNet-152模型¶

这里，我们展示了如何在使用docker容器提供模型服务时注册一个支持批量推理的模型。我们在config.properties中设置batch_size和max_batch_delay，类似于上一节，这些设置被dockered_entrypoint.sh使用。

批量推理使用Docker容器的Resnet-152¶

在 dockered_entrypoint.sh 中引用的 config.properties 文件中设置批次 batch_size 和 max_batch_delay

inference_address=http://127.0.0.1:8080
management_address=http://127.0.0.1:8081
metrics_address=http://127.0.0.1:8082
number_of_netty_threads=32
job_queue_size=1000
model_store=/home/model-server/model-store
load_models=resnet-152-batch_v2.mar
models={\
  "resnet-152-batch_v2": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "resnet-152-batch_v2.mar",\
        "minWorkers": 1,\
        "maxWorkers": 1,\
        "batchSize": 3,\
        "maxBatchDelay": 100,\
        "responseTimeout": 120\
    }\
  }\
}

从这里构建目标 Docker 镜像，这里我们使用 GPU 镜像

./build_image.sh -g -cv cu102

开始使用容器运行模型，并将config.properties传递给容器

 docker run --rm -it --gpus all -p 127.0.0.1:8080:8080 -p 127.0.0.1:8081:8081 --name mar -v /home/ubuntu/serve/model_store:/home/model-server/model-store  -v $ path to config.properties:/home/model-server/config.properties  pytorch/torchserve:latest-gpu

确认工人是否启动正常。

curl http://localhost:8081/models/resnet-152-batch_v2

[
  {
    "modelName": "resnet-152-batch_v2",
    "modelVersion": "2.0",
    "modelUrl": "resnet-152-batch_v2.mar",
    "runtime": "python",
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 3,
    "maxBatchDelay": 5000,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2021-06-14T22:44:36.742Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 19116,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::678 MiB"
      }
    ]
  }
]

现在让我们测试一下这个服务。

获取一张图片来测试这项服务

$ curl -LJO https://github.com/pytorch/serve/raw/master/examples/image_classifier/kitten.jpg

运行推理以测试模型。

  $ curl http://localhost:8080/predictions/resnet-152-batch_v2 -T kitten.jpg
  {
      "tiger_cat": 0.5798614621162415,
      "tabby": 0.38344162702560425,
      "Egyptian_cat": 0.0342114195227623,
      "lynx": 0.0005819813231937587,
      "quilt": 0.000273319921689108
  }

批量推理与TorchServe¶

本文档内容¶

介绍¶

预备知识¶

使用TorchServe的默认处理器进行批量推理¶

批量推理使用TorchServe和ResNet-152模型¶

TorchServe 模型配置¶

演示配置TorchServe ResNet-152模型，支持批量模型¶

安装 TorchServe 和 Torch 模型存档器¶

批量推理配置为使用管理API的Resnet-152¶

批量推理配置为Resnet-152的模型¶

演示如何使用Docker配置支持批量的TorchServe ResNet-152模型¶

批量推理使用Docker容器的Resnet-152¶

文档

教程

资源