管理 API ¶

TorchServe 提供了以下 API，允许您在运行时管理模型：

管理 API 侦听端口 8081，默认情况下只能从 localhost 访问。要更改默认设置，请参阅 TorchServe 配置。

默认情况下，用于注册和删除模型的管理 API 处于禁用状态。在运行 TorchServe 时添加到命令行以允许使用这些 API。有关更多详细信息和启用方法，请参阅模型 API 控制--enable-model-api

对于所有管理 API 请求，TorchServe 需要包含正确的管理令牌，或者必须禁用令牌授权。有关更多详细信息，请参阅 Token 授权文档

与推理 API 类似，管理 API 提供 API 描述，以描述具有 OpenAPI 3.0 规范的管理 API。

或者，如果您想使用 KServe，TorchServe 同时支持 v1 和 v2 API。有关更多详细信息，请查看此 kserve 文档

注册模型¶

此 API 遵循 ManagementAPIsService.RegisterModel gRPC API。

要在 TorchServe 启动后使用此 API，必须启用模型 API 控制。在启动 TorchServe 时添加到命令行，以启用此 API 的使用。有关更多详细信息，请参阅模型 API 控制--enable-model-api

POST /models

url- 模型存档下载 URL。支持以下位置：
- 本地模型档案（.mar）;文件必须位于文件夹中（而不是子文件夹中）。model_store
- 使用 HTTP（s）协议的 URI。TorchServe 可以从 Internet 下载 .mar 文件。
model_name- 模型的名称;此名称将用作其他 API 中的 {model_name} 作为路径的一部分。如果此参数不存在，则将使用 in MANIFEST.json。modelName
handler- 推理处理程序入口点。如果存在，此值将在 MANIFEST.json 中覆盖。handler注意：确保给定的位于 .处理程序的格式为。handlerPYTHONPATHmodule_name:method_name
runtime- 模型自定义服务代码的运行时。此值将覆盖 MANIFEST.json 中的运行时（如果存在）。默认值为 .PYTHON
batch_size- 推理批量大小。默认值为 .1
max_batch_delay- 批量聚合的最大延迟。默认值为 100 毫秒。
initial_workers- 要创建的初始 worker 的数量。默认值为 .TorchServe 在分配了至少一个工作之前不会运行推理。0
synchronous- worker 的创建是否是同步的。默认值为 false。TorchServe 将创建新的 worker，而无需等待确认前一个 worker 在线。
response_timeout- 如果模型的后端工作程序在此超时期限内未使用推理响应进行响应，则该工作程序将被视为无响应并重启。单位为秒。默认值为 120 秒。
startup_timeout- 如果模型的后端 worker 未在此超时期限内加载模型，则 worker 将被视为无响应并重启。单位为秒。默认值为 120 秒。

curl -X POST  "http://localhost:8081/models?url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

{
  "status": "Model \"squeezenet_v1.1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}

加密模型服务¶

如果要提供加密模型，则需要使用以下环境变量设置 S3 SSE-KMS：

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION

并在 HTTP 请求中设置 “s3_sse_kms=true”。

例如：模型 squeezenet1_1 在您自己的私有账户下的 S3 上加密。S3 上的模型 http url 是 .https://torchserve.pytorch.org/sse-test/squeezenet1_1.mar

如果 torchserve 将在 EC2 实例上运行（例如作系统：ubuntu）

为 EC2 实例添加 IAM 角色（AWSS3ReadOnlyAccess）
运行 ts_scripts/get_aws_credential.sh 以导出 AWS_ACCESS_KEY_ID 和 AWS_SECRET_ACCESS_KEY
导出 AWS_DEFAULT_REGION=your_s3_bucket_region
启动 TorchServe
通过在 curl 命令中设置 s3_sse_kms=true 来注册加密的模型squeezenet1_1。

curl -X POST  "http://localhost:8081/models?url=https://torchserve.pytorch.org/sse-test/squeezenet1_1.mar&s3_sse_kms=true"

{
  "status": "Model \"squeezenet_v1.1\" Version: 1.0 registered with 0 initial workers. Use scale workers API to add workers for the model."
}

如果 torchserve 将在本地运行（例如作系统：macOS）

找到您的 AWS 访问密钥和私有密钥。如果您忘记了密钥，您可以重置它们。
导出 AWS_ACCESS_KEY_ID=your_aws_access_key
导出 AWS_SECRET_ACCESS_KEY=your_aws_secret_key
导出 AWS_DEFAULT_REGION=your_s3_bucket_region
启动 TorchServe
通过在 curl 命令中设置 s3_sse_kms=true 来注册加密的模型squeezenet1_1（与 EC2 示例步骤 5 相同）。

您可能希望在注册期间创建工作人员。因为创建初始 worker 可能需要一些时间，您可以选择同步或异步调用，以确保正确创建 initial worker。

异步调用在尝试创建 worker 之前返回 HTTP 代码 202。

curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=false&url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 4dc54158-c6de-42aa-b5dd-ebcb5f721043
< content-length: 47
< connection: keep-alive
<
{
  "status": "Processing worker updates..."
}

在调整所有工作程序后，同步调用返回 HTTP 代码 200。

curl -v -X POST "http://localhost:8081/models?initial_workers=1&synchronous=true&url=https://torchserve.pytorch.org/mar_files/squeezenet1_1.mar"

< HTTP/1.1 200 OK
< content-type: application/json
< x-request-id: ecd2e502-382f-4c3b-b425-519fbf6d3b85
< content-length: 89
< connection: keep-alive
<
{
  "status": "Model \"squeezenet1_1\" Version: 1.0 registered with 1 initial workers"
}

Scale Worker¶

该接口遵循 ManagementAPIsService.ScaleWorker gRPC API。它返回 ModelServer 中模型的状态。

PUT /models/{model_name}

min_worker- （可选）工作进程的最小数量。TorchServe 将尝试为指定模型保持此最小值。默认值为 .1
max_worker- （可选）最大工作进程数。TorchServe 不会为指定模型生成更多数量的 worker。默认值与的设置相同。min_worker
synchronous- 调用是否同步。默认值为 .false
timeout- 工作人员完成所有待处理请求的指定等待时间。如果超过，则工作进程将被终止。用于立即终止后端工作进程。用于无限等待。默认值为 .0-1-1

使用 Scale Worker API 动态调整模型任何版本的工作线程数量，以更好地为不同的推理请求负载提供服务。

此 API 有两种不同的风格：同步和异步。

异步调用将立即返回 HTTP 代码 202：

curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3"

< HTTP/1.1 202 Accepted
< content-type: application/json
< x-request-id: 42adc58e-6956-4198-ad07-db6c620c4c1e
< content-length: 47
< connection: keep-alive
<
{
  "status": "Processing worker updates..."
}

在调整所有工作程序后，同步调用返回 HTTP 代码 200。

curl -v -X PUT "http://localhost:8081/models/noop?min_worker=3&synchronous=true"

< HTTP/1.1 200 OK
< content-type: application/json
< x-request-id: b72b1ea0-81c6-4cce-92c4-530d3cfe5d4a
< content-length: 63
< connection: keep-alive
<
{
  "status": "Workers scaled to 3 for model: noop"
}

要扩展特定模型版本的工作程序，请使用 URI：/models/{model_name}/{version}PUT /models/{model_name}/{version}

在使用 HTTP 代码 200 调整模型“noop”的版本 “2.0” 的所有工作程序后，将返回以下同步调用。

curl -v -X PUT "http://localhost:8081/models/noop/2.0?min_worker=3&synchronous=true"

< HTTP/1.1 200 OK
< content-type: application/json
< x-request-id: 3997ccd4-ae44-4570-b249-e361b08d3d47
< content-length: 77
< connection: keep-alive
<
{
  "status": "Workers scaled to 3 for model: noop, version: 2.0"
}

描述模型¶

此 API 遵循 ManagementAPIsService.DescribeModel gRPC API。它返回 ModelServer 中模型的状态。

GET /models/{model_name}

使用 Describe Model API 获取模型默认版本的详细运行时状态：

curl http://localhost:8081/models/noop
[
    {
      "modelName": "noop",
      "modelVersion": "1.0",
      "modelUrl": "noop.mar",
      "engine": "Torch",
      "runtime": "python",
      "minWorkers": 1,
      "maxWorkers": 1,
      "batchSize": 1,
      "maxBatchDelay": 100,
      "workers": [
        {
          "id": "9000",
          "startTime": "2018-10-02T13:44:53.034Z",
          "status": "READY",
          "gpu": false,
          "memoryUsage": 89247744
        }
      ],
      "jobQueueStatus": {
        "remainingCapacity": 100,
        "pendingRequests": 0
      }
    }
]

GET /models/{model_name}/{version}

使用 Describe Model API 获取模型的特定版本的详细运行时状态：

curl http://localhost:8081/models/noop/2.0
[
    {
      "modelName": "noop",
      "modelVersion": "2.0",
      "modelUrl": "noop_2.mar",
      "engine": "Torch",
      "runtime": "python",
      "minWorkers": 1,
      "maxWorkers": 1,
      "batchSize": 1,
      "maxBatchDelay": 100,
      "workers": [
        {
          "id": "9000",
          "startTime": "2018-10-02T13:44:53.034Z",
          "status": "READY",
          "gpu": false,
          "memoryUsage": 89247744
        }
      ],
      "jobQueueStatus": {
        "remainingCapacity": 100,
        "pendingRequests": 0
      }
    }
]

GET /models/{model_name}/all

使用 Describe Model API 获取模型的所有版本的详细运行时状态：

curl http://localhost:8081/models/noop/all
[
    {
      "modelName": "noop",
      "modelVersion": "1.0",
      "modelUrl": "noop.mar",
      "engine": "Torch",
      "runtime": "python",
      "minWorkers": 1,
      "maxWorkers": 1,
      "batchSize": 1,
      "maxBatchDelay": 100,
      "workers": [
        {
          "id": "9000",
          "startTime": "2018-10-02T13:44:53.034Z",
          "status": "READY",
          "gpu": false,
          "memoryUsage": 89247744
        }
      ],
      "jobQueueStatus": {
        "remainingCapacity": 100,
        "pendingRequests": 0
      }
    },
    {
      "modelName": "noop",
      "modelVersion": "2.0",
      "modelUrl": "noop_2.mar",
      "engine": "Torch",
      "runtime": "python",
      "minWorkers": 1,
      "maxWorkers": 1,
      "batchSize": 1,
      "maxBatchDelay": 100,
      "workers": [
        {
          "id": "9000",
          "startTime": "2018-10-02T13:44:53.034Z",
          "status": "READY",
          "gpu": false,
          "memoryUsage": 89247744
        }
      ],
      "jobQueueStatus": {
        "remainingCapacity": 100,
        "pendingRequests": 0
      }
    }
]

GET /models/{model_name}/{model_version}?customized=true或GET /models/{model_name}?customized=true

使用 Describe Model API 获取模型版本的详细运行时状态和自定义元数据：

实现 describe_handle 函数。例如

    def describe_handle(self):
        """Customized describe handler
        Returns:
            dict : A dictionary response.
        """
        output_describe = None

        logger.info("Collect customized metadata")

        return output_describe

如果 handler 不是从 BaseHandler 继承的，则_is_describe实现函数。然后，调用 _is_describe 并describe_handle handle 中。

    def _is_describe(self):
        if self.context and self.context.get_request_header(0, "describe"):
            if self.context.get_request_header(0, "describe") == "True":
                return True
        return False

    def handle(self, data, context):
        if self._is_describe():
            output = [self.describe_handle()]
        else:
            data_preprocess = self.preprocess(data)

            if not self._is_explain():
                output = self.inference(data_preprocess)
                output = self.postprocess(output)
            else:
                output = self.explain_handle(data_preprocess, data)

        return output

调用 handle 中的 _is_describe 和 describe_handle 函数。例如

def handle(self, data, context):
        """Entry point for default handler. It takes the data from the input request and returns
           the predicted outcome for the input.
        Args:
            data (list): The input data that needs to be made a prediction request on.
            context (Context): It is a JSON Object containing information pertaining to
                               the model artifacts parameters.
        Returns:
            list : Returns a list of dictionary with the predicted response.
        """

        # It can be used for pre or post processing if needed as additional request
        # information is available in context
        start_time = time.time()

        self.context = context
        metrics = self.context.metrics

        is_profiler_enabled = os.environ.get("ENABLE_TORCH_PROFILER", None)
        if is_profiler_enabled:
            output, _ = self._infer_with_profiler(data=data)
        else:
            if self._is_describe():
                output = [self.describe_handle()]
            else:
                data_preprocess = self.preprocess(data)

                if not self._is_explain():
                    output = self.inference(data_preprocess)
                    output = self.postprocess(output)
                else:
                    output = self.explain_handle(data_preprocess, data)

        stop_time = time.time()
        metrics.add_time('HandlerTime', round(
            (stop_time - start_time) * 1000, 2), None, 'ms')
        return output

下面是一个示例。“customizedMetadata” 显示用户模型中的元数据。这些元数据可以解码为字典。

curl http://localhost:8081/models/noop-customized/1.0?customized=true
[
    {
        "modelName": "noop-customized",
        "modelVersion": "1.0",
        "modelUrl": "noop-customized.mar",
        "runtime": "python",
        "minWorkers": 1,
        "maxWorkers": 1,
        "batchSize": 1,
        "maxBatchDelay": 100,
        "loadedAtStartup": false,
        "workers": [
          {
            "id": "9010",
            "startTime": "2022-02-08T11:03:20.974Z",
            "status": "READY",
            "memoryUsage": 0,
            "pid": 98972,
            "gpu": false,
            "gpuUsage": "N/A"
          }
        ],
        "jobQueueStatus": {
          "remainingCapacity": 100,
          "pendingRequests": 0
        },
        "customizedMetadata": "{\n  \"data1\": \"1\",\n  \"data2\": \"2\"\n}"
     }
]

在客户端解码 customizedMetadata。例如：

import requests
import json

response = requests.get('http://localhost:8081/models/noop-customized/?customized=true').json()
customizedMetadata = response[0]['customizedMetadata']
print(customizedMetadata)

取消注册模型¶

此 API 遵循 ManagementAPIsService.UnregisterModel gRPC API。它返回 ModelServer 中模型的状态。

要在 TorchServe 启动后使用此 API，必须启用模型 API 控制。在启动 TorchServe 时添加到命令行，以启用此 API 的使用。有关更多详细信息，请参阅模型 API 控制--enable-model-api

DELETE /models/{model_name}/{version}

使用取消注册模型 API 从 TorchServe 取消注册模型的特定版本来释放系统资源：

curl -X DELETE http://localhost:8081/models/noop/1.0

{
  "status": "Model \"noop\" unregistered"
}

列出模型¶

此 API 遵循 ManagementAPIsService.ListModels gRPC API。它返回 ModelServer 中模型的状态。

GET /models

limit- （可选）要返回的最大项目数。它作为查询参数传递。默认值为 .100
next_page_token- （可选）下一页的查询。它作为查询参数传递。此值由之前的 API 调用返回。

使用模型 API 查询当前已注册模型的默认版本：

curl "http://localhost:8081/models"

此 API 支持分页：

curl "http://localhost:8081/models?limit=2&next_page_token=2"

{
  "nextPageToken": "4",
  "models": [
    {
      "modelName": "noop",
      "modelUrl": "noop-v1.0"
    },
    {
      "modelName": "noop_v0.1",
      "modelUrl": "noop-v0.1"
    }
  ]
}

接口说明¶

OPTIONS /

要查看推理和管理 API 的完整列表，您可以使用以下命令：

# To view all inference APIs:
curl -X OPTIONS http://localhost:8080

# To view all management APIs:
curl -X OPTIONS http://localhost:8081

输出的是 OpenAPI 3.0.1 json 格式。您可以使用它来生成客户端代码，有关详细信息，请参阅 swagger codegen。

推理和管理 API 的示例输出：

设置默认版本¶

该接口遵循 ManagementAPIsService.SetDefault gRPC API。它返回 ModelServer 中模型的状态。

PUT /models/{model_name}/{version}/set-default

要将模型的任何注册版本设置为默认版本，请使用：

curl -v -X PUT http://localhost:8081/models/noop/2.0/set-default

输出的是 OpenAPI 3.0.1 json 格式。您可以使用它来生成客户端代码，有关详细信息，请参阅 swagger codegen。

Token 授权 API¶

TorchServe 现在默认强制执行令牌授权。有关更多信息，请查看以下文档：令牌授权。

此 API 用于生成新密钥以替换管理密钥或推理密钥。

管理示例：

curl localhost:8081/token?type=management -H "Authorization: Bearer {API Token}"

会将key_file中的当前管理密钥替换为新的管理密钥，并更新过期时间。

推理示例：

curl localhost:8081/token?type=inference -H "Authorization: Bearer {API Token}"

会将key_file中的当前推理 Key 替换为新的 Key，并更新过期时间。

管理 API¶