TorchServe 指标 ¶

内容¶

介绍¶

Torchserve 指标大致可分为前端指标和后端指标。

前端指标：¶

API 请求状态指标
推理请求指标
系统利用率指标

注意：定期收集系统利用率指标（默认值：每分钟一次）

后端指标：¶

默认模型指标
自定义模型指标

注意：Torchserve 提供了一个 API 来收集自定义模型指标。

默认前端和后端指标显示在 Default Metrics （默认指标）部分中。

度量模式¶

支持三种度量模式，即，默认模式为。可以使用 configuration option in 或 environment 变量来配置 metrics 模式。有关基于环境变量的配置的更多详细信息，请参阅 Torchserve 配置文档。logprometheuslegacylogmetrics_modeconfig.propertiesTS_METRICS_MODEconfig.properties

日志模式¶

在模式下，将记录指标，并可由指标代理聚合。默认情况下，在 mode 的以下位置收集指标：loglog

前端指标 -log_directory/ts_metrics.log
后端指标 -log_directory/model_metrics.log

日志文件和度量文件的位置可以在 log4j2.xml 文件中配置

Prometheus 模式¶

在模式下，指标通过 metrics API 端点以 prometheus 格式提供。prometheus

传统模式¶

legacy模式启用与 Torchserve 版本的向后兼容性，其中：<= 0.7.1

ts_inference_requests_total，并且只能通过 prometheus 格式的指标 API 端点使用。ts_inference_latency_microsecondsts_queue_latency_microseconds
前端指标记录到log_directory/ts_metrics.log
后端指标记录到log_directory/model_metrics.log

注意：要启用与 releases 的完全向后兼容性，请使用启用了 Model Metrics Auto-Detection 的旧版度量模式。<= 0.7.1

开始¶

使用示例演示自定义指标作为参考：

创建自定义指标配置文件或使用默认的 metrics.yaml 文件。
将参数设置为等于正在使用的 yaml 文件路径：metrics_configconfig.properties
```
metrics_config=/<path>/<to>/<metrics>/<config>/<file>/metrics.yaml
```
如果未指定参数，则将使用默认的 metrics.yaml 配置文件。metrics_config
使用或环境变量中的配置选项设置所需的度量模式。如果未设置，则默认使用 mode。metrics_modeconfig.propertiesTS_METRICS_MODElog
使用自定义指标 API 在处理程序中发出自定义指标（如果有）。
运行 torchserve 并在 flag 后指定文件的路径：config.propertiests-config

torchserve --ncs --start --model-store model_store --models my_model=model.mar --ts-config /<path>/<to>/<config>/<file>/config.properties
根据所选模式收集指标：

如果为 mode，请检查：log
- 前端指标 -log_directory/ts_metrics.log
- 后端指标 -log_directory/model_metrics.log
否则，如果使用 mode，请使用 Metrics API 终端节点。prometheus

指标配置¶

TorchServe 在 yaml 文件中定义指标配置，包括前端指标（即）和后端指标（即）。启动 TorchServe 时，将加载指标定义，并根据配置将相应的指标作为日志或通过指标 API 终端节点提供。ts_metricsmodel_metricsmetrics_mode

不支持对指标配置文件进行动态更新。为了考虑对指标配置文件所做的更新，需要重新启动 Torchserve。

模型量度自动检测¶

默认情况下，指标配置文件中未定义的指标不会记录在指标日志文件中，也不会通过指标 API 终端节点提供。后端模型指标可以通过设置为 in 或使用环境变量来实现。默认情况下，模型指标自动检测处于禁用状态。auto-detectedmodel_metrics_auto_detecttrueconfig.propertiesTS_MODEL_METRICS_AUTO_DETECT

Warning: Using auto-detection of backend metrics will have performance impact in the form of latency overhead, typically at model load and first inference for a given model. This cold start behavior is because, it is during model load and first inference that new metrics are typically emitted by the backend and is detected and registered by the frontend. Subsequent inferences could also see performance impact if new metrics are updated for the first time. For use cases where multiple models are loaded/unloaded often, the latency overhead can be mitigated by specifying known metrics in the metrics configuration file, ahead of time.

指标配置格式¶

指标配置 yaml 文件使用 Prometheus 指标类型术语进行格式设置：

dimensions: # dimension aliases
  - &model_name "ModelName"
  - &level "Level"

ts_metrics:  # frontend metrics
  counter:  # metric type
    - name: NameOfCounterMetric  # name of metric
      unit: ms  # unit of metric
      dimensions: [*model_name, *level]  # dimension names of metric (referenced from the above dimensions dict)
  gauge:
    - name: NameOfGaugeMetric
      unit: ms
      dimensions: [*model_name, *level]
  histogram:
    - name: NameOfHistogramMetric
      unit: ms
      dimensions: [*model_name, *level]

model_metrics:  # backend metrics
  counter:  # metric type
    - name: InferenceTimeInMS  # name of metric
      unit: ms  # unit of metric
      dimensions: [*model_name, *level]  # dimension names of metric (referenced from the above dimensions dict)
    - name: NumberOfMetrics
      unit: count
      dimensions: [*model_name]
  gauge:
    - name: GaugeModelMetricNameExample
      unit: ms
      dimensions: [*model_name, *level]
  histogram:
    - name: HistogramModelMetricNameExample
      unit: ms
      dimensions: [*model_name, *level]

注意：在 metrics 配置文件中添加 custom 时，请确保 include 和 dimension 名称位于维度列表的末尾，因为它们默认包含在以下自定义量度 API 中：add_metric、add_counter、add_time、add_size 和 add_percent。model_metricsModelNameLevel

默认指标配置¶

默认指标在默认指标配置文件 metrics.yaml 中提供。

量度类型¶

TorchServe 指标使用与 Prometheus 指标类型一致的指标类型。

度量类型是 Metric 对象的一个属性。添加自定义指标时，用户将被限制为现有指标类型。

class MetricTypes(enum.Enum):
    COUNTER = "counter"
    GAUGE = "gauge"
    HISTOGRAM = "histogram"

默认量度¶

默认前端指标¶

指标名称	类型	单位	尺寸	语义学
请求2XX	计数器	计数	级别、主机名	响应在 200-300 状态代码范围内的请求总数
请求4XX	计数器	计数	级别、主机名	响应在 400-500 状态代码范围内的请求总数
请求数5XX	计数器	计数	级别、主机名	响应状态代码高于 500 的请求总数
ts_inference_requests_total	计数器	计数	model_name、model_version、主机名	收到的推理请求总数
ts_inference_latency_microseconds	计数器	微秒	model_name、model_version、主机名	总推理延迟（以微秒为单位）
ts_queue_latency_microseconds	计数器	微秒	model_name、model_version、主机名	总队列延迟（以微秒为单位）
队列时间	轨距	毫秒	级别、主机名	请求队列中作业花费的时间（以毫秒为单位）
WorkerThreadTime （工人线程时间）	轨距	毫秒	级别、主机名	工作线程中花费的时间（不包括后端响应时间，以毫秒为单位）
WorkerLoadTime （工人加载时间）	轨距	毫秒	WorkerName、Level、Hostname	工作程序加载模型所花费的时间（以毫秒为单位）
CPUUtilization	轨距	百分之	级别、主机名	主机上的 CPU 利用率
MemoryUsed	轨距	兆字节	级别、主机名	主机上使用的内存
MemoryAvailable 内存	轨距	兆字节	级别、主机名	主机上的可用内存
MemoryUtilization （内存利用率）	轨距	百分之	级别、主机名	主机上的内存利用率
DiskUsage	轨距	千兆字节	级别、主机名	主机上使用的磁盘
DiskUtilization	轨距	百分之	级别、主机名	主机上使用的磁盘
磁盘可用	轨距	千兆字节	级别、主机名	主机上的可用磁盘
GPUMemoryUtilization	轨距	百分之	级别、DeviceId、主机名	主机上的 GPU 内存利用率，DeviceId
GPUMemory已使用	轨距	兆字节	级别、DeviceId、主机名	主机上使用的 GPU 内存，DeviceId
GPUUtilization	轨距	百分之	级别、DeviceId、主机名	主机上的 GPU 利用率，DeviceId

默认后端指标¶

指标名称	类型	单位	尺寸	语义学
HandlerTime （处理程序时间）	轨距	女士	ModelName、Level、Hostname	在后端处理程序中花费的时间
预测时间	轨距	女士	ModelName、Level、Hostname	后端预测时间

自定义指标 API¶

TorchServe 使处理程序能够发出自定义指标，然后根据配置的 .metrics_mode

自定义处理程序示例，显示自定义指标 API 的使用情况。

自定义处理程序代码提供当前请求的上下文，该上下文由一个对象组成：metrics

# Access metrics object in context as follows
def initialize(self, context):
    metrics = context.metrics

注意：不要将自定义指标 API 与指标 API 终端节点混淆，后者是用于获取 prometheus 格式指标的 HTTP API。

默认维度¶

如果尚未指定，则 Metrics 将具有几个默认维度：

ModelName: {name_of_model}
Level: Model

Create Dimension Object(s)¶

指标的维度可以定义为对象

from ts.metrics.dimension import Dimension

# Dimensions are name value pairs
dim1 = Dimension(name, value)
dim2 = Dimension(some_name, some_value)
.
.
.
dimN= Dimension(name_n, value_n)

添加通用量度¶

通用指标默认为指标类型COUNTER

用于添加没有默认维度的通用量度的函数 API¶

    def add_metric_to_cache(
        self,
        metric_name: str,
        unit: str,
        dimension_names: list = [],
        metric_type: MetricTypes = MetricTypes.COUNTER,
    ) -> CachingMetric:
        """
        Create a new metric and add into cache. Override existing metric if already present.

        Parameters
        ----------
        metric_name str
            Name of metric
        unit str
            unit can be one of ms, percent, count, MB, GB or a generic string
        dimension_names list
            list of dimension name strings for the metric
        metric_type MetricTypes
            Type of metric Counter, Gauge, Histogram
        Returns
        -------
        newly created Metrics object
        """

CachingMetric用于更新指标的 API

    def add_or_update(
        self,
        value: int or float,
        dimension_values: list = [],
        request_id: str = "",
):
    """
    Update metric value, request id and dimensions

    Parameters
    ----------
    value : int, float
        metric to be updated
    dimension_values : list
        list of dimension value strings
    request_id : str
        request id to be associated with the metric
    """

    def update(
        self,
        value: int or float,
        request_id: str = "",
        dimensions: list = [],
):
    """
    BACKWARDS COMPATIBILITY: Update metric value

    Parameters
    ----------
    value : int, float
        metric to be updated
    request_id : str
        request id to be associated with the metric
    dimensions : list
        list of Dimension objects
    """

# Example usage
metrics = context.metrics
# Add metric
distance_metric = metrics.add_metric_to_cache(name='DistanceInKM', unit='km', dimension_names=[...])
# Update metric
distance_metric.add_or_update(value=distance, dimension_values=[...], request_id=context.get_request_id())
# OR
distance_metric.update(value=distance, request_id=context.get_request_id(), dimensions=[...])

注意：调用不会发出指标，需要在指标对象上调用，如上所示。add_metric_to_cacheadd_or_update

用于添加具有默认维度的通用指标的函数 API¶

    def add_metric(
        self,
        name: str,
        value: int or float,
        unit: str,
        idx: str = None,
        dimensions: list = [],
        metric_type: MetricTypes = MetricTypes.COUNTER,
    ):
        """
        Add a generic metric
            Default metric type is counter

        Parameters
        ----------
        name : str
            metric name
        value: int or float
            value of the metric
        unit: str
            unit of metric
        idx: str
            request id to be associated with the metric
        dimensions: list
            list of Dimension objects for the metric
        metric_type MetricTypes
            Type of metric Counter, Gauge, Histogram
        """

# Example usage
metrics = context.metrics
metric = metrics.add_metric(name='DistanceInKM', value=10, unit='km', dimensions=[...])

添加基于时间的量度¶

基于时间的指标默认为指标类型GAUGE

    def add_time(self, name: str, value: int or float, idx=None, unit: str = 'ms', dimensions: list = None,
                 metric_type: MetricTypes = MetricTypes.GAUGE):
        """
        Add a time based metric like latency, default unit is 'ms'
            Default metric type is gauge

        Parameters
        ----------
        name : str
            metric name
        value: int
            value of metric
        idx: int
            request_id index in batch
        unit: str
            unit of metric,  default here is ms, s is also accepted
        dimensions: list
            list of Dimension objects for the metric
        metric_type: MetricTypes
           type for defining different operations, defaulted to gauge metric type for Time metrics
        """

注意：默认单位为ms

支持的单位：['ms', 's']

# Example usage
metrics = context.metrics
metrics.add_time(name='InferenceTime', value=end_time-start_time, idx=None, unit='ms', dimensions=[...])

添加基于大小的量度¶

基于大小的指标默认为指标类型GAUGE

    def add_size(self, name: str, value: int or float, idx=None, unit: str = 'MB', dimensions: list = None,
                 metric_type: MetricTypes = MetricTypes.GAUGE):
        """
        Add a size based metric
            Default metric type is gauge

        Parameters
        ----------
        name : str
            metric name
        value: int, float
            value of metric
        idx: int
            request_id index in batch
        unit: str
            unit of metric, default here is 'MB', 'kB', 'GB' also supported
        dimensions: list
            list of Dimension objects for the metric
        metric_type: MetricTypes
           type for defining different operations, defaulted to gauge metric type for Size metrics
        """

注意：默认单位为。MB

支持的单位：['MB', 'kB', 'GB', 'B']

# Example usage
metrics = context.metrics
metrics.add_size(name='SizeOfImage', value=img_size, idx=None, unit='MB', dimensions=[...])

添加基于百分比的量度¶

基于百分比的指标默认为指标类型GAUGE

    def add_percent(self, name: str, value: int or float, idx=None, dimensions: list = None,
                    metric_type: MetricTypes = MetricTypes.GAUGE):
        """
        Add a percentage based metric
            Default metric type is gauge

        Parameters
        ----------
        name : str
            metric name
        value: int, float
            value of metric
        idx: int
            request_id index in batch
        dimensions: list
            list of Dimension objects for the metric
        metric_type: MetricTypes
           type for defining different operations, defaulted to gauge metric type for Percent metrics
        """

推断单位：percent

# Example usage
metrics = context.metrics
metrics.add_percent(name='MemoryUtilization', value=utilization_percent, idx=None, dimensions=[...])

添加基于计数器的量度¶

基于计数器的指标默认为指标类型COUNTER

    def add_counter(self, name: str, value: int or float, idx=None, dimensions: list = None):
        """
        Add a counter metric or increment an existing counter metric
            Default metric type is counter
        Parameters
        ----------
        name : str
            metric name
        value: int or float
            value of metric
        idx: int
            request_id index in batch
        dimensions: list
            list of Dimension objects for the metric
        """

# Example usage
metrics = context.metrics
metrics.add_counter(name='CallCount', value=call_count, idx=None, dimensions=[...])

推断单位：count

获取指标¶

用户可以从缓存中获取指标。返回 CachingMetric 对象因此，用户可以访问 CachingMetric 的方法来更新指标：（即CachingMetric.add_or_update(value, dimension_values)CachingMetric.update(value, dimensions))

    def get_metric(
        self,
        metric_name: str,
        metric_type: MetricTypes = MetricTypes.COUNTER,
) -> CachingMetric:
    """
    Create a new metric and add into cache

    Parameters
    ----------
    metric_name str
        Name of metric

    metric_type MetricTypes
        Type of metric Counter, Gauge, Histogram

    Returns
    -------
    Metrics object or MetricsCacheKeyError if not found
    """

# Example usage
metrics = context.metrics
# Get metric
gauge_metric = metrics.get_metric(metric_name = "GaugeMetricName", metric_type = MetricTypes.GAUGE)
# Update metric
gauge_metric.add_or_update(value=gauge_metric_value, dimension_values=[...], request_id=context.get_request_id())
# OR
gauge_metric.update(value=gauge_metric_value, request_id=context.get_request_id(), dimensions=[...])

TorchServe 指标 ¶

内容¶

介绍¶

前端指标：¶

后端指标：¶

度量模式¶

日志模式¶

Prometheus 模式¶

传统模式¶

开始¶

指标配置¶

模型量度自动检测¶

指标配置格式¶

默认指标配置¶

量度类型¶

默认量度¶

默认前端指标¶

默认后端指标¶

自定义指标 API¶

默认维度¶

Create Dimension Object(s)¶

添加通用量度¶

用于添加没有默认维度的通用量度的函数 API¶

用于添加具有默认维度的通用指标的函数 API¶

添加基于时间的量度¶

添加基于大小的量度¶

添加基于百分比的量度¶

添加基于计数器的量度¶

获取指标¶

文档

教程

资源

TorchServe 指标¶

内容¶

介绍¶

前端指标：¶

后端指标：¶

度量模式¶

日志模式¶

Prometheus 模式¶

传统模式¶

开始¶

指标配置¶

模型量度自动检测¶

指标配置格式¶

默认指标配置¶

量度类型¶

默认量度¶

默认前端指标¶

默认后端指标¶

自定义指标 API¶

默认维度¶

Create Dimension Object(s)¶

添加通用量度¶

用于添加没有默认维度的通用量度的函数 API¶

用于添加具有默认维度的通用指标的函数 API¶

添加基于时间的量度¶

添加基于大小的量度¶

添加基于百分比的量度¶

添加基于计数器的量度¶

获取指标¶

文档

教程

资源

TorchServe 指标 ¶