推理¶

TorchRec 提供了易于使用的 API，用于转换编写的 TorchRec 模型通过 EAGER 模块交换转换为分布式推理的优化推理模型。

这会将 TorchRec 模块（如模型中的模块）转换为可以使用 torch.fx 和 TorchScript 编译的量化分片版本用于 C++ 环境中的推理。EmbeddingBagCollection

预期用途是调用 model 后跟。quantize_inference_modelshard_quant_model

torchrec.inference.modules 中。quantize_inference_model（model： module， quantization_mapping： optional[dict[str， type[module]]] = 无，per_table_weight_dtype：可选[Dict[str， dtype]] = 无，fp_weight_dtype：dtype = torch.int8，quantization_dtype：dtype = torch.int8， output_dtype： dtype = torch.float32） → 模块¶

量化模型，将 TorchRec 系列模块与其量化的对应项（例如 EmbeddingBagCollection -> QuantEmbeddingBagCollection）。

参数

model （torch.nn.Module） – 要量化的模型
quantization_mapping （Optional[Dict[str， Type[torch.nn.Module]]]） – 从将 Original Module type 转换为 quantized Module 类型。如果未提供，将使用默认映射：（EmbeddingBagCollection -> QuantEmbeddingBagCollection、EmbeddingCollection -> QuantEmbeddingCollection）。
per_table_weight_dtype （Optional[Dict[str， torch.dtype]]） – 从表名到权重 dtype 的映射。如果未提供，将使用默认的量化 dtype （int8）。
fp_weight_dtype （torch.dtype） – 特征处理器权重所需的量化 dtype FeatureProcessedEmbeddingBagCollection（如果使用）。默认值为 int8。

结果

量化模型

返回类型：

torch.nn.Module

例：

ebc = EmbeddingBagCollection(tables=eb_configs, device=torch.device("meta"))

module = DLRMPredictModule(
    embedding_bag_collection=ebc,
    dense_in_features=self.model_config.dense_in_features,
    dense_arch_layer_sizes=self.model_config.dense_arch_layer_sizes,
    over_arch_layer_sizes=self.model_config.over_arch_layer_sizes,
    id_list_features_keys=self.model_config.id_list_features_keys,
    dense_device=device,
)

quant_model = quantize_inference_model(module)

torchrec.inference.modules 中。shard_quant_model（model： Module， world_size： int = 1， compute_device： str = 'cuda'， sharding_device： str = 'meta'， 分片器：可选[List[ModuleSharder[模块]]] = 无，device_memory_size：可选[int] = 无，约束：可选[Dict[str， ParameterConstraints]] = 无，ddr_cap：可选[int] = None） → Tuple[Module， ShardingPlan]¶

分片一个量化的 TorchRec 模型，用于生成最佳的推理模型和分布式推理所必需的。

参数

model （torch.nn.Module） – 要分片的量化模型
world_size （int） – 要对模型进行分片的设备数，默认为 1
compute_device （str） – 运行模型的设备，默认为 “cuda”
sharding_device （str） – 运行分片的设备，默认为 “meta”
sharders （Optional[List[ModuleSharder[torch.nn.Module]]]） – 用于分片的分片量化模型，默认为 QuantEmbeddingBagCollectionSharder、QuantEmbeddingCollectionSharder、 QuantFeatureProcessedEmbeddingBagCollectionSharder 的 QuantFeatureProcessedBeddingBagCollectionSharder 中。
device_memory_size （Optional[int]） – cuda 设备的内存限制，默认为 None
constraints （Optional[Dict[str， ParameterConstraints]]） – 用于分片的约束，默认为 None 然后，它将实现默认约束，并将 QuantEmbeddingBagCollection 分片为 TableWise

结果

分片模型和分片计划

返回类型：

元组[torch.nn.Module， ShardingPlan]]

例：：

ebc = EmbeddingBagCollection（tables=eb_configs， device=torch.device（“meta”））

模块 = DLRMPredictModule（: embedding_bag_collection=EBC， dense_in_features=self.model_config.dense_in_features 中， dense_arch_layer_sizes=self.model_config.dense_arch_layer_sizes， over_arch_layer_sizes=self.model_config.over_arch_layer_sizes， id_list_features_keys=self.model_config.id_list_features_keys 中， dense_device=设备，

)

quant_model = quantize_inference_model（模块） sharded_model， _ = shard_quant_model（quant_model）

推理¶

文档

教程

资源