张量并行性 - torch.distributed.tensor.parallel¶

Tensor Parallelism（TP）构建在 PyTorch DistributedTensor 之上（DTensor）并提供不同的并行样式：Colwise、Rowwise 和 Sequence Parallelism。

警告

Tensor Parallelism API 是实验性的，可能会发生变化。

使用 Tensor Parallelism 并行化的入口点是：nn.Module

torch.distributed.tensor.parallel 的parallelize_module（module， device_mesh， parallelize_plan）[来源]¶

通过基于用户指定的计划并行化模块或子模块，在 PyTorch 中应用 Tensor 并行性。

我们基于 parallelize_plan 并行化 module 或 sub_modules。parallelize_plan包含，它指示用户对模块或sub_module的需要进行并行化。ParallelStyle

用户还可以为每个模块指定不同的并行样式完全限定名称（FQN）。

请注意，如果您有一个 2-D 或 N-D ，则只接受 1-D ，先将 DeviceMesh 切成一个 1-D 子 DeviceMesh，然后传递给这个 API（即parallelize_moduleDeviceMeshDeviceMeshdevice_mesh["tp"])

参数

module （） – 要并行化的模块。nn.Module
device_mesh （） – 描述网格拓扑的对象的设备数量。DeviceMesh
parallelize_plan （Union[， Dict[str， ]]） – 用于并行化模块的计划。它可以是一个对象，其中包含我们为 Tensor Parallelism 准备 input/output，或者它可以是一个 dict 及其对应的对象。ParallelStyleParallelStyleParallelStyleParallelStyle

返回

并行化的对象。nn.Module

返回类型

模块

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>>
>>> # Define the module.
>>> m = Model(...)
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>> m = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel(), "w2": RowwiseParallel()})
>>>

注意

对于像 Attention、MLP 层这样的复杂模块架构，我们建议将不同的 ParallelStyles 一起（即和）并传递作为parallelize_plan，以实现所需的分片计算。ColwiseParallelRowwiseParallel

Tensor Parallelism 支持以下并行样式：

类 torch.distributed.tensor.parallel。ColwiseParallel（*， input_layouts=无， output_layouts=无， use_local_output=True）[来源]¶

对兼容的 nn.Module 的 Module。目前支持 nn.Linear 和 nn.嵌入。用户可以将其与 RowwiseParallel 一起组合，以实现对更复杂模块的分片。（即 MLP，注意）

关键字参数

input_layouts （Placement， optional） – nn.Module，用于将输入张量注释为成为 DTensor。如果未指定，则假定要复制的输入张量。
output_layouts （Placement， optional）（放置，可选） – nn.Module 的 Module，这用于确保 nn.模块替换为用户所需的布局。如果未指定，则输出张量将在最后一个维度上分片。
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输出，默认为 True。DTensor

返回

一个对象，表示 nn.模块。ParallelStyle

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "w1" nn.Linear submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "w1" Linear will be converted to Replicated DTensor
>>> # and the output of "w1" will return :class:`torch.Tensor` that shards on the last dim.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel()})
>>> ...

注意

默认情况下，如果指定，如果存在需要特定张量形状的运算符（即在 paird 之前），请记住，如果输出是分片的，则可能需要将运算符调整为分片大小。ColwiseParalleloutput_layoutsRowwiseParallel

类 torch.distributed.tensor.parallel。RowwiseParallel（*， input_layouts=无， output_layouts=无， use_local_output=True）[来源]¶

对兼容的 nn.Module 以行方式。目前支持 nn.Linear 和 nn.嵌入。用户可以使用 ColwiseParallel 进行组合，实现对更复杂模块的分片。（即 MLP，注意）

关键字参数

input_layouts （Placement， optional） – nn.Module，用于将输入张量注释为成为 DTensor。如果未指定，则假定输入张量在最后一个维度上分片。
output_layouts （Placement， optional）（放置，可选） – nn.Module 的 Module，这用于确保 nn.模块替换为用户所需的布局。如果未指定，则复制输出张量。
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输出，默认为 True。DTensor

返回

一个对象，表示 nn.模块。ParallelStyle

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "w2" nn.Linear submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "w2" Linear will be converted to DTensor that shards on the last dim
>>> # and the output of "w2" will return a replicated :class:`torch.Tensor`.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"w2": RowwiseParallel()}),
>>> ...

类 torch.distributed.tensor.parallel。SequenceParallel（*， sequence_dim=1， use_local_output=False）[来源]¶

SequenceParallel 复制兼容的参数，并使用 Sequence 维度上的 input 分片。它目前支持、和 RMSNorm python 实现nn.Modulenn.LayerNormnn.Dropout

这种风格实现了论文 Reducing Activation Recomputation in Large Transformer Models 中描述的作

如果传入的输入是nn.Moduletorch.Tensor，则假定输入已分片，并将输入转换为 Sequence 维度上的分片。如果输入传入 this 已经是一个，但未在 sequence 维度上分片，则重新分配要在 sequence 维度上分片的输入。DTensornn.ModuleDTensor

的输出将在 sequence 维度上分片。nn.Module

关键字参数

sequence_dim （int， optional） – 输入张量的序列维度，用于将输入张量注释为成为在序列维度上分片的 DTensor，默认值：1。nn.Module
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输出，默认值： False。DTensor

返回

表示 .ParallelStylenn.Module

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, SequenceParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> m = Model(...)  # m is a nn.Module that contains a "norm" nn.LayerNorm submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # By default, the input of the "norm" will be converted to DTensor that shards on the sequence dim
>>> # and the output of "norm" will return a sharded on sequence dimension :class:`DTensor`.
>>>
>>> sharded_mod = parallelize_module(m, tp_mesh, {"norm": SequenceParallel()}),
>>> ...

注意

SequenceParallel 样式假定 nn.模块（即或，默认情况下，它们具有 1 初始化）。如果您自定义了 inits 的权重，你需要在并行化之前/之后广播权重以确保它们被复制。nn.LayerNormRMSNorm

要简单地配置 nn.具有 DTensor 布局的模块的输入和输出并执行必要的布局重新分发，而不分发模块 parameters 传递给 DTensor 时，可以在调用时：ParallelStyleparallelize_planparallelize_module

类 torch.distributed.tensor.parallel。PrepareModuleInput（*， input_layouts=无， desired_input_layouts=无， input_kwarg_layouts=无， desired_input_kwarg_layouts=无，use_local_output=False）[来源]¶

配置 nn.Module 的 inputs 来转换 nn.Module 传递给 DTensor，并根据执行布局重新分发。input_layoutsdesired_input_layouts

关键字参数

input_layouts （Union[Placement， Tuple[Optional[Placement]]]） – nn.Module，用于将输入张量转换为 DTensor 的如果某些输入不是 torch。Tensor 或者不需要转换为 DTensor，需要指定作为占位符。默认值：无。None
desired_input_layouts （Union[Placement， Tuple[Optional[Placement]]]） – nn.Module，这用于确保 nn.模块具有所需的 DTensor 布局。此参数需要与具有相同的长度。默认值：无。input_layouts
input_kwarg_layouts （Dict[str， Placement]） – nn.Module，用于将输入 kwarg 张量转换为 DTensor。默认值：无
desired_input_kwarg_layouts – （Dict[str， Placement]）： nn.Module，这用于确保 nn.模块具有所需的 DTensor 布局。默认值：无。
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输入，默认值： False。DTensor

返回

一个对象，用于准备 nn.Module 的输入。ParallelStyle

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleInput
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> block = TransformerBlock(...)  # block is a nn.Module that contains an "attn" Attention submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     block, # this can be a submodule or module
>>>     tp_mesh,
>>>     parallelize_plan={
>>>         "attn": PrepareModuleInput(
>>>             input_layouts=(Shard(0), None, None, ...),
>>>             desired_input_layouts=(Replicate(), None, None, ...)
>>>         ),
>>>     }
>>> )

类 torch.distributed.tensor.parallel。PrepareModuleOutput（*， output_layouts， desired_output_layouts， use_local_output=True）[来源]¶

配置 nn.Module 的输出来转换 nn.Module 传递给 DTensor，并根据执行布局重新分发。output_layoutsdesired_output_layouts

关键字参数

output_layouts （Union[Placement， Tuple[Placement]]） - nn.Module，用于将输出张量转换为 DTensor 如果是torch.Tensor.如果某些输出不是 torch.Tensor 或不需要转换为 DTensor，需要指定为占位符。None
desired_output_layouts （Union[Placement， Tuple[Placement]]） – nn.Module，这用于确保 nn.模块具有所需的 DTensor 布局。
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输出，默认： True。DTensor

返回

一个 ParallelStyle 对象，用于准备 nn.Module 的输出。

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleOutput
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> block = TransformerBlock(...)  # block is a nn.Module that contains an "attn" Attention submodule
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>>
>>> # According to the style specified below, the output of the TransformerBlock will be converted to Replicated DTensor
>>> # and then redistributed to Sharded DTensor.
>>> parallelize_module(
>>>     block, # this can be a submodule or module
>>>     tp_mesh,
>>>     parallelize_plan = PrepareModuleOutput(
>>>         output_layouts=Replicate(),
>>>         desired_output_layouts=Shard(0)
>>>     )
>>> )

注意

当使用 the 作为上述 S 的输入/输出布局时，我们假设输入/输出激活张量在 TP作的上的张量维度。例如由于接受在最后一个维度上分片的输入，因此它假定输入张量已在最后一个维度上均匀分片。对于凹凸不平的情况分片激活张量，则可以将 DTensor 直接传入分区模块，并用于在每个之后返回 DTensor，其中 DTensor 可以跟踪不均匀的分片信息。Shard(dim)ParallelStyledimDeviceMeshRowwiseParalleluse_local_output=FalseParallelStyle

对于像 Transformer 这样的模型，我们建议用户在 parallelize_plan 中使用和一起以达到所需的整个模型的分片（即 Attention 和 MLP）。ColwiseParallelRowwiseParallel

并行交叉熵损失计算（损失并行）通过以下上下文管理器支持：

torch.distributed.tensor.parallel 的loss_parallel（）[来源]¶

一个支持损失并行性的上下文管理器，其中高效的并行损失计算当输入在 class 维度上分片时，可以执行。目前只有 cross-entropy 支持 loss。

在这个上下文管理器中，可以使用cross_entropy()或CrossEntropyLoss像往常一样，对 input 参数进行以下假设。相应的调用（如果有）也需要在此上下文管理器下进行。backward()

参数

input （） – 输入 logits。假定在类维度上分片。DTensor
target （联合[torch.Tensor， ]） – 必须是真实类索引（当前不支持类概率）。假定跨 .DTensorDeviceMesh
权重（联合[torch.Tensor， ]，可选） – 如果给定，则假定跨 .DTensorDeviceMesh
label_smoothing – 当前不支持。

返回

复制的 .DTensor

例

在此处手动创建分片的 DTensor 以展示使用情况。在实践中，它通常是 TP 模块的输出。

>>> from torch.distributed.tensor.parallel import loss_parallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>> ...
>>> device_mesh = init_device_mesh("cuda", (8,))
>>> input = torch.randn(4, 16, device="cuda", requires_grad=True)
>>> dist_input = distribute_tensor(input, device_mesh, placements=[Shard(1)])
>>> target = torch.randint(16, (4,), device="cuda")
>>> with loss_parallel():
>>>     loss = F.cross_entropy(dist_input, target, reduction="mean")
>>>     loss.backward()
>>> ...

警告

loss_parallel API 是实验性的，可能会发生更改。

张量并行性 - torch.distributed.tensor.parallel¶

文档

教程

资源