张量并行性 - torch.distributed.tensor.parallel¶

Tensor Parallelism（TP）构建在 PyTorch DistributedTensor 之上（DTensor）并提供不同的并行样式：Colwise 和 Rowwise Parallelism。

警告

Tensor Parallelism API 是实验性的，可能会发生变化。

使用 Tensor Parallelism 并行化的入口点是：nn.Module

torch.distributed.tensor.parallel 的parallelize_module（module， device_mesh， parallelize_plan， tp_mesh_dim=0）[来源]¶

通过基于用户指定的计划并行化模块或子模块，在 PyTorch 中应用 Tensor 并行性。

我们基于 parallelize_plan 并行化 module 或 sub_modules。parallelize_plan包含，它指示用户对模块或sub_module的需要进行并行化。ParallelStyle

用户还可以为每个模块指定不同的并行样式完全限定名称（FQN）。

请注意，如果您有一个 2-D 或 N-D ，则只接受 1-D ，先将 DeviceMesh 切成一个 1-D 子 DeviceMesh，然后传递给这个 API（即parallelize_moduleDeviceMeshDeviceMeshdevice_mesh["tp"])

参数

module （） – 要并行化的模块。nn.Module
device_mesh （） – 描述网格拓扑的对象的设备数量。DeviceMesh
parallelize_plan （Union[， Dict[str， ]]） – 用于并行化模块的计划。它可以是一个对象，其中包含我们为 Tensor Parallelism 准备 input/output，或者它可以是一个 dict 及其对应的对象。ParallelStyleParallelStyleParallelStyleParallelStyle
tp_mesh_dim （int， deprecated） – 我们执行位置的维度 Tensor Parallelism 开启，此字段已弃用，将来将被删除。如果您有 2-D 或 N-D ，请考虑传入 device_mesh[“tp”]device_meshDeviceMesh

返回

并行化的对象。nn.Module

返回类型

模块

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> from torch.distributed.device_mesh import init_device_mesh
>>>
>>> # Define the module.
>>> m = Model(...)
>>> tp_mesh = init_device_mesh("cuda", (8,))
>>> m = parallelize_module(m, tp_mesh, {"w1": ColwiseParallel(), "w2": RowwiseParallel()})
>>>

注意

对于像 Attention、MLP 层这样的复杂模块架构，我们建议将不同的 ParallelStyles 一起（即和）并传递作为parallelize_plan，以实现所需的分片计算。ColwiseParallelRowwiseParallel

Tensor Parallelism 支持以下并行样式：

类 torch.distributed.tensor.parallel。ColwiseParallel（*， input_layouts=无， output_layouts=无， use_local_output=True）[来源]¶

对兼容的 nn.Module 的 Module。目前支持 nn.Linear 和 nn.嵌入。用户可以将其与 RowwiseParallel 一起组合，以实现对更复杂模块的分片。（即 MLP，注意）

关键字参数

input_layouts （Placement， optional） - nn.Module，用于将输入张量注释为成为 DTensor。如果未指定，则假定要复制的输入张量。
output_layouts （Placement， optional）（放置，可选） – nn.Module 的 Module，这用于确保 nn.模块替换为用户所需的布局。如果未指定，则输出张量将在最后一个维度上分片。
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输出，默认为 True。DTensor

返回

一个对象，表示 nn.模块。ParallelStyle

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel
>>> ...
>>> # By default, the input of the "w1" Linear will be annotated to Replicated DTensor
>>> # and the output of "w1" will return :class:`torch.Tensor` that shards on the last dim.
>>>>
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={"w1": ColwiseParallel()},
>>> )
>>> ...

注意

默认情况下，如果指定，如果存在需要特定张量形状的运算符（即在 paird 之前），请记住，如果输出是分片的，则可能需要将运算符调整为分片大小。ColwiseParalleloutput_layoutsRowwiseParallel

类 torch.distributed.tensor.parallel。RowwiseParallel（*， input_layouts=无， output_layouts=无， use_local_output=True）[来源]¶

对兼容的 nn.Module 以行方式。目前支持 nn.仅线性。用户可以使用 ColwiseParallel 进行组合，实现对更复杂模块的分片。（即 MLP，注意）

关键字参数

input_layouts （Placement， optional） - nn.Module，用于将输入张量注释为成为 DTensor。如果未指定，则假定输入张量在最后一个维度上分片。
output_layouts （Placement， optional）（放置，可选） – nn.Module 的 Module，这用于确保 nn.模块替换为用户所需的布局。如果未指定，则复制输出张量。
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输出，默认为 True。DTensor

返回

一个对象，表示 nn.模块。ParallelStyle

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, RowwiseParallel
>>> ...
>>> # By default, the input of the "w2" Linear will be annotated to DTensor that shards on the last dim
>>> # and the output of "w2" will return a replicated :class:`torch.Tensor`.
>>>
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={"w2": RowwiseParallel()},
>>> )
>>> ...

要简单地配置 nn.具有 DTensor 布局的模块的输入和输出并执行必要的布局重新分发，而不分发模块参数添加到 DTensor 中，可以在的：parallelize_planparallelize_module

类 torch.distributed.tensor.parallel。PrepareModuleInput（*， input_layouts， desired_input_layouts， use_local_output=False）[来源]¶

配置 nn.Module 的 inputs 来转换 nn.Module 传递给 DTensor，并根据执行布局重新分发。input_layoutsdesired_input_layouts

关键字参数

input_layouts （Union[Placement， Tuple[Placement]]） - nn.Module，用于将输入张量转换为 DTensor 的如果某些输入不是 torch。Tensor 或者不需要转换为 DTensor，需要指定作为占位符。None
desired_input_layouts （Union[Placement， Tuple[Placement]]） – nn 所需的输入张量的 DTensor 布局。Module，这用于确保 nn.模块具有所需的 DTensor 布局。此参数需要与具有相同的长度。input_layouts
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输入，默认值： False。DTensor

返回

一个对象，用于准备 nn.Module 的输入。ParallelStyle

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleInput
>>> ...
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={
>>>         "attn": PrepareModuleInput(
>>>             input_layouts=(Shard(0), None, None, ...),
>>>             desired_input_layouts=(Replicate(), None, None, ...)
>>>         ),
>>>     }
>>> )

类 torch.distributed.tensor.parallel。PrepareModuleOutput（*， output_layouts， desired_output_layouts， use_local_output=True）[来源]¶

配置 nn.Module 的输出来转换 nn.Module 传递给 DTensor，并根据执行布局重新分发。output_layoutsdesired_output_layouts

关键字参数

output_layouts （Union[Placement， Tuple[Placement]]） - nn.Module，用于将输出张量转换为 DTensor 如果是torch.Tensor.如果某些输出不是 torch.Tensor 或不需要转换为 DTensor，需要指定为占位符。None
desired_output_layouts （Union[Placement， Tuple[Placement]]） – nn.Module，这用于确保 nn.模块具有所需的 DTensor 布局。
use_local_output （bool， optional） – 是否使用本地torch.Tensor而不是模块输出，默认值：False。DTensor

返回

一个 ParallelStyle 对象，用于准备 nn.Module 的输出。

例：：

>>> from torch.distributed.tensor.parallel import parallelize_module, PrepareModuleOutput
>>> ...
>>> # According to the style specified below, the first input of attn will be annotated to Sharded DTensor
>>> # and then redistributed to Replicated DTensor.
>>> parallelize_module(
>>>     module=block, # this can be a submodule or module
>>>     ...,
>>>     parallelize_plan={
>>>         "submodule": PrepareModuleOutput(
>>>             output_layouts=Replicate(),
>>>             desired_output_layouts=Shard(0)
>>>         ),
>>>     }
>>> )

对于像 Transformer 这样的模型，我们建议用户在 parallelize_plan 中使用和一起以达到所需的整个模型的分片（即 Attention 和 MLP）。ColwiseParallelRowwiseParallel

张量并行性 - torch.distributed.tensor.parallel¶

文档

教程

资源