分散式¶

对于分布式训练，TorchX 依赖于调度器的 gang scheduling 计划节点副本的功能。启动后，应用程序应以利用此拓扑的方式编写，例如，使用 PyTorch 的 DDP。您可以通过指定多个ntorchx.specs.Role在组件的 AppDef 中。每个角色映射到在整体中执行“角色”（功能）的同构节点组训练。在调度方面，TorchX 将每个角色作为一个子团伙启动。

DDP 风格的训练作业只有一个角色：培训师。而使用 Parameter Server 的 Training Job 将有两个角色：Parameter Server、Trainer。您可以指定不同的入口点（可执行文件）、数量副本、资源要求、以及每个角色的更多内容。

DDP 内置¶

DDP 型运动鞋很常见，并且易于模板化，因为它们是同质的单个角色 AppDefs，因此有一个内置的 .假设您的 DDP training 脚本的，启动方式为：dist.ddpmain.py

# locally, 1 node x 4 workers
$ torchx run -s local_cwd dist.ddp -j 1x4 --script main.py

# locally, 2 node x 4 workers (8 total)
$ torchx run -s local_cwd dist.ddp -j 2x4 --script main.py

# remote (optionally pass --rdzv_port to use a different master port than the default 29500)
$ torchx run -s kubernetes -cfg queue=default dist.ddp \
    -j 2x4 \
    --script main.py

# remote -- elastic/autoscaling with 2 minimum and max 5 nodes with 8
# workers each
$ torchx run -s kubernetes dist.ddp -j 2:5x8 --script main.py

请注意，与本地启动相比，唯一的区别是调度程序（）。内置使用（更具体地说）在引擎盖下。在此处阅读有关 torchelastic 的更多信息。-sdist.ddptorchelastictorch.distributed.run

组件 API¶

torchx.components.dist 的ddp（*script_args： str， script：可选[str] = 无， m：可选[str] = 无， image： str = 'ghcr.io/pytorch/torchx:0.5.0'， 名称： str = '/'， h：可选[str] = 无， cpu： int = 2， gpu： int = 0， memMB： int = 1024， j： str = '1x2'， env：可选[Dict[str， str]] = 无，max_retries：int = 0，rdzv_port：int = 29500，挂载：可选[List[str]] = 无，调试： bool = False） → AppDef[来源]¶

分布式数据并行风格的应用程序（一个角色，多副本）。使用 torch.distributed.run 启动和协调 PyTorch 工作进程。默认使用 rendezvous 后端在 rendezvous_endpoint .请注意，参数将被忽略当在单个节点上运行时，我们使用端口 0，它指示 torchelastic 选择主机上的空闲随机端口。c10d$rank_0_host:$rdzv_portrdzv_port

注意：（cpu、gpu、memMB）参数与（named resource）互斥，其中h: h如果为设置资源要求指定，则优先。请参阅注册命名资源。

参数

script_args – 主模块的参数
script – 要在映像中运行的脚本或二进制文件
m – 要运行的 Python 模块路径
image – 镜像（例如 Docker）
name （名称） – 以下格式的作业名称覆盖：OR 或 OR 。如果未指定，则使用脚本或模块名称。{experimentname}/{runname}{experimentname}//{runname}{runname}{runname}
cpu – 每个副本的 CPU 数量
gpu – 每个副本的 GPU 数量
memMB – 每个副本的 CPU 内存（以 MB 为单位）
h – 已注册的命名资源（如果指定，则优先于 cpu、gpu、memMB）
j – [{min_nnodes}：]{nnodes}x{nproc_per_node}，对于 GPU 主机，nproc_per_node不得超过 GPU 数量
env – 要传递给运行的环境变量（例如 ENV1=v1，ENV2=v2，ENV3=v3）
max_retries – 允许的调度程序重试次数
rdzv_port – Rank0 主机上的端口，用于托管用于 Rendezvous 的 C10d 存储。仅在运行多节点时生效。当运行单个节点时，此参数被忽略，并选择一个随机的空闲端口。
mounts – 挂载到工作环境/容器中的挂载（例如 type=<bind/volume>，src=/host，dst=/job[，readonly]）。有关更多信息，请参阅调度程序文档。
debug – 是否在启用预设调试标志的情况下运行

torchx.components.dist 的_TORCH_DEBUG_FLAGS¶

这些是通常用于调试 PyTorch 执行的环境变量。

CUDA_LAUNCH_BLOCKING：在此处阅读更多内容。
NCCL_DESYNC_DEBUG
TORCH_DISTRIBUTED_DEBUG：在此处阅读更多内容。
TORCH_SHOW_CPP_STACKTRACES：在此处阅读更多内容。

分散式¶

DDP 内置¶

组件 API¶

文档

教程

资源