分布式和并行训练教程¶

创建时间： 2022年10月4日 |上次更新时间：2024 年 10 月 31 日 |上次验证： Nov 05， 2024

分布式训练是一种模型训练范式，涉及因此，将训练工作负载分散到多个 worker 节点显著提高训练速度和模型准确性。而分布式训练可用于任何类型的 ML 模型训练，它最有利于将其用于大型模型和计算要求任务作为深度学习。

有几种方法可以在 PyTorch 中，每种方法在某些用例中都有其优势：

在 Distributed Overview 中阅读有关这些选项的更多信息。

学习 DDP¶

DDP 介绍视频教程

有关如何开始使用 DistributedDataParallel 并进入更复杂的主题的分步视频系列

代码视频

https://pytorch.org/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro

Distributed Data Parallel 入门

本教程对 PyTorch 进行了简短而温和的介绍 DistributedData 并行。

法典

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html?utm_source=distr_landing&utm_medium=intermediate_ddp_tutorial

使用不均匀输入的分布式训练 Join 上下文管理器

本教程介绍了 Join 上下文管理器和演示了它与 DistributedData Parallel 的配合使用。

法典

https://pytorch.org/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join

学习 FSDP¶

FSDP 入门

本教程演示如何执行分布式训练在 MNIST 数据集上使用 FSDP。

法典

https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_getting_started

FSDP 高级

在本教程中，您将学习如何微调 HuggingFace （HF） T5 带有 FSDP 的模型，用于文本摘要。

法典

https://pytorch.org/tutorials/intermediate/FSDP_advanced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced

学习 Tensor Parallel （TP）¶

使用 Tensor Parallel （TP）进行大规模 Transformer 模型训练

本教程演示了如何使用 Tensor Parallel 和 Fully Sharded Data Parallel 跨数百到数千个 GPU 训练大型 Transformer 类模型。

法典

https://pytorch.org/tutorials/intermediate/TP_tutorial.html

学习 DeviceMesh¶

DeviceMesh 入门

在本教程中，您将了解 DeviceMesh 以及它如何帮助进行分布式训练。

法典

https://pytorch.org/tutorials/recipes/distributed_device_mesh.html?highlight=devicemesh

学习 RPC¶

分布式 RPC 框架入门

本教程演示如何开始使用基于 RPC 的分布式训练。

法典

https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?utm_source=distr_landing&utm_medium=rpc_getting_started

使用分布式 RPC 框架实现 Parameter Server

本教程将引导您完成一个实现参数服务器使用 PyTorch 的分布式 RPC 框架。

法典

https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html?utm_source=distr_landing&utm_medium=rpc_param_server_tutorial

使用异步执行实现批处理 RPC 处理

在本教程中，您将构建批处理 RPC 应用程序使用 @rpc.functions.async_execution 装饰器。

法典

https://pytorch.org/tutorials/intermediate/rpc_async_execution.html?utm_source=distr_landing&utm_medium=rpc_async_execution

将 Distributed DataParallel 与 Distributed RPC 框架相结合

在本教程中，您将学习如何合并分布式数据 Parallelism 与 Distributed Model Parallelism 的匹配。

法典

https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html?utm_source=distr_landing&utm_medium=rpc_plus_ddp

自定义扩展¶

使用 cpp 扩展自定义进程组后端

在本教程中，您将学习如何实现自定义 ProcessGroup 后端，并使用 cpp 扩展名。

法典

https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html?utm_source=distr_landing&utm_medium=custom_extensions_cpp