目录

Distributed

Components for applications that run as distributed jobs. Many of the components in this section are simply topological, meaning that they define the layout of the nodes in a distributed setting and take the actual binaries that each group of nodes (specs.Role) runs.

torchx.components.dist.ddp(*script_args: str, image: str, entrypoint: str, rdzv_backend: Optional[str] = None, rdzv_endpoint: Optional[str] = None, resource: Optional[str] = None, nnodes: int = 1, nproc_per_node: int = 1, base_image: Optional[str] = None, name: str = 'test-name', role: str = 'worker', env: Optional[Dict[str, str]] = None)torchx.specs.api.AppDef[source]

Distributed data parallel style application (one role, multi-replica).

This uses Torch Elastic to manage the distributed workers.

Parameters
  • script_args – Script arguments.

  • image – container image.

  • entrypoint – script or binary to run within the image.

  • rdzv_backend – rendezvous backend to use, allowed values can be found at https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/registry.py

  • rdzv_endpoint – Controller endpoint. In case of rdzv_backend is etcd, this is a etcd endpoint, in case of c10d, this is the endpoint of one of the hosts.

  • resource – Optional named resource identifier. The resource parameter gets ignored when running on the local scheduler.

  • nnodes – Number of nodes.

  • nproc_per_node – Number of processes per node.

  • name – Name of the application.

  • base_image – container base image (not required) .

  • role – Name of the ddp role.

  • env – Env variables.

Returns

Torchx AppDef

Return type

specs.AppDef

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源