Distributed

Components for applications that run as distributed jobs. Many of the components in this section are simply topological, meaning that they define the layout of the nodes in a distributed setting and take the actual binaries that each group of nodes (specs.Role) runs.

torchx.components.dist.ddp(*script_args: str, image: str, entrypoint: str, rdzv_backend: Optional[str] = None, rdzv_endpoint: Optional[str] = None, resource: Optional[str] = None, nnodes: int = 1, nproc_per_node: int = 1, base_image: Optional[str] = None, name: str = 'test-name', role: str = 'worker', env: Optional[Dict[str, str]] = None) → torchx.specs.api.AppDef [source]

Distributed data parallel style application (one role, multi-replica).

This uses Torch Elastic to manage the distributed workers.

Parameters

script_args – Script arguments.
image – container image.
entrypoint – script or binary to run within the image.
rdzv_backend – rendezvous backend to use, allowed values can be found at https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/rendezvous/registry.py
rdzv_endpoint – Controller endpoint. In case of rdzv_backend is etcd, this is a etcd endpoint, in case of c10d, this is the endpoint of one of the hosts.
resource – Optional named resource identifier. The resource parameter gets ignored when running on the local scheduler.
nnodes – Number of nodes.
nproc_per_node – Number of processes per node.
name – Name of the application.
base_image – container base image (not required) .
role – Name of the ddp role.
env – Env variables.

Returns

Torchx AppDef

Return type

specs.AppDef

Distributed

文档

教程

资源

APP信息