Distributed¶
For distributed training, TorchX relies on the scheduler’s gang scheduling
capabilities to schedule n
copies of nodes. Once launched, the application
is expected to be written in a way that leverages this topology, for instance,
with PyTorch’s
DDP.
You can express a variety of node topologies with TorchX by specifying multiple
torchx.specs.Role
in your component’s AppDef. Each role maps to
a homogeneous group of nodes that performs a “role” (function) in the overall
training. Scheduling-wise, TorchX launches each role as a sub-gang.
A DDP-style training job has a single role: trainers. Whereas a training job that uses parameter servers will have two roles: parameter server, trainer. You can specify different entrypoint (executable), num replicas, resource requirements, and more for each role.
DDP Builtin¶
DDP-style trainers are common and easy to templetize since they are homogeneous
single role AppDefs, so there is a builtin: dist.ddp
. Assuming your DDP
training script is called main.py
, launch it as:
# locally, 1 node x 4 workers
$ torchx run -s local_cwd dist.ddp -j 1x4 --script main.py
# locally, 2 node x 4 workers (8 total)
# remote (needs you to start a local etcd server on port 2379! and have a `python-etcd` library installed)
$ torchx run -s local_cwd dist.ddp
-j 2x4 \
--rdzv_endpoint localhost:2379 \
--script main.py \
# remote (needs you to setup an etcd server first!)
$ torchx run -s kubernetes -cfg queue=default dist.ddp \
-j 2x4 \
--script main.py \
There is a lot happening under the hood so we strongly encourage you
to continue reading the rest of this section
to get an understanding of how everything works. Also note that
while dist.ddp
is convenient, you’ll find that authoring your own
distributed component is not only easy (simplest way is to just copy
dist.ddp
!) but also leads to better flexbility and maintainability
down the road since builtin APIs are subject to more changes than
the more stable specs API. However the choice is yours, feel free to rely on
the builtins if they meet your needs.
Distributed Training¶
Local Testing¶
Note
Please follow Prerequisites of running examples first.
Running distributed training locally is a quick way to validate your
training script. TorchX’s local scheduler will create a process per
replica (--nodes
). The example below uses torchelastic,
as the main entrypoint of each node, which in turn spawns --nprocs_per_node
number
of trainers. In total you’ll see nnodes*nprocs_per_node
trainer processes and nnodes
elastic agent procesess created on your local host.
$ torchx run -s local_docker ./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
--nnodes 2 \
--nproc_per_node 2 \
--rdzv_backend c10d \
--rdzv_endpoint localhost:29500
Remote Launching¶
Note
Please follow the Prerequisites first.
The following example demonstrate launching the same job remotely on kubernetes.
$ torchx run -s kubernetes -cfg queue=default \
./torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist \
--nnodes 2 \
--nproc_per_node 2 \
--rdzv_backend etcd \
--rdzv_endpoint etcd-server.default.svc.cluster.local:2379
torchx 2021-10-18 18:46:55 INFO Launched app: kubernetes://torchx/default:cv-trainer-pa2a7qgee9zng
torchx 2021-10-18 18:46:55 INFO AppStatus:
msg: <NONE>
num_restarts: -1
roles: []
state: PENDING (2)
structured_error_msg: <NONE>
ui_url: null
torchx 2021-10-18 18:46:55 INFO Job URL: None
Note that the only difference compared to the local launch is the scheduler (-s
)
and --rdzv_backend
. etcd will also work in the local case, but we used c10d
since it does not require additional setup. Note that this is a torchelastic requirement
not TorchX. Read more about rendezvous here.
Note
For GPU training, keep nproc_per_node
equal to the amount of GPUs on the host and
change the resource requirements in torchx/examples/apps/lightning_classy_vision/component.py:trainer_dist
method. Modify resource_def
to the number of GPUs that your host has.
Components APIs¶
- torchx.components.dist.ddp(*script_args: str, script: str, image: str = 'ghcr.io/pytorch/torchx:0.1.1dev0', name: Optional[str] = None, h: str = 'aws_t3.medium', j: str = '1x2', rdzv_endpoint: str = 'etcd-server.default.svc.cluster.local:2379') → torchx.specs.api.AppDef[source]¶
Distributed data parallel style application (one role, multi-replica). Uses torch.distributed.run to launch and coordinate pytorch worker processes.
- Parameters
script_args – arguments to the main module
script – script or binary to run within the image
image – image (e.g. docker)
name – job name override (uses the script name if not specified)
h – a registered named resource
j – {nnodes}x{nproc_per_node}, for gpu hosts, nproc_per_node must not exceed num gpus
rdzv_endpoint – etcd server endpoint (only matters when nnodes > 1)