Base¶
These components are meant to be used as convenience methods when constructing
other components. Many methods in the base component library are factory methods
for Role
, Container
, and Resources
that are hooked up to
TorchX’s configurable extension points.
- torchx.components.base.binary_component.binary_component(name: str, image: str, entrypoint: str, args: Optional[List[str]] = None, env: Optional[Dict[str, str]] = None, resource: torchx.specs.api.Resource = Resource(cpu=- 1, gpu=- 1, memMB=- 1, capabilities={})) → torchx.specs.api.AppDef[source]¶
binary_component creates a single binary component from the provided arguments.
>>> from torchx.components.base.binary_component import binary_component >>> binary_component( ... name="datapreproc", ... image="pytorch/pytorch:latest", ... entrypoint="python3", ... args=["--version"], ... env={ ... "FOO": "bar", ... }, ... ) AppDef(name='datapreproc', ...)
- torchx.components.base.roles.create_torch_dist_role(name: str, image: str, entrypoint: str, resource: torchx.specs.api.Resource = Resource(cpu=-1, gpu=-1, memMB=-1, capabilities={}), base_image: Optional[str] = None, args: Optional[List[str]] = None, env: Optional[Dict[str, str]] = None, num_replicas: int = 1, max_retries: int = 0, port_map: Dict[str, int] = Field(name=None, type=None, default=<dataclasses._MISSING_TYPE object>, default_factory=<class 'dict'>, init=True, repr=True, hash=None, compare=True, metadata=mappingproxy({}), _field_type=None), retry_policy: torchx.specs.api.RetryPolicy = <RetryPolicy.APPLICATION: 'APPLICATION'>, **launch_kwargs: Any) → torchx.specs.api.Role[source]¶
A
Role
for which the user providedentrypoint
is executed with the torchelastic agent (in the container). Note that the torchelastic agent invokes multiple copies ofentrypoint
.For more information about torchelastic see torchelastic quickstart docs.
Important
It is the responsibility of the user to ensure that the role’s image includes torchelastic. Since Torchx has no control over the build process of the image, it cannot automatically include torchelastic in the role’s image.
The following example launches 2
replicas
(nodes) of an elasticmy_train_script.py
that is allowed to scale between 2 to 4 nodes. Each node runs 8 workers which are allowed to fail and restart a maximum of 3 times.Warning
replicas
MUST BE an integer between (inclusive)nnodes
. That is,Role("trainer", nnodes="2:4", num_replicas=5)
is invalid and will result in undefined behavior.>>> from torchx.components.base.roles import create_torch_dist_role >>> from torchx.specs.api import NULL_RESOURCE >>> elastic_trainer = create_torch_dist_role( ... name="trainer", ... image="<NONE>", ... resource=NULL_RESOURCE, ... entrypoint="my_train_script.py", ... args=["--script_arg", "foo", "--another_arg", "bar"], ... num_replicas=4, max_retries=1, ... nproc_per_node=8, nnodes="2:4", max_restarts=3) ... # effectively runs: ... # python -m torch.distributed.launch ... # --nproc_per_node 8 ... # --nnodes 2:4 ... # --max_restarts 3 ... # my_train_script.py --script_arg foo --another_arg bar >>> elastic_trainer Role(name='trainer', ...)
- Parameters
name – Name of the role
entrypoint – User binary or python script that will be launched.
resource – Resource that is requested by scheduler
base_image – Optional base image, if schedulers support image overlay
args – User provided arguments
env – Env. variables that will be set on worker process that runs entrypoint
num_replicas – Number of role replicas to run
max_retries – Max number of retries
port_map – Port mapping for the role
retry_policy – Retry policy that is applied to the role
launch_kwargs – kwarg style launch arguments that will be used to launch torchelastic agent.
- Returns
Role object that launches user entrypoint via the torchelastic as proxy