目录

快速上手

这是一个关于如何编写一个简单应用程序并开始在本地和远程集群上启动分布式任务的完整指南。

安装

首先,我们需要安装包含命令行接口和库的TorchX Python包。

# install torchx with all dependencies
$ pip install torchx[dev]

参见README以获取有关安装的更多信息。

[1]:
%%sh
torchx --help
usage: torchx [-h] [--log_level LOG_LEVEL] [--version]
              {builtins,cancel,configure,describe,list,log,run,runopts,status,tracker}
              ...

torchx CLI

optional arguments:
  -h, --help            show this help message and exit
  --log_level LOG_LEVEL
                        Python logging log level
  --version             show program's version number and exit

sub-commands:
  Use the following commands to run operations, e.g.: torchx run ${JOB_NAME}

  {builtins,cancel,configure,describe,list,log,run,runopts,status,tracker}

世界你好

让我们从编写一个简单的“Hello World” Python应用程序开始。这只是一个普通的Python程序,可以包含你想要的任何内容。

注意

此示例使用 Jupyter Notebook %%writefile 创建本地文件,以作示范用途。在正常用法中,这些文件应为独立文件。

[2]:
%%writefile my_app.py

import sys

print(f"Hello, {sys.argv[1]}!")
Overwriting my_app.py

发布

我们可以通过torchx run执行我们的应用。The local_cwd调度程序相对于当前目录执行应用。

为此,我们将使用utils.python组件:

[3]:
%%sh
torchx run --scheduler local_cwd utils.python --help
usage: torchx run <run args...> python  [--help] [-m M] [-c C]
                                        [--script SCRIPT] [--image IMAGE]
                                        [--name NAME] [--cpu CPU] [--gpu GPU]
                                        [--memMB MEMMB] [-h H]
                                        [--num_replicas NUM_REPLICAS]
                                        ...

Runs ``python`` with the specified module, command or script on the specified
...

positional arguments:
  args                  arguments passed to the program in sys.argv[1:]
                        (ignored with `--c`) (required)

optional arguments:
  --help                show this help message and exit
  -m M, --m M           run library module as a script (default: None)
  -c C, --c C           program passed as string (may error if scheduler has a
                        length limit on args) (default: None)
  --script SCRIPT       .py script to run (default: None)
  --image IMAGE         image to run on (default:
                        ghcr.io/pytorch/torchx:0.4.0dev0)
  --name NAME           name of the job (default: torchx_utils_python)
  --cpu CPU             number of cpus per replica (default: 1)
  --gpu GPU             number of gpus per replica (default: 0)
  --memMB MEMMB         cpu memory in MB per replica (default: 1024)
  -h H, --h H           a registered named resource (if specified takes
                        precedence over cpu, gpu, memMB) (default: None)
  --num_replicas NUM_REPLICAS
                        number of copies to run (each on its own container)
                        (default: 1)

该组件接收脚本名称以及任何额外参数,并将其传递给脚本本身。

[4]:
%%sh
torchx run --scheduler local_cwd utils.python --script my_app.py "your name"
torchx 2022-12-29 22:58:54 INFO     Log directory not set in scheduler cfg. Creating a temporary log dir that will be deleted on exit. To preserve log directory set the `log_dir` cfg option
torchx 2022-12-29 22:58:54 INFO     Log directory is: /tmp/torchx_ywy1xi2_
torchx 2022-12-29 22:58:54 INFO     Waiting for the app to finish...
python/0 Hello, your name!
torchx 2022-12-29 22:58:55 INFO     Job finished: SUCCEEDED
local_cwd://torchx/torchx_utils_python-mmqvsj5qtplp9

我们可以通过local_docker调度器运行完全相同的应用程序。此调度器会将本地工作区打包为指定镜像之上的一个层。这提供了与基于容器的远程调度器非常相似的环境。

注意

这需要安装Docker,并且在Google Colab等环境中无法使用。请参阅Docker安装说明:https://docs.docker.com/get-docker/

[5]:
%%sh
torchx run --scheduler local_docker utils.python --script my_app.py "your name"
torchx 2022-12-29 22:58:55 INFO     Checking for changes in workspace `file:///home/runner/work/torchx/torchx/docs/source`...
torchx 2022-12-29 22:58:55 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2022-12-29 22:58:55 INFO     Workspace `file:///home/runner/work/torchx/torchx/docs/source` resolved to filesystem path `/home/runner/work/torchx/torchx/docs/source`
torchx 2022-12-29 22:58:56 INFO     Building workspace docker image (this may take a while)...
torchx 2022-12-29 22:59:06 INFO     Built new image `sha256:534b196968980a14822b00fb7a1ba296460b4d1f8dbd05981ef92d2b17587754` based on original image `ghcr.io/pytorch/torchx:0.4.0dev0` and changes in workspace `file:///home/runner/work/torchx/torchx/docs/source` for role[0]=python.
torchx 2022-12-29 22:59:06 INFO     Waiting for the app to finish...
python/0 Hello, your name!
torchx 2022-12-29 22:59:07 INFO     Job finished: SUCCEEDED
local_docker://torchx/torchx_utils_python-p56rntgrdw5dmd

TorchX 默认使用包含 PyTorch 库、TorchX 和相关依赖项的 ghcr.io/pytorch/torchx Docker 容器镜像。

分布式

TorchX的dist.ddp组件使用TorchElastic来管理工作者。这意味着您可以在我们支持的所有调度器上一键启动多工作者和多主机任务。

[6]:
%%sh
torchx run --scheduler local_docker dist.ddp --help
usage: torchx run <run args...> ddp  [--help] [--script SCRIPT] [-m M]
                                     [--image IMAGE] [--name NAME] [-h H]
                                     [--cpu CPU] [--gpu GPU] [--memMB MEMMB]
                                     [-j J] [--env ENV]
                                     [--max_retries MAX_RETRIES]
                                     [--rdzv_port RDZV_PORT] [--mounts MOUNTS]
                                     [--debug DEBUG]
                                     ...

Distributed data parallel style application (one role, multi-replica). ...

positional arguments:
  script_args           arguments to the main module (required)

optional arguments:
  --help                show this help message and exit
  --script SCRIPT       script or binary to run within the image (default:
                        None)
  -m M, --m M           the python module path to run (default: None)
  --image IMAGE         image (e.g. docker) (default:
                        ghcr.io/pytorch/torchx:0.4.0dev0)
  --name NAME           job name override (uses the script name if not
                        specified) (default: None)
  -h H, --h H           a registered named resource (if specified takes
                        precedence over cpu, gpu, memMB) (default: None)
  --cpu CPU             number of cpus per replica (default: 2)
  --gpu GPU             number of gpus per replica (default: 0)
  --memMB MEMMB         cpu memory in MB per replica (default: 1024)
  -j J, --j J           [{min_nnodes}:]{nnodes}x{nproc_per_node}, for gpu
                        hosts, nproc_per_node must not exceed num gpus
                        (default: 1x2)
  --env ENV             environment varibles to be passed to the run (e.g.
                        ENV1=v1,ENV2=v2,ENV3=v3) (default: None)
  --max_retries MAX_RETRIES
                        the number of scheduler retries allowed (default: 0)
  --rdzv_port RDZV_PORT
                        the port on rank0's host to use for hosting the c10d
                        store used for rendezvous. Only takes effect when
                        running multi-node. When running single node, this
                        parameter is ignored and a random free port is chosen.
                        (default: 29500)
  --mounts MOUNTS       mounts to mount into the worker environment/container
                        (ex.
                        type=<bind/volume>,src=/host,dst=/job[,readonly]). See
                        scheduler documentation for more info. (default: None)
  --debug DEBUG         whether to run with preset debug flags enabled
                        (default: False)

让我们创建一个稍微有趣一点的应用程序,以利用TorchX的分布式支持。

[7]:
%%writefile dist_app.py

import torch
import torch.distributed as dist

dist.init_process_group(backend="gloo")
print(f"I am worker {dist.get_rank()} of {dist.get_world_size()}!")

a = torch.tensor([dist.get_rank()])
dist.all_reduce(a)
print(f"all_reduce output = {a}")
Writing dist_app.py

让我们启动一个包含2个节点,每个节点2个工作进程的小任务:

[8]:
%%sh
torchx run --scheduler local_docker dist.ddp -j 2x2 --script dist_app.py
torchx 2022-12-29 22:59:09 INFO     Checking for changes in workspace `file:///home/runner/work/torchx/torchx/docs/source`...
torchx 2022-12-29 22:59:09 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2022-12-29 22:59:09 INFO     Workspace `file:///home/runner/work/torchx/torchx/docs/source` resolved to filesystem path `/home/runner/work/torchx/torchx/docs/source`
torchx 2022-12-29 22:59:09 INFO     Building workspace docker image (this may take a while)...
torchx 2022-12-29 22:59:19 INFO     Built new image `sha256:5698b4d1cccf5c40e3a1d36326ab13ef63065f4813b7c31b44d9843cbb3d4abe` based on original image `ghcr.io/pytorch/torchx:0.4.0dev0` and changes in workspace `file:///home/runner/work/torchx/torchx/docs/source` for role[0]=dist_app.
torchx 2022-12-29 22:59:20 INFO     Waiting for the app to finish...
dist_app/0 WARNING:__main__:
dist_app/0 *****************************************
dist_app/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
dist_app/0 *****************************************
dist_app/1 WARNING:__main__:
dist_app/1 *****************************************
dist_app/1 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
dist_app/1 *****************************************
dist_app/0 [1]:I am worker 1 of 4!
dist_app/1 [1]:I am worker 3 of 4!
dist_app/1 [1]:all_reduce output = tensor([6])
dist_app/1 [0]:I am worker 2 of 4!
dist_app/1 [0]:all_reduce output = tensor([6])
dist_app/0 [1]:all_reduce output = tensor([6])
dist_app/0 [0]:I am worker 0 of 4!
dist_app/0 [0]:all_reduce output = tensor([6])
torchx 2022-12-29 22:59:29 INFO     Job finished: SUCCEEDED
local_docker://torchx/dist_app-z04qllzfcnk7p

工作区 / 补丁

对于每个调度器来说,都有一个概念叫做image。对于local_cwdslurm,它使用当前工作目录。对于基于容器的调度器,例如local_dockerkubernetesaws_batch,它使用Docker容器。

为了在本地和远程作业之间提供相同的环境,TorchX CLI 使用工作区根据每个调度器自动修补远程作业的镜像。

当您通过torchx run启动作业时,它将在提供的映像顶部叠加当前目录,以便您的代码可以在启动的作业中使用。

对于基于 docker 的调度器,您需要一个本地的 Docker 守护进程来构建并将镜像推送到远程 Docker 仓库。

.torchxconfig

调度器的参数可以通过命令行标志传递给torchx run -s <scheduler> -c <args>,也可以通过每个调度器的.torchxconfig文件指定。

[9]:
%%writefile .torchxconfig

[kubernetes]
queue=torchx
image_repo=<your docker image repository>

[slurm]
partition=torchx
Writing .torchxconfig

远程调度器

TorchX 支持大量的调度器。没有看到你的?请求它!

远程调度程序的工作方式与本地调度程序完全相同。适用于本地的相同运行命令可以直接在远程使用。

$ torchx run --scheduler slurm dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler kubernetes dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler aws_batch dist.ddp -j 2x2 --script dist_app.py
$ torchx run --scheduler ray dist.ddp -j 2x2 --script dist_app.py

根据调度器的不同,可能需要额外配置一些参数,以便TorchX知道在哪里运行作业并上传构建的镜像。这些参数可以通过-c设置,或者在.torchxconfig文件中进行配置。

所有配置选项:

[10]:
%%sh
torchx runopts
local_docker:
    usage:
        [copy_env=COPY_ENV],[image_repo=IMAGE_REPO]

    optional arguments:
        copy_env=COPY_ENV (typing.List[str], None)
            list of glob patterns of environment variables to copy if not set in AppDef. Ex: FOO_*
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container

local_cwd:
    usage:
        [log_dir=LOG_DIR],[prepend_cwd=PREPEND_CWD],[auto_set_cuda_visible_devices=AUTO_SET_CUDA_VISIBLE_DEVICES]

    optional arguments:
        log_dir=LOG_DIR (str, None)
            dir to write stdout/stderr log files of replicas
        prepend_cwd=PREPEND_CWD (bool, False)
            if set, prepends CWD to replica's PATH env var making any binaries in CWD take precedence over those in PATH
        auto_set_cuda_visible_devices=AUTO_SET_CUDA_VISIBLE_DEVICES (bool, False)
            sets the `CUDA_AVAILABLE_DEVICES` for roles that request GPU resources. Each role replica will be assigned one GPU. Does nothing if the device count is less than replicas.

slurm:
    usage:
        [partition=PARTITION],[time=TIME],[comment=COMMENT],[constraint=CONSTRAINT],[mail-user=MAIL-USER],[mail-type=MAIL-TYPE],[job_dir=JOB_DIR]

    optional arguments:
        partition=PARTITION (str, None)
            The partition to run the job in.
        time=TIME (str, None)
            The maximum time the job is allowed to run for. Formats:             "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours",             "days-hours:minutes" or "days-hours:minutes:seconds"
        comment=COMMENT (str, None)
            Comment to set on the slurm job.
        constraint=CONSTRAINT (str, None)
            Constraint to use for the slurm job.
        mail-user=MAIL-USER (str, None)
            User to mail on job end.
        mail-type=MAIL-TYPE (str, None)
            What events to mail users on.
        job_dir=JOB_DIR (str, None)
            The directory to place the job code and outputs. The
            directory must not exist and will be created. To enable log
            iteration, jobs will be tracked in ``.torchxslurmjobdirs``.


kubernetes:
    usage:
        queue=QUEUE,[namespace=NAMESPACE],[service_account=SERVICE_ACCOUNT],[priority_class=PRIORITY_CLASS],[image_repo=IMAGE_REPO]

    required arguments:
        queue=QUEUE (str)
            Volcano queue to schedule job in

    optional arguments:
        namespace=NAMESPACE (str, default)
            Kubernetes namespace to schedule job in
        service_account=SERVICE_ACCOUNT (str, None)
            The service account name to set on the pod specs
        priority_class=PRIORITY_CLASS (str, None)
            The name of the PriorityClass to set on the job specs
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container

aws_batch:
    usage:
        queue=QUEUE,[user=USER],[share_id=SHARE_ID],[priority=PRIORITY],[image_repo=IMAGE_REPO]

    required arguments:
        queue=QUEUE (str)
            queue to schedule job in

    optional arguments:
        user=USER (str, runner)
            The username to tag the job with. `getpass.getuser()` if not specified.
        share_id=SHARE_ID (str, None)
            The share identifier for the job. This must be set if and only if the job queue has a scheduling policy.
        priority=PRIORITY (int, None)
            The scheduling priority for the job within the context of share_id. Higher number (between 0 and 9999) means higher priority. This will only take effect if the job queue has a scheduling policy.
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container

gcp_batch:
    usage:
        [project=PROJECT],[location=LOCATION]

    optional arguments:
        project=PROJECT (str, None)
            Name of the GCP project. Defaults to the configured GCP project in the environment
        location=LOCATION (str, us-central1)
            Name of the location to schedule the job in. Defaults to us-central1

ray:
    usage:
        [cluster_config_file=CLUSTER_CONFIG_FILE],[cluster_name=CLUSTER_NAME],[dashboard_address=DASHBOARD_ADDRESS],[requirements=REQUIREMENTS]

    optional arguments:
        cluster_config_file=CLUSTER_CONFIG_FILE (str, None)
            Use CLUSTER_CONFIG_FILE to access or create the Ray cluster.
        cluster_name=CLUSTER_NAME (str, None)
            Override the configured cluster name.
        dashboard_address=DASHBOARD_ADDRESS (str, 127.0.0.1:8265)
            Use ray status to get the dashboard address you will submit jobs against
        requirements=REQUIREMENTS (str, None)
            Path to requirements.txt

lsf:
    usage:
        [lsf_queue=LSF_QUEUE],[jobdir=JOBDIR],[container_workdir=CONTAINER_WORKDIR],[host_network=HOST_NETWORK],[shm_size=SHM_SIZE]

    optional arguments:
        lsf_queue=LSF_QUEUE (str, None)
            queue name to submit jobs
        jobdir=JOBDIR (str, None)
            The directory to place the job code and outputs. The directory must not exist and will be created.
        container_workdir=CONTAINER_WORKDIR (str, None)
            working directory in container jobs
        host_network=HOST_NETWORK (bool, False)
            True if using the host network for jobs
        shm_size=SHM_SIZE (str, 64m)
            size of shared memory (/dev/shm) for jobs

自定义图像

基于Docker的任务调度器

如果你需要标准的 PyTorch 库之外的功能,可以添加自定义的 Dockerfile,或者构建自己的 Docker 容器,并将其用作 TorchX 作业的基础镜像。

[11]:
%%writefile timm_app.py

import timm

print(timm.models.resnet18())
Writing timm_app.py
[12]:
%%writefile Dockerfile.torchx

FROM pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime

RUN pip install timm

COPY . .
Writing Dockerfile.torchx

一旦我们创建了 Dockerfile,就可以像平常一样启动,TorchX 会自动使用新提供的 Dockerfile 而不是默认的那个来构建镜像。

[13]:
%%sh
torchx run --scheduler local_docker utils.python --script timm_app.py
torchx 2022-12-29 22:59:31 INFO     loaded configs from /home/runner/work/torchx/torchx/docs/source/.torchxconfig
torchx 2022-12-29 22:59:31 INFO     Checking for changes in workspace `file:///home/runner/work/torchx/torchx/docs/source`...
torchx 2022-12-29 22:59:31 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2022-12-29 22:59:31 INFO     Workspace `file:///home/runner/work/torchx/torchx/docs/source` resolved to filesystem path `/home/runner/work/torchx/torchx/docs/source`
torchx 2022-12-29 22:59:32 INFO     Building workspace docker image (this may take a while)...
torchx 2022-12-29 23:01:02 INFO     Built new image `sha256:953cbaf303631d338b5191b423e2dcfba2e45396ff8e985f838aaafad9d965ec` based on original image `ghcr.io/pytorch/torchx:0.4.0dev0` and changes in workspace `file:///home/runner/work/torchx/torchx/docs/source` for role[0]=python.
torchx 2022-12-29 23:01:02 INFO     Waiting for the app to finish...
python/0 ResNet(
python/0   (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
python/0   (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0   (act1): ReLU(inplace=True)
python/0   (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
python/0   (layer1): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (layer2): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
python/0         (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       )
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (layer3): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
python/0         (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       )
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (layer4): Sequential(
python/0     (0): BasicBlock(
python/0       (conv1): Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0       (downsample): Sequential(
python/0         (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
python/0         (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       )
python/0     )
python/0     (1): BasicBlock(
python/0       (conv1): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (drop_block): Identity()
python/0       (act1): ReLU(inplace=True)
python/0       (aa): Identity()
python/0       (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
python/0       (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
python/0       (act2): ReLU(inplace=True)
python/0     )
python/0   )
python/0   (global_pool): SelectAdaptivePool2d (pool_type=avg, flatten=Flatten(start_dim=1, end_dim=-1))
python/0   (fc): Linear(in_features=512, out_features=1000, bias=True)
python/0 )
torchx 2022-12-29 23:01:04 INFO     Job finished: SUCCEEDED
local_docker://torchx/torchx_utils_python-cdg0jxk7wwkmrc

Slurm

4 和 5 使用当前环境,因此您可以正常使用 6 和 7。

下一步

  1. 查看其他功能,例如 torchx CLI

  2. 查看 调度器列表 以了解 runner 支持的调度器

  3. 浏览内置组件集合 builtin components

  4. 查看可以在哪些 机器学习流水线平台 上运行组件

  5. See a 训练应用示例

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源