实用程序¶

这包含开箱即用的 TorchX 实用程序组件。这些是仅执行众所周知的二进制文件（例如），并用作教程材料或之间的粘合作工作流中有意义的阶段。cp

torchx.components.utils 中。echo（msg： str = 'hello world'， image： str = 'ghcr.io/pytorch/torchx:0.8.0dev0'， num_replicas：int = 1） → AppDef[来源]¶

将消息回显到 stdout（调用 echo）

参数

msg – 消息回声
image – 要使用的图像
num_replicas – 要运行的副本数

torchx.components.utils 中。touch（文件： str， image： str = 'ghcr.io/pytorch/torchx:0.8.0dev0'） → AppDef[来源]¶

触摸文件（调用 touch）

参数

file （文件） – 要创建的文件
image – 要使用的镜像

torchx.components.utils 中。sh（*args： str， image： str = 'ghcr.io/pytorch/torchx:0.8.0dev0'， num_replicas： int = 1，CPU：int = 1，GPU：int = 0， memMB： int = 1024， h：可选[str] = 无，env：可选[Dict[str， str]] = None， max_retries： int = 0，挂载数：可选[List[str]] = 无） → AppDef[来源]¶

通过 sh 运行提供的命令。目前 sh 不支持环境变量替换。

参数

args – bash 参数
image – 要使用的图像
num_replicas – 要运行的副本数
cpu – 每个副本的 CPU 数量
gpu – 每个副本的 GPU 数量
memMB – 每个副本的 CPU 内存（以 MB 为单位）
h – 已注册的命名资源（如果指定，则优先于 cpu、gpu、memMB）
env – 要传递给运行的环境变量（例如 ENV1=v1，ENV2=v2，ENV3=v3）
max_retries – 允许的调度程序重试次数
mounts – 挂载到工作环境/容器中的挂载（例如 type=<bind/volume>，src=/host，dst=/job[，readonly]）。有关更多信息，请参阅调度程序文档。

torchx.components.utils 中。copy（src： str， dst： str， image： str = 'ghcr.io/pytorch/torchx:0.8.0dev0'） → AppDef[来源]¶

copy 将文件从 src 复制到 DST。src 和 dst 可以是任何有效的 fsspec url 的 URL 中。

这不支持递归副本或目录。

参数

src – 源 fsspec 文件位置
dst – 目标 fsspec 文件位置
image （图像） – 包含 Copy app （复制应用程序）的图像

torchx.components.utils 中。python（*args： str， m：可选[str] = 无， c：可选[str] = 无，脚本：可选[str] = 无，图片：str = 'ghcr.io/pytorch/torchx:0.8.0dev0'，名称：str = 'torchx_utils_python'， CPU：int = 1，gpu：int = 0，memMB：int = 1024， h：可选[str] = 无， num_replicas： int = 1） → AppDef[来源]¶

在指定的 image 和 host 中。用于分隔组件 args 和程序 args （例如python--torchx run utils.python --m foo.main -- --args to --main)

注意：（cpu、gpu、memMB）参数与（named resource）互斥，其中h: h如果为设置资源要求指定，则优先。请参阅注册命名资源。

参数

args – 在 sys.argv[1：] 中传递给程序的参数（用 –c 忽略)
m – 将库模块作为脚本运行
c – 作为字符串传递的程序（如果调度器对 args 有长度限制，可能会出错）
script – .py要运行的脚本
image – 要在其上运行的镜像
name （名称） – 任务的名称
cpu – 每个副本的 CPU 数量
gpu – 每个副本的 GPU 数量
memMB – 每个副本的 CPU 内存（以 MB 为单位）
h – 已注册的命名资源（如果指定，则优先于 cpu、gpu、memMB）
num_replicas – 要运行的副本数（每个副本都在自己的容器上）

结果

用法与常规 python 非常相似，不同之处在于它支持远程启动。例：

# locally (cmd)
$ torchx run utils.python --image $FBPKG -c "import torch; print(torch.__version__)"

# locally (module)
$ torchx run utils.python --image $FBPKG -m foo.bar.main

# remote (cmd)
$ torchx run -s mast utils.python --image $FBPKG -c "import torch; print(torch.__version__)"

# remote (module)
$ torchx run -s mast utils.python --image $FBPKG -m foo.bar.main

笔记：

torchx run修补当前工作目录（CWD）以加快远程迭代速度。$FBPKG
补丁内容将包含对本地 fbcode 的所有更改，但是，仅当 CWD 是 fbcode 的子目录时，才会触发补丁构建。如果你从 fbcode 的根目录（例如 ~/fbsource/fbcode）运行，你的作业将不会被修补！
小心不要滥用 .调度器对参数有长度限制，因此不要尝试传递长 CMD，请谨慎使用。-c CMD
在 -m MODULE 中，模块需要从 fbcode 中根除。示例：对于 ~/fbsource/fbcode/foo/bar/main.py 模块为 .-m foo.bar.main
不要在 buck 规则中推翻。如果你这样做，你就只能靠自己了，修补是行不通的。base_modulepython_library

组件中的内联脚本

注意

重要提示：请勿滥用此功能！这种用途应谨慎使用，不得滥用！我们保留将来删除此功能的权利。

TorchX 和 penv python 的构建方式的一个很好的副作用是，您几乎可以做通常使用 python 做的任何事情，还有一个额外的好处，即它会自动修补您的工作目录，并使您能够在本地和远程运行。这意味着 python 也可以工作。下面是一个示例来说明这一点-c CMD

$ cd ~/fbsource/fbcode/torchx/examples/apps

$ ls
component.py  config  main.py  module  README.md  TARGETS

# lets try getting the version of torch from a prebuilt fbpkg or bento kernel
$ torchx run utils.python --image bento_kernel_pytorch_lightning -c "import torch; print(torch.__version__)"
torchx 2021-10-27 11:27:28 INFO     loaded configs from /data/users/kiuk/fbsource/fbcode/torchx/fb/example/.torchxconfig
2021-10-27 11:27:44,633 fbpkg.fetch INFO: completed download of bento_kernel_pytorch_lightning:405
2021-10-27 11:27:44,634 fbpkg.fetch INFO: extracted bento_kernel_pytorch_lightning:405 to bento_kernel_pytorch_lightning
2021-10-27 11:27:48,591 fbpkg.util WARNING: removing old version /home/kiuk/.torchx/fbpkg/bento_kernel_pytorch_lightning/403
All packages downloaded successfully
local_penv://torchx/torchx_utils_python_6effc4e2
torchx 2021-10-27 11:27:49 INFO     Waiting for the app to finish...
1.11.0a0+fb
torchx 2021-10-27 11:27:58 INFO     Job finished: SUCCEEDED
Now for a more interesting example, lets run a dumb all reduce of a 1-d tensor on 1 worker:
$ torchx run utils.python --image torchx_fb_example \
-c "import torch; import torch.distributed as dist; dist.init_process_group(backend='gloo', init_method='tcp://localhost:29500', rank=0, world_size=1); t=torch.tensor(1); dist.all_reduce(t); print(f'all reduce result: {t.item()}')"

torchx 2021-10-27 10:23:05 INFO     loaded configs from /data/users/kiuk/fbsource/fbcode/torchx/fb/example/.torchxconfig
2021-10-27 10:23:09,339 fbpkg.fetch INFO: checksums verified: torchx_fb_example:11
All packages verified
local_penv://torchx/torchx_utils_python_08a41456
torchx 2021-10-27 10:23:09 INFO     Waiting for the app to finish...
all reduce result: 1
torchx 2021-10-27 10:23:13 INFO     Job finished: SUCCEEDED
WARNING: Long inlined scripts won't work since schedulers typically have a character limit on the length of each argument.

torchx.components.utils 中。booth（x1： float， x2： float， trial_idx： int = 0， tracker_base： str = '/tmp/torchx-util-booth'， 图片： str = 'ghcr.io/pytorch/torchx:0.8.0dev0'） → AppDef[来源]¶

评估 booth 函数。输出结果可通过以下方式访问f(x1, x2) = (x1 + 2*x2 - 7)^2 + (2*x1 + x2 - 5)^2FsspecResultTracker(outdir)[trial_idx]

参数

x1 – x1
x2 – x2
trial_idx – 如果未运行 HPO，则忽略
tracker_base – 跟踪链接的基本输出目录的 URI（例如 s3：//foo/bar）
image – 包含 booth 应用程序的映像

torchx.components.utils 中。binary（*args： str， entrypoint： str， name： str = 'torchx_utils_binary'，num_replicas：int = 1，CPU：int = 1，gpu： int = 0，memMB：int = 1024，h：可选[str] = 无） → AppDef[来源]¶

测试组件

参数

args – 在 sys.argv[1：] 中传递给程序的参数（用 –c 忽略)
name （名称） – 任务的名称
num_replicas – 要运行的副本数（每个副本都在自己的容器上）
cpu – 每个副本的 CPU 数量
gpu – 每个副本的 GPU 数量
memMB – 每个副本的 CPU 内存（以 MB 为单位）
h – 已注册的命名资源（如果指定，则优先于 cpu、gpu、memMB）

结果

实用程序¶

文档

教程

资源