命令行界面¶

CLI 是一个命令行工具torchxtorchx.runner.Runner. 它允许用户启动torchx.specs.AppDef直接上无需编写管道的受支持计划程序之一（也称为工作流）。这便于快速迭代应用程序逻辑，而无需产生学习、写作的技术和认知开销，以及处理管道。

注意

如有疑问，请使用 .torchx --help

列出内置组件¶

该模块下的大多数组件都是 CLI 考虑 “内置” 应用程序。在编写自己的组件之前，您应该浏览 Builtin 以查看是否有适合您需求的内置功能。如果是这样，甚至不需要编写应用程序规范！torchx.components

$ torchx builtins
Found <n> builtin configs:
 1. echo
 2. simple_example
 3. touch
 ... <omitted for brevity>

列出支持的调度程序¶

要获取可启动作业的受支持计划程序列表，请执行以下作：

$ torchx schedulers

将应用程序作为作业运行¶

该子命令采用以下任一方式：run

内置名称

$ torchx run --scheduler <sched_name> echo

组件函数的完整 Python 模块路径

$ torchx run --scheduler <sched_name> torchx.components.utils.echo

定义组件的文件的 file 路径以及该文件中的组件函数名称。*.py

$ cat ~/my_trainer_spec.py
import torchx.specs as specs

def get_spec(foo: int, bar: str) -> specs.AppDef:
  <...spec file details omitted for brevity...>

$ torchx run --scheduler <sched_name> ~/my_trainer_spec.py get_spec

现在您已经了解了如何选择要启动的应用程序，现在是时候了以查看需要传递哪些参数。有三套参数：

参数，通过运行以下命令查看它们的列表：run
```
$ torchx run --help
```

调度器的参数（也称为 OR）、每个调度器采用不同的 args，以找出特定调度器运行的 args（调度器的命令如下所示：--scheduler_argsrun_optionsrun_configslocal

$ torchx runopts local
{ 'image_fetcher': { 'default': 'dir',
                 'help': 'image fetcher type',
                 'type': 'str'},
'log_dir': { 'default': 'None',
           'help': 'dir to write stdout/stderr log files of replicas',
           'type': 'str'}}

$ torchx run --scheduler local --scheduler_args image_fetcher=dir,log_dir=/tmp ...

参数（此处包含 app 参数），这也取决于组件，可以通过组件上的字符串看到--help

$ torchx run --scheduler local echo --help
usage: torchx run echo.torchx [-h] [--msg MSG]

Echos a message

optional arguments:
-h, --help  show this help message and exit
--msg MSG   Message to echo

将所有内容放在一起，与调度器一起运行：echolocal

$ torchx run      --scheduler local      --scheduler_args image_fetcher=dir,log_dir=/tmp      echo      --msg "hello $USER"
=== RUN RESULT ===
Launched app: local://torchx_kiuk/echo_ecd30f74

该子命令不会阻止作业完成，而只是在指定的调度程序上调度作业，并打印一个 URL，该 URL 的格式为：。请记下此句柄，因为这是您需要提供给其他人的内容子命令来标识您的作业。runapp handle$scheduler_name://torchx_$user/$job_id

描述和查询作业的状态¶

子命令本质上是命令的反面。也就是说，它会打印给定 .describerunapp_handle

$ torchx describe <app handle>

重要

该命令尝试重新创建应用程序规范通过查询 Scheduler 以获取 Job Description。那你看到打印并不总是 100% 完全相同的应用程序规格给 .运行程序可以重新创建应用程序规范取决于许多因素，例如调度程序的 API 的描述性如何作为 App Spec 中是否存在被忽略的字段将作业提交到调度程序时，因为调度程序不支持此类参数/功能。永远不要依赖 API 作为存储函数您的 App Spec。它只是为了帮助您抽查事物。describerundescribe_jobdescribe

要获取正在运行的作业的状态，请执行以下作：

$ torchx status <app_handle> # prints status for all replicas and roles
$ torchx status --role trainer <app_handle> # filters it down to the trainer role

筛选依据对于具有多个角色的大型作业非常有用。--role

查看日志¶

注意

此功能取决于您的计划程序设置保留日志的时间 TorchX 不会代表您存档日志，而是依靠调度程序的 API 来获取日志。请参阅调度程序的用户手册以正确设置日志保留。get_log

该子命令是调度程序 API 的简单包装器允许您从一个位置拉取/打印所有副本和角色的日志。它还允许您拉取副本或特定于角色的日志。以下是一些日志访问有用且不言自明的模式logget_log

$ torchx log <app_handle>
Prints all logs from all replicas and roles (each log line is prefixed with role name and replica id)

$ torchx log --tail <app_handle>
If the job is still running tail the logs

$ torchx log --regex ".*Exception.*" <app_handle>
regex filter to exceptions

$ torchx log <app_handle>/<role>
$ torchx log <app_handle>/trainer
pulls all logs for the role trainer

$ torchx log <app_handle>/<role_name>/<replica_id>
$ torchx log <app_handle>/trainer/0,1
only pulls trainer 0 and trainer 1 (node not rank) logs

注意

某些计划程序不支持服务器端正则表达式筛选器。在这种情况下正则表达式过滤器应用于客户端，即完整日志必须通过客户端传递。这可能会非常费力本地主机。请在使用 logs API 时做出最佳判断。