torchx.runner¶

torchx.runner.get_runner(name: Optional[str] = None, **scheduler_params: Any) → torchx.runner.api.Runner [source]¶: 方便的方法来构建和获取一个Runner对象。

class torchx.runner.Runner(name: str, schedulers: Dict[str, torchx.schedulers.api.Scheduler], wait_interval: int = 10)[source]¶

Torchx 个人组件运行器。具有用户可以操作的方法AppDefs。Runner是状态化的，并代表用户的逻辑工作空间。它可以由服务（例如Torchx服务器）提供持久性，也可以是独立的，没有持久性，这意味着Runner仅在托管过程期间有效（请参见attach() API以了解如何在会话之间重置应用程序）。

describe(app_handle: str) → Optional[torchx.specs.api.AppDef][source]¶

重建应用（尽可能地）给定应用程序句柄。请注意，重建的应用可能不是通过运行API提交的完整应用。可以重建的应用部分取决于调度器。

Returns: 如果应用程序不再存在或者调度器不支持描述应用程序句柄，则使用AppDef或None。

dryrun(app: torchx.specs.api.AppDef, scheduler: str = 'default', cfg: Optional[torchx.specs.api.RunConfig] = None) → torchx.specs.api.AppDryRunInfo[source]¶

在给定的调度器上进行应用程序的模拟运行，使用提供的运行配置。不会实际提交应用程序，而是返回将要提交的内容。返回的AppDryRunInfo格式非常漂亮，可以直接打印或记录。

Usage:

dryrun_info = session.dryrun(app, scheduler="local", cfg)
print(dryrun_info)

list() → Dict[str, torchx.specs.api.AppDef][source]¶: 返回与本会话运行的应用程序，通过应用程序标识符进行映射。会话的持久性取决于实现。

log_lines(app_handle: str, role_name: str, k: int = 0, regex: Optional[str] = None, since: Optional[datetime.datetime] = None, until: Optional[datetime.datetime] = None, should_tail: bool = False) → Iterable[str][source]¶

返回指定工作容器的日志行迭代器。

注意

k 是节点（主机）ID，而不是 rank。
since 和 until 不一定总是需要被尊重（取决于调度器）。

警告

返回迭代器的语义和保证高度依赖于调度器。请参见torchx.specs.api.Scheduler.log_iter了解此日志迭代器的高级语义。因此，强烈建议不要使用此方法生成输出以传递给下游函数/依赖项。此方法不保证返回100%的日志行。如果调度器已经完全或部分清空了应用程序的日志记录，则这种方法完全可以返回零或部分日志行。

Usage:

app_handle = session.run(app, scheduler="local", cfg=RunConfig())

print("== trainer node 0 logs ==")
for line in session.log_lines(app_handle, "trainer", k=0):
   print(line)

被禁止的反模式：

# DO NOT DO THIS!
# parses accuracy metric from log and reports it for this experiment run
accuracy = -1
for line in session.log_lines(app_handle, "trainer", k=0):
   if matches_regex(line, "final model_accuracy:[0-9]*"):
       accuracy = parse_accuracy(line)
       break
report(experiment_name, accuracy)

Parameters

app_handle – 应用程序句柄
角色名称 – 应用程序中的角色（例如：训练器）
k – k-th replica of the role to fetch the logs for
正则表达式 – 可选的正则表达式过滤器，如果为空则返回所有行
自 – 基于日期时间的起始游标。如果为空，则从第一条日志行（任务开始）开始。
直到 – 基于日期时间的结束游标。如果为空，则跟随日志输出，直到任务完成且所有日志行都被消费。

Returns

一个迭代器，遍历指定应用的第k个副本的角色。

Raises

未知应用异常 – 如果应用程序不在调度器中

run(app: torchx.specs.api.AppDef, scheduler: str = 'default', cfg: Optional[torchx.specs.api.RunConfig] = None) → str [source]¶

在指定模式下运行给定的应用程序。

注意

子类应实现 Session 方法而不是直接覆盖此方法。

Returns: 一个用于调用应用程序中其他动作API的应用程序处理对象。
Raises: AppNotReRunnableException – 如果会话/调度器不支持重新运行附加的应用程序

run_from_path(component_path: str, app_args: List[str], scheduler: str = 'default', cfg: Optional[torchx.specs.api.RunConfig] = None, dryrun: bool = False) → str [source]¶

解决并以指定模式运行应用程序。

根据 component_path 检索应用程序并以指定模式运行它。

The component_path has the following resolution order(from high-pri to low-pri):

User-registered components. Users can register components via
https://packaging.python.org/specifications/entry-points/. 方法查找在组 torchx.components 中的入口点。
File-based components in format: $FILE_PATH:FUNCTION_NAME. Both relative and
支持绝对路径。
Builtin components relative to torchx.components. The path to the component should
模块名称相对于 torchx.components，函数名称格式为： $module.$function。

Usage:

runner.run_from_path("distributed.ddp", ...) - will be resolved to
   ``torchx.components.distributed`` module and ``ddp`` function.


runner.run_from_path("~/home/components.py:my_component", ...) - will be resolved to
   ``~/home/components.py`` file and ``my_component`` function.

Returns

一个应用程序处理程序，用于调用其他应用API或<NONE>如果指定为干运行。

Raises

AppNotReRunnableException – 如果会话/调度器不支持重新运行附加的应用程序
ValueError – 如果无法解析component_path。

run_opts() → Dict[str, torchx.specs.api.runopts][source]¶

返回支持的调度器后端的 runopts。

Usage:

local_runopts = session.run_opts()["local"]
print("local scheduler run options: {local_runopts}")

Returns: 一个调度器后端的地图及其 runopts

schedule(dryrun_info: torchx.specs.api.AppDryRunInfo) → str [source]¶

实际上，从给定的干运行信息中运行应用程序。当需要覆盖调度器请求中的参数时非常有用，该参数无法通过对象API之一进行配置。

警告

谨慎使用，因为滥用这种方法在原始调度请求中覆盖许多参数可能会导致您长期使用TorchX不符合规范。这种方法旨在短期内阻止用户等待TorchX在其API中暴露调度器功能，从而实验某些特定的调度器功能。

注意

建议 Session 的子类实现此方法，而不是直接实现 run 方法。

Usage:

dryrun_info = session.dryrun(app, scheduler="default", cfg)

# overwrite parameter "foo" to "bar"
dryrun_info.request.foo = "bar"

app_handle = session.submit(dryrun_info)

scheduler_backends() → List[str][source]¶: 返回所有支持的调度器后端列表。所有会话实现必须支持一个“默认”调度器后端，并说明默认调度器是什么。

status(app_handle: str) → Optional[torchx.specs.api.AppStatus][source]¶

Returns: 应用的状态，或者None如果应用程序已经不存在（例如，在过去停止并从调度器的后端删除）。

stop(app_handle: str) → None [source]¶: 停止应用程序，有效地将调度器导向取消任务。如果应用程序不存在，则无操作。

注意

这个方法在取消请求提交给调度器后立即返回。应用程序将在RUNNING状态直到调度器实际终止该任务。如果调度器成功中断并终止了任务，最终状态将是CANCELLED；否则，它将处于FAILED状态。

wait(app_handle: str) → Optional[torchx.specs.api.AppStatus][source]¶

块等待（无限期）直到应用程序完成。可能的实现：

while(True):
    app_status = status(app)
    if app_status.is_terminal():
        return
    sleep(10)

Returns: 应用的终端状态，或者None如果应用程序已经不存在

torchx.runner¶

文档

教程

资源