torchx.runner¶
- torchx.runner.get_runner(name: Optional[str] = None, **scheduler_params: Any) → torchx.runner.api.Runner[source]¶
Convenience method to construct and get a Runner object.
- class torchx.runner.Runner(name: str, schedulers: Dict[str, torchx.schedulers.api.Scheduler], wait_interval: int = 10)[source]¶
Torchx individual component runner. Has the methods for the user to act upon
AppDefs
. TheRunner
is stateful and represents a logical workspace of the user. It can be backed by a service (e.g. Torchx server) for persistence or can be standalone with no persistence meaning that theRunner
lasts only during the duration of the hosting process (see theattach()
API for instructions on re-parenting apps between sessions).- describe(app_handle: str) → Optional[torchx.specs.api.AppDef][source]¶
Reconstructs the application (to the best extent) given the app handle. Note that the reconstructed application may not be the complete app as it was submitted via the run API. How much of the app can be reconstructed is scheduler dependent.
- Returns
AppDef or None if the app does not exist anymore or if the scheduler does not support describing the app handle
- dryrun(app: torchx.specs.api.AppDef, scheduler: str = 'default', cfg: Optional[torchx.specs.api.RunConfig] = None) → torchx.specs.api.AppDryRunInfo[source]¶
Dry runs an app on the given scheduler with the provided run configs. Does not actually submit the app but rather returns what would have been submitted. The returned
AppDryRunInfo
is pretty formatted and can be printed or logged directly.Usage:
dryrun_info = session.dryrun(app, scheduler="local", cfg) print(dryrun_info)
- list() → Dict[str, torchx.specs.api.AppDef][source]¶
Returns the applications that were run with this session mapped by the app handle. The persistence of the session is implementation dependent.
- log_lines(app_handle: str, role_name: str, k: int = 0, regex: Optional[str] = None, since: Optional[datetime.datetime] = None, until: Optional[datetime.datetime] = None, should_tail: bool = False) → Iterable[str][source]¶
Returns an iterator over the log lines of the specified job container.
Note
k
is the node (host) id NOT therank
.since
anduntil
need not always be honored (depends on scheduler).
Warning
The semantics and guarantees of the returned iterator is highly scheduler dependent. See
torchx.specs.api.Scheduler.log_iter
for the high-level semantics of this log iterator. For this reason it is HIGHLY DISCOURAGED to use this method for generating output to pass to downstream functions/dependencies. This method DOES NOT guarantee that 100% of the log lines are returned. It is totally valid for this method to return no or partial log lines if the scheduler has already totally or partially purged log records for the application.Usage:
app_handle = session.run(app, scheduler="local", cfg=RunConfig()) print("== trainer node 0 logs ==") for line in session.log_lines(app_handle, "trainer", k=0): print(line)
Discouraged anti-pattern:
# DO NOT DO THIS! # parses accuracy metric from log and reports it for this experiment run accuracy = -1 for line in session.log_lines(app_handle, "trainer", k=0): if matches_regex(line, "final model_accuracy:[0-9]*"): accuracy = parse_accuracy(line) break report(experiment_name, accuracy)
- Parameters
app_handle – application handle
role_name – role within the app (e.g. trainer)
k – k-th replica of the role to fetch the logs for
regex – optional regex filter, returns all lines if left empty
since – datetime based start cursor. If left empty begins from the first log line (start of job).
until – datetime based end cursor. If left empty, follows the log output until the job completes and all log lines have been consumed.
- Returns
An iterator over the role k-th replica of the specified application.
- Raises
UnknownAppException – if the app does not exist in the scheduler
- run(app: torchx.specs.api.AppDef, scheduler: str = 'default', cfg: Optional[torchx.specs.api.RunConfig] = None) → str[source]¶
Runs the given application in the specified mode.
Note
sub-classes of
Session
should implementschedule
method rather than overriding this method directly.- Returns
An application handle that is used to call other action APIs on the app.
- Raises
AppNotReRunnableException – if the session/scheduler does not support re-running attached apps
- run_from_path(component_path: str, app_args: List[str], scheduler: str = 'default', cfg: Optional[torchx.specs.api.RunConfig] = None, dryrun: bool = False) → str[source]¶
Resolves and runs the application in the specified mode.
Retrieves application based on
component_path
and runs it in the specified mode.- The
component_path
has the following resolution order(from high-pri to low-pri): - User-registered components. Users can register components via
https://packaging.python.org/specifications/entry-points/. Method looks for entrypoints in the group
torchx.components
.
- File-based components in format:
$FILE_PATH:FUNCTION_NAME
. Both relative and absolute paths supported.
- File-based components in format:
- Builtin components relative to torchx.components. The path to the component should
be module name relative to torchx.components and function name in a format:
$module.$function
.
Usage:
runner.run_from_path("distributed.ddp", ...) - will be resolved to ``torchx.components.distributed`` module and ``ddp`` function. runner.run_from_path("~/home/components.py:my_component", ...) - will be resolved to ``~/home/components.py`` file and ``my_component`` function.
- Returns
An application handle that is used to call other action APIs on the app, or
<NONE>
if it dryrun specified.- Raises
AppNotReRunnableException – if the session/scheduler does not support re-running attached apps
ValueError – if the
component_path
is failed to resolve.
- The
- run_opts() → Dict[str, torchx.specs.api.runopts][source]¶
Returns the
runopts
for the supported scheduler backends.Usage:
local_runopts = session.run_opts()["local"] print("local scheduler run options: {local_runopts}")
- Returns
A map of scheduler backend to its
runopts
- schedule(dryrun_info: torchx.specs.api.AppDryRunInfo) → str[source]¶
Actually runs the application from the given dryrun info. Useful when one needs to overwrite a parameter in the scheduler request that is not configurable from one of the object APIs.
Warning
Use sparingly since abusing this method to overwrite many parameters in the raw scheduler request may lead to your usage of TorchX going out of compliance in the long term. This method is intended to unblock the user from experimenting with certain scheduler-specific features in the short term without having to wait until TorchX exposes scheduler features in its APIs.
Note
It is recommended that sub-classes of
Session
implement this method instead of directly implementing therun
method.Usage:
dryrun_info = session.dryrun(app, scheduler="default", cfg) # overwrite parameter "foo" to "bar" dryrun_info.request.foo = "bar" app_handle = session.submit(dryrun_info)
- scheduler_backends() → List[str][source]¶
Returns a list of all supported scheduler backends. All session implementations must support a “default” scheduler backend and document what the default scheduler is.
- status(app_handle: str) → Optional[torchx.specs.api.AppStatus][source]¶
- Returns
The status of the application, or
None
if the app does not exist anymore (e.g. was stopped in the past and removed from the scheduler’s backend).
- stop(app_handle: str) → None[source]¶
Stops the application, effectively directing the scheduler to cancel the job. Does nothing if the app does not exist.
Note
This method returns as soon as the cancel request has been submitted to the scheduler. The application will be in a
RUNNING
state until the scheduler actually terminates the job. If the scheduler successfully interrupts the job and terminates it the final state will beCANCELLED
otherwise it will beFAILED
.
- wait(app_handle: str) → Optional[torchx.specs.api.AppStatus][source]¶
Block waits (indefinitely) for the application to complete. Possible implementation:
while(True): app_status = status(app) if app_status.is_terminal(): return sleep(10)
- Returns
The terminal status of the application, or
None
if the app does not exist anymore