目录

快速入门

这是一个独立的指南,介绍如何构建一个简单的应用程序和组件规范,并通过两个不同的调度器启动它。

安装

我们需要做的第一件事是安装 TorchX python 包,其中包括 CLI 和库。

# install torchx with all dependencies
$ pip install torchx[dev]

有关安装的更多信息,请参阅 README

[1]:
%%sh
torchx --help
usage: torchx [-h] [--log_level LOG_LEVEL] [--version]
              {describe,log,run,builtins,runopts,status,configure} ...

torchx CLI

optional arguments:
  -h, --help            show this help message and exit
  --log_level LOG_LEVEL
                        Python logging log level
  --version             show program's version number and exit

sub-commands:
  Use the following commands to run operations, e.g.: torchx run ${JOB_NAME}

  {describe,log,run,builtins,runopts,status,configure}

世界您好

让我们从编写一个简单的 “Hello World” python 应用程序开始。这只是一个普通的 python 程序,可以包含您想要的任何内容。

注意

此示例使用 Jupyter Notebook 创建本地文件以用于示例目的。在正常使用情况下,您可以将这些文件作为独立文件。%%writefile

[2]:
%%writefile my_app.py

import sys
import argparse

def main(user: str) -> None:
    print(f"Hello, {user}!")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Hello world app"
    )
    parser.add_argument(
        "--user",
        type=str,
        help="the person to greet",
        required=True,
    )
    args = parser.parse_args(sys.argv[1:])

    main(args.user)
Writing my_app.py

现在我们有一个应用程序,我们可以为它编写组件文件。此功能允许我们以用户友好的方式重用和共享我们的应用程序。

我们可以从 cli 或以编程方式将此组件用作管道的一部分。torchx

[3]:
%%writefile my_component.py

import torchx.specs as specs

def greet(user: str, image: str = "my_app:latest") -> specs.AppDef:
    return specs.AppDef(
        name="hello_world",
        roles=[
            specs.Role(
                name="greeter",
                image=image,
                entrypoint="python",
                args=[
                    "-m", "my_app",
                    "--user", user,
                ],
            )
        ],
    )
Writing my_component.py

我们可以通过 .调度程序相对于当前目录执行组件。torchx runlocal_cwd

[4]:
%%sh
torchx run --scheduler local_cwd my_component.py:greet --user "your name"
Hello, your name!
local_cwd://torchx/hello_world_49bdfa71
torchx 2021-10-20 19:04:18 INFO     Waiting for the app to finish...
torchx 2021-10-20 19:04:19 INFO     Job finished: SUCCEEDED

如果我们想在其他环境中运行,我们可以构建一个 Docker 容器,这样我们就可以在支持 Docker 的环境(如 Kubernetes)中或通过本地 Docker 调度器运行我们的组件。

注意

这需要安装 Docker,并且在 Google Colab 等环境中不起作用。如果您尚未按照以下安装说明进行作:https://docs.docker.com/get-docker/

[5]:
%%writefile Dockerfile

FROM ghcr.io/pytorch/torchx:0.1.0rc1

ADD my_app.py .
Writing Dockerfile

创建 Dockerfile 后,我们就可以创建 Docker 映像了。

[6]:
%%sh
docker build -t my_app:latest -f Dockerfile .
Step 1/2 : FROM ghcr.io/pytorch/torchx:0.1.0rc1
0.1.0rc1: Pulling from pytorch/torchx
4bbfd2c87b75: Pulling fs layer
d2e110be24e1: Pulling fs layer
889a7173dcfe: Pulling fs layer
6009a622672a: Pulling fs layer
143f80195431: Pulling fs layer
eccbe17c44e1: Pulling fs layer
d4c7af0d4fa7: Pulling fs layer
06b5edd6bf52: Pulling fs layer
f18d016c4ccc: Pulling fs layer
c0ad16d9fa05: Pulling fs layer
30587ba7fd6b: Pulling fs layer
909695be1d50: Pulling fs layer
f119a6d0a466: Pulling fs layer
88d87059c913: Pulling fs layer
143f80195431: Waiting
eccbe17c44e1: Waiting
d4c7af0d4fa7: Waiting
06b5edd6bf52: Waiting
f18d016c4ccc: Waiting
c0ad16d9fa05: Waiting
30587ba7fd6b: Waiting
909695be1d50: Waiting
f119a6d0a466: Waiting
88d87059c913: Waiting
6009a622672a: Waiting
d2e110be24e1: Verifying Checksum
d2e110be24e1: Download complete
6009a622672a: Verifying Checksum
6009a622672a: Download complete
4bbfd2c87b75: Verifying Checksum
4bbfd2c87b75: Download complete
889a7173dcfe: Verifying Checksum
889a7173dcfe: Download complete
eccbe17c44e1: Verifying Checksum
eccbe17c44e1: Download complete
06b5edd6bf52: Verifying Checksum
06b5edd6bf52: Download complete
d4c7af0d4fa7: Verifying Checksum
d4c7af0d4fa7: Download complete
c0ad16d9fa05: Verifying Checksum
c0ad16d9fa05: Download complete
30587ba7fd6b: Verifying Checksum
30587ba7fd6b: Download complete
4bbfd2c87b75: Pull complete
909695be1d50: Verifying Checksum
909695be1d50: Download complete
f119a6d0a466: Verifying Checksum
f119a6d0a466: Download complete
88d87059c913: Verifying Checksum
88d87059c913: Download complete
d2e110be24e1: Pull complete
889a7173dcfe: Pull complete
f18d016c4ccc: Verifying Checksum
f18d016c4ccc: Download complete
143f80195431: Verifying Checksum
143f80195431: Download complete
6009a622672a: Pull complete
143f80195431: Pull complete
eccbe17c44e1: Pull complete
d4c7af0d4fa7: Pull complete
06b5edd6bf52: Pull complete
f18d016c4ccc: Pull complete
c0ad16d9fa05: Pull complete
30587ba7fd6b: Pull complete
909695be1d50: Pull complete
f119a6d0a466: Pull complete
88d87059c913: Pull complete
Digest: sha256:a738949601d82e7f100fa1efeb8dde0c35ce44c66726cf38596f96d78dcd7ad3
Status: Downloaded newer image for ghcr.io/pytorch/torchx:0.1.0rc1
 ---> 3dbec59e8049
Step 2/2 : ADD my_app.py .
 ---> ceb8e40109ec
Successfully built ceb8e40109ec
Successfully tagged my_app:latest

然后,我们可以在本地调度器上启动它。

[7]:
%%sh
torchx run --scheduler local_docker my_component.py:greet --image "my_app:latest" --user "your name"
Client:
 Version:           20.10.9+azure-1
 API version:       1.41
 Go version:        go1.16.8
 Git commit:        c2ea9bc90bacf19bdbe37fd13eec8772432aca99
 Built:             Thu Sep 23 18:26:34 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.9+azure-1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.8
  Git commit:       79ea9d3080181d755855d5924d0f4f116faa9463
  Built:            Thu Sep 23 18:26:18 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11+azure
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        52b36a2dd837e8462de8e01458bf02cf9eea47dd
 docker-init:
  Version:          0.19.0
  GitCommit:
Hello, your name!
local_docker://torchx/hello_world_fce26e07
Error response from daemon: pull access denied for my_app, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
torchx 2021-10-20 19:06:48 WARNING  failed to fetch image my_app:latest, falling back to local: Command '['docker', 'pull', 'my_app:latest']' returned non-zero exit status 1.
torchx 2021-10-20 19:06:48 INFO     Waiting for the app to finish...
torchx 2021-10-20 19:06:49 INFO     Job finished: SUCCEEDED

如果您有 Kubernetes 集群,则可以使用 Kubernetes 调度程序在集群上启动它。

$ docker push my_app:latest
$ torchx run --scheduler kubernetes my_component.py:greet --image "my_app:latest" --user "your name"

内置

TorchX 还提供了许多带有预制图像的内置组件。您可以通过以下方式发现它们:

[8]:
%%sh
torchx builtins
Found 7 builtin components:
  1. dist.ddp
  2. utils.booth
  3. utils.copy
  4. utils.echo
  5. utils.sh
  6. utils.touch
  7. serve.torchserve

您可以从 CLI、管道或像任何其他组件一样以编程方式使用这些组件。

[9]:
%%sh
torchx run utils.echo --msg "Hello :)"
Client:
 Version:           20.10.9+azure-1
 API version:       1.41
 Go version:        go1.16.8
 Git commit:        c2ea9bc90bacf19bdbe37fd13eec8772432aca99
 Built:             Thu Sep 23 18:26:34 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.9+azure-1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.8
  Git commit:       79ea9d3080181d755855d5924d0f4f116faa9463
  Built:            Thu Sep 23 18:26:18 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.11+azure
  GitCommit:        5b46e404f6b9f661a205e28d59c982d3634148f8
 runc:
  Version:          1.0.2
  GitCommit:        52b36a2dd837e8462de8e01458bf02cf9eea47dd
 docker-init:
  Version:          0.19.0
  GitCommit:
Hello :)
local_docker://torchx/echo_1d65ce6f
0.1.0rc2: Pulling from pytorch/torchx
4bbfd2c87b75: Already exists
d2e110be24e1: Already exists
889a7173dcfe: Already exists
6009a622672a: Already exists
143f80195431: Already exists
eccbe17c44e1: Already exists
c8e67195524a: Pulling fs layer
b8164a4c6bc7: Pulling fs layer
e6dc31f41742: Pulling fs layer
72e3ffceef80: Pulling fs layer
2f0f90133f1e: Pulling fs layer
2f47096d98b8: Pulling fs layer
fbb305c7c317: Pulling fs layer
f9cd8d47efcc: Pulling fs layer
2f47096d98b8: Waiting
fbb305c7c317: Waiting
f9cd8d47efcc: Waiting
72e3ffceef80: Waiting
2f0f90133f1e: Waiting
b8164a4c6bc7: Verifying Checksum
b8164a4c6bc7: Download complete
72e3ffceef80: Verifying Checksum
72e3ffceef80: Download complete
c8e67195524a: Verifying Checksum
c8e67195524a: Download complete
2f0f90133f1e: Verifying Checksum
2f0f90133f1e: Download complete
fbb305c7c317: Verifying Checksum
fbb305c7c317: Download complete
f9cd8d47efcc: Verifying Checksum
f9cd8d47efcc: Download complete
c8e67195524a: Pull complete
2f47096d98b8: Verifying Checksum
2f47096d98b8: Download complete
b8164a4c6bc7: Pull complete
e6dc31f41742: Verifying Checksum
e6dc31f41742: Download complete
e6dc31f41742: Pull complete
72e3ffceef80: Pull complete
2f0f90133f1e: Pull complete
2f47096d98b8: Pull complete
fbb305c7c317: Pull complete
f9cd8d47efcc: Pull complete
Digest: sha256:f97df7bd3d2137b3dcb6b85e14f4eef9c1c7cdd826d12508cc7cafb08bb2f704
Status: Downloaded newer image for ghcr.io/pytorch/torchx:0.1.0rc2
ghcr.io/pytorch/torchx:0.1.0rc2
torchx 2021-10-20 19:07:54 INFO     Waiting for the app to finish...
torchx 2021-10-20 19:07:55 INFO     Job finished: SUCCEEDED

后续步骤

  1. 查看 torchx CLI 的其他功能

  2. 了解如何通过引用 spec 来编写更复杂的 App 规范

  3. 浏览内置组件的集合

  4. 查看 Runner 支持的调度程序列表

  5. 查看您可以在哪些 ML 管道平台上运行组件

  6. 查看培训应用程序示例

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源