AWS SageMaker¶

torchx.schedulers.aws_sagemaker_scheduler 类。AWSSageMakerScheduler（session_name： str， client： Optional[Any] = None， docker_client：可选[DockerClient] = None）[来源]¶

基地：DockerWorkspaceMixin,Scheduler[AWSSageMakerOpts]

AWSSageMakerScheduler 是 AWS SageMaker 的 TorchX 调度接口。

$ torchx run -s aws_sagemaker utils.echo --image alpine:latest --msg hello
aws_batch://torchx_user/1234
$ torchx status aws_batch://torchx_user/1234
...

使用凭证从环境中加载身份验证处理。boto3

配置选项

    usage:
        role=ROLE,instance_type=INSTANCE_TYPE,[instance_count=INSTANCE_COUNT],[user=USER],[keep_alive_period_in_seconds=KEEP_ALIVE_PERIOD_IN_SECONDS],[volume_size=VOLUME_SIZE],[volume_kms_key=VOLUME_KMS_KEY],[max_run=MAX_RUN],[input_mode=INPUT_MODE],[output_path=OUTPUT_PATH],[output_kms_key=OUTPUT_KMS_KEY],[base_job_name=BASE_JOB_NAME],[tags=TAGS],[subnets=SUBNETS],[security_group_ids=SECURITY_GROUP_IDS],[model_uri=MODEL_URI],[model_channel_name=MODEL_CHANNEL_NAME],[metric_definitions=METRIC_DEFINITIONS],[encrypt_inter_container_traffic=ENCRYPT_INTER_CONTAINER_TRAFFIC],[use_spot_instances=USE_SPOT_INSTANCES],[max_wait=MAX_WAIT],[checkpoint_s3_uri=CHECKPOINT_S3_URI],[checkpoint_local_path=CHECKPOINT_LOCAL_PATH],[debugger_hook_config=DEBUGGER_HOOK_CONFIG],[enable_sagemaker_metrics=ENABLE_SAGEMAKER_METRICS],[enable_network_isolation=ENABLE_NETWORK_ISOLATION],[disable_profiler=DISABLE_PROFILER],[environment=ENVIRONMENT],[max_retry_attempts=MAX_RETRY_ATTEMPTS],[source_dir=SOURCE_DIR],[git_config=GIT_CONFIG],[hyperparameters=HYPERPARAMETERS],[container_log_level=CONTAINER_LOG_LEVEL],[code_location=CODE_LOCATION],[dependencies=DEPENDENCIES],[training_repository_access_mode=TRAINING_REPOSITORY_ACCESS_MODE],[training_repository_credentials_provider_arn=TRAINING_REPOSITORY_CREDENTIALS_PROVIDER_ARN],[disable_output_compression=DISABLE_OUTPUT_COMPRESSION],[enable_infra_check=ENABLE_INFRA_CHECK],[image_repo=IMAGE_REPO],[quiet=QUIET]

    required arguments:
        role=ROLE (str)
            an AWS IAM role (either name or full ARN). The Amazon SageMaker training jobs and APIs that create Amazon SageMaker endpoints use this role to access training data and model artifacts. After the endpoint is created, the inference code might use the IAM role, if it needs to access an AWS resource.
        instance_type=INSTANCE_TYPE (str)
            type of EC2 instance to use for training, for example, 'ml.c4.xlarge'

    optional arguments:
        instance_count=INSTANCE_COUNT (int, 1)
            number of Amazon EC2 instances to use for training. Required if instance_groups is not set.
        user=USER (str, ec2-user)
            the username to tag the job with. `getpass.getuser()` if not specified.
        keep_alive_period_in_seconds=KEEP_ALIVE_PERIOD_IN_SECONDS (int, None)
            the duration of time in seconds to retain configured resources in a warm pool for subsequent training jobs.
        volume_size=VOLUME_SIZE (int, None)
            size in GB of the storage volume to use for storing input and output data during training (default: 30).
        volume_kms_key=VOLUME_KMS_KEY (str, None)
            KMS key ID for encrypting EBS volume attached to the training instance.
        max_run=MAX_RUN (int, None)
            timeout in seconds for training (default: 24 * 60 * 60).
        input_mode=INPUT_MODE (str, None)
            the input mode that the algorithm supports (default: ‘File’).
        output_path=OUTPUT_PATH (str, None)
            S3 location for saving the training result (model artifacts and output files). If not specified, results are stored to a default bucket. If the bucket with the specific name does not exist, the estimator creates the bucket during the fit() method execution.
        output_kms_key=OUTPUT_KMS_KEY (str, None)
            KMS key ID for encrypting the training output (default: Your IAM role’s KMS key for Amazon S3).
        base_job_name=BASE_JOB_NAME (str, None)
            prefix for training job name when the fit() method launches. If not specified, the estimator generates a default job name based on the training image name and current timestamp.
        tags=TAGS (typing.List[typing.Dict[str, str]], None)
            list of tags for labeling a training job.
        subnets=SUBNETS (typing.List[str], None)
            list of subnet ids. If not specified training job will be created without VPC config.
        security_group_ids=SECURITY_GROUP_IDS (typing.List[str], None)
            list of security group ids. If not specified training job will be created without VPC config.
        model_uri=MODEL_URI (str, None)
            URI where a pre-trained model is stored, either locally or in S3.
        model_channel_name=MODEL_CHANNEL_NAME (str, None)
            name of the channel where ‘model_uri’ will be downloaded (default: ‘model’).
        metric_definitions=METRIC_DEFINITIONS (typing.List[typing.Dict[str, str]], None)
            list of dictionaries that defines the metric(s) used to evaluate the training jobs. Each dictionary contains two keys: ‘Name’ for the name of the metric, and ‘Regex’ for the regular expression used to extract the metric from the logs.
        encrypt_inter_container_traffic=ENCRYPT_INTER_CONTAINER_TRAFFIC (bool, None)
            specifies whether traffic between training containers is encrypted for the training job (default: False).
        use_spot_instances=USE_SPOT_INSTANCES (bool, None)
            specifies whether to use SageMaker Managed Spot instances for training. If enabled then the max_wait arg should also be set.
        max_wait=MAX_WAIT (int, None)
            timeout in seconds waiting for spot training job.
        checkpoint_s3_uri=CHECKPOINT_S3_URI (str, None)
            S3 URI in which to persist checkpoints that the algorithm persists (if any) during training.
        checkpoint_local_path=CHECKPOINT_LOCAL_PATH (str, None)
            local path that the algorithm writes its checkpoints to.
        debugger_hook_config=DEBUGGER_HOOK_CONFIG (bool, None)
            configuration for how debugging information is emitted with SageMaker Debugger. If not specified, a default one is created using the estimator’s output_path, unless the region does not support SageMaker Debugger. To disable SageMaker Debugger, set this parameter to False.
        enable_sagemaker_metrics=ENABLE_SAGEMAKER_METRICS (bool, None)
            enable SageMaker Metrics Time Series.
        enable_network_isolation=ENABLE_NETWORK_ISOLATION (bool, None)
            specifies whether container will run in network isolation mode (default: False).
        disable_profiler=DISABLE_PROFILER (bool, None)
            specifies whether Debugger monitoring and profiling will be disabled (default: False).
        environment=ENVIRONMENT (typing.Dict[str, str], None)
            environment variables to be set for use during training job
        max_retry_attempts=MAX_RETRY_ATTEMPTS (int, None)
            number of times to move a job to the STARTING status. You can specify between 1 and 30 attempts.
        source_dir=SOURCE_DIR (str, None)
            absolute, relative, or S3 URI Path to a directory with any other training source code dependencies aside from the entry point file (default: current working directory)
        git_config=GIT_CONFIG (typing.Dict[str, str], None)
            git configurations used for cloning files, including repo, branch, commit, 2FA_enabled, username, password, and token.
        hyperparameters=HYPERPARAMETERS (typing.Dict[str, str], None)
            dictionary containing the hyperparameters to initialize this estimator with.
        container_log_level=CONTAINER_LOG_LEVEL (int, None)
            log level to use within the container (default: logging.INFO).
        code_location=CODE_LOCATION (str, None)
            S3 prefix URI where custom code is uploaded.
        dependencies=DEPENDENCIES (typing.List[str], None)
            list of absolute or relative paths to directories with any additional libraries that should be exported to the container.
        training_repository_access_mode=TRAINING_REPOSITORY_ACCESS_MODE (str, None)
            specifies how SageMaker accesses the Docker image that contains the training algorithm.
        training_repository_credentials_provider_arn=TRAINING_REPOSITORY_CREDENTIALS_PROVIDER_ARN (str, None)
            Amazon Resource Name (ARN) of an AWS Lambda function that provides credentials to authenticate to the private Docker registry where your training image is hosted.
        disable_output_compression=DISABLE_OUTPUT_COMPRESSION (bool, None)
            when set to true, Model is uploaded to Amazon S3 without compression after training finishes.
        enable_infra_check=ENABLE_INFRA_CHECK (bool, None)
            specifies whether it is running Sagemaker built-in infra check jobs.
        image_repo=IMAGE_REPO (str, None)
            (remote jobs) the image repository to use when pushing patched images, must have push access. Ex: example.com/your/container
        quiet=QUIET (bool, False)
            whether to suppress verbose output for image building. Defaults to ``False``.

兼容性

特征	计划程序支持
获取日志	❌
分布式作业	✔️
取消作业	✔️
描述任务	部分支持。SageMakerScheduler 将返回作业和副本状态，但不提供完整的原始 AppSpec。
工作区 / 修补	✔️
坐骑	❌
弹性	❌

describe（app_id： str） → 可选[DescribeAppResponse][来源]¶

描述指定的应用程序。

结果: AppDef 描述，或者应用程序不存在。None

list（） → List[ListAppResponse][来源]¶: 对于在调度程序上启动的应用程序，此 API 返回 ListAppResponse 列表对象，每个对象都有 App ID 及其 Status。注意：此 API 处于原型阶段，可能会发生更改。

log_iter（app_id： str， role_name： str， k： int = 0， 正则表达式：可选[str] = 无，因为：可选[日期时间] = 无，直到：可选[日期时间] = 无，should_tail：bool = False，流：可选[stream] = None） → Iterable[str][来源]¶

返回 . 当读取了所有符合条件的 log 行时，迭代器结束。k``th replica of the ``role

如果调度程序支持基于时间的游标获取日志行对于自定义时间范围，则，字段为 honored，否则将被忽略。未指定，相当于获取所有可用的日志行。如果是 empty，则迭代器的行为类似于，跟在日志输出之后直到作业达到 END 状态。sinceuntilsinceuntiluntiltail -f

构成日志的确切定义特定于计划程序。一些调度器可能会将 stderr 或 stdout 视为日志，其他人可能会读取日志从日志文件中。

行为和假设：

如果在不存在的应用程序上调用，则生成 undefined-behavior 调用方应在调用此方法之前检查应用是否存在 using。exists(app_id)
不是有状态的，使用相同的参数调用此方法两次返回一个新的迭代器。先前迭代进度丢失。
并不总是支持对数拖尾。并非所有调度程序都支持 live 日志迭代（例如，在应用程序运行时跟踪日志）。指 Iterator 行为的特定 scheduler 文档。

3.1 如果调度器支持 log-tailing，应该对其进行控制: by 参数。should_tail

不保证日志保留。有可能到这个方法调用时，底层调度程序可能已经清除了日志记录对于此应用程序。如果是这样，此方法将引发任意异常。
如果为 True，则该方法仅引发异常当可访问的日志行已完全耗尽并且应用程序已达到最终状态。例如，如果应用程序卡住并且没有产生任何日志行，然后 iterator 会阻塞，直到应用程序最终被杀死（通过 timeout 或手动），此时它会引发一个 .should_tailStopIterationStopIteration

如果为 False，则当没有更多日志时，该方法将引发。should_tailStopIteration
不需要所有调度程序都支持。
一些调度器可能通过支持 line cursor（例如寻找第 50 个对数行）。__getitem__iter[50]
保留空格，每个新行应包含。自\n
支持交互式进度条返回的行不需要 include 的 m，但随后应打印时不带换行符正确处理回车。\n\r

参数: streams – 要选择的 IO 输出流。其中之一： combined， stdout， stderr. 如果调度程序不支持所选流，它将 throw 一个 ValueError 的 Error。
结果: 指定角色副本的 over log linesIterator
提升：: NotImplementedError – 如果调度程序不支持日志迭代

schedule（dryrun_info： AppDryRunInfo[AWSSageMakerJob]） → str[来源]¶

相同，只是它需要一个 . 鼓励实现者实现此方法，而不是直接实现 since 可以很简单实施者：submitAppDryRunInfosubmitsubmit

dryrun_info = self.submit_dryrun(app, cfg)
return schedule(dryrun_info)

torchx.schedulers.aws_sagemaker_scheduler 类。AWSSageMakerJob（job_name： str， job_def： Dict[str， Any]， images_to_push： Dict[str， Tuple[str， str]]）[来源]¶

Jobs 定义了计划作业所需的键值。这将是 of 请求。

job_name：定义 SageMaker 中显示的任务名称
job_def：定义将用于在 SageMaker 上计划作业的作业描述
images_to_push：torchx 用于推image_repo

参考¶

torchx.schedulers.aws_sagemaker_scheduler。create_scheduler（session_name： str， **kwargs： object） → AWSSageMakerScheduler[来源]¶

AWS SageMaker¶

参考¶

文档

教程

资源