注意

转到末尾下载完整的示例代码。

环境、TED 和转换入门¶

注意

要在笔记本中运行本教程，请添加安装单元开头包含：

!pip install tensordict
!pip install torchrl

欢迎使用入门教程！

以下是我们将涵盖的主题列表。

如果您赶时间，可以直接跳到最后一个教程，即您自己的第一个训练循环，从那里您可以如果事情不清楚，则回溯所有其他 “Getting Started” 教程，或者如果您想了解有关特定主题的更多信息！

RL 中的环境¶

标准的 RL（强化学习）训练循环涉及一个模型也称为策略，它经过训练以在特定环境。通常，此环境是一个接受作作为输入，并生成一个观察结果以及一些元数据输出。

在本文档中，我们将探索 TorchRL 的环境 API：我们将了解如何创建环境、与环境交互并了解 data 格式。

创建环境¶

从本质上讲，TorchRL 并不直接提供环境，而是为封装模拟器的其他库提供包装器。该模块可以被视为泛型环境 API，以及 Gym 等模拟后端的中心枢纽envs (GymEnv）、布拉克斯 (BraxEnv) 或 DeepMind Control Suite (DMControlEnv).

创建环境通常与底层环境一样简单 backend API 允许。下面是一个使用 gym 的示例：

from torchrl.envs import GymEnv

env = GymEnv("Pendulum-v1")

/pytorch/rl/torchrl/envs/common.py:2989: DeprecationWarning: Your wrapper was not given a device. Currently, this value will default to 'cpu'. From v0.5 it will default to `None`. With a device of None, no device casting is performed and the resulting tensordicts are deviceless. Please set your device accordingly.
  warnings.warn(

运行环境¶

TorchRL 中的环境有两种关键方法：reset()，它会启动一个剧集，以及step()，它会执行作。在 TorchRL 中，环境方法读取和写入实例。本质上，是基于 key 的通用数据 Carrier 的 Carrier 进行 Tensor 的 Carrier 的与普通张量相比，使用 TensorDict 的好处是它使我们能够可互换处理简单和复杂的数据结构。作为我们的函数签名非常通用，它消除了适应不同的数据格式。简单来说，在这个简短的教程之后，您将能够对简单和高度复杂的作进行作环境，因为他们面向用户的 API 是相同且简单的！TensorDictTensorDict

让我们将环境付诸行动，看看什么是 tensordict 实例看来：

reset = env.reset()
print(reset)

TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

现在，让我们在作空间中执行一个随机作。首先，对作进行采样：

reset_with_action = env.rand_action(reset)
print(reset_with_action)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

这个 tensordict 的结构与从EnvBase()替换为一个额外的条目。您可以轻松访问该作，就像使用常规字典："action"

print(reset_with_action["action"])

tensor([0.9439])

我们现在需要将此作传递到环境。我们将把整个 tensordict 传递给该方法，因为可能是多个张量，以便在更高级的情况下读取，例如多代理 RL 或无状态环境：step

stepped_data = env.step(reset_with_action)
print(stepped_data)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

同样，这个新的 tensordict 与前一个相同，除了事实上，它有一个条目（本身就是一个 Tensordict！），其中包含 observation、reward 和 done 状态由我们的行动。"next"

我们将这种格式称为 TED，即 TorchRL Episode Data 格式。是的在库中表示数据的无处不在的方式，既是动态的，如此处，或使用离线数据集静态。

在环境中运行转出所需的最后一点信息是如何将该条目置于根目录以执行下一步。 TorchRL 提供了一个专用的"next"step_mdp()功能这样做就是：它过滤掉您不需要的信息，并且在步骤马尔可夫决策过程（MDP）。

from torchrl.envs import step_mdp

data = step_mdp(stepped_data)
print(data)

TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

“环境”卷展栏¶

写下这三个步骤（计算一个动作、做一个步骤、在 MDP 中移动）可能有点乏味和重复。幸运 TorchRL 提供了一个很好的rollout()函数允许你随意在闭环中运行它们：

rollout = env.rollout(max_steps=10)
print(rollout)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([10]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([10]),
    device=cpu,
    is_shared=False)

此数据看起来与上述数据非常相似，其中 exception 的 batch-size 的 Exception 函数，它现在等于我们通过参数提供。tensordict 的魔力不止于此：如果您对此环境中，你可以像索引 Tensor 一样为 Tensordict 编制索引：stepped_datamax_steps

transition = rollout[3]
print(transition)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([3]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

TensorDict会自动检查索引是否为你 provided 是一个 key（在这种情况下，我们沿 key-dimension 进行索引）或空间索引，如下所示。

以这种方式执行（没有策略），该方法可能看起来相当无用：它只是运行随机作。如果策略可用，则它可以传递给 Method 并用于收集数据。rollout

尽管如此，首先运行一个幼稚的、无策略的推出是有用的，以一目了然地检查环境的预期内容。

要欣赏 TorchRL API 的多功能性，请考虑转出方法普遍适用。它适用于所有用途 案例，无论您使用的是像这样的单个环境，跨各种进程、多代理环境甚至多个副本它的无状态版本！

改造环境¶

大多数情况下，您需要将环境的输出修改为更好地满足您的要求。例如，您可能希望监控自上次重置、调整图像大小或堆栈以来执行的步骤数连续的观察值。

在本节中，我们将研究一个简单的转换StepCounter变换。可在此处找到转换的完整列表。

转换通过TransformedEnv:

from torchrl.envs import StepCounter, TransformedEnv

transformed_env = TransformedEnv(env, StepCounter(max_steps=10))
rollout = transformed_env.rollout(max_steps=100)
print(rollout)

TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                step_count: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.int64, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([10]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        step_count: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.int64, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([10]),
    device=cpu,
    is_shared=False)

如您所见，我们的环境现在多了一个条目，即跟踪自上次重置以来的步骤数。鉴于我们传递了可选的参数传递给 transform 构造函数时，我们还截断了轨迹（未完成 100 个步骤的完整推出，例如我们在电话中询问）。我们可以看到，轨迹是通过查看截断的条目来截断："step_count"max_steps=10rollout

print(rollout["next", "truncated"])

tensor([[False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [ True]])

这就是 TorchRL 环境 API 的简短介绍的全部内容！

后续步骤¶

要进一步探索 TorchRL 的环境可以做什么，请去检查：

这step_and_maybe_reset()方法，该一起step(),step_mdp()和reset().
一些环境，如GymEnv支持渲染通过参数。检查类文档字符串以了解更多！from_pixels
特别是批处理环境ParallelEnv它允许您运行一个相同（或不同！环境。
使用 Pendulum 教程设计您自己的环境，并了解规格和无状态环境。
请参阅专用教程中有关环境的更深入教程;
如果您对 MARL 感兴趣，请查看多代理环境 API;
TorchRL 有许多与 Gym API 交互的工具，例如一种在 Gym register 中注册 TorchRL 环境的方法，通过register_gym()，一个用于读取的 API 信息字典通过或方式多亏了set_info_dict_reader()set_gym_backend().

脚本总运行时间：（0 分 42.862 秒）

估计内存使用量：9 MB

由 Sphinx-Gallery 生成的图库