使用预训练模型¶

本教程介绍如何在 TorchRL 中使用预训练模型。

本教程结束时，您将能够使用预训练模型进行高效的图像表征，并对它们进行微调。

TorchRL 提供预训练模型，这些模型可用作变换或策略的组件。由于语义相同，它们可以在一种或另一种上下文中互换使用。在本教程中，我们将使用 R3M (https://arxiv.org/abs/2203.12601)，但其他模型（例如 VIP）同样适用。

import torch.cuda
from tensordict.nn import TensorDictSequential
from torch import nn
from torchrl.envs import R3MTransform, TransformedEnv
from torchrl.envs.libs.gym import GymEnv
from torchrl.modules import Actor

is_fork = multiprocessing.get_start_method() == "fork"
device = (
    torch.device(0)
    if torch.cuda.is_available() and not is_fork
    else torch.device("cpu")
)

首先，我们来创建一个环境。为简化起见，我们将使用一个常见的 Gym 环境。在实际应用中，该方法同样适用于更具挑战性的具身 AI 场景（例如，可参考我们的 Habitat 封装器）。

base_env = GymEnv("Ant-v4", from_pixels=True, device=device)

让我们获取预训练模型。我们通过设置 `download=True` 标志来请求该模型的预训练版本；默认情况下，此标志处于关闭状态。接下来，我们将把变换（transform）添加到环境中。在实际运行中，每一批采集到的数据都会经过该变换，并在输出的 `tensordict` 中映射为一个名为 `r3m_vec` 的条目。我们的策略（policy）由单层多层感知机（MLP）构成，它将读取该向量并计算出对应的动作。

r3m = R3MTransform(
    "resnet50",
    in_keys=["pixels"],
    download=True,
)
env_transformed = TransformedEnv(base_env, r3m)
net = nn.Sequential(
    nn.LazyLinear(128, device=device),
    nn.Tanh(),
    nn.Linear(128, base_env.action_spec.shape[-1], device=device),
)
policy = Actor(net, in_keys=["r3m_vec"])

Downloading: "https://pytorch.s3.amazonaws.com/models/rl/r3m/r3m_50.pt" to /root/.cache/torch/hub/checkpoints/r3m_50.pt

  0%|          | 0.00/374M [00:00<?, ?B/s]
  3%|▎         | 10.6M/374M [00:00<00:05, 63.8MB/s]
  7%|▋         | 27.0M/374M [00:00<00:03, 110MB/s]
 10%|█         | 38.8M/374M [00:00<00:03, 94.1MB/s]
 14%|█▍        | 52.9M/374M [00:00<00:03, 111MB/s]
 17%|█▋        | 64.4M/374M [00:00<00:03, 106MB/s]
 20%|██        | 75.1M/374M [00:00<00:02, 107MB/s]
 23%|██▎       | 85.8M/374M [00:00<00:03, 89.9MB/s]
 26%|██▋       | 98.4M/374M [00:01<00:03, 91.9MB/s]
 29%|██▉       | 110M/374M [00:01<00:02, 99.1MB/s]
 35%|███▍      | 130M/374M [00:01<00:01, 130MB/s]
 38%|███▊      | 144M/374M [00:01<00:01, 122MB/s]
 42%|████▏     | 156M/374M [00:01<00:02, 106MB/s]
 45%|████▍     | 167M/374M [00:01<00:02, 99.4MB/s]
 48%|████▊     | 180M/374M [00:01<00:02, 96.0MB/s]
 52%|█████▏    | 195M/374M [00:02<00:01, 95.3MB/s]
 55%|█████▍    | 204M/374M [00:02<00:01, 93.1MB/s]
 57%|█████▋    | 214M/374M [00:02<00:02, 68.2MB/s]
 61%|██████    | 228M/374M [00:02<00:02, 72.9MB/s]
 63%|██████▎   | 236M/374M [00:02<00:02, 59.4MB/s]
 66%|██████▋   | 248M/374M [00:02<00:01, 73.3MB/s]
 70%|██████▉   | 262M/374M [00:03<00:01, 81.1MB/s]
 73%|███████▎  | 273M/374M [00:03<00:01, 74.8MB/s]
 75%|███████▌  | 281M/374M [00:03<00:01, 66.6MB/s]
 80%|███████▉  | 299M/374M [00:03<00:00, 92.0MB/s]
 83%|████████▎ | 309M/374M [00:03<00:01, 67.1MB/s]
 85%|████████▍ | 318M/374M [00:03<00:00, 70.6MB/s]
 88%|████████▊ | 328M/374M [00:04<00:00, 75.6MB/s]
 92%|█████████▏| 343M/374M [00:04<00:00, 93.8MB/s]
 94%|█████████▍| 353M/374M [00:04<00:00, 72.6MB/s]
 97%|█████████▋| 364M/374M [00:04<00:00, 81.2MB/s]
100%|██████████| 374M/374M [00:04<00:00, 86.7MB/s]

让我们检查策略的参数数量：

print("number of params:", len(list(policy.parameters())))

number of params: 4

我们收集 32 步的 rollout 并打印其输出：

rollout = env_transformed.rollout(32, policy)
print("rollout with transform:", rollout)

rollout with transform: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([32, 8]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([32]),
            device=cpu,
            is_shared=False),
        r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([32]),
    device=cpu,
    is_shared=False)

对于微调，我们在使参数变为可训练状态后，将变换操作集成到策略中。在实践中，限制该操作仅作用于参数的一个子集（例如多层感知机的最后一层）可能更为明智。

r3m.train()
policy = TensorDictSequential(r3m, policy)
print("number of params after r3m is integrated:", len(list(policy.parameters())))

number of params after r3m is integrated: 163

同样，我们再次使用 R3M 进行 rollout。输出结构发生了轻微变化，因为此时环境返回的是像素值（而非嵌入向量）。“r3m_vec” 嵌入是我们的策略所生成的一个中间结果。

rollout = base_env.rollout(32, policy)
print("rollout, fine tuning:", rollout)

rollout, fine tuning: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([32, 8]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                pixels: Tensor(shape=torch.Size([32, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
                reward: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([32]),
            device=cpu,
            is_shared=False),
        r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([32]),
    device=cpu,
    is_shared=False)

我们将变换从环境交换到策略的简便性，是因为两者都像 TensorDictModule 一样：它们拥有一组 “in_keys” 和 “out_keys”，使得在不同上下文中读写输出变得容易。

本教程最后，让我们看看如何使用 R3M 读取存储在回放缓冲区（例如，在离线强化学习场景中）中的图像。首先，我们来构建数据集：

from torchrl.data import LazyMemmapStorage, ReplayBuffer

storage = LazyMemmapStorage(1000)
rb = ReplayBuffer(storage=storage, transform=r3m)

现在，我们可以收集数据（对我们的目的而言是随机 rollout）并用其填充经验回放缓冲区：

total = 0
while total < 1000:
    tensordict = base_env.rollout(1000)
    rb.extend(tensordict)
    total += tensordict.numel()

让我们检查一下重放缓冲区存储的结构。由于我们尚未使用它，因此其中不应包含“r3m_vec”条目：

print("stored data:", storage._storage)

stored data: TensorDict(
    fields={
        action: MemoryMappedTensor(shape=torch.Size([1000, 8]), device=cpu, dtype=torch.float32, is_shared=False),
        done: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                pixels: MemoryMappedTensor(shape=torch.Size([1000, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
                reward: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([1000]),
            device=cpu,
            is_shared=False),
        pixels: MemoryMappedTensor(shape=torch.Size([1000, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
        terminated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: MemoryMappedTensor(shape=torch.Size([1000, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([1000]),
    device=cpu,
    is_shared=False)

采样时，数据将经过 R3M 变换，从而得到我们所需的处理后数据。通过这种方式，我们可以在由图像构成的数据集上离线训练算法：

batch = rb.sample(32)
print("data after sampling:", batch)

data after sampling: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([32, 8]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                pixels: Tensor(shape=torch.Size([32, 480, 480, 3]), device=cpu, dtype=torch.uint8, is_shared=False),
                reward: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
            batch_size=torch.Size([32]),
            device=cpu,
            is_shared=False),
        r3m_vec: Tensor(shape=torch.Size([32, 2048]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        truncated: Tensor(shape=torch.Size([32, 1]), device=cpu, dtype=torch.bool, is_shared=False)},
    batch_size=torch.Size([32]),
    device=cpu,
    is_shared=False)

脚本总运行时间： (0 分钟 49.894 秒)

估计内存使用量： 2036 MB

通过 Sphinx-Gallery 生成的画廊

使用预训练模型¶

文档

教程

资源