目录

Pendulum:使用 TorchRL 编写环境和转换

作者Vincent Moens

创建环境(模拟器或物理控制系统的接口) 是强化学习和控制工程的集成部分。

TorchRL 提供了一组工具,可以在多个上下文中执行此操作。 本教程演示如何使用 PyTorch 和 TorchRL 对钟摆进行编码 从头开始的模拟器。 它受到 OpenAI-Gym/Farama-Gymnasium 的 Pendulum-v1 实现的自由启发 control 库

摆

单摆

主要学习内容:

  • 如何在 TorchRL 中设计环境: - 编写规范(输入、观察和奖励); - 实现行为:seeding、reset 和 step。

  • 转换您的环境输入和输出,并编写您自己的 变换;

  • 如何使用 来携带任意数据结构 通过 .TensorDictcodebase

    在此过程中,我们将涉及 TorchRL 的三个关键组件:

为了让您了解 TorchRL 的环境可以实现什么,我们将 正在设计一个无状态的环境。虽然有状态环境会跟踪 遇到的最新物理状态,并依靠此来模拟状态到状态 transition,无状态环境希望将当前状态提供给 它们以及所采取的行动。TorchRL 同时支持 类型的环境,但无状态环境更通用,因此 涵盖 TorchRL 中环境 API 的更广泛功能。

对无状态环境进行建模使用户能够完全控制 input 和 模拟器的输出:可以在任何阶段或主动重置实验 从外部修改动态。但是,它假设我们有一定的控制权 在一项任务上,情况可能并非总是如此:解决我们无法解决的问题 Control the Current State 更具挑战性,但具有更广泛的应用程序集。

无状态环境的另一个优点是它们可以启用 过渡模拟的批量执行。如果后端和 implementation 允许它,代数运算可以在 标量、向量或张量。本教程提供了此类示例。

本教程的结构如下:

  • 我们首先要熟悉环境属性: 其形状 ()、方法(主要是 ) 最后是它的规格。batch_size

  • 在对模拟器进行编码后,我们将演示如何使用它 在使用 Transform 进行训练期间。

  • 我们将探索 TorchRL 的 API 的新途径, 包括:转换输入的可能性、矢量化执行 的模拟以及通过 模拟图。

  • 最后,我们将训练一个简单的策略来解决我们实现的系统。

这个环境的内置版本可以在 class:~torchrl.envs.PendulumEnv 中找到。

from collections import defaultdict
from typing import Optional

import numpy as np
import torch
import tqdm
from tensordict import TensorDict, TensorDictBase
from tensordict.nn import TensorDictModule
from torch import nn

from torchrl.data import Bounded, Composite, Unbounded
from torchrl.envs import (
    CatTensors,
    EnvBase,
    Transform,
    TransformedEnv,
    UnsqueezeTransform,
)
from torchrl.envs.transforms.transforms import _apply_to_composite
from torchrl.envs.utils import check_env_specs, step_mdp

DEFAULT_X = np.pi
DEFAULT_Y = 1.0

在设计新环境时,您必须注意四件事 类:

  • EnvBase._reset(),用于重置模拟器的代码 处于(可能是随机的)初始状态;

  • EnvBase._step()哪个代码表示 state transition dynamic;

  • EnvBase._set_seed()它实现了种子设定机制;

  • 环境规格。

让我们首先描述手头的问题:我们想对一个简单的 摆锤,我们可以控制施加在其固定点上的扭矩。 我们的目标是将钟摆置于向上的位置(角度位置为 0 按照惯例)并使其在那个位置静止不动。 为了设计我们的动态系统,我们需要定义两个方程:运动 动作后的方程式(施加的扭矩)和奖励方程式 这将构成我们的目标函数。

对于运动方程,我们将更新角速度:

\[\dot{\theta}_{t+1} = \dot{\theta}_t + (3 * g / (2 * L) * \sin(\theta_t) + 3 / (m * L^2) * u) * dt\]

其中 是角速度,单位为 rad/sec, 是 重力,是摆锤的长度,是它的质量,是它的角位置,是扭矩。这 然后根据

\[\theta_{t+1} = \theta_{t} + \dot{\theta}_{t+1} dt\]

我们将奖励定义为

\[r = -(\theta^2 + 0.1 * \dot{\theta}^2 + 0.001 * u^2)\]

当角度接近 0 时,该角度将最大化(Pendulum in upward 位置),角速度接近 0(无运动),扭矩为 也是 0。

对操作的效果进行编码:_step()

step 方法是首先要考虑的事情,因为它将对 我们感兴趣的模拟。在 TorchRL 中,该类有一个方法,该方法接收一个实例,该实例带有一个条目,指示要采取什么操作。EnvBase.step()tensordict.TensorDict"action"

为了方便从中读取和写入,并确保 键与库的预期值一致,则 模拟部分已被委托给一个私有抽象方法,该方法从 中读取输入数据,并使用输出数据写入 newtensordict_step()tensordicttensordict

该方法应执行以下操作:_step()

  1. 读取输入键 (如 ) 并执行模拟 基于这些;"action"

  2. 检索观察结果、完成状态和奖励;

  3. 编写观察值集以及 reward 和 done 状态 在新 .TensorDict

接下来,该方法将合并输出 of 在输入中强制执行 输入/输出一致性。tensordict

通常,对于有状态环境,这将如下所示:

>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=cpu,
            is_shared=False),
        observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=cpu,
    is_shared=False)

请注意,根没有更改,唯一的修改是 包含新信息的新条目的外观。tensordict"next"

在 Pendulum 示例中,我们的方法将读取相关的 条目,并计算 施加按键编码的力之后的钟摆 到它上面。我们计算钟摆的新角度位置,这是前一个位置加上新的 velocity 在某个时间间隔 上。_step()tensordict"action""new_th""th""new_thdot"dt

因为我们的目标是把钟摆调高并保持在那个位置上 position,则我们的 (negative reward) 函数对于位置较低 靠近目标且速度低。 事实上,我们想阻止那些远非“向上”的头寸 和/或速度远非 0。cost

在我们的示例中,被编码为静态方法,因为我们的 environment 是无状态的。在有状态设置中,参数为 需要,因为需要从环境中读取状态。EnvBase._step()self

def _step(tensordict):
    th, thdot = tensordict["th"], tensordict["thdot"]  # th := theta

    g_force = tensordict["params", "g"]
    mass = tensordict["params", "m"]
    length = tensordict["params", "l"]
    dt = tensordict["params", "dt"]
    u = tensordict["action"].squeeze(-1)
    u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
    costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)

    new_thdot = (
        thdot
        + (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
    )
    new_thdot = new_thdot.clamp(
        -tensordict["params", "max_speed"], tensordict["params", "max_speed"]
    )
    new_th = th + new_thdot * dt
    reward = -costs.view(*tensordict.shape, 1)
    done = torch.zeros_like(reward, dtype=torch.bool)
    out = TensorDict(
        {
            "th": new_th,
            "thdot": new_thdot,
            "params": tensordict["params"],
            "reward": reward,
            "done": done,
        },
        tensordict.shape,
    )
    return out


def angle_normalize(x):
    return ((x + torch.pi) % (2 * torch.pi)) - torch.pi

重置模拟器:_reset()

我们需要关心的第二种方法是方法。就像 一样,它应该写入观察条目 并且可能是 it 输出中的 done 状态(如果 done 状态为 省略,它将由 parent method 填充 )。在某些情况下,需要 该方法从调用 它(例如,在 Multi Agent Settings 中,我们可能想要指示哪些 Agent 需要 重置)。这就是为什么该方法 还需要 a as input,尽管它可能完全为空或 ._reset()_step()tensordictFalse_reset_reset()tensordictNone

父级会像 一样执行一些简单的检查,例如确保 state 在输出中返回,并且形状与 从规格中预期。EnvBase.reset()EnvBase.step()"done"tensordict

对我们来说,唯一要考虑的重要事情是是否包含所有预期的观测值。再一次 由于我们正在使用无状态环境,因此我们将配置 的钟摆位于 名为 的嵌套 中。EnvBase._reset()tensordict"params"

在此示例中,我们没有传递 done 状态,因为这不是强制性的 for 并且我们的环境是非终止的,因此我们始终 预计它会是 。_reset()False

def _reset(self, tensordict):
    if tensordict is None or tensordict.is_empty():
        # if no ``tensordict`` is passed, we generate a single set of hyperparameters
        # Otherwise, we assume that the input ``tensordict`` contains all the relevant
        # parameters to get started.
        tensordict = self.gen_params(batch_size=self.batch_size)

    high_th = torch.tensor(DEFAULT_X, device=self.device)
    high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
    low_th = -high_th
    low_thdot = -high_thdot

    # for non batch-locked environments, the input ``tensordict`` shape dictates the number
    # of simulators run simultaneously. In other contexts, the initial
    # random state's shape will depend upon the environment batch-size instead.
    th = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_th - low_th)
        + low_th
    )
    thdot = (
        torch.rand(tensordict.shape, generator=self.rng, device=self.device)
        * (high_thdot - low_thdot)
        + low_thdot
    )
    out = TensorDict(
        {
            "th": th,
            "thdot": thdot,
            "params": tensordict["params"],
        },
        batch_size=tensordict.shape,
    )
    return out

环境元数据:env.*_spec

规范定义环境的输入和输出域。 规范准确定义将要 接收,因为它们通常用于携带有关 环境。他们也可以是 用于实例化延迟定义的神经网络和测试脚本,而无需 实际查询环境(这在现实世界中可能代价高昂 例如物理系统)。

我们必须在我们的环境中编写四个规范:

  • EnvBase.observation_spec:这将是一个实例,其中每个键都是一个观察值(可以是 视为规格词典)。CompositeSpec

  • EnvBase.action_spec:可以是任何类型的规格,但这是必需的 它对应于 input 中的条目"action"tensordict;

  • EnvBase.reward_spec:提供有关奖励空间的信息;

  • EnvBase.done_spec:提供有关已完成空间的信息 旗。

TorchRL 规范分为两个通用容器:其中 包含阶跃函数读取的信息的 spec(除以 between containing the action 和 containing 所有其他的),它对 step 输出 (和 )。 通常,您不应直接与其内容交互,而应仅与其内容交互: 、 、 和 。 如果规范以非平凡的方式组织,则原因 在 和 和 中,这两个都不应直接修改。input_specaction_specstate_specoutput_specobservation_specreward_specdone_specoutput_specinput_specobservation_specreward_specdone_specaction_specstate_specoutput_specinput_spec

换句话说,和 related 属性是 输出和输入规范容器内容的便捷快捷方式。observation_spec

TorchRL 提供了多个子类 对环境的输入和输出特征进行编码。

规格 形状

环境规范前导维度必须与 environment batch-size 的 intent 实例。这样做是为了强制 environment(包括其转换)具有 预期的输入和输出形状。这应该是 在有状态设置中准确编码。

对于非批处理锁定环境,例如我们示例中的环境(见下文), 这无关紧要,因为 Environment Batch Size 很可能为空。

def _make_spec(self, td_params):
    # Under the hood, this will populate self.output_spec["observation"]
    self.observation_spec = Composite(
        th=Bounded(
            low=-torch.pi,
            high=torch.pi,
            shape=(),
            dtype=torch.float32,
        ),
        thdot=Bounded(
            low=-td_params["params", "max_speed"],
            high=td_params["params", "max_speed"],
            shape=(),
            dtype=torch.float32,
        ),
        # we need to add the ``params`` to the observation specs, as we want
        # to pass it at each step during a rollout
        params=make_composite_from_td(td_params["params"]),
        shape=(),
    )
    # since the environment is stateless, we expect the previous output as input.
    # For this, ``EnvBase`` expects some state_spec to be available
    self.state_spec = self.observation_spec.clone()
    # action-spec will be automatically wrapped in input_spec when
    # `self.action_spec = spec` will be called supported
    self.action_spec = Bounded(
        low=-td_params["params", "max_torque"],
        high=td_params["params", "max_torque"],
        shape=(1,),
        dtype=torch.float32,
    )
    self.reward_spec = Unbounded(shape=(*td_params.shape, 1))


def make_composite_from_td(td):
    # custom function to convert a ``tensordict`` in a similar spec structure
    # of unbounded values.
    composite = Composite(
        {
            key: make_composite_from_td(tensor)
            if isinstance(tensor, TensorDictBase)
            else Unbounded(dtype=tensor.dtype, device=tensor.device, shape=tensor.shape)
            for key, tensor in td.items()
        },
        shape=td.shape,
    )
    return composite

可重复的实验:种子

初始化实验时,为环境设定种子是一种常见操作。 的唯一目标是设置包含的 模拟器。如果可能,此操作不应调用或交互 与环境执行。父方法 包含一种机制,该机制允许使用 不同的伪随机和可重复种子。EnvBase._set_seed()reset()EnvBase.set_seed()

def _set_seed(self, seed: Optional[int]):
    rng = torch.manual_seed(seed)
    self.rng = rng

将事物包装在一起:

我们终于可以把各个部分放在一起并设计我们的环境类了。 需要在环境期间执行 spec 初始化 构造,因此我们必须注意调用 在。_make_spec()PendulumEnv.__init__()

我们添加了一个 static 方法,它确定性地 生成一组要在执行期间使用的超参数:PendulumEnv.gen_params()

def gen_params(g=10.0, batch_size=None) -> TensorDictBase:
    """Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
    if batch_size is None:
        batch_size = []
    td = TensorDict(
        {
            "params": TensorDict(
                {
                    "max_speed": 8,
                    "max_torque": 2.0,
                    "dt": 0.05,
                    "g": g,
                    "m": 1.0,
                    "l": 1.0,
                },
                [],
            )
        },
        [],
    )
    if batch_size:
        td = td.expand(batch_size).contiguous()
    return td

我们通过将属性转换为 .这意味着我们不会强制 input 具有与环境匹配的 a。batch_lockedhomonymousFalsetensordictbatch-size

下面的代码将把我们上面编码的部分放在一起。

class PendulumEnv(EnvBase):
    metadata = {
        "render_modes": ["human", "rgb_array"],
        "render_fps": 30,
    }
    batch_locked = False

    def __init__(self, td_params=None, seed=None, device="cpu"):
        if td_params is None:
            td_params = self.gen_params()

        super().__init__(device=device, batch_size=[])
        self._make_spec(td_params)
        if seed is None:
            seed = torch.empty((), dtype=torch.int64).random_().item()
        self.set_seed(seed)

    # Helpers: _make_step and gen_params
    gen_params = staticmethod(gen_params)
    _make_spec = _make_spec

    # Mandatory methods: _step, _reset and _set_seed
    _reset = _reset
    _step = staticmethod(_step)
    _set_seed = _set_seed

测试我们的环境

TorchRL 提供了一个简单的函数来检查(转换后的)环境是否具有 Input/Output 结构,该 匹配其 spec 指定的 1。 让我们试一试:

env = PendulumEnv()
check_env_specs(env)

我们可以查看我们的规格,以直观地表示环境 签名:

print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: Composite(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: Composite(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([])),
    device=cpu,
    shape=torch.Size([]))
state_spec: Composite(
    th: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    thdot: BoundedContinuous(
        shape=torch.Size([]),
        space=ContinuousBox(
            low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
            high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
        device=cpu,
        dtype=torch.float32,
        domain=continuous),
    params: Composite(
        max_speed: UnboundedDiscrete(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
            device=cpu,
            dtype=torch.int64,
            domain=discrete),
        max_torque: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        dt: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        g: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        m: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        l: UnboundedContinuous(
            shape=torch.Size([]),
            space=ContinuousBox(
                low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
                high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
            device=cpu,
            dtype=torch.float32,
            domain=continuous),
        device=cpu,
        shape=torch.Size([])),
    device=cpu,
    shape=torch.Size([]))
reward_spec: UnboundedContinuous(
    shape=torch.Size([1]),
    space=ContinuousBox(
        low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
        high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
    device=cpu,
    dtype=torch.float32,
    domain=continuous)

我们也可以执行几个命令来检查输出结构 与预期匹配。

td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
    fields={
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

我们可以运行 to generate 来自域的随机操作。A 包含 超参数和当前状态必须传递,因为我们的 environment 是无状态的。在有状态上下文中,工作 也完美无缺。env.rand_step()action_spectensordictenv.rand_step()

td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
    fields={
        action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([]),
            device=None,
            is_shared=False),
        terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([]),
    device=None,
    is_shared=False)

改造环境

为 stateless simulator 编写环境转换稍微多一些 比有状态的更复杂:转换需要 若要在以下迭代中读取,需要应用逆变换 在下一步调用之前。 这是展示 TorchRL 的所有功能的理想场景 变换!meth.step()

例如,在下面的转换环境中,我们 entry 能够将它们堆叠在最后一个 尺寸。我们还将他们传递 as 以将他们挤回他们的 original shape 的实例。unsqueeze["th", "thdot"]in_keys_inv

env = TransformedEnv(
    env,
    # ``Unsqueeze`` the observations that we will concatenate
    UnsqueezeTransform(
        dim=-1,
        in_keys=["th", "thdot"],
        in_keys_inv=["th", "thdot"],
    ),
)

编写自定义转换

TorchRL 的转换可能无法涵盖要执行的所有操作 在执行环境之后。 编写转换不需要太多努力。环境 design,编写 transform 有两个步骤:

  • 获得正确的动态(正向和反向);

  • 调整环境规范。

转换可以在两个设置中使用:就其本身而言,它可以用作 .它也可以附加到 .类的结构允许 自定义不同上下文中的行为。

骨架可以总结如下:

class Transform(nn.Module):
    def forward(self, tensordict):
        ...
    def _apply_transform(self, tensordict):
        ...
    def _step(self, tensordict):
        ...
    def _call(self, tensordict):
        ...
    def inv(self, tensordict):
        ...
    def _inv_apply_transform(self, tensordict):
        ...

有三个入口点 (和 ) 它们都接收实例。前两个 最终将遍历 AND CALL 每个 KEY 指示的键。结果将 写入 如果提供 (如果不是,则将使用转换后的值进行更新)。 如果需要执行逆变换,类似的数据流将是 执行但使用 and 方法和跨 and 键列表。 下图总结了环境和重播的此流程 缓冲区。forward()_step()inv()tensordict.TensorDictin_keys_apply_transform()Transform.out_keysin_keysTransform.inv()Transform._inv_apply_transform()in_keys_invout_keys_inv

转换 API

在某些情况下,转换不适用于 unitary 中的键子集 方式,但会在父环境中执行一些操作,或者 使用整个 input 。 在这些情况下,和 方法应该是 re-written,并且可以跳过该方法。tensordict_call()forward()_apply_transform()

让我们编写新的转换,这些转换将计算位置角度的 and 值,因为这些值对我们来说更有用 a 策略:sinecosine

class SinTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.sin()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return Bounded(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


class CosTransform(Transform):
    def _apply_transform(self, obs: torch.Tensor) -> None:
        return obs.cos()

    # The transform must also modify the data at reset time
    def _reset(
        self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
    ) -> TensorDictBase:
        return self._call(tensordict_reset)

    # _apply_to_composite will execute the observation spec transform across all
    # in_keys/out_keys pairs and write the result in the observation_spec which
    # is of type ``Composite``
    @_apply_to_composite
    def transform_observation_spec(self, observation_spec):
        return Bounded(
            low=-1,
            high=1,
            shape=observation_spec.shape,
            dtype=observation_spec.dtype,
            device=observation_spec.device,
        )


t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th'])))

将观察值连接到“observation”条目。 确保我们保留这些值以供下一个 迭 代。del_keys=False

cat_transform = CatTensors(
    in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
    env=PendulumEnv(),
    transform=Compose(
            UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
            SinTransform(keys=['th']),
            CosTransform(keys=['th']),
            CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))

让我们再次检查我们的环境规范是否与收到的内容相匹配:

check_env_specs(env)

执行转出

执行转出是一系列简单的步骤:

  • 重置环境

  • 当某些条件未满足时:

    • 在给定策略的情况下计算操作

    • 在给定此操作的情况下执行步骤

    • 收集数据

    • 执行步骤MDP

  • 收集数据并返回

这些操作已方便地包装在方法中,我们在下面提供了该方法的简化版本。

def simple_rollout(steps=100):
    # preallocate:
    data = TensorDict({}, [steps])
    # reset
    _data = env.reset()
    for i in range(steps):
        _data["action"] = env.action_spec.rand()
        _data = env.step(_data)
        data[i] = _data
        _data = step_mdp(_data, keep_other=True)
    return data


print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
    fields={
        action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([100]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([100]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([100]),
    device=None,
    is_shared=False)

批处理计算

我们教程的最后一个未探索的结局是我们必须 TorchRL 中的批量计算。因为我们的环境没有 对输入数据形状做出任何假设,我们可以无缝地 对批量数据执行它。更好的是:对于非 batch lock 环境,例如我们的 Pendulum,我们可以动态地更改批次大小 而无需重新创建环境。 为此,我们只需生成具有所需形状的参数。

batch_size = 10  # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
    fields={
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)
rand step (batch size of 10) TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10]),
    device=None,
    is_shared=False)

使用一批数据执行转出需要我们重置环境 ,因为我们需要定义 batch_size 动态的,并且不支持

rollout = env.rollout(
    3,
    auto_reset=False,  # we're executing the reset out of the ``rollout`` call
    tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
    fields={
        action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        next: TensorDict(
            fields={
                cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                params: TensorDict(
                    fields={
                        dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                        max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                        max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
                    batch_size=torch.Size([10, 3]),
                    device=None,
                    is_shared=False),
                reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
                th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
                thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
        params: TensorDict(
            fields={
                dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
                max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
                max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
            batch_size=torch.Size([10, 3]),
            device=None,
            is_shared=False),
        sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
        th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
        thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
    batch_size=torch.Size([10, 3]),
    device=None,
    is_shared=False)

训练一个简单的策略

在此示例中,我们将使用 reward 作为 可微分目标,例如负损失。 我们将利用我们的动态系统完全 differentiable 通过轨迹返回进行反向传播,并调整 权重直接最大化此值。当然,在许多 设置我们所做的许多假设都不成立,例如 微分系统和对底层机制的完全访问权限。

不过,这是一个非常简单的示例,展示了训练循环如何 使用 TorchRL 中的自定义环境进行编码。

让我们首先编写策略网络:

torch.manual_seed(0)
env.set_seed(0)

net = nn.Sequential(
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(64),
    nn.Tanh(),
    nn.LazyLinear(1),
)
policy = TensorDictModule(
    net,
    in_keys=["observation"],
    out_keys=["action"],
)

和我们的优化器:

训练循环

我们将陆续:

  • 生成轨迹

  • 对奖励求和

  • 通过这些操作定义的图形反向传播

  • 裁剪梯度范数并进行优化步骤

  • 重复

在训练循环结束时,我们应该有一个接近 0 的最终奖励 这表明钟摆是向上的,并且如所愿。

batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)

for _ in pbar:
    init_td = env.reset(env.gen_params(batch_size=[batch_size]))
    rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
    traj_return = rollout["next", "reward"].mean()
    (-traj_return).backward()
    gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
    optim.step()
    optim.zero_grad()
    pbar.set_description(
        f"reward: {traj_return: 4.4f}, "
        f"last reward: {rollout[..., -1]['next', 'reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
    )
    logs["return"].append(traj_return.item())
    logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
    scheduler.step()


def plot():
    import matplotlib
    from matplotlib import pyplot as plt

    is_ipython = "inline" in matplotlib.get_backend()
    if is_ipython:
        from IPython import display

    with plt.ion():
        plt.figure(figsize=(10, 5))
        plt.subplot(1, 2, 1)
        plt.plot(logs["return"])
        plt.title("returns")
        plt.xlabel("iteration")
        plt.subplot(1, 2, 2)
        plt.plot(logs["last_reward"])
        plt.title("last reward")
        plt.xlabel("iteration")
        if is_ipython:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        plt.show()


plot()
返回,最后奖励
  0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm:  8.519:   0%|          | 1/625 [00:00<01:33,  6.69it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 1/625 [00:00<01:33,  6.69it/s]
reward: -7.0499, last reward: -7.4472, gradient norm:  5.073:   0%|          | 2/625 [00:00<01:33,  6.63it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 2/625 [00:00<01:33,  6.63it/s]
reward: -7.0685, last reward: -7.0408, gradient norm:  5.552:   0%|          | 3/625 [00:00<01:33,  6.63it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   0%|          | 3/625 [00:00<01:33,  6.63it/s]
reward: -6.5154, last reward: -5.9086, gradient norm:  2.527:   1%|          | 4/625 [00:00<01:34,  6.60it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 4/625 [00:00<01:34,  6.60it/s]
reward: -6.2006, last reward: -5.9385, gradient norm:  8.155:   1%|          | 5/625 [00:00<01:33,  6.60it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 5/625 [00:00<01:33,  6.60it/s]
reward: -6.2568, last reward: -5.4981, gradient norm:  6.223:   1%|          | 6/625 [00:00<01:33,  6.62it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 6/625 [00:01<01:33,  6.62it/s]
reward: -5.8929, last reward: -8.4491, gradient norm:  4.581:   1%|          | 7/625 [00:01<01:33,  6.63it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|          | 7/625 [00:01<01:33,  6.63it/s]
reward: -6.3233, last reward: -9.0664, gradient norm:  7.596:   1%|▏         | 8/625 [00:01<01:34,  6.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 8/625 [00:01<01:34,  6.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm:  0.9579:   1%|▏         | 9/625 [00:01<01:34,  6.53it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   1%|▏         | 9/625 [00:01<01:34,  6.53it/s]
reward: -6.5807, last reward: -8.8075, gradient norm:  3.212:   2%|▏         | 10/625 [00:01<01:33,  6.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 10/625 [00:01<01:33,  6.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm:  2.914:   2%|▏         | 11/625 [00:01<01:33,  6.58it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 11/625 [00:01<01:33,  6.58it/s]
reward: -6.2894, last reward: -8.0115, gradient norm:  52.06:   2%|▏         | 12/625 [00:01<01:32,  6.59it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 12/625 [00:01<01:32,  6.59it/s]
reward: -6.0977, last reward: -6.1845, gradient norm:  18.09:   2%|▏         | 13/625 [00:01<01:32,  6.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 13/625 [00:02<01:32,  6.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm:  5.233:   2%|▏         | 14/625 [00:02<01:32,  6.61it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 14/625 [00:02<01:32,  6.61it/s]
reward: -6.2863, last reward: -5.0297, gradient norm:  1.464:   2%|▏         | 15/625 [00:02<01:31,  6.63it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.904:   2%|▏         | 15/625 [00:02<01:31,  6.63it/s]
reward: -6.4617, last reward: -5.5997, gradient norm:  2.904:   3%|▎         | 16/625 [00:02<01:31,  6.63it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.901:   3%|▎         | 16/625 [00:02<01:31,  6.63it/s]
reward: -6.1647, last reward: -6.0777, gradient norm:  4.901:   3%|▎         | 17/625 [00:02<01:31,  6.64it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8317:   3%|▎         | 17/625 [00:02<01:31,  6.64it/s]
reward: -6.4709, last reward: -6.6813, gradient norm:  0.8317:   3%|▎         | 18/625 [00:02<01:31,  6.64it/s]
reward: -6.3221, last reward: -6.5554, gradient norm:  1.276:   3%|▎         | 18/625 [00:02<01:31,  6.64it/s]
reward: -6.3221, last reward: -6.5554, gradient norm:  1.276:   3%|▎         | 19/625 [00:02<01:31,  6.64it/s]
reward: -6.3353, last reward: -7.9999, gradient norm:  4.701:   3%|▎         | 19/625 [00:03<01:31,  6.64it/s]
reward: -6.3353, last reward: -7.9999, gradient norm:  4.701:   3%|▎         | 20/625 [00:03<01:31,  6.65it/s]
reward: -5.8570, last reward: -7.6656, gradient norm:  5.463:   3%|▎         | 20/625 [00:03<01:31,  6.65it/s]
reward: -5.8570, last reward: -7.6656, gradient norm:  5.463:   3%|▎         | 21/625 [00:03<01:30,  6.65it/s]
reward: -5.7779, last reward: -6.6911, gradient norm:  6.875:   3%|▎         | 21/625 [00:03<01:30,  6.65it/s]
reward: -5.7779, last reward: -6.6911, gradient norm:  6.875:   4%|▎         | 22/625 [00:03<01:30,  6.66it/s]
reward: -6.0796, last reward: -5.7082, gradient norm:  5.308:   4%|▎         | 22/625 [00:03<01:30,  6.66it/s]
reward: -6.0796, last reward: -5.7082, gradient norm:  5.308:   4%|▎         | 23/625 [00:03<01:30,  6.65it/s]
reward: -6.0421, last reward: -6.1496, gradient norm:  12.4:   4%|▎         | 23/625 [00:03<01:30,  6.65it/s]
reward: -6.0421, last reward: -6.1496, gradient norm:  12.4:   4%|▍         | 24/625 [00:03<01:30,  6.65it/s]
reward: -5.5037, last reward: -5.1755, gradient norm:  22.62:   4%|▍         | 24/625 [00:03<01:30,  6.65it/s]
reward: -5.5037, last reward: -5.1755, gradient norm:  22.62:   4%|▍         | 25/625 [00:03<01:30,  6.65it/s]
reward: -5.5029, last reward: -4.9454, gradient norm:  3.665:   4%|▍         | 25/625 [00:03<01:30,  6.65it/s]
reward: -5.5029, last reward: -4.9454, gradient norm:  3.665:   4%|▍         | 26/625 [00:03<01:30,  6.65it/s]
reward: -5.9330, last reward: -6.2118, gradient norm:  5.444:   4%|▍         | 26/625 [00:04<01:30,  6.65it/s]
reward: -5.9330, last reward: -6.2118, gradient norm:  5.444:   4%|▍         | 27/625 [00:04<01:29,  6.65it/s]
reward: -6.0995, last reward: -6.6294, gradient norm:  11.69:   4%|▍         | 27/625 [00:04<01:29,  6.65it/s]
reward: -6.0995, last reward: -6.6294, gradient norm:  11.69:   4%|▍         | 28/625 [00:04<01:29,  6.65it/s]
reward: -6.3146, last reward: -7.2909, gradient norm:  5.461:   4%|▍         | 28/625 [00:04<01:29,  6.65it/s]
reward: -6.3146, last reward: -7.2909, gradient norm:  5.461:   5%|▍         | 29/625 [00:04<01:29,  6.66it/s]
reward: -5.9720, last reward: -6.1298, gradient norm:  19.91:   5%|▍         | 29/625 [00:04<01:29,  6.66it/s]
reward: -5.9720, last reward: -6.1298, gradient norm:  19.91:   5%|▍         | 30/625 [00:04<01:29,  6.66it/s]
reward: -5.9923, last reward: -7.0345, gradient norm:  3.464:   5%|▍         | 30/625 [00:04<01:29,  6.66it/s]
reward: -5.9923, last reward: -7.0345, gradient norm:  3.464:   5%|▍         | 31/625 [00:04<01:29,  6.65it/s]
reward: -5.3438, last reward: -4.3688, gradient norm:  2.424:   5%|▍         | 31/625 [00:04<01:29,  6.65it/s]
reward: -5.3438, last reward: -4.3688, gradient norm:  2.424:   5%|▌         | 32/625 [00:04<01:29,  6.64it/s]
reward: -5.6953, last reward: -4.5233, gradient norm:  3.411:   5%|▌         | 32/625 [00:04<01:29,  6.64it/s]
reward: -5.6953, last reward: -4.5233, gradient norm:  3.411:   5%|▌         | 33/625 [00:04<01:29,  6.64it/s]
reward: -5.4288, last reward: -2.8011, gradient norm:  10.82:   5%|▌         | 33/625 [00:05<01:29,  6.64it/s]
reward: -5.4288, last reward: -2.8011, gradient norm:  10.82:   5%|▌         | 34/625 [00:05<01:29,  6.64it/s]
reward: -5.5329, last reward: -4.2677, gradient norm:  15.71:   5%|▌         | 34/625 [00:05<01:29,  6.64it/s]
reward: -5.5329, last reward: -4.2677, gradient norm:  15.71:   6%|▌         | 35/625 [00:05<01:28,  6.64it/s]
reward: -5.6969, last reward: -3.7010, gradient norm:  1.376:   6%|▌         | 35/625 [00:05<01:28,  6.64it/s]
reward: -5.6969, last reward: -3.7010, gradient norm:  1.376:   6%|▌         | 36/625 [00:05<01:28,  6.65it/s]
reward: -5.9352, last reward: -4.7707, gradient norm:  15.49:   6%|▌         | 36/625 [00:05<01:28,  6.65it/s]
reward: -5.9352, last reward: -4.7707, gradient norm:  15.49:   6%|▌         | 37/625 [00:05<01:28,  6.65it/s]
reward: -5.6178, last reward: -4.5646, gradient norm:  3.348:   6%|▌         | 37/625 [00:05<01:28,  6.65it/s]
reward: -5.6178, last reward: -4.5646, gradient norm:  3.348:   6%|▌         | 38/625 [00:05<01:28,  6.65it/s]
reward: -5.7304, last reward: -3.9407, gradient norm:  4.942:   6%|▌         | 38/625 [00:05<01:28,  6.65it/s]
reward: -5.7304, last reward: -3.9407, gradient norm:  4.942:   6%|▌         | 39/625 [00:05<01:28,  6.64it/s]
reward: -5.3882, last reward: -3.7604, gradient norm:  9.85:   6%|▌         | 39/625 [00:06<01:28,  6.64it/s]
reward: -5.3882, last reward: -3.7604, gradient norm:  9.85:   6%|▋         | 40/625 [00:06<01:28,  6.63it/s]
reward: -5.3507, last reward: -2.8928, gradient norm:  1.258:   6%|▋         | 40/625 [00:06<01:28,  6.63it/s]
reward: -5.3507, last reward: -2.8928, gradient norm:  1.258:   7%|▋         | 41/625 [00:06<01:27,  6.64it/s]
reward: -5.6978, last reward: -4.4641, gradient norm:  4.549:   7%|▋         | 41/625 [00:06<01:27,  6.64it/s]
reward: -5.6978, last reward: -4.4641, gradient norm:  4.549:   7%|▋         | 42/625 [00:06<01:27,  6.64it/s]
reward: -5.5263, last reward: -3.6047, gradient norm:  2.544:   7%|▋         | 42/625 [00:06<01:27,  6.64it/s]
reward: -5.5263, last reward: -3.6047, gradient norm:  2.544:   7%|▋         | 43/625 [00:06<01:27,  6.64it/s]
reward: -5.5005, last reward: -4.4136, gradient norm:  11.49:   7%|▋         | 43/625 [00:06<01:27,  6.64it/s]
reward: -5.5005, last reward: -4.4136, gradient norm:  11.49:   7%|▋         | 44/625 [00:06<01:27,  6.64it/s]
reward: -5.2993, last reward: -6.3222, gradient norm:  32.53:   7%|▋         | 44/625 [00:06<01:27,  6.64it/s]
reward: -5.2993, last reward: -6.3222, gradient norm:  32.53:   7%|▋         | 45/625 [00:06<01:27,  6.65it/s]
reward: -5.4046, last reward: -5.7314, gradient norm:  7.275:   7%|▋         | 45/625 [00:06<01:27,  6.65it/s]
reward: -5.4046, last reward: -5.7314, gradient norm:  7.275:   7%|▋         | 46/625 [00:06<01:27,  6.65it/s]
reward: -5.6331, last reward: -4.9318, gradient norm:  6.961:   7%|▋         | 46/625 [00:07<01:27,  6.65it/s]
reward: -5.6331, last reward: -4.9318, gradient norm:  6.961:   8%|▊         | 47/625 [00:07<01:26,  6.65it/s]
reward: -4.8331, last reward: -4.1604, gradient norm:  26.26:   8%|▊         | 47/625 [00:07<01:26,  6.65it/s]
reward: -4.8331, last reward: -4.1604, gradient norm:  26.26:   8%|▊         | 48/625 [00:07<01:26,  6.66it/s]
reward: -5.4099, last reward: -4.4761, gradient norm:  8.125:   8%|▊         | 48/625 [00:07<01:26,  6.66it/s]
reward: -5.4099, last reward: -4.4761, gradient norm:  8.125:   8%|▊         | 49/625 [00:07<01:26,  6.65it/s]
reward: -5.4262, last reward: -3.6363, gradient norm:  2.382:   8%|▊         | 49/625 [00:07<01:26,  6.65it/s]
reward: -5.4262, last reward: -3.6363, gradient norm:  2.382:   8%|▊         | 50/625 [00:07<01:26,  6.65it/s]
reward: -5.3593, last reward: -5.7377, gradient norm:  22.62:   8%|▊         | 50/625 [00:07<01:26,  6.65it/s]
reward: -5.3593, last reward: -5.7377, gradient norm:  22.62:   8%|▊         | 51/625 [00:07<01:26,  6.64it/s]
reward: -5.2847, last reward: -3.3443, gradient norm:  2.867:   8%|▊         | 51/625 [00:07<01:26,  6.64it/s]
reward: -5.2847, last reward: -3.3443, gradient norm:  2.867:   8%|▊         | 52/625 [00:07<01:26,  6.64it/s]
reward: -5.3592, last reward: -6.4760, gradient norm:  8.441:   8%|▊         | 52/625 [00:07<01:26,  6.64it/s]
reward: -5.3592, last reward: -6.4760, gradient norm:  8.441:   8%|▊         | 53/625 [00:07<01:26,  6.64it/s]
reward: -5.9950, last reward: -10.8021, gradient norm:  11.77:   8%|▊         | 53/625 [00:08<01:26,  6.64it/s]
reward: -5.9950, last reward: -10.8021, gradient norm:  11.77:   9%|▊         | 54/625 [00:08<01:25,  6.65it/s]
reward: -6.3528, last reward: -7.1214, gradient norm:  7.708:   9%|▊         | 54/625 [00:08<01:25,  6.65it/s]
reward: -6.3528, last reward: -7.1214, gradient norm:  7.708:   9%|▉         | 55/625 [00:08<01:25,  6.64it/s]
reward: -6.4023, last reward: -7.3583, gradient norm:  9.041:   9%|▉         | 55/625 [00:08<01:25,  6.64it/s]
reward: -6.4023, last reward: -7.3583, gradient norm:  9.041:   9%|▉         | 56/625 [00:08<01:25,  6.64it/s]
reward: -6.3801, last reward: -7.0310, gradient norm:  120.1:   9%|▉         | 56/625 [00:08<01:25,  6.64it/s]
reward: -6.3801, last reward: -7.0310, gradient norm:  120.1:   9%|▉         | 57/625 [00:08<01:25,  6.64it/s]
reward: -6.4244, last reward: -6.2039, gradient norm:  15.48:   9%|▉         | 57/625 [00:08<01:25,  6.64it/s]
reward: -6.4244, last reward: -6.2039, gradient norm:  15.48:   9%|▉         | 58/625 [00:08<01:25,  6.64it/s]
reward: -6.4850, last reward: -6.8748, gradient norm:  4.706:   9%|▉         | 58/625 [00:08<01:25,  6.64it/s]
reward: -6.4850, last reward: -6.8748, gradient norm:  4.706:   9%|▉         | 59/625 [00:08<01:25,  6.63it/s]
reward: -6.4897, last reward: -5.9210, gradient norm:  11.63:   9%|▉         | 59/625 [00:09<01:25,  6.63it/s]
reward: -6.4897, last reward: -5.9210, gradient norm:  11.63:  10%|▉         | 60/625 [00:09<01:25,  6.64it/s]
reward: -6.2299, last reward: -7.8964, gradient norm:  13.35:  10%|▉         | 60/625 [00:09<01:25,  6.64it/s]
reward: -6.2299, last reward: -7.8964, gradient norm:  13.35:  10%|▉         | 61/625 [00:09<01:24,  6.65it/s]
reward: -6.0832, last reward: -9.3934, gradient norm:  4.456:  10%|▉         | 61/625 [00:09<01:24,  6.65it/s]
reward: -6.0832, last reward: -9.3934, gradient norm:  4.456:  10%|▉         | 62/625 [00:09<01:24,  6.65it/s]
reward: -5.8971, last reward: -10.2933, gradient norm:  10.74:  10%|▉         | 62/625 [00:09<01:24,  6.65it/s]
reward: -5.8971, last reward: -10.2933, gradient norm:  10.74:  10%|█         | 63/625 [00:09<01:24,  6.64it/s]
reward: -5.3377, last reward: -4.6996, gradient norm:  23.29:  10%|█         | 63/625 [00:09<01:24,  6.64it/s]
reward: -5.3377, last reward: -4.6996, gradient norm:  23.29:  10%|█         | 64/625 [00:09<01:24,  6.63it/s]
reward: -5.2274, last reward: -2.8916, gradient norm:  4.098:  10%|█         | 64/625 [00:09<01:24,  6.63it/s]
reward: -5.2274, last reward: -2.8916, gradient norm:  4.098:  10%|█         | 65/625 [00:09<01:24,  6.64it/s]
reward: -5.2660, last reward: -4.9110, gradient norm:  12.28:  10%|█         | 65/625 [00:09<01:24,  6.64it/s]
reward: -5.2660, last reward: -4.9110, gradient norm:  12.28:  11%|█         | 66/625 [00:09<01:24,  6.64it/s]
reward: -5.4503, last reward: -5.6956, gradient norm:  12.22:  11%|█         | 66/625 [00:10<01:24,  6.64it/s]
reward: -5.4503, last reward: -5.6956, gradient norm:  12.22:  11%|█         | 67/625 [00:10<01:23,  6.65it/s]
reward: -5.9172, last reward: -5.4026, gradient norm:  7.946:  11%|█         | 67/625 [00:10<01:23,  6.65it/s]
reward: -5.9172, last reward: -5.4026, gradient norm:  7.946:  11%|█         | 68/625 [00:10<01:23,  6.65it/s]
reward: -5.9229, last reward: -4.5205, gradient norm:  6.294:  11%|█         | 68/625 [00:10<01:23,  6.65it/s]
reward: -5.9229, last reward: -4.5205, gradient norm:  6.294:  11%|█         | 69/625 [00:10<01:23,  6.65it/s]
reward: -5.8872, last reward: -5.6637, gradient norm:  8.019:  11%|█         | 69/625 [00:10<01:23,  6.65it/s]
reward: -5.8872, last reward: -5.6637, gradient norm:  8.019:  11%|█         | 70/625 [00:10<01:23,  6.64it/s]
reward: -5.9281, last reward: -4.2082, gradient norm:  5.724:  11%|█         | 70/625 [00:10<01:23,  6.64it/s]
reward: -5.9281, last reward: -4.2082, gradient norm:  5.724:  11%|█▏        | 71/625 [00:10<01:23,  6.63it/s]
reward: -5.8561, last reward: -5.6574, gradient norm:  8.357:  11%|█▏        | 71/625 [00:10<01:23,  6.63it/s]
reward: -5.8561, last reward: -5.6574, gradient norm:  8.357:  12%|█▏        | 72/625 [00:10<01:23,  6.62it/s]
reward: -5.4138, last reward: -4.5230, gradient norm:  7.385:  12%|█▏        | 72/625 [00:11<01:23,  6.62it/s]
reward: -5.4138, last reward: -4.5230, gradient norm:  7.385:  12%|█▏        | 73/625 [00:11<01:23,  6.61it/s]
reward: -5.4065, last reward: -5.5642, gradient norm:  9.921:  12%|█▏        | 73/625 [00:11<01:23,  6.61it/s]
reward: -5.4065, last reward: -5.5642, gradient norm:  9.921:  12%|█▏        | 74/625 [00:11<01:23,  6.62it/s]
reward: -4.9786, last reward: -3.2894, gradient norm:  32.73:  12%|█▏        | 74/625 [00:11<01:23,  6.62it/s]
reward: -4.9786, last reward: -3.2894, gradient norm:  32.73:  12%|█▏        | 75/625 [00:11<01:23,  6.63it/s]
reward: -5.4129, last reward: -7.5831, gradient norm:  9.266:  12%|█▏        | 75/625 [00:11<01:23,  6.63it/s]
reward: -5.4129, last reward: -7.5831, gradient norm:  9.266:  12%|█▏        | 76/625 [00:11<01:22,  6.62it/s]
reward: -5.7723, last reward: -7.4152, gradient norm:  5.608:  12%|█▏        | 76/625 [00:11<01:22,  6.62it/s]
reward: -5.7723, last reward: -7.4152, gradient norm:  5.608:  12%|█▏        | 77/625 [00:11<01:22,  6.63it/s]
reward: -6.1604, last reward: -8.0898, gradient norm:  4.389:  12%|█▏        | 77/625 [00:11<01:22,  6.63it/s]
reward: -6.1604, last reward: -8.0898, gradient norm:  4.389:  12%|█▏        | 78/625 [00:11<01:22,  6.63it/s]
reward: -6.5155, last reward: -5.5376, gradient norm:  36.34:  12%|█▏        | 78/625 [00:11<01:22,  6.63it/s]
reward: -6.5155, last reward: -5.5376, gradient norm:  36.34:  13%|█▎        | 79/625 [00:11<01:22,  6.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm:  8.283:  13%|█▎        | 79/625 [00:12<01:22,  6.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm:  8.283:  13%|█▎        | 80/625 [00:12<01:22,  6.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm:  5.895:  13%|█▎        | 80/625 [00:12<01:22,  6.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm:  5.895:  13%|█▎        | 81/625 [00:12<01:22,  6.62it/s]
reward: -6.6566, last reward: -5.2588, gradient norm:  7.662:  13%|█▎        | 81/625 [00:12<01:22,  6.62it/s]
reward: -6.6566, last reward: -5.2588, gradient norm:  7.662:  13%|█▎        | 82/625 [00:12<01:21,  6.62it/s]
reward: -6.4732, last reward: -6.7503, gradient norm:  6.068:  13%|█▎        | 82/625 [00:12<01:21,  6.62it/s]
reward: -6.4732, last reward: -6.7503, gradient norm:  6.068:  13%|█▎        | 83/625 [00:12<01:22,  6.60it/s]
reward: -6.0714, last reward: -7.3370, gradient norm:  8.059:  13%|█▎        | 83/625 [00:12<01:22,  6.60it/s]
reward: -6.0714, last reward: -7.3370, gradient norm:  8.059:  13%|█▎        | 84/625 [00:12<01:21,  6.61it/s]
reward: -5.8612, last reward: -6.1915, gradient norm:  9.3:  13%|█▎        | 84/625 [00:12<01:21,  6.61it/s]
reward: -5.8612, last reward: -6.1915, gradient norm:  9.3:  14%|█▎        | 85/625 [00:12<01:21,  6.61it/s]
reward: -5.3855, last reward: -5.0349, gradient norm:  15.2:  14%|█▎        | 85/625 [00:12<01:21,  6.61it/s]
reward: -5.3855, last reward: -5.0349, gradient norm:  15.2:  14%|█▍        | 86/625 [00:12<01:21,  6.61it/s]
reward: -4.9644, last reward: -3.4538, gradient norm:  3.445:  14%|█▍        | 86/625 [00:13<01:21,  6.61it/s]
reward: -4.9644, last reward: -3.4538, gradient norm:  3.445:  14%|█▍        | 87/625 [00:13<01:21,  6.62it/s]
reward: -5.0392, last reward: -4.4080, gradient norm:  11.45:  14%|█▍        | 87/625 [00:13<01:21,  6.62it/s]
reward: -5.0392, last reward: -4.4080, gradient norm:  11.45:  14%|█▍        | 88/625 [00:13<01:21,  6.62it/s]
reward: -5.1648, last reward: -5.9599, gradient norm:  143.4:  14%|█▍        | 88/625 [00:13<01:21,  6.62it/s]
reward: -5.1648, last reward: -5.9599, gradient norm:  143.4:  14%|█▍        | 89/625 [00:13<01:20,  6.62it/s]
reward: -5.4284, last reward: -5.5946, gradient norm:  10.3:  14%|█▍        | 89/625 [00:13<01:20,  6.62it/s]
reward: -5.4284, last reward: -5.5946, gradient norm:  10.3:  14%|█▍        | 90/625 [00:13<01:20,  6.63it/s]
reward: -5.2590, last reward: -5.9181, gradient norm:  11.15:  14%|█▍        | 90/625 [00:13<01:20,  6.63it/s]
reward: -5.2590, last reward: -5.9181, gradient norm:  11.15:  15%|█▍        | 91/625 [00:13<01:20,  6.63it/s]
reward: -5.4621, last reward: -5.9075, gradient norm:  8.674:  15%|█▍        | 91/625 [00:13<01:20,  6.63it/s]
reward: -5.4621, last reward: -5.9075, gradient norm:  8.674:  15%|█▍        | 92/625 [00:13<01:20,  6.62it/s]
reward: -5.1772, last reward: -4.9444, gradient norm:  8.351:  15%|█▍        | 92/625 [00:14<01:20,  6.62it/s]
reward: -5.1772, last reward: -4.9444, gradient norm:  8.351:  15%|█▍        | 93/625 [00:14<01:20,  6.63it/s]
reward: -4.9391, last reward: -4.5595, gradient norm:  8.1:  15%|█▍        | 93/625 [00:14<01:20,  6.63it/s]
reward: -4.9391, last reward: -4.5595, gradient norm:  8.1:  15%|█▌        | 94/625 [00:14<01:20,  6.61it/s]
reward: -4.8673, last reward: -4.6240, gradient norm:  14.43:  15%|█▌        | 94/625 [00:14<01:20,  6.61it/s]
reward: -4.8673, last reward: -4.6240, gradient norm:  14.43:  15%|█▌        | 95/625 [00:14<01:20,  6.61it/s]
reward: -4.5919, last reward: -5.0018, gradient norm:  26.09:  15%|█▌        | 95/625 [00:14<01:20,  6.61it/s]
reward: -4.5919, last reward: -5.0018, gradient norm:  26.09:  15%|█▌        | 96/625 [00:14<01:20,  6.61it/s]
reward: -5.1071, last reward: -3.9127, gradient norm:  2.251:  15%|█▌        | 96/625 [00:14<01:20,  6.61it/s]
reward: -5.1071, last reward: -3.9127, gradient norm:  2.251:  16%|█▌        | 97/625 [00:14<01:19,  6.62it/s]
reward: -4.9799, last reward: -5.3131, gradient norm:  19.65:  16%|█▌        | 97/625 [00:14<01:19,  6.62it/s]
reward: -4.9799, last reward: -5.3131, gradient norm:  19.65:  16%|█▌        | 98/625 [00:14<01:19,  6.63it/s]
reward: -4.9612, last reward: -3.9705, gradient norm:  12.55:  16%|█▌        | 98/625 [00:14<01:19,  6.63it/s]
reward: -4.9612, last reward: -3.9705, gradient norm:  12.55:  16%|█▌        | 99/625 [00:14<01:19,  6.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm:  6.19:  16%|█▌        | 99/625 [00:15<01:19,  6.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm:  6.19:  16%|█▌        | 100/625 [00:15<01:19,  6.62it/s]
reward: -5.0972, last reward: -5.0337, gradient norm:  11.86:  16%|█▌        | 100/625 [00:15<01:19,  6.62it/s]
reward: -5.0972, last reward: -5.0337, gradient norm:  11.86:  16%|█▌        | 101/625 [00:15<01:19,  6.62it/s]
reward: -5.0350, last reward: -5.0654, gradient norm:  10.83:  16%|█▌        | 101/625 [00:15<01:19,  6.62it/s]
reward: -5.0350, last reward: -5.0654, gradient norm:  10.83:  16%|█▋        | 102/625 [00:15<01:18,  6.62it/s]
reward: -5.2441, last reward: -4.4596, gradient norm:  7.362:  16%|█▋        | 102/625 [00:15<01:18,  6.62it/s]
reward: -5.2441, last reward: -4.4596, gradient norm:  7.362:  16%|█▋        | 103/625 [00:15<01:18,  6.63it/s]
reward: -5.1664, last reward: -5.4362, gradient norm:  8.171:  16%|█▋        | 103/625 [00:15<01:18,  6.63it/s]
reward: -5.1664, last reward: -5.4362, gradient norm:  8.171:  17%|█▋        | 104/625 [00:15<01:18,  6.61it/s]
reward: -5.4041, last reward: -5.6907, gradient norm:  7.77:  17%|█▋        | 104/625 [00:15<01:18,  6.61it/s]
reward: -5.4041, last reward: -5.6907, gradient norm:  7.77:  17%|█▋        | 105/625 [00:15<01:18,  6.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm:  11.19:  17%|█▋        | 105/625 [00:15<01:18,  6.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm:  11.19:  17%|█▋        | 106/625 [00:15<01:18,  6.57it/s]
reward: -5.0299, last reward: -3.9712, gradient norm:  9.349:  17%|█▋        | 106/625 [00:16<01:18,  6.57it/s]
reward: -5.0299, last reward: -3.9712, gradient norm:  9.349:  17%|█▋        | 107/625 [00:16<01:18,  6.58it/s]
reward: -4.3332, last reward: -2.4479, gradient norm:  5.772:  17%|█▋        | 107/625 [00:16<01:18,  6.58it/s]
reward: -4.3332, last reward: -2.4479, gradient norm:  5.772:  17%|█▋        | 108/625 [00:16<01:18,  6.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm:  4.543:  17%|█▋        | 108/625 [00:16<01:18,  6.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm:  4.543:  17%|█▋        | 109/625 [00:16<01:18,  6.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm:  4.692:  17%|█▋        | 109/625 [00:16<01:18,  6.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm:  4.692:  18%|█▊        | 110/625 [00:16<01:18,  6.60it/s]
reward: -4.6261, last reward: -3.7086, gradient norm:  4.496:  18%|█▊        | 110/625 [00:16<01:18,  6.60it/s]
reward: -4.6261, last reward: -3.7086, gradient norm:  4.496:  18%|█▊        | 111/625 [00:16<01:17,  6.60it/s]
reward: -4.7758, last reward: -5.9818, gradient norm:  21.71:  18%|█▊        | 111/625 [00:16<01:17,  6.60it/s]
reward: -4.7758, last reward: -5.9818, gradient norm:  21.71:  18%|█▊        | 112/625 [00:16<01:17,  6.59it/s]
reward: -4.7772, last reward: -7.5055, gradient norm:  62.86:  18%|█▊        | 112/625 [00:17<01:17,  6.59it/s]
reward: -4.7772, last reward: -7.5055, gradient norm:  62.86:  18%|█▊        | 113/625 [00:17<01:17,  6.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm:  18.74:  18%|█▊        | 113/625 [00:17<01:17,  6.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm:  18.74:  18%|█▊        | 114/625 [00:17<01:17,  6.62it/s]
reward: -4.2976, last reward: -3.2083, gradient norm:  10.63:  18%|█▊        | 114/625 [00:17<01:17,  6.62it/s]
reward: -4.2976, last reward: -3.2083, gradient norm:  10.63:  18%|█▊        | 115/625 [00:17<01:17,  6.62it/s]
reward: -4.5275, last reward: -3.6873, gradient norm:  15.65:  18%|█▊        | 115/625 [00:17<01:17,  6.62it/s]
reward: -4.5275, last reward: -3.6873, gradient norm:  15.65:  19%|█▊        | 116/625 [00:17<01:16,  6.63it/s]
reward: -4.4107, last reward: -3.1624, gradient norm:  19.7:  19%|█▊        | 116/625 [00:17<01:16,  6.63it/s]
reward: -4.4107, last reward: -3.1624, gradient norm:  19.7:  19%|█▊        | 117/625 [00:17<01:16,  6.63it/s]
reward: -4.6372, last reward: -3.2571, gradient norm:  15.83:  19%|█▊        | 117/625 [00:17<01:16,  6.63it/s]
reward: -4.6372, last reward: -3.2571, gradient norm:  15.83:  19%|█▉        | 118/625 [00:17<01:16,  6.62it/s]
reward: -4.4039, last reward: -4.4428, gradient norm:  13.06:  19%|█▉        | 118/625 [00:17<01:16,  6.62it/s]
reward: -4.4039, last reward: -4.4428, gradient norm:  13.06:  19%|█▉        | 119/625 [00:17<01:16,  6.63it/s]
reward: -4.4728, last reward: -3.5628, gradient norm:  12.04:  19%|█▉        | 119/625 [00:18<01:16,  6.63it/s]
reward: -4.4728, last reward: -3.5628, gradient norm:  12.04:  19%|█▉        | 120/625 [00:18<01:16,  6.61it/s]
reward: -4.6767, last reward: -5.2466, gradient norm:  6.522:  19%|█▉        | 120/625 [00:18<01:16,  6.61it/s]
reward: -4.6767, last reward: -5.2466, gradient norm:  6.522:  19%|█▉        | 121/625 [00:18<01:16,  6.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm:  19.21:  19%|█▉        | 121/625 [00:18<01:16,  6.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm:  19.21:  20%|█▉        | 122/625 [00:18<01:16,  6.60it/s]
reward: -4.6548, last reward: -6.3766, gradient norm:  5.692:  20%|█▉        | 122/625 [00:18<01:16,  6.60it/s]
reward: -4.6548, last reward: -6.3766, gradient norm:  5.692:  20%|█▉        | 123/625 [00:18<01:16,  6.60it/s]
reward: -4.5134, last reward: -7.1955, gradient norm:  11.11:  20%|█▉        | 123/625 [00:18<01:16,  6.60it/s]
reward: -4.5134, last reward: -7.1955, gradient norm:  11.11:  20%|█▉        | 124/625 [00:18<01:15,  6.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm:  11.85:  20%|█▉        | 124/625 [00:18<01:15,  6.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm:  11.85:  20%|██        | 125/625 [00:18<01:15,  6.60it/s]
reward: -4.4500, last reward: -5.3368, gradient norm:  10.19:  20%|██        | 125/625 [00:19<01:15,  6.60it/s]
reward: -4.4500, last reward: -5.3368, gradient norm:  10.19:  20%|██        | 126/625 [00:19<01:15,  6.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm:  42.81:  20%|██        | 126/625 [00:19<01:15,  6.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm:  42.81:  20%|██        | 127/625 [00:19<01:15,  6.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm:  4.843:  20%|██        | 127/625 [00:19<01:15,  6.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm:  4.843:  20%|██        | 128/625 [00:19<01:15,  6.61it/s]
reward: -4.3327, last reward: -4.6193, gradient norm:  20.96:  20%|██        | 128/625 [00:19<01:15,  6.61it/s]
reward: -4.3327, last reward: -4.6193, gradient norm:  20.96:  21%|██        | 129/625 [00:19<01:15,  6.61it/s]
reward: -4.4831, last reward: -4.1172, gradient norm:  24.81:  21%|██        | 129/625 [00:19<01:15,  6.61it/s]
reward: -4.4831, last reward: -4.1172, gradient norm:  24.81:  21%|██        | 130/625 [00:19<01:14,  6.61it/s]
reward: -4.2593, last reward: -4.4219, gradient norm:  5.962:  21%|██        | 130/625 [00:19<01:14,  6.61it/s]
reward: -4.2593, last reward: -4.4219, gradient norm:  5.962:  21%|██        | 131/625 [00:19<01:14,  6.61it/s]
reward: -4.4800, last reward: -3.8380, gradient norm:  2.899:  21%|██        | 131/625 [00:19<01:14,  6.61it/s]
reward: -4.4800, last reward: -3.8380, gradient norm:  2.899:  21%|██        | 132/625 [00:19<01:14,  6.59it/s]
reward: -4.2721, last reward: -4.9048, gradient norm:  7.166:  21%|██        | 132/625 [00:20<01:14,  6.59it/s]
reward: -4.2721, last reward: -4.9048, gradient norm:  7.166:  21%|██▏       | 133/625 [00:20<01:14,  6.59it/s]
reward: -4.2419, last reward: -4.5248, gradient norm:  25.93:  21%|██▏       | 133/625 [00:20<01:14,  6.59it/s]
reward: -4.2419, last reward: -4.5248, gradient norm:  25.93:  21%|██▏       | 134/625 [00:20<01:14,  6.60it/s]
reward: -4.2139, last reward: -4.4278, gradient norm:  20.26:  21%|██▏       | 134/625 [00:20<01:14,  6.60it/s]
reward: -4.2139, last reward: -4.4278, gradient norm:  20.26:  22%|██▏       | 135/625 [00:20<01:14,  6.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm:  22.5:  22%|██▏       | 135/625 [00:20<01:14,  6.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm:  22.5:  22%|██▏       | 136/625 [00:20<01:13,  6.62it/s]
reward: -4.1140, last reward: -3.7402, gradient norm:  11.11:  22%|██▏       | 136/625 [00:20<01:13,  6.62it/s]
reward: -4.1140, last reward: -3.7402, gradient norm:  11.11:  22%|██▏       | 137/625 [00:20<01:13,  6.62it/s]
reward: -4.5356, last reward: -5.1636, gradient norm:  400.1:  22%|██▏       | 137/625 [00:20<01:13,  6.62it/s]
reward: -4.5356, last reward: -5.1636, gradient norm:  400.1:  22%|██▏       | 138/625 [00:20<01:13,  6.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm:  13.34:  22%|██▏       | 138/625 [00:20<01:13,  6.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm:  13.34:  22%|██▏       | 139/625 [00:20<01:13,  6.62it/s]
reward: -4.8918, last reward: -6.3298, gradient norm:  7.307:  22%|██▏       | 139/625 [00:21<01:13,  6.62it/s]
reward: -4.8918, last reward: -6.3298, gradient norm:  7.307:  22%|██▏       | 140/625 [00:21<01:13,  6.62it/s]
reward: -5.1779, last reward: -4.1915, gradient norm:  11.43:  22%|██▏       | 140/625 [00:21<01:13,  6.62it/s]
reward: -5.1779, last reward: -4.1915, gradient norm:  11.43:  23%|██▎       | 141/625 [00:21<01:13,  6.62it/s]
reward: -5.1771, last reward: -4.3624, gradient norm:  6.936:  23%|██▎       | 141/625 [00:21<01:13,  6.62it/s]
reward: -5.1771, last reward: -4.3624, gradient norm:  6.936:  23%|██▎       | 142/625 [00:21<01:12,  6.63it/s]
reward: -5.1683, last reward: -3.4810, gradient norm:  13.29:  23%|██▎       | 142/625 [00:21<01:12,  6.63it/s]
reward: -5.1683, last reward: -3.4810, gradient norm:  13.29:  23%|██▎       | 143/625 [00:21<01:13,  6.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm:  19.33:  23%|██▎       | 143/625 [00:21<01:13,  6.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm:  19.33:  23%|██▎       | 144/625 [00:21<01:12,  6.61it/s]
reward: -4.4396, last reward: -4.8092, gradient norm:  118.9:  23%|██▎       | 144/625 [00:21<01:12,  6.61it/s]
reward: -4.4396, last reward: -4.8092, gradient norm:  118.9:  23%|██▎       | 145/625 [00:21<01:12,  6.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm:  15.04:  23%|██▎       | 145/625 [00:22<01:12,  6.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm:  15.04:  23%|██▎       | 146/625 [00:22<01:12,  6.62it/s]
reward: -4.4212, last reward: -3.0260, gradient norm:  26.01:  23%|██▎       | 146/625 [00:22<01:12,  6.62it/s]
reward: -4.4212, last reward: -3.0260, gradient norm:  26.01:  24%|██▎       | 147/625 [00:22<01:12,  6.62it/s]
reward: -4.0939, last reward: -4.6478, gradient norm:  9.605:  24%|██▎       | 147/625 [00:22<01:12,  6.62it/s]
reward: -4.0939, last reward: -4.6478, gradient norm:  9.605:  24%|██▎       | 148/625 [00:22<01:12,  6.62it/s]
reward: -4.6606, last reward: -4.7289, gradient norm:  11.19:  24%|██▎       | 148/625 [00:22<01:12,  6.62it/s]
reward: -4.6606, last reward: -4.7289, gradient norm:  11.19:  24%|██▍       | 149/625 [00:22<01:11,  6.63it/s]
reward: -4.9300, last reward: -4.7193, gradient norm:  8.563:  24%|██▍       | 149/625 [00:22<01:11,  6.63it/s]
reward: -4.9300, last reward: -4.7193, gradient norm:  8.563:  24%|██▍       | 150/625 [00:22<01:11,  6.63it/s]
reward: -5.1166, last reward: -4.8514, gradient norm:  8.384:  24%|██▍       | 150/625 [00:22<01:11,  6.63it/s]
reward: -5.1166, last reward: -4.8514, gradient norm:  8.384:  24%|██▍       | 151/625 [00:22<01:11,  6.63it/s]
reward: -4.9108, last reward: -5.0672, gradient norm:  9.292:  24%|██▍       | 151/625 [00:22<01:11,  6.63it/s]
reward: -4.9108, last reward: -5.0672, gradient norm:  9.292:  24%|██▍       | 152/625 [00:22<01:11,  6.63it/s]
reward: -4.8591, last reward: -4.3768, gradient norm:  9.72:  24%|██▍       | 152/625 [00:23<01:11,  6.63it/s]
reward: -4.8591, last reward: -4.3768, gradient norm:  9.72:  24%|██▍       | 153/625 [00:23<01:11,  6.63it/s]
reward: -4.2721, last reward: -3.9976, gradient norm:  10.37:  24%|██▍       | 153/625 [00:23<01:11,  6.63it/s]
reward: -4.2721, last reward: -3.9976, gradient norm:  10.37:  25%|██▍       | 154/625 [00:23<01:11,  6.63it/s]
reward: -4.0576, last reward: -2.0067, gradient norm:  8.935:  25%|██▍       | 154/625 [00:23<01:11,  6.63it/s]
reward: -4.0576, last reward: -2.0067, gradient norm:  8.935:  25%|██▍       | 155/625 [00:23<01:10,  6.64it/s]
reward: -4.4199, last reward: -5.1722, gradient norm:  18.7:  25%|██▍       | 155/625 [00:23<01:10,  6.64it/s]
reward: -4.4199, last reward: -5.1722, gradient norm:  18.7:  25%|██▍       | 156/625 [00:23<01:10,  6.64it/s]
reward: -4.8310, last reward: -7.3466, gradient norm:  28.52:  25%|██▍       | 156/625 [00:23<01:10,  6.64it/s]
reward: -4.8310, last reward: -7.3466, gradient norm:  28.52:  25%|██▌       | 157/625 [00:23<01:10,  6.64it/s]
reward: -4.8631, last reward: -6.2492, gradient norm:  89.17:  25%|██▌       | 157/625 [00:23<01:10,  6.64it/s]
reward: -4.8631, last reward: -6.2492, gradient norm:  89.17:  25%|██▌       | 158/625 [00:23<01:10,  6.64it/s]
reward: -4.8763, last reward: -6.1277, gradient norm:  24.43:  25%|██▌       | 158/625 [00:24<01:10,  6.64it/s]
reward: -4.8763, last reward: -6.1277, gradient norm:  24.43:  25%|██▌       | 159/625 [00:24<01:10,  6.64it/s]
reward: -4.5562, last reward: -5.7446, gradient norm:  23.35:  25%|██▌       | 159/625 [00:24<01:10,  6.64it/s]
reward: -4.5562, last reward: -5.7446, gradient norm:  23.35:  26%|██▌       | 160/625 [00:24<01:10,  6.64it/s]
reward: -4.1082, last reward: -4.9830, gradient norm:  22.14:  26%|██▌       | 160/625 [00:24<01:10,  6.64it/s]
reward: -4.1082, last reward: -4.9830, gradient norm:  22.14:  26%|██▌       | 161/625 [00:24<01:09,  6.64it/s]
reward: -4.0946, last reward: -2.5229, gradient norm:  10.47:  26%|██▌       | 161/625 [00:24<01:09,  6.64it/s]
reward: -4.0946, last reward: -2.5229, gradient norm:  10.47:  26%|██▌       | 162/625 [00:24<01:09,  6.64it/s]
reward: -4.4574, last reward: -4.6900, gradient norm:  112.6:  26%|██▌       | 162/625 [00:24<01:09,  6.64it/s]
reward: -4.4574, last reward: -4.6900, gradient norm:  112.6:  26%|██▌       | 163/625 [00:24<01:09,  6.63it/s]
reward: -5.2229, last reward: -4.0318, gradient norm:  6.482:  26%|██▌       | 163/625 [00:24<01:09,  6.63it/s]
reward: -5.2229, last reward: -4.0318, gradient norm:  6.482:  26%|██▌       | 164/625 [00:24<01:09,  6.64it/s]
reward: -5.0543, last reward: -4.0817, gradient norm:  5.761:  26%|██▌       | 164/625 [00:24<01:09,  6.64it/s]
reward: -5.0543, last reward: -4.0817, gradient norm:  5.761:  26%|██▋       | 165/625 [00:24<01:09,  6.64it/s]
reward: -5.2809, last reward: -4.5118, gradient norm:  5.366:  26%|██▋       | 165/625 [00:25<01:09,  6.64it/s]
reward: -5.2809, last reward: -4.5118, gradient norm:  5.366:  27%|██▋       | 166/625 [00:25<01:08,  6.65it/s]
reward: -5.1142, last reward: -4.5635, gradient norm:  5.04:  27%|██▋       | 166/625 [00:25<01:08,  6.65it/s]
reward: -5.1142, last reward: -4.5635, gradient norm:  5.04:  27%|██▋       | 167/625 [00:25<01:08,  6.64it/s]
reward: -5.1949, last reward: -4.2327, gradient norm:  4.982:  27%|██▋       | 167/625 [00:25<01:08,  6.64it/s]
reward: -5.1949, last reward: -4.2327, gradient norm:  4.982:  27%|██▋       | 168/625 [00:25<01:08,  6.65it/s]
reward: -5.0967, last reward: -5.0387, gradient norm:  7.457:  27%|██▋       | 168/625 [00:25<01:08,  6.65it/s]
reward: -5.0967, last reward: -5.0387, gradient norm:  7.457:  27%|██▋       | 169/625 [00:25<01:08,  6.65it/s]
reward: -5.0782, last reward: -5.2150, gradient norm:  10.54:  27%|██▋       | 169/625 [00:25<01:08,  6.65it/s]
reward: -5.0782, last reward: -5.2150, gradient norm:  10.54:  27%|██▋       | 170/625 [00:25<01:08,  6.65it/s]
reward: -4.5222, last reward: -4.3725, gradient norm:  22.63:  27%|██▋       | 170/625 [00:25<01:08,  6.65it/s]
reward: -4.5222, last reward: -4.3725, gradient norm:  22.63:  27%|██▋       | 171/625 [00:25<01:08,  6.64it/s]
reward: -3.9288, last reward: -3.9837, gradient norm:  83.59:  27%|██▋       | 171/625 [00:25<01:08,  6.64it/s]
reward: -3.9288, last reward: -3.9837, gradient norm:  83.59:  28%|██▊       | 172/625 [00:25<01:08,  6.63it/s]
reward: -4.1416, last reward: -4.1099, gradient norm:  30.57:  28%|██▊       | 172/625 [00:26<01:08,  6.63it/s]
reward: -4.1416, last reward: -4.1099, gradient norm:  30.57:  28%|██▊       | 173/625 [00:26<01:08,  6.62it/s]
reward: -4.8620, last reward: -6.8475, gradient norm:  18.91:  28%|██▊       | 173/625 [00:26<01:08,  6.62it/s]
reward: -4.8620, last reward: -6.8475, gradient norm:  18.91:  28%|██▊       | 174/625 [00:26<01:08,  6.62it/s]
reward: -5.1807, last reward: -6.4375, gradient norm:  18.48:  28%|██▊       | 174/625 [00:26<01:08,  6.62it/s]
reward: -5.1807, last reward: -6.4375, gradient norm:  18.48:  28%|██▊       | 175/625 [00:26<01:07,  6.63it/s]
reward: -5.1148, last reward: -5.0645, gradient norm:  14.36:  28%|██▊       | 175/625 [00:26<01:07,  6.63it/s]
reward: -5.1148, last reward: -5.0645, gradient norm:  14.36:  28%|██▊       | 176/625 [00:26<01:07,  6.62it/s]
reward: -5.2751, last reward: -4.8313, gradient norm:  15.32:  28%|██▊       | 176/625 [00:26<01:07,  6.62it/s]
reward: -5.2751, last reward: -4.8313, gradient norm:  15.32:  28%|██▊       | 177/625 [00:26<01:07,  6.63it/s]
reward: -4.9286, last reward: -6.9770, gradient norm:  24.75:  28%|██▊       | 177/625 [00:26<01:07,  6.63it/s]
reward: -4.9286, last reward: -6.9770, gradient norm:  24.75:  28%|██▊       | 178/625 [00:26<01:07,  6.62it/s]
reward: -4.5735, last reward: -5.2837, gradient norm:  15.2:  28%|██▊       | 178/625 [00:27<01:07,  6.62it/s]
reward: -4.5735, last reward: -5.2837, gradient norm:  15.2:  29%|██▊       | 179/625 [00:27<01:07,  6.63it/s]
reward: -4.2926, last reward: -1.9489, gradient norm:  18.24:  29%|██▊       | 179/625 [00:27<01:07,  6.63it/s]
reward: -4.2926, last reward: -1.9489, gradient norm:  18.24:  29%|██▉       | 180/625 [00:27<01:07,  6.63it/s]
reward: -4.1507, last reward: -3.5593, gradient norm:  37.66:  29%|██▉       | 180/625 [00:27<01:07,  6.63it/s]
reward: -4.1507, last reward: -3.5593, gradient norm:  37.66:  29%|██▉       | 181/625 [00:27<01:06,  6.64it/s]
reward: -3.8724, last reward: -4.3567, gradient norm:  16.67:  29%|██▉       | 181/625 [00:27<01:06,  6.64it/s]
reward: -3.8724, last reward: -4.3567, gradient norm:  16.67:  29%|██▉       | 182/625 [00:27<01:06,  6.64it/s]
reward: -4.3574, last reward: -3.6140, gradient norm:  13.96:  29%|██▉       | 182/625 [00:27<01:06,  6.64it/s]
reward: -4.3574, last reward: -3.6140, gradient norm:  13.96:  29%|██▉       | 183/625 [00:27<01:06,  6.63it/s]
reward: -4.7895, last reward: -6.2518, gradient norm:  14.74:  29%|██▉       | 183/625 [00:27<01:06,  6.63it/s]
reward: -4.7895, last reward: -6.2518, gradient norm:  14.74:  29%|██▉       | 184/625 [00:27<01:06,  6.63it/s]
reward: -4.6146, last reward: -5.6969, gradient norm:  11.45:  29%|██▉       | 184/625 [00:27<01:06,  6.63it/s]
reward: -4.6146, last reward: -5.6969, gradient norm:  11.45:  30%|██▉       | 185/625 [00:27<01:06,  6.63it/s]
reward: -4.8776, last reward: -5.7358, gradient norm:  13.16:  30%|██▉       | 185/625 [00:28<01:06,  6.63it/s]
reward: -4.8776, last reward: -5.7358, gradient norm:  13.16:  30%|██▉       | 186/625 [00:28<01:06,  6.62it/s]
reward: -4.3722, last reward: -4.8428, gradient norm:  23.57:  30%|██▉       | 186/625 [00:28<01:06,  6.62it/s]
reward: -4.3722, last reward: -4.8428, gradient norm:  23.57:  30%|██▉       | 187/625 [00:28<01:06,  6.62it/s]
reward: -4.2656, last reward: -3.7955, gradient norm:  54.67:  30%|██▉       | 187/625 [00:28<01:06,  6.62it/s]
reward: -4.2656, last reward: -3.7955, gradient norm:  54.67:  30%|███       | 188/625 [00:28<01:05,  6.63it/s]
reward: -4.0092, last reward: -1.7106, gradient norm:  7.829:  30%|███       | 188/625 [00:28<01:05,  6.63it/s]
reward: -4.0092, last reward: -1.7106, gradient norm:  7.829:  30%|███       | 189/625 [00:28<01:30,  4.79it/s]
reward: -4.2264, last reward: -3.6919, gradient norm:  16.17:  30%|███       | 189/625 [00:28<01:30,  4.79it/s]
reward: -4.2264, last reward: -3.6919, gradient norm:  16.17:  30%|███       | 190/625 [00:28<01:23,  5.23it/s]
reward: -4.1438, last reward: -2.1362, gradient norm:  19.43:  30%|███       | 190/625 [00:29<01:23,  5.23it/s]
reward: -4.1438, last reward: -2.1362, gradient norm:  19.43:  31%|███       | 191/625 [00:29<01:17,  5.58it/s]
reward: -4.0618, last reward: -2.8217, gradient norm:  73.63:  31%|███       | 191/625 [00:29<01:17,  5.58it/s]
reward: -4.0618, last reward: -2.8217, gradient norm:  73.63:  31%|███       | 192/625 [00:29<01:13,  5.85it/s]
reward: -3.9420, last reward: -3.6765, gradient norm:  34.1:  31%|███       | 192/625 [00:29<01:13,  5.85it/s]
reward: -3.9420, last reward: -3.6765, gradient norm:  34.1:  31%|███       | 193/625 [00:29<01:11,  6.07it/s]
reward: -3.7745, last reward: -4.0709, gradient norm:  26.48:  31%|███       | 193/625 [00:29<01:11,  6.07it/s]
reward: -3.7745, last reward: -4.0709, gradient norm:  26.48:  31%|███       | 194/625 [00:29<01:09,  6.23it/s]
reward: -3.9478, last reward: -2.6867, gradient norm:  22.82:  31%|███       | 194/625 [00:29<01:09,  6.23it/s]
reward: -3.9478, last reward: -2.6867, gradient norm:  22.82:  31%|███       | 195/625 [00:29<01:07,  6.34it/s]
reward: -3.6507, last reward: -2.6225, gradient norm:  37.44:  31%|███       | 195/625 [00:29<01:07,  6.34it/s]
reward: -3.6507, last reward: -2.6225, gradient norm:  37.44:  31%|███▏      | 196/625 [00:29<01:06,  6.40it/s]
reward: -4.2244, last reward: -3.2195, gradient norm:  10.71:  31%|███▏      | 196/625 [00:29<01:06,  6.40it/s]
reward: -4.2244, last reward: -3.2195, gradient norm:  10.71:  32%|███▏      | 197/625 [00:29<01:07,  6.37it/s]
reward: -4.5385, last reward: -3.9263, gradient norm:  31.03:  32%|███▏      | 197/625 [00:30<01:07,  6.37it/s]
reward: -4.5385, last reward: -3.9263, gradient norm:  31.03:  32%|███▏      | 198/625 [00:30<01:06,  6.44it/s]
reward: -4.1878, last reward: -3.2374, gradient norm:  34.35:  32%|███▏      | 198/625 [00:30<01:06,  6.44it/s]
reward: -4.1878, last reward: -3.2374, gradient norm:  34.35:  32%|███▏      | 199/625 [00:30<01:05,  6.49it/s]
reward: -3.8054, last reward: -2.3504, gradient norm:  5.557:  32%|███▏      | 199/625 [00:30<01:05,  6.49it/s]
reward: -3.8054, last reward: -2.3504, gradient norm:  5.557:  32%|███▏      | 200/625 [00:30<01:05,  6.52it/s]
reward: -4.0766, last reward: -4.6825, gradient norm:  38.72:  32%|███▏      | 200/625 [00:30<01:05,  6.52it/s]
reward: -4.0766, last reward: -4.6825, gradient norm:  38.72:  32%|███▏      | 201/625 [00:30<01:04,  6.55it/s]
reward: -4.2011, last reward: -5.8393, gradient norm:  21.06:  32%|███▏      | 201/625 [00:30<01:04,  6.55it/s]
reward: -4.2011, last reward: -5.8393, gradient norm:  21.06:  32%|███▏      | 202/625 [00:30<01:04,  6.56it/s]
reward: -4.0803, last reward: -3.7815, gradient norm:  10.6:  32%|███▏      | 202/625 [00:30<01:04,  6.56it/s]
reward: -4.0803, last reward: -3.7815, gradient norm:  10.6:  32%|███▏      | 203/625 [00:30<01:04,  6.58it/s]
reward: -3.8363, last reward: -3.2460, gradient norm:  32.57:  32%|███▏      | 203/625 [00:30<01:04,  6.58it/s]
reward: -3.8363, last reward: -3.2460, gradient norm:  32.57:  33%|███▎      | 204/625 [00:30<01:03,  6.59it/s]
reward: -3.8643, last reward: -3.2191, gradient norm:  8.593:  33%|███▎      | 204/625 [00:31<01:03,  6.59it/s]
reward: -3.8643, last reward: -3.2191, gradient norm:  8.593:  33%|███▎      | 205/625 [00:31<01:03,  6.59it/s]
reward: -4.0773, last reward: -5.1343, gradient norm:  14.49:  33%|███▎      | 205/625 [00:31<01:03,  6.59it/s]
reward: -4.0773, last reward: -5.1343, gradient norm:  14.49:  33%|███▎      | 206/625 [00:31<01:03,  6.60it/s]
reward: -4.1400, last reward: -5.8657, gradient norm:  17.05:  33%|███▎      | 206/625 [00:31<01:03,  6.60it/s]
reward: -4.1400, last reward: -5.8657, gradient norm:  17.05:  33%|███▎      | 207/625 [00:31<01:03,  6.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm:  33.25:  33%|███▎      | 207/625 [00:31<01:03,  6.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm:  33.25:  33%|███▎      | 208/625 [00:31<01:03,  6.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm:  10.76:  33%|███▎      | 208/625 [00:31<01:03,  6.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm:  10.76:  33%|███▎      | 209/625 [00:31<01:02,  6.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm:  40.8:  33%|███▎      | 209/625 [00:31<01:02,  6.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm:  40.8:  34%|███▎      | 210/625 [00:31<01:02,  6.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm:  193.3:  34%|███▎      | 210/625 [00:32<01:02,  6.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm:  193.3:  34%|███▍      | 211/625 [00:32<01:02,  6.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm:  136.5:  34%|███▍      | 211/625 [00:32<01:02,  6.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm:  136.5:  34%|███▍      | 212/625 [00:32<01:02,  6.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm:  21.44:  34%|███▍      | 212/625 [00:32<01:02,  6.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm:  21.44:  34%|███▍      | 213/625 [00:32<01:02,  6.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm:  30.65:  34%|███▍      | 213/625 [00:32<01:02,  6.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm:  30.65:  34%|███▍      | 214/625 [00:32<01:02,  6.58it/s]
reward: -3.9406, last reward: -2.8090, gradient norm:  20.18:  34%|███▍      | 214/625 [00:32<01:02,  6.58it/s]
reward: -3.9406, last reward: -2.8090, gradient norm:  20.18:  34%|███▍      | 215/625 [00:32<01:02,  6.54it/s]
reward: -3.6291, last reward: -2.8923, gradient norm:  7.876:  34%|███▍      | 215/625 [00:32<01:02,  6.54it/s]
reward: -3.6291, last reward: -2.8923, gradient norm:  7.876:  35%|███▍      | 216/625 [00:32<01:02,  6.52it/s]
reward: -3.5112, last reward: -3.9504, gradient norm:  3.21e+03:  35%|███▍      | 216/625 [00:32<01:02,  6.52it/s]
reward: -3.5112, last reward: -3.9504, gradient norm:  3.21e+03:  35%|███▍      | 217/625 [00:32<01:02,  6.52it/s]
reward: -3.7431, last reward: -2.7880, gradient norm:  13.73:  35%|███▍      | 217/625 [00:33<01:02,  6.52it/s]
reward: -3.7431, last reward: -2.7880, gradient norm:  13.73:  35%|███▍      | 218/625 [00:33<01:02,  6.52it/s]
reward: -3.4463, last reward: -4.5432, gradient norm:  32.37:  35%|███▍      | 218/625 [00:33<01:02,  6.52it/s]
reward: -3.4463, last reward: -4.5432, gradient norm:  32.37:  35%|███▌      | 219/625 [00:33<01:02,  6.51it/s]
reward: -3.3793, last reward: -3.3313, gradient norm:  60.63:  35%|███▌      | 219/625 [00:33<01:02,  6.51it/s]
reward: -3.3793, last reward: -3.3313, gradient norm:  60.63:  35%|███▌      | 220/625 [00:33<01:02,  6.51it/s]
reward: -3.8843, last reward: -3.0369, gradient norm:  5.065:  35%|███▌      | 220/625 [00:33<01:02,  6.51it/s]
reward: -3.8843, last reward: -3.0369, gradient norm:  5.065:  35%|███▌      | 221/625 [00:33<01:02,  6.51it/s]
reward: -3.4828, last reward: -3.8391, gradient norm:  59.85:  35%|███▌      | 221/625 [00:33<01:02,  6.51it/s]
reward: -3.4828, last reward: -3.8391, gradient norm:  59.85:  36%|███▌      | 222/625 [00:33<01:01,  6.51it/s]
reward: -3.6265, last reward: -4.2913, gradient norm:  8.947:  36%|███▌      | 222/625 [00:33<01:01,  6.51it/s]
reward: -3.6265, last reward: -4.2913, gradient norm:  8.947:  36%|███▌      | 223/625 [00:33<01:01,  6.50it/s]
reward: -3.5541, last reward: -4.1252, gradient norm:  255.9:  36%|███▌      | 223/625 [00:34<01:01,  6.50it/s]
reward: -3.5541, last reward: -4.1252, gradient norm:  255.9:  36%|███▌      | 224/625 [00:34<01:01,  6.49it/s]
reward: -3.7342, last reward: -2.2396, gradient norm:  7.995:  36%|███▌      | 224/625 [00:34<01:01,  6.49it/s]
reward: -3.7342, last reward: -2.2396, gradient norm:  7.995:  36%|███▌      | 225/625 [00:34<01:01,  6.48it/s]
reward: -3.5936, last reward: -4.1924, gradient norm:  59.49:  36%|███▌      | 225/625 [00:34<01:01,  6.48it/s]
reward: -3.5936, last reward: -4.1924, gradient norm:  59.49:  36%|███▌      | 226/625 [00:34<01:01,  6.48it/s]
reward: -3.9975, last reward: -4.2045, gradient norm:  21.77:  36%|███▌      | 226/625 [00:34<01:01,  6.48it/s]
reward: -3.9975, last reward: -4.2045, gradient norm:  21.77:  36%|███▋      | 227/625 [00:34<01:01,  6.48it/s]
reward: -3.8367, last reward: -1.9540, gradient norm:  32.26:  36%|███▋      | 227/625 [00:34<01:01,  6.48it/s]
reward: -3.8367, last reward: -1.9540, gradient norm:  32.26:  36%|███▋      | 228/625 [00:34<01:01,  6.48it/s]
reward: -3.7259, last reward: -3.6743, gradient norm:  28.62:  36%|███▋      | 228/625 [00:34<01:01,  6.48it/s]
reward: -3.7259, last reward: -3.6743, gradient norm:  28.62:  37%|███▋      | 229/625 [00:34<01:01,  6.41it/s]
reward: -3.4827, last reward: -3.7528, gradient norm:  64.85:  37%|███▋      | 229/625 [00:34<01:01,  6.41it/s]
reward: -3.4827, last reward: -3.7528, gradient norm:  64.85:  37%|███▋      | 230/625 [00:34<01:01,  6.46it/s]
reward: -3.7361, last reward: -3.8756, gradient norm:  24.69:  37%|███▋      | 230/625 [00:35<01:01,  6.46it/s]
reward: -3.7361, last reward: -3.8756, gradient norm:  24.69:  37%|███▋      | 231/625 [00:35<01:00,  6.51it/s]
reward: -3.7646, last reward: -3.1116, gradient norm:  14.25:  37%|███▋      | 231/625 [00:35<01:00,  6.51it/s]
reward: -3.7646, last reward: -3.1116, gradient norm:  14.25:  37%|███▋      | 232/625 [00:35<01:00,  6.53it/s]
reward: -3.5426, last reward: -2.8385, gradient norm:  34.07:  37%|███▋      | 232/625 [00:35<01:00,  6.53it/s]
reward: -3.5426, last reward: -2.8385, gradient norm:  34.07:  37%|███▋      | 233/625 [00:35<00:59,  6.56it/s]
reward: -3.5662, last reward: -1.8585, gradient norm:  11.26:  37%|███▋      | 233/625 [00:35<00:59,  6.56it/s]
reward: -3.5662, last reward: -1.8585, gradient norm:  11.26:  37%|███▋      | 234/625 [00:35<00:59,  6.58it/s]
reward: -3.8234, last reward: -2.7930, gradient norm:  32.18:  37%|███▋      | 234/625 [00:35<00:59,  6.58it/s]
reward: -3.8234, last reward: -2.7930, gradient norm:  32.18:  38%|███▊      | 235/625 [00:35<00:59,  6.59it/s]
reward: -4.2648, last reward: -4.9309, gradient norm:  24.83:  38%|███▊      | 235/625 [00:35<00:59,  6.59it/s]
reward: -4.2648, last reward: -4.9309, gradient norm:  24.83:  38%|███▊      | 236/625 [00:35<00:59,  6.59it/s]
reward: -4.2039, last reward: -3.6817, gradient norm:  19.24:  38%|███▊      | 236/625 [00:36<00:59,  6.59it/s]
reward: -4.2039, last reward: -3.6817, gradient norm:  19.24:  38%|███▊      | 237/625 [00:36<00:58,  6.60it/s]
reward: -4.0943, last reward: -3.1533, gradient norm:  145.1:  38%|███▊      | 237/625 [00:36<00:58,  6.60it/s]
reward: -4.0943, last reward: -3.1533, gradient norm:  145.1:  38%|███▊      | 238/625 [00:36<00:58,  6.61it/s]
reward: -4.3045, last reward: -3.0483, gradient norm:  20.89:  38%|███▊      | 238/625 [00:36<00:58,  6.61it/s]
reward: -4.3045, last reward: -3.0483, gradient norm:  20.89:  38%|███▊      | 239/625 [00:36<00:58,  6.56it/s]
reward: -4.4128, last reward: -5.2528, gradient norm:  24.97:  38%|███▊      | 239/625 [00:36<00:58,  6.56it/s]
reward: -4.4128, last reward: -5.2528, gradient norm:  24.97:  38%|███▊      | 240/625 [00:36<00:58,  6.56it/s]
reward: -4.6415, last reward: -8.0201, gradient norm:  26.74:  38%|███▊      | 240/625 [00:36<00:58,  6.56it/s]
reward: -4.6415, last reward: -8.0201, gradient norm:  26.74:  39%|███▊      | 241/625 [00:36<00:58,  6.53it/s]
reward: -4.4437, last reward: -5.4365, gradient norm:  132.7:  39%|███▊      | 241/625 [00:36<00:58,  6.53it/s]
reward: -4.4437, last reward: -5.4365, gradient norm:  132.7:  39%|███▊      | 242/625 [00:36<00:58,  6.53it/s]
reward: -4.0358, last reward: -3.4943, gradient norm:  11.46:  39%|███▊      | 242/625 [00:36<00:58,  6.53it/s]
reward: -4.0358, last reward: -3.4943, gradient norm:  11.46:  39%|███▉      | 243/625 [00:36<00:58,  6.52it/s]
reward: -4.1272, last reward: -3.5003, gradient norm:  68.09:  39%|███▉      | 243/625 [00:37<00:58,  6.52it/s]
reward: -4.1272, last reward: -3.5003, gradient norm:  68.09:  39%|███▉      | 244/625 [00:37<00:58,  6.52it/s]
reward: -4.1180, last reward: -4.2637, gradient norm:  39.25:  39%|███▉      | 244/625 [00:37<00:58,  6.52it/s]
reward: -4.1180, last reward: -4.2637, gradient norm:  39.25:  39%|███▉      | 245/625 [00:37<00:58,  6.52it/s]
reward: -4.7197, last reward: -3.0873, gradient norm:  12.2:  39%|███▉      | 245/625 [00:37<00:58,  6.52it/s]
reward: -4.7197, last reward: -3.0873, gradient norm:  12.2:  39%|███▉      | 246/625 [00:37<00:58,  6.52it/s]
reward: -4.2917, last reward: -3.6656, gradient norm:  17.17:  39%|███▉      | 246/625 [00:37<00:58,  6.52it/s]
reward: -4.2917, last reward: -3.6656, gradient norm:  17.17:  40%|███▉      | 247/625 [00:37<00:58,  6.51it/s]
reward: -4.0160, last reward: -3.0738, gradient norm:  43.07:  40%|███▉      | 247/625 [00:37<00:58,  6.51it/s]
reward: -4.0160, last reward: -3.0738, gradient norm:  43.07:  40%|███▉      | 248/625 [00:37<00:57,  6.51it/s]
reward: -4.3689, last reward: -4.0120, gradient norm:  11.81:  40%|███▉      | 248/625 [00:37<00:57,  6.51it/s]
reward: -4.3689, last reward: -4.0120, gradient norm:  11.81:  40%|███▉      | 249/625 [00:37<00:57,  6.51it/s]
reward: -4.5570, last reward: -7.0475, gradient norm:  22.45:  40%|███▉      | 249/625 [00:38<00:57,  6.51it/s]
reward: -4.5570, last reward: -7.0475, gradient norm:  22.45:  40%|████      | 250/625 [00:38<00:57,  6.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm:  18.4:  40%|████      | 250/625 [00:38<00:57,  6.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm:  18.4:  40%|████      | 251/625 [00:38<00:57,  6.48it/s]
reward: -4.2118, last reward: -4.6803, gradient norm:  15.86:  40%|████      | 251/625 [00:38<00:57,  6.48it/s]
reward: -4.2118, last reward: -4.6803, gradient norm:  15.86:  40%|████      | 252/625 [00:38<00:57,  6.48it/s]
reward: -4.1465, last reward: -3.7214, gradient norm:  25.93:  40%|████      | 252/625 [00:38<00:57,  6.48it/s]
reward: -4.1465, last reward: -3.7214, gradient norm:  25.93:  40%|████      | 253/625 [00:38<00:57,  6.49it/s]
reward: -3.8801, last reward: -2.7034, gradient norm:  103.6:  40%|████      | 253/625 [00:38<00:57,  6.49it/s]
reward: -3.8801, last reward: -2.7034, gradient norm:  103.6:  41%|████      | 254/625 [00:38<00:57,  6.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm:  17.63:  41%|████      | 254/625 [00:38<00:57,  6.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm:  17.63:  41%|████      | 255/625 [00:38<00:56,  6.50it/s]
reward: -3.7589, last reward: -4.5013, gradient norm:  143.3:  41%|████      | 255/625 [00:38<00:56,  6.50it/s]
reward: -3.7589, last reward: -4.5013, gradient norm:  143.3:  41%|████      | 256/625 [00:38<00:56,  6.49it/s]
reward: -3.8150, last reward: -3.2241, gradient norm:  113.9:  41%|████      | 256/625 [00:39<00:56,  6.49it/s]
reward: -3.8150, last reward: -3.2241, gradient norm:  113.9:  41%|████      | 257/625 [00:39<00:56,  6.51it/s]
reward: -4.0753, last reward: -3.8081, gradient norm:  14.8:  41%|████      | 257/625 [00:39<00:56,  6.51it/s]
reward: -4.0753, last reward: -3.8081, gradient norm:  14.8:  41%|████▏     | 258/625 [00:39<00:56,  6.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm:  27.63:  41%|████▏     | 258/625 [00:39<00:56,  6.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm:  27.63:  41%|████▏     | 259/625 [00:39<00:56,  6.53it/s]
reward: -4.0038, last reward: -2.5333, gradient norm:  42.85:  41%|████▏     | 259/625 [00:39<00:56,  6.53it/s]
reward: -4.0038, last reward: -2.5333, gradient norm:  42.85:  42%|████▏     | 260/625 [00:39<00:56,  6.51it/s]
reward: -4.0889, last reward: -2.4616, gradient norm:  13.78:  42%|████▏     | 260/625 [00:39<00:56,  6.51it/s]
reward: -4.0889, last reward: -2.4616, gradient norm:  13.78:  42%|████▏     | 261/625 [00:39<00:55,  6.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm:  10.98:  42%|████▏     | 261/625 [00:39<00:55,  6.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm:  10.98:  42%|████▏     | 262/625 [00:39<00:55,  6.51it/s]
reward: -3.8333, last reward: -1.9476, gradient norm:  13.47:  42%|████▏     | 262/625 [00:40<00:55,  6.51it/s]
reward: -3.8333, last reward: -1.9476, gradient norm:  13.47:  42%|████▏     | 263/625 [00:40<00:55,  6.50it/s]
reward: -3.7554, last reward: -4.3798, gradient norm:  41.76:  42%|████▏     | 263/625 [00:40<00:55,  6.50it/s]
reward: -3.7554, last reward: -4.3798, gradient norm:  41.76:  42%|████▏     | 264/625 [00:40<00:55,  6.49it/s]
reward: -3.3717, last reward: -2.3947, gradient norm:  6.529:  42%|████▏     | 264/625 [00:40<00:55,  6.49it/s]
reward: -3.3717, last reward: -2.3947, gradient norm:  6.529:  42%|████▏     | 265/625 [00:40<00:55,  6.50it/s]
reward: -4.3060, last reward: -4.6495, gradient norm:  11.24:  42%|████▏     | 265/625 [00:40<00:55,  6.50it/s]
reward: -4.3060, last reward: -4.6495, gradient norm:  11.24:  43%|████▎     | 266/625 [00:40<00:55,  6.49it/s]
reward: -4.7467, last reward: -5.8889, gradient norm:  12.35:  43%|████▎     | 266/625 [00:40<00:55,  6.49it/s]
reward: -4.7467, last reward: -5.8889, gradient norm:  12.35:  43%|████▎     | 267/625 [00:40<00:55,  6.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm:  6.591:  43%|████▎     | 267/625 [00:40<00:55,  6.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm:  6.591:  43%|████▎     | 268/625 [00:40<00:55,  6.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm:  5.771:  43%|████▎     | 268/625 [00:40<00:55,  6.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm:  5.771:  43%|████▎     | 269/625 [00:40<00:54,  6.48it/s]
reward: -4.7197, last reward: -4.1651, gradient norm:  5.388:  43%|████▎     | 269/625 [00:41<00:54,  6.48it/s]
reward: -4.7197, last reward: -4.1651, gradient norm:  5.388:  43%|████▎     | 270/625 [00:41<00:54,  6.47it/s]
reward: -4.8246, last reward: -5.5709, gradient norm:  8.281:  43%|████▎     | 270/625 [00:41<00:54,  6.47it/s]
reward: -4.8246, last reward: -5.5709, gradient norm:  8.281:  43%|████▎     | 271/625 [00:41<00:54,  6.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm:  9.032:  43%|████▎     | 271/625 [00:41<00:54,  6.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm:  9.032:  44%|████▎     | 272/625 [00:41<00:54,  6.52it/s]
reward: -4.5475, last reward: -4.7253, gradient norm:  21.18:  44%|████▎     | 272/625 [00:41<00:54,  6.52it/s]
reward: -4.5475, last reward: -4.7253, gradient norm:  21.18:  44%|████▎     | 273/625 [00:41<00:53,  6.53it/s]
reward: -4.2856, last reward: -3.7130, gradient norm:  13.53:  44%|████▎     | 273/625 [00:41<00:53,  6.53it/s]
reward: -4.2856, last reward: -3.7130, gradient norm:  13.53:  44%|████▍     | 274/625 [00:41<00:53,  6.51it/s]
reward: -3.2778, last reward: -3.4122, gradient norm:  28.52:  44%|████▍     | 274/625 [00:41<00:53,  6.51it/s]
reward: -3.2778, last reward: -3.4122, gradient norm:  28.52:  44%|████▍     | 275/625 [00:41<00:53,  6.52it/s]
reward: -3.8368, last reward: -2.1841, gradient norm:  2.07:  44%|████▍     | 275/625 [00:42<00:53,  6.52it/s]
reward: -3.8368, last reward: -2.1841, gradient norm:  2.07:  44%|████▍     | 276/625 [00:42<00:53,  6.51it/s]
reward: -3.9622, last reward: -3.1603, gradient norm:  1.003e+03:  44%|████▍     | 276/625 [00:42<00:53,  6.51it/s]
reward: -3.9622, last reward: -3.1603, gradient norm:  1.003e+03:  44%|████▍     | 277/625 [00:42<00:53,  6.53it/s]
reward: -4.0247, last reward: -2.9830, gradient norm:  8.346:  44%|████▍     | 277/625 [00:42<00:53,  6.53it/s]
reward: -4.0247, last reward: -2.9830, gradient norm:  8.346:  44%|████▍     | 278/625 [00:42<00:53,  6.51it/s]
reward: -4.2238, last reward: -4.6418, gradient norm:  14.55:  44%|████▍     | 278/625 [00:42<00:53,  6.51it/s]
reward: -4.2238, last reward: -4.6418, gradient norm:  14.55:  45%|████▍     | 279/625 [00:42<00:53,  6.51it/s]
reward: -4.0626, last reward: -4.2538, gradient norm:  17.88:  45%|████▍     | 279/625 [00:42<00:53,  6.51it/s]
reward: -4.0626, last reward: -4.2538, gradient norm:  17.88:  45%|████▍     | 280/625 [00:42<00:53,  6.50it/s]
reward: -4.0149, last reward: -3.7380, gradient norm:  13.13:  45%|████▍     | 280/625 [00:42<00:53,  6.50it/s]
reward: -4.0149, last reward: -3.7380, gradient norm:  13.13:  45%|████▍     | 281/625 [00:42<00:52,  6.50it/s]
reward: -4.2167, last reward: -2.8911, gradient norm:  11.41:  45%|████▍     | 281/625 [00:42<00:52,  6.50it/s]
reward: -4.2167, last reward: -2.8911, gradient norm:  11.41:  45%|████▌     | 282/625 [00:42<00:52,  6.47it/s]
reward: -3.8725, last reward: -4.1983, gradient norm:  18.88:  45%|████▌     | 282/625 [00:43<00:52,  6.47it/s]
reward: -3.8725, last reward: -4.1983, gradient norm:  18.88:  45%|████▌     | 283/625 [00:43<00:52,  6.48it/s]
reward: -2.8142, last reward: -2.3709, gradient norm:  43.73:  45%|████▌     | 283/625 [00:43<00:52,  6.48it/s]
reward: -2.8142, last reward: -2.3709, gradient norm:  43.73:  45%|████▌     | 284/625 [00:43<00:52,  6.47it/s]
reward: -3.2022, last reward: -2.4989, gradient norm:  11.14:  45%|████▌     | 284/625 [00:43<00:52,  6.47it/s]
reward: -3.2022, last reward: -2.4989, gradient norm:  11.14:  46%|████▌     | 285/625 [00:43<00:52,  6.50it/s]
reward: -3.6464, last reward: -1.6210, gradient norm:  43.37:  46%|████▌     | 285/625 [00:43<00:52,  6.50it/s]
reward: -3.6464, last reward: -1.6210, gradient norm:  43.37:  46%|████▌     | 286/625 [00:43<00:52,  6.50it/s]
reward: -3.9726, last reward: -3.0820, gradient norm:  39.93:  46%|████▌     | 286/625 [00:43<00:52,  6.50it/s]
reward: -3.9726, last reward: -3.0820, gradient norm:  39.93:  46%|████▌     | 287/625 [00:43<00:52,  6.49it/s]
reward: -3.6975, last reward: -2.9091, gradient norm:  29.46:  46%|████▌     | 287/625 [00:43<00:52,  6.49it/s]
reward: -3.6975, last reward: -2.9091, gradient norm:  29.46:  46%|████▌     | 288/625 [00:43<00:51,  6.49it/s]
reward: -3.4926, last reward: -2.4791, gradient norm:  160.7:  46%|████▌     | 288/625 [00:44<00:51,  6.49it/s]
reward: -3.4926, last reward: -2.4791, gradient norm:  160.7:  46%|████▌     | 289/625 [00:44<00:52,  6.40it/s]
reward: -3.0905, last reward: -1.3500, gradient norm:  31.38:  46%|████▌     | 289/625 [00:44<00:52,  6.40it/s]
reward: -3.0905, last reward: -1.3500, gradient norm:  31.38:  46%|████▋     | 290/625 [00:44<00:51,  6.46it/s]
reward: -3.2287, last reward: -2.7137, gradient norm:  26.31:  46%|████▋     | 290/625 [00:44<00:51,  6.46it/s]
reward: -3.2287, last reward: -2.7137, gradient norm:  26.31:  47%|████▋     | 291/625 [00:44<00:51,  6.50it/s]
reward: -2.9918, last reward: -1.5543, gradient norm:  29.73:  47%|████▋     | 291/625 [00:44<00:51,  6.50it/s]
reward: -2.9918, last reward: -1.5543, gradient norm:  29.73:  47%|████▋     | 292/625 [00:44<00:50,  6.54it/s]
reward: -2.9245, last reward: -0.6444, gradient norm:  2.631:  47%|████▋     | 292/625 [00:44<00:50,  6.54it/s]
reward: -2.9245, last reward: -0.6444, gradient norm:  2.631:  47%|████▋     | 293/625 [00:44<00:50,  6.56it/s]
reward: -3.0448, last reward: -0.4769, gradient norm:  7.266:  47%|████▋     | 293/625 [00:44<00:50,  6.56it/s]
reward: -3.0448, last reward: -0.4769, gradient norm:  7.266:  47%|████▋     | 294/625 [00:44<00:50,  6.58it/s]
reward: -2.8566, last reward: -1.7208, gradient norm:  25.22:  47%|████▋     | 294/625 [00:44<00:50,  6.58it/s]
reward: -2.8566, last reward: -1.7208, gradient norm:  25.22:  47%|████▋     | 295/625 [00:44<00:50,  6.60it/s]
reward: -2.8872, last reward: -1.0966, gradient norm:  8.247:  47%|████▋     | 295/625 [00:45<00:50,  6.60it/s]
reward: -2.8872, last reward: -1.0966, gradient norm:  8.247:  47%|████▋     | 296/625 [00:45<00:49,  6.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm:  2.023:  47%|████▋     | 296/625 [00:45<00:49,  6.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm:  2.023:  48%|████▊     | 297/625 [00:45<00:49,  6.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm:  7.564:  48%|████▊     | 297/625 [00:45<00:49,  6.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm:  7.564:  48%|████▊     | 298/625 [00:45<00:49,  6.63it/s]
reward: -2.4318, last reward: -0.5063, gradient norm:  14.87:  48%|████▊     | 298/625 [00:45<00:49,  6.63it/s]
reward: -2.4318, last reward: -0.5063, gradient norm:  14.87:  48%|████▊     | 299/625 [00:45<00:49,  6.62it/s]
reward: -2.7475, last reward: -1.4190, gradient norm:  21.66:  48%|████▊     | 299/625 [00:45<00:49,  6.62it/s]
reward: -2.7475, last reward: -1.4190, gradient norm:  21.66:  48%|████▊     | 300/625 [00:45<00:49,  6.63it/s]
reward: -2.8186, last reward: -2.5077, gradient norm:  22.4:  48%|████▊     | 300/625 [00:45<00:49,  6.63it/s]
reward: -2.8186, last reward: -2.5077, gradient norm:  22.4:  48%|████▊     | 301/625 [00:45<00:48,  6.63it/s]
reward: -3.1883, last reward: -1.5291, gradient norm:  7.472:  48%|████▊     | 301/625 [00:46<00:48,  6.63it/s]
reward: -3.1883, last reward: -1.5291, gradient norm:  7.472:  48%|████▊     | 302/625 [00:46<00:48,  6.63it/s]
reward: -2.1256, last reward: -0.3998, gradient norm:  11.01:  48%|████▊     | 302/625 [00:46<00:48,  6.63it/s]
reward: -2.1256, last reward: -0.3998, gradient norm:  11.01:  48%|████▊     | 303/625 [00:46<00:48,  6.63it/s]
reward: -2.3622, last reward: -0.0930, gradient norm:  1.626:  48%|████▊     | 303/625 [00:46<00:48,  6.63it/s]
reward: -2.3622, last reward: -0.0930, gradient norm:  1.626:  49%|████▊     | 304/625 [00:46<00:48,  6.63it/s]
reward: -1.9500, last reward: -0.0075, gradient norm:  0.5664:  49%|████▊     | 304/625 [00:46<00:48,  6.63it/s]
reward: -1.9500, last reward: -0.0075, gradient norm:  0.5664:  49%|████▉     | 305/625 [00:46<00:48,  6.64it/s]
reward: -2.5697, last reward: -0.3024, gradient norm:  22.61:  49%|████▉     | 305/625 [00:46<00:48,  6.64it/s]
reward: -2.5697, last reward: -0.3024, gradient norm:  22.61:  49%|████▉     | 306/625 [00:46<00:48,  6.64it/s]
reward: -2.3117, last reward: -0.0052, gradient norm:  1.006:  49%|████▉     | 306/625 [00:46<00:48,  6.64it/s]
reward: -2.3117, last reward: -0.0052, gradient norm:  1.006:  49%|████▉     | 307/625 [00:46<00:47,  6.63it/s]
reward: -2.0981, last reward: -0.0018, gradient norm:  0.9312:  49%|████▉     | 307/625 [00:46<00:47,  6.63it/s]
reward: -2.0981, last reward: -0.0018, gradient norm:  0.9312:  49%|████▉     | 308/625 [00:46<00:47,  6.63it/s]
reward: -2.5140, last reward: -0.3873, gradient norm:  3.93:  49%|████▉     | 308/625 [00:47<00:47,  6.63it/s]
reward: -2.5140, last reward: -0.3873, gradient norm:  3.93:  49%|████▉     | 309/625 [00:47<00:47,  6.63it/s]
reward: -2.0411, last reward: -0.2650, gradient norm:  3.183:  49%|████▉     | 309/625 [00:47<00:47,  6.63it/s]
reward: -2.0411, last reward: -0.2650, gradient norm:  3.183:  50%|████▉     | 310/625 [00:47<00:47,  6.63it/s]
reward: -2.1656, last reward: -0.0228, gradient norm:  2.004:  50%|████▉     | 310/625 [00:47<00:47,  6.63it/s]
reward: -2.1656, last reward: -0.0228, gradient norm:  2.004:  50%|████▉     | 311/625 [00:47<00:47,  6.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm:  11.78:  50%|████▉     | 311/625 [00:47<00:47,  6.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm:  11.78:  50%|████▉     | 312/625 [00:47<00:47,  6.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm:  82.91:  50%|████▉     | 312/625 [00:47<00:47,  6.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm:  82.91:  50%|█████     | 313/625 [00:47<00:47,  6.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm:  8.758:  50%|█████     | 313/625 [00:47<00:47,  6.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm:  8.758:  50%|█████     | 314/625 [00:47<00:46,  6.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm:  15.99:  50%|█████     | 314/625 [00:47<00:46,  6.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm:  15.99:  50%|█████     | 315/625 [00:47<00:46,  6.62it/s]
reward: -3.2156, last reward: -1.9055, gradient norm:  7.851:  50%|█████     | 315/625 [00:48<00:46,  6.62it/s]
reward: -3.2156, last reward: -1.9055, gradient norm:  7.851:  51%|█████     | 316/625 [00:48<00:46,  6.62it/s]
reward: -3.1953, last reward: -2.3774, gradient norm:  19.78:  51%|█████     | 316/625 [00:48<00:46,  6.62it/s]
reward: -3.1953, last reward: -2.3774, gradient norm:  19.78:  51%|█████     | 317/625 [00:48<00:46,  6.63it/s]
reward: -2.6385, last reward: -0.9917, gradient norm:  16.15:  51%|█████     | 317/625 [00:48<00:46,  6.63it/s]
reward: -2.6385, last reward: -0.9917, gradient norm:  16.15:  51%|█████     | 318/625 [00:48<00:46,  6.62it/s]
reward: -2.2764, last reward: -0.0536, gradient norm:  2.905:  51%|█████     | 318/625 [00:48<00:46,  6.62it/s]
reward: -2.2764, last reward: -0.0536, gradient norm:  2.905:  51%|█████     | 319/625 [00:48<00:46,  6.62it/s]
reward: -2.6391, last reward: -1.9317, gradient norm:  23.78:  51%|█████     | 319/625 [00:48<00:46,  6.62it/s]
reward: -2.6391, last reward: -1.9317, gradient norm:  23.78:  51%|█████     | 320/625 [00:48<00:46,  6.62it/s]
reward: -2.9748, last reward: -4.2679, gradient norm:  59.43:  51%|█████     | 320/625 [00:48<00:46,  6.62it/s]
reward: -2.9748, last reward: -4.2679, gradient norm:  59.43:  51%|█████▏    | 321/625 [00:48<00:45,  6.62it/s]
reward: -2.8495, last reward: -4.5125, gradient norm:  52.19:  51%|█████▏    | 321/625 [00:49<00:45,  6.62it/s]
reward: -2.8495, last reward: -4.5125, gradient norm:  52.19:  52%|█████▏    | 322/625 [00:49<00:45,  6.62it/s]
reward: -2.8177, last reward: -2.6602, gradient norm:  52.75:  52%|█████▏    | 322/625 [00:49<00:45,  6.62it/s]
reward: -2.8177, last reward: -2.6602, gradient norm:  52.75:  52%|█████▏    | 323/625 [00:49<00:45,  6.62it/s]
reward: -2.0704, last reward: -0.5776, gradient norm:  59.07:  52%|█████▏    | 323/625 [00:49<00:45,  6.62it/s]
reward: -2.0704, last reward: -0.5776, gradient norm:  59.07:  52%|█████▏    | 324/625 [00:49<00:45,  6.62it/s]
reward: -1.9833, last reward: -0.1339, gradient norm:  4.402:  52%|█████▏    | 324/625 [00:49<00:45,  6.62it/s]
reward: -1.9833, last reward: -0.1339, gradient norm:  4.402:  52%|█████▏    | 325/625 [00:49<00:45,  6.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm:  30.36:  52%|█████▏    | 325/625 [00:49<00:45,  6.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm:  30.36:  52%|█████▏    | 326/625 [00:49<00:45,  6.63it/s]
reward: -2.9299, last reward: -5.0227, gradient norm:  100.5:  52%|█████▏    | 326/625 [00:49<00:45,  6.63it/s]
reward: -2.9299, last reward: -5.0227, gradient norm:  100.5:  52%|█████▏    | 327/625 [00:49<00:44,  6.63it/s]
reward: -2.7727, last reward: -2.1607, gradient norm:  336.7:  52%|█████▏    | 327/625 [00:49<00:44,  6.63it/s]
reward: -2.7727, last reward: -2.1607, gradient norm:  336.7:  52%|█████▏    | 328/625 [00:49<00:44,  6.62it/s]
reward: -2.3958, last reward: -0.3223, gradient norm:  2.763:  52%|█████▏    | 328/625 [00:50<00:44,  6.62it/s]
reward: -2.3958, last reward: -0.3223, gradient norm:  2.763:  53%|█████▎    | 329/625 [00:50<00:44,  6.63it/s]
reward: -2.4742, last reward: -0.1797, gradient norm:  47.32:  53%|█████▎    | 329/625 [00:50<00:44,  6.63it/s]
reward: -2.4742, last reward: -0.1797, gradient norm:  47.32:  53%|█████▎    | 330/625 [00:50<00:44,  6.63it/s]
reward: -2.0144, last reward: -0.0085, gradient norm:  4.791:  53%|█████▎    | 330/625 [00:50<00:44,  6.63it/s]
reward: -2.0144, last reward: -0.0085, gradient norm:  4.791:  53%|█████▎    | 331/625 [00:50<00:44,  6.64it/s]
reward: -1.8284, last reward: -0.0428, gradient norm:  12.29:  53%|█████▎    | 331/625 [00:50<00:44,  6.64it/s]
reward: -1.8284, last reward: -0.0428, gradient norm:  12.29:  53%|█████▎    | 332/625 [00:50<00:44,  6.64it/s]
reward: -2.5229, last reward: -0.0098, gradient norm:  0.7365:  53%|█████▎    | 332/625 [00:50<00:44,  6.64it/s]
reward: -2.5229, last reward: -0.0098, gradient norm:  0.7365:  53%|█████▎    | 333/625 [00:50<00:44,  6.63it/s]
reward: -2.4566, last reward: -0.0781, gradient norm:  2.086:  53%|█████▎    | 333/625 [00:50<00:44,  6.63it/s]
reward: -2.4566, last reward: -0.0781, gradient norm:  2.086:  53%|█████▎    | 334/625 [00:50<00:43,  6.64it/s]
reward: -2.3355, last reward: -0.0230, gradient norm:  1.311:  53%|█████▎    | 334/625 [00:50<00:43,  6.64it/s]
reward: -2.3355, last reward: -0.0230, gradient norm:  1.311:  54%|█████▎    | 335/625 [00:50<00:43,  6.63it/s]
reward: -1.9346, last reward: -0.0423, gradient norm:  1.076:  54%|█████▎    | 335/625 [00:51<00:43,  6.63it/s]
reward: -1.9346, last reward: -0.0423, gradient norm:  1.076:  54%|█████▍    | 336/625 [00:51<00:43,  6.63it/s]
reward: -2.3711, last reward: -0.1335, gradient norm:  0.6855:  54%|█████▍    | 336/625 [00:51<00:43,  6.63it/s]
reward: -2.3711, last reward: -0.1335, gradient norm:  0.6855:  54%|█████▍    | 337/625 [00:51<00:43,  6.63it/s]
reward: -2.0304, last reward: -0.0023, gradient norm:  0.8459:  54%|█████▍    | 337/625 [00:51<00:43,  6.63it/s]
reward: -2.0304, last reward: -0.0023, gradient norm:  0.8459:  54%|█████▍    | 338/625 [00:51<00:43,  6.63it/s]
reward: -1.9998, last reward: -0.4399, gradient norm:  13.1:  54%|█████▍    | 338/625 [00:51<00:43,  6.63it/s]
reward: -1.9998, last reward: -0.4399, gradient norm:  13.1:  54%|█████▍    | 339/625 [00:51<00:43,  6.64it/s]
reward: -2.2303, last reward: -2.1346, gradient norm:  45.99:  54%|█████▍    | 339/625 [00:51<00:43,  6.64it/s]
reward: -2.2303, last reward: -2.1346, gradient norm:  45.99:  54%|█████▍    | 340/625 [00:51<00:43,  6.63it/s]
reward: -2.2915, last reward: -1.7116, gradient norm:  40.34:  54%|█████▍    | 340/625 [00:51<00:43,  6.63it/s]
reward: -2.2915, last reward: -1.7116, gradient norm:  40.34:  55%|█████▍    | 341/625 [00:51<00:42,  6.62it/s]
reward: -2.5560, last reward: -0.0487, gradient norm:  1.195:  55%|█████▍    | 341/625 [00:52<00:42,  6.62it/s]
reward: -2.5560, last reward: -0.0487, gradient norm:  1.195:  55%|█████▍    | 342/625 [00:52<00:42,  6.62it/s]
reward: -2.5119, last reward: -0.0358, gradient norm:  1.061:  55%|█████▍    | 342/625 [00:52<00:42,  6.62it/s]
reward: -2.5119, last reward: -0.0358, gradient norm:  1.061:  55%|█████▍    | 343/625 [00:52<00:42,  6.63it/s]
reward: -2.3305, last reward: -0.3705, gradient norm:  1.957:  55%|█████▍    | 343/625 [00:52<00:42,  6.63it/s]
reward: -2.3305, last reward: -0.3705, gradient norm:  1.957:  55%|█████▌    | 344/625 [00:52<00:42,  6.64it/s]
reward: -2.6068, last reward: -0.2112, gradient norm:  13.83:  55%|█████▌    | 344/625 [00:52<00:42,  6.64it/s]
reward: -2.6068, last reward: -0.2112, gradient norm:  13.83:  55%|█████▌    | 345/625 [00:52<00:42,  6.64it/s]
reward: -2.5731, last reward: -1.8455, gradient norm:  66.75:  55%|█████▌    | 345/625 [00:52<00:42,  6.64it/s]
reward: -2.5731, last reward: -1.8455, gradient norm:  66.75:  55%|█████▌    | 346/625 [00:52<00:42,  6.63it/s]
reward: -2.3897, last reward: -0.0376, gradient norm:  1.608:  55%|█████▌    | 346/625 [00:52<00:42,  6.63it/s]
reward: -2.3897, last reward: -0.0376, gradient norm:  1.608:  56%|█████▌    | 347/625 [00:52<00:41,  6.62it/s]
reward: -2.2264, last reward: -0.0434, gradient norm:  2.012:  56%|█████▌    | 347/625 [00:52<00:41,  6.62it/s]
reward: -2.2264, last reward: -0.0434, gradient norm:  2.012:  56%|█████▌    | 348/625 [00:52<00:41,  6.63it/s]
reward: -2.1300, last reward: -0.1215, gradient norm:  2.557:  56%|█████▌    | 348/625 [00:53<00:41,  6.63it/s]
reward: -2.1300, last reward: -0.1215, gradient norm:  2.557:  56%|█████▌    | 349/625 [00:53<00:41,  6.62it/s]
reward: -2.0968, last reward: -0.0885, gradient norm:  3.389:  56%|█████▌    | 349/625 [00:53<00:41,  6.62it/s]
reward: -2.0968, last reward: -0.0885, gradient norm:  3.389:  56%|█████▌    | 350/625 [00:53<00:41,  6.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm:  0.5052:  56%|█████▌    | 350/625 [00:53<00:41,  6.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm:  0.5052:  56%|█████▌    | 351/625 [00:53<00:41,  6.62it/s]
reward: -2.4184, last reward: -3.2817, gradient norm:  108.6:  56%|█████▌    | 351/625 [00:53<00:41,  6.62it/s]
reward: -2.4184, last reward: -3.2817, gradient norm:  108.6:  56%|█████▋    | 352/625 [00:53<00:41,  6.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm:  54.07:  56%|█████▋    | 352/625 [00:53<00:41,  6.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm:  54.07:  56%|█████▋    | 353/625 [00:53<00:41,  6.62it/s]
reward: -2.4779, last reward: -0.1009, gradient norm:  10.91:  56%|█████▋    | 353/625 [00:53<00:41,  6.62it/s]
reward: -2.4779, last reward: -0.1009, gradient norm:  10.91:  57%|█████▋    | 354/625 [00:53<00:40,  6.61it/s]
reward: -2.2588, last reward: -0.0604, gradient norm:  2.599:  57%|█████▋    | 354/625 [00:54<00:40,  6.61it/s]
reward: -2.2588, last reward: -0.0604, gradient norm:  2.599:  57%|█████▋    | 355/625 [00:54<00:40,  6.62it/s]
reward: -2.4486, last reward: -0.1176, gradient norm:  3.656:  57%|█████▋    | 355/625 [00:54<00:40,  6.62it/s]
reward: -2.4486, last reward: -0.1176, gradient norm:  3.656:  57%|█████▋    | 356/625 [00:54<00:40,  6.62it/s]
reward: -2.2436, last reward: -0.0668, gradient norm:  2.724:  57%|█████▋    | 356/625 [00:54<00:40,  6.62it/s]
reward: -2.2436, last reward: -0.0668, gradient norm:  2.724:  57%|█████▋    | 357/625 [00:54<00:40,  6.62it/s]
reward: -1.8849, last reward: -0.0012, gradient norm:  5.326:  57%|█████▋    | 357/625 [00:54<00:40,  6.62it/s]
reward: -1.8849, last reward: -0.0012, gradient norm:  5.326:  57%|█████▋    | 358/625 [00:54<00:40,  6.62it/s]
reward: -2.7511, last reward: -0.8804, gradient norm:  13.6:  57%|█████▋    | 358/625 [00:54<00:40,  6.62it/s]
reward: -2.7511, last reward: -0.8804, gradient norm:  13.6:  57%|█████▋    | 359/625 [00:54<00:40,  6.62it/s]
reward: -2.8870, last reward: -3.6728, gradient norm:  33.56:  57%|█████▋    | 359/625 [00:54<00:40,  6.62it/s]
reward: -2.8870, last reward: -3.6728, gradient norm:  33.56:  58%|█████▊    | 360/625 [00:54<00:40,  6.62it/s]
reward: -2.8841, last reward: -2.5508, gradient norm:  30.93:  58%|█████▊    | 360/625 [00:54<00:40,  6.62it/s]
reward: -2.8841, last reward: -2.5508, gradient norm:  30.93:  58%|█████▊    | 361/625 [00:54<00:40,  6.59it/s]
reward: -2.5242, last reward: -1.0268, gradient norm:  33.15:  58%|█████▊    | 361/625 [00:55<00:40,  6.59it/s]
reward: -2.5242, last reward: -1.0268, gradient norm:  33.15:  58%|█████▊    | 362/625 [00:55<00:39,  6.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm:  0.6185:  58%|█████▊    | 362/625 [00:55<00:39,  6.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm:  0.6185:  58%|█████▊    | 363/625 [00:55<00:39,  6.61it/s]
reward: -2.1378, last reward: -0.0204, gradient norm:  1.337:  58%|█████▊    | 363/625 [00:55<00:39,  6.61it/s]
reward: -2.1378, last reward: -0.0204, gradient norm:  1.337:  58%|█████▊    | 364/625 [00:55<00:39,  6.61it/s]
reward: -2.2677, last reward: -0.0355, gradient norm:  1.685:  58%|█████▊    | 364/625 [00:55<00:39,  6.61it/s]
reward: -2.2677, last reward: -0.0355, gradient norm:  1.685:  58%|█████▊    | 365/625 [00:55<00:39,  6.61it/s]
reward: -2.4884, last reward: -0.0231, gradient norm:  1.213:  58%|█████▊    | 365/625 [00:55<00:39,  6.61it/s]
reward: -2.4884, last reward: -0.0231, gradient norm:  1.213:  59%|█████▊    | 366/625 [00:55<00:39,  6.62it/s]
reward: -2.0770, last reward: -0.0014, gradient norm:  0.6793:  59%|█████▊    | 366/625 [00:55<00:39,  6.62it/s]
reward: -2.0770, last reward: -0.0014, gradient norm:  0.6793:  59%|█████▊    | 367/625 [00:55<00:39,  6.60it/s]
reward: -1.9834, last reward: -0.0349, gradient norm:  1.863:  59%|█████▊    | 367/625 [00:55<00:39,  6.60it/s]
reward: -1.9834, last reward: -0.0349, gradient norm:  1.863:  59%|█████▉    | 368/625 [00:55<00:38,  6.60it/s]
reward: -2.6709, last reward: -0.1416, gradient norm:  5.462:  59%|█████▉    | 368/625 [00:56<00:38,  6.60it/s]
reward: -2.6709, last reward: -0.1416, gradient norm:  5.462:  59%|█████▉    | 369/625 [00:56<00:38,  6.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm:  47.67:  59%|█████▉    | 369/625 [00:56<00:38,  6.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm:  47.67:  59%|█████▉    | 370/625 [00:56<00:38,  6.62it/s]
reward: -2.9401, last reward: -3.7802, gradient norm:  32.47:  59%|█████▉    | 370/625 [00:56<00:38,  6.62it/s]
reward: -2.9401, last reward: -3.7802, gradient norm:  32.47:  59%|█████▉    | 371/625 [00:56<00:38,  6.60it/s]
reward: -2.6723, last reward: -3.6507, gradient norm:  45.1:  59%|█████▉    | 371/625 [00:56<00:38,  6.60it/s]
reward: -2.6723, last reward: -3.6507, gradient norm:  45.1:  60%|█████▉    | 372/625 [00:56<00:38,  6.60it/s]
reward: -2.2678, last reward: -0.6201, gradient norm:  32.94:  60%|█████▉    | 372/625 [00:56<00:38,  6.60it/s]
reward: -2.2678, last reward: -0.6201, gradient norm:  32.94:  60%|█████▉    | 373/625 [00:56<00:38,  6.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm:  0.7385:  60%|█████▉    | 373/625 [00:56<00:38,  6.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm:  0.7385:  60%|█████▉    | 374/625 [00:56<00:37,  6.62it/s]
reward: -2.6344, last reward: -0.0576, gradient norm:  1.617:  60%|█████▉    | 374/625 [00:57<00:37,  6.62it/s]
reward: -2.6344, last reward: -0.0576, gradient norm:  1.617:  60%|██████    | 375/625 [00:57<00:37,  6.62it/s]
reward: -1.9945, last reward: -0.0772, gradient norm:  2.567:  60%|██████    | 375/625 [00:57<00:37,  6.62it/s]
reward: -1.9945, last reward: -0.0772, gradient norm:  2.567:  60%|██████    | 376/625 [00:57<00:37,  6.59it/s]
reward: -1.7576, last reward: -0.0398, gradient norm:  1.961:  60%|██████    | 376/625 [00:57<00:37,  6.59it/s]
reward: -1.7576, last reward: -0.0398, gradient norm:  1.961:  60%|██████    | 377/625 [00:57<00:37,  6.61it/s]
reward: -2.3396, last reward: -0.0022, gradient norm:  1.094:  60%|██████    | 377/625 [00:57<00:37,  6.61it/s]
reward: -2.3396, last reward: -0.0022, gradient norm:  1.094:  60%|██████    | 378/625 [00:57<00:37,  6.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm:  29.23:  60%|██████    | 378/625 [00:57<00:37,  6.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm:  29.23:  61%|██████    | 379/625 [00:57<00:51,  4.77it/s]
reward: -2.3313, last reward: -1.1869, gradient norm:  38.62:  61%|██████    | 379/625 [00:57<00:51,  4.77it/s]
reward: -2.3313, last reward: -1.1869, gradient norm:  38.62:  61%|██████    | 380/625 [00:57<00:47,  5.20it/s]
reward: -2.0481, last reward: -0.1117, gradient norm:  5.321:  61%|██████    | 380/625 [00:58<00:47,  5.20it/s]
reward: -2.0481, last reward: -0.1117, gradient norm:  5.321:  61%|██████    | 381/625 [00:58<00:43,  5.56it/s]
reward: -1.6823, last reward: -0.0001, gradient norm:  1.981:  61%|██████    | 381/625 [00:58<00:43,  5.56it/s]
reward: -1.6823, last reward: -0.0001, gradient norm:  1.981:  61%|██████    | 382/625 [00:58<00:41,  5.84it/s]
reward: -1.8305, last reward: -0.0210, gradient norm:  1.228:  61%|██████    | 382/625 [00:58<00:41,  5.84it/s]
reward: -1.8305, last reward: -0.0210, gradient norm:  1.228:  61%|██████▏   | 383/625 [00:58<00:39,  6.06it/s]
reward: -1.4908, last reward: -0.0272, gradient norm:  1.538:  61%|██████▏   | 383/625 [00:58<00:39,  6.06it/s]
reward: -1.4908, last reward: -0.0272, gradient norm:  1.538:  61%|██████▏   | 384/625 [00:58<00:38,  6.22it/s]
reward: -2.3267, last reward: -0.0111, gradient norm:  0.7965:  61%|██████▏   | 384/625 [00:58<00:38,  6.22it/s]
reward: -2.3267, last reward: -0.0111, gradient norm:  0.7965:  62%|██████▏   | 385/625 [00:58<00:37,  6.33it/s]
reward: -2.1796, last reward: -0.0039, gradient norm:  0.5396:  62%|██████▏   | 385/625 [00:58<00:37,  6.33it/s]
reward: -2.1796, last reward: -0.0039, gradient norm:  0.5396:  62%|██████▏   | 386/625 [00:58<00:37,  6.41it/s]
reward: -2.3757, last reward: -0.0490, gradient norm:  2.237:  62%|██████▏   | 386/625 [00:59<00:37,  6.41it/s]
reward: -2.3757, last reward: -0.0490, gradient norm:  2.237:  62%|██████▏   | 387/625 [00:59<00:36,  6.47it/s]
reward: -2.1394, last reward: -0.4187, gradient norm:  52.11:  62%|██████▏   | 387/625 [00:59<00:36,  6.47it/s]
reward: -2.1394, last reward: -0.4187, gradient norm:  52.11:  62%|██████▏   | 388/625 [00:59<00:36,  6.52it/s]
reward: -2.2986, last reward: -0.0038, gradient norm:  0.7954:  62%|██████▏   | 388/625 [00:59<00:36,  6.52it/s]
reward: -2.2986, last reward: -0.0038, gradient norm:  0.7954:  62%|██████▏   | 389/625 [00:59<00:36,  6.55it/s]
reward: -2.1274, last reward: -0.0063, gradient norm:  0.813:  62%|██████▏   | 389/625 [00:59<00:36,  6.55it/s]
reward: -2.1274, last reward: -0.0063, gradient norm:  0.813:  62%|██████▏   | 390/625 [00:59<00:35,  6.57it/s]
reward: -1.8706, last reward: -0.0114, gradient norm:  3.325:  62%|██████▏   | 390/625 [00:59<00:35,  6.57it/s]
reward: -1.8706, last reward: -0.0114, gradient norm:  3.325:  63%|██████▎   | 391/625 [00:59<00:35,  6.58it/s]
reward: -1.6922, last reward: -0.0004, gradient norm:  0.2423:  63%|██████▎   | 391/625 [00:59<00:35,  6.58it/s]
reward: -1.6922, last reward: -0.0004, gradient norm:  0.2423:  63%|██████▎   | 392/625 [00:59<00:35,  6.59it/s]
reward: -1.9115, last reward: -0.2602, gradient norm:  2.599:  63%|██████▎   | 392/625 [00:59<00:35,  6.59it/s]
reward: -1.9115, last reward: -0.2602, gradient norm:  2.599:  63%|██████▎   | 393/625 [00:59<00:35,  6.60it/s]
reward: -2.2449, last reward: -0.0783, gradient norm:  5.199:  63%|██████▎   | 393/625 [01:00<00:35,  6.60it/s]
reward: -2.2449, last reward: -0.0783, gradient norm:  5.199:  63%|██████▎   | 394/625 [01:00<00:34,  6.61it/s]
reward: -2.0631, last reward: -0.0057, gradient norm:  0.7444:  63%|██████▎   | 394/625 [01:00<00:34,  6.61it/s]
reward: -2.0631, last reward: -0.0057, gradient norm:  0.7444:  63%|██████▎   | 395/625 [01:00<00:34,  6.61it/s]
reward: -2.3339, last reward: -0.0167, gradient norm:  1.39:  63%|██████▎   | 395/625 [01:00<00:34,  6.61it/s]
reward: -2.3339, last reward: -0.0167, gradient norm:  1.39:  63%|██████▎   | 396/625 [01:00<00:34,  6.62it/s]
reward: -2.4806, last reward: -0.0023, gradient norm:  2.317:  63%|██████▎   | 396/625 [01:00<00:34,  6.62it/s]
reward: -2.4806, last reward: -0.0023, gradient norm:  2.317:  64%|██████▎   | 397/625 [01:00<00:34,  6.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm:  5.067:  64%|██████▎   | 397/625 [01:00<00:34,  6.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm:  5.067:  64%|██████▎   | 398/625 [01:00<00:34,  6.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm:  20.39:  64%|██████▎   | 398/625 [01:00<00:34,  6.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm:  20.39:  64%|██████▍   | 399/625 [01:00<00:34,  6.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm:  0.3364:  64%|██████▍   | 399/625 [01:01<00:34,  6.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm:  0.3364:  64%|██████▍   | 400/625 [01:01<00:34,  6.62it/s]
reward: -1.8733, last reward: -0.0184, gradient norm:  2.275:  64%|██████▍   | 400/625 [01:01<00:34,  6.62it/s]
reward: -1.8733, last reward: -0.0184, gradient norm:  2.275:  64%|██████▍   | 401/625 [01:01<00:33,  6.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm:  1.025:  64%|██████▍   | 401/625 [01:01<00:33,  6.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm:  1.025:  64%|██████▍   | 402/625 [01:01<00:33,  6.61it/s]
reward: -2.0386, last reward: -0.0625, gradient norm:  2.763:  64%|██████▍   | 402/625 [01:01<00:33,  6.61it/s]
reward: -2.0386, last reward: -0.0625, gradient norm:  2.763:  64%|██████▍   | 403/625 [01:01<00:33,  6.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm:  0.7816:  64%|██████▍   | 403/625 [01:01<00:33,  6.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm:  0.7816:  65%|██████▍   | 404/625 [01:01<00:33,  6.63it/s]
reward: -1.8341, last reward: -0.0941, gradient norm:  5.854:  65%|██████▍   | 404/625 [01:01<00:33,  6.63it/s]
reward: -1.8341, last reward: -0.0941, gradient norm:  5.854:  65%|██████▍   | 405/625 [01:01<00:33,  6.63it/s]
reward: -1.8615, last reward: -0.0968, gradient norm:  4.588:  65%|██████▍   | 405/625 [01:01<00:33,  6.63it/s]
reward: -1.8615, last reward: -0.0968, gradient norm:  4.588:  65%|██████▍   | 406/625 [01:01<00:33,  6.62it/s]
reward: -2.0981, last reward: -0.3849, gradient norm:  6.008:  65%|██████▍   | 406/625 [01:02<00:33,  6.62it/s]
reward: -2.0981, last reward: -0.3849, gradient norm:  6.008:  65%|██████▌   | 407/625 [01:02<00:32,  6.62it/s]
reward: -1.9395, last reward: -0.0765, gradient norm:  4.055:  65%|██████▌   | 407/625 [01:02<00:32,  6.62it/s]
reward: -1.9395, last reward: -0.0765, gradient norm:  4.055:  65%|██████▌   | 408/625 [01:02<00:32,  6.62it/s]
reward: -2.2685, last reward: -0.2235, gradient norm:  1.688:  65%|██████▌   | 408/625 [01:02<00:32,  6.62it/s]
reward: -2.2685, last reward: -0.2235, gradient norm:  1.688:  65%|██████▌   | 409/625 [01:02<00:32,  6.62it/s]
reward: -2.3052, last reward: -1.4249, gradient norm:  25.99:  65%|██████▌   | 409/625 [01:02<00:32,  6.62it/s]
reward: -2.3052, last reward: -1.4249, gradient norm:  25.99:  66%|██████▌   | 410/625 [01:02<00:32,  6.61it/s]
reward: -2.6806, last reward: -1.6383, gradient norm:  30.59:  66%|██████▌   | 410/625 [01:02<00:32,  6.61it/s]
reward: -2.6806, last reward: -1.6383, gradient norm:  30.59:  66%|██████▌   | 411/625 [01:02<00:32,  6.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm:  74.37:  66%|██████▌   | 411/625 [01:02<00:32,  6.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm:  74.37:  66%|██████▌   | 412/625 [01:02<00:32,  6.60it/s]
reward: -2.1862, last reward: -0.0063, gradient norm:  1.822:  66%|██████▌   | 412/625 [01:02<00:32,  6.60it/s]
reward: -2.1862, last reward: -0.0063, gradient norm:  1.822:  66%|██████▌   | 413/625 [01:02<00:32,  6.61it/s]
reward: -1.9811, last reward: -0.0171, gradient norm:  1.013:  66%|██████▌   | 413/625 [01:03<00:32,  6.61it/s]
reward: -1.9811, last reward: -0.0171, gradient norm:  1.013:  66%|██████▌   | 414/625 [01:03<00:31,  6.62it/s]
reward: -2.0252, last reward: -0.0049, gradient norm:  0.6205:  66%|██████▌   | 414/625 [01:03<00:31,  6.62it/s]
reward: -2.0252, last reward: -0.0049, gradient norm:  0.6205:  66%|██████▋   | 415/625 [01:03<00:31,  6.62it/s]
reward: -2.1108, last reward: -0.4921, gradient norm:  23.74:  66%|██████▋   | 415/625 [01:03<00:31,  6.62it/s]
reward: -2.1108, last reward: -0.4921, gradient norm:  23.74:  67%|██████▋   | 416/625 [01:03<00:31,  6.64it/s]
reward: -1.9142, last reward: -0.8130, gradient norm:  52.65:  67%|██████▋   | 416/625 [01:03<00:31,  6.64it/s]
reward: -1.9142, last reward: -0.8130, gradient norm:  52.65:  67%|██████▋   | 417/625 [01:03<00:31,  6.64it/s]
reward: -2.1725, last reward: -0.0036, gradient norm:  0.3196:  67%|██████▋   | 417/625 [01:03<00:31,  6.64it/s]
reward: -2.1725, last reward: -0.0036, gradient norm:  0.3196:  67%|██████▋   | 418/625 [01:03<00:31,  6.65it/s]
reward: -1.7795, last reward: -0.0242, gradient norm:  1.799:  67%|██████▋   | 418/625 [01:03<00:31,  6.65it/s]
reward: -1.7795, last reward: -0.0242, gradient norm:  1.799:  67%|██████▋   | 419/625 [01:03<00:31,  6.64it/s]
reward: -1.7737, last reward: -0.0138, gradient norm:  1.39:  67%|██████▋   | 419/625 [01:04<00:31,  6.64it/s]
reward: -1.7737, last reward: -0.0138, gradient norm:  1.39:  67%|██████▋   | 420/625 [01:04<00:30,  6.64it/s]
reward: -2.1462, last reward: -0.0053, gradient norm:  0.47:  67%|██████▋   | 420/625 [01:04<00:30,  6.64it/s]
reward: -2.1462, last reward: -0.0053, gradient norm:  0.47:  67%|██████▋   | 421/625 [01:04<00:30,  6.63it/s]
reward: -1.9226, last reward: -0.6139, gradient norm:  40.3:  67%|██████▋   | 421/625 [01:04<00:30,  6.63it/s]
reward: -1.9226, last reward: -0.6139, gradient norm:  40.3:  68%|██████▊   | 422/625 [01:04<00:30,  6.63it/s]
reward: -1.9889, last reward: -0.0403, gradient norm:  1.112:  68%|██████▊   | 422/625 [01:04<00:30,  6.63it/s]
reward: -1.9889, last reward: -0.0403, gradient norm:  1.112:  68%|██████▊   | 423/625 [01:04<00:30,  6.63it/s]
reward: -1.6194, last reward: -0.0032, gradient norm:  0.79:  68%|██████▊   | 423/625 [01:04<00:30,  6.63it/s]
reward: -1.6194, last reward: -0.0032, gradient norm:  0.79:  68%|██████▊   | 424/625 [01:04<00:30,  6.62it/s]
reward: -2.3989, last reward: -0.0104, gradient norm:  1.134:  68%|██████▊   | 424/625 [01:04<00:30,  6.62it/s]
reward: -2.3989, last reward: -0.0104, gradient norm:  1.134:  68%|██████▊   | 425/625 [01:04<00:30,  6.62it/s]
reward: -1.9960, last reward: -0.0009, gradient norm:  0.6009:  68%|██████▊   | 425/625 [01:04<00:30,  6.62it/s]
reward: -1.9960, last reward: -0.0009, gradient norm:  0.6009:  68%|██████▊   | 426/625 [01:04<00:30,  6.62it/s]
reward: -2.2697, last reward: -0.0914, gradient norm:  2.905:  68%|██████▊   | 426/625 [01:05<00:30,  6.62it/s]
reward: -2.2697, last reward: -0.0914, gradient norm:  2.905:  68%|██████▊   | 427/625 [01:05<00:29,  6.62it/s]
reward: -2.4256, last reward: -0.1114, gradient norm:  2.102:  68%|██████▊   | 427/625 [01:05<00:29,  6.62it/s]
reward: -2.4256, last reward: -0.1114, gradient norm:  2.102:  68%|██████▊   | 428/625 [01:05<00:29,  6.62it/s]
reward: -1.9862, last reward: -0.1932, gradient norm:  22.44:  68%|██████▊   | 428/625 [01:05<00:29,  6.62it/s]
reward: -1.9862, last reward: -0.1932, gradient norm:  22.44:  69%|██████▊   | 429/625 [01:05<00:29,  6.62it/s]
reward: -2.0637, last reward: -0.0623, gradient norm:  3.082:  69%|██████▊   | 429/625 [01:05<00:29,  6.62it/s]
reward: -2.0637, last reward: -0.0623, gradient norm:  3.082:  69%|██████▉   | 430/625 [01:05<00:29,  6.63it/s]
reward: -1.9906, last reward: -0.2031, gradient norm:  5.5:  69%|██████▉   | 430/625 [01:05<00:29,  6.63it/s]
reward: -1.9906, last reward: -0.2031, gradient norm:  5.5:  69%|██████▉   | 431/625 [01:05<00:29,  6.63it/s]
reward: -1.9948, last reward: -0.0895, gradient norm:  3.456:  69%|██████▉   | 431/625 [01:05<00:29,  6.63it/s]
reward: -1.9948, last reward: -0.0895, gradient norm:  3.456:  69%|██████▉   | 432/625 [01:05<00:29,  6.64it/s]
reward: -2.1970, last reward: -0.0256, gradient norm:  1.593:  69%|██████▉   | 432/625 [01:05<00:29,  6.64it/s]
reward: -2.1970, last reward: -0.0256, gradient norm:  1.593:  69%|██████▉   | 433/625 [01:05<00:29,  6.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm:  3.644:  69%|██████▉   | 433/625 [01:06<00:29,  6.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm:  3.644:  69%|██████▉   | 434/625 [01:06<00:28,  6.62it/s]
reward: -2.1039, last reward: -3.1973, gradient norm:  87.37:  69%|██████▉   | 434/625 [01:06<00:28,  6.62it/s]
reward: -2.1039, last reward: -3.1973, gradient norm:  87.37:  70%|██████▉   | 435/625 [01:06<00:28,  6.62it/s]
reward: -2.4561, last reward: -0.1225, gradient norm:  6.119:  70%|██████▉   | 435/625 [01:06<00:28,  6.62it/s]
reward: -2.4561, last reward: -0.1225, gradient norm:  6.119:  70%|██████▉   | 436/625 [01:06<00:28,  6.58it/s]
reward: -2.0211, last reward: -0.2125, gradient norm:  2.94:  70%|██████▉   | 436/625 [01:06<00:28,  6.58it/s]
reward: -2.0211, last reward: -0.2125, gradient norm:  2.94:  70%|██████▉   | 437/625 [01:06<00:28,  6.57it/s]
reward: -2.3866, last reward: -0.0050, gradient norm:  0.7202:  70%|██████▉   | 437/625 [01:06<00:28,  6.57it/s]
reward: -2.3866, last reward: -0.0050, gradient norm:  0.7202:  70%|███████   | 438/625 [01:06<00:28,  6.54it/s]
reward: -1.6388, last reward: -0.0072, gradient norm:  0.8657:  70%|███████   | 438/625 [01:06<00:28,  6.54it/s]
reward: -1.6388, last reward: -0.0072, gradient norm:  0.8657:  70%|███████   | 439/625 [01:06<00:28,  6.54it/s]
reward: -2.1187, last reward: -0.0015, gradient norm:  0.5116:  70%|███████   | 439/625 [01:07<00:28,  6.54it/s]
reward: -2.1187, last reward: -0.0015, gradient norm:  0.5116:  70%|███████   | 440/625 [01:07<00:28,  6.52it/s]
reward: -2.0432, last reward: -0.0025, gradient norm:  0.7809:  70%|███████   | 440/625 [01:07<00:28,  6.52it/s]
reward: -2.0432, last reward: -0.0025, gradient norm:  0.7809:  71%|███████   | 441/625 [01:07<00:28,  6.52it/s]
reward: -2.1925, last reward: -0.0103, gradient norm:  2.83:  71%|███████   | 441/625 [01:07<00:28,  6.52it/s]
reward: -2.1925, last reward: -0.0103, gradient norm:  2.83:  71%|███████   | 442/625 [01:07<00:27,  6.54it/s]
reward: -1.9570, last reward: -0.0002, gradient norm:  0.35:  71%|███████   | 442/625 [01:07<00:27,  6.54it/s]
reward: -1.9570, last reward: -0.0002, gradient norm:  0.35:  71%|███████   | 443/625 [01:07<00:27,  6.53it/s]
reward: -2.0871, last reward: -0.0022, gradient norm:  0.5601:  71%|███████   | 443/625 [01:07<00:27,  6.53it/s]
reward: -2.0871, last reward: -0.0022, gradient norm:  0.5601:  71%|███████   | 444/625 [01:07<00:27,  6.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm:  0.6061:  71%|███████   | 444/625 [01:07<00:27,  6.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm:  0.6061:  71%|███████   | 445/625 [01:07<00:27,  6.49it/s]
reward: -2.2746, last reward: -0.0027, gradient norm:  0.7887:  71%|███████   | 445/625 [01:07<00:27,  6.49it/s]
reward: -2.2746, last reward: -0.0027, gradient norm:  0.7887:  71%|███████▏  | 446/625 [01:07<00:27,  6.50it/s]
reward: -2.1835, last reward: -0.0035, gradient norm:  0.855:  71%|███████▏  | 446/625 [01:08<00:27,  6.50it/s]
reward: -2.1835, last reward: -0.0035, gradient norm:  0.855:  72%|███████▏  | 447/625 [01:08<00:27,  6.52it/s]
reward: -1.8420, last reward: -0.0103, gradient norm:  1.548:  72%|███████▏  | 447/625 [01:08<00:27,  6.52it/s]
reward: -1.8420, last reward: -0.0103, gradient norm:  1.548:  72%|███████▏  | 448/625 [01:08<00:27,  6.51it/s]
reward: -2.2653, last reward: -0.0126, gradient norm:  0.9736:  72%|███████▏  | 448/625 [01:08<00:27,  6.51it/s]
reward: -2.2653, last reward: -0.0126, gradient norm:  0.9736:  72%|███████▏  | 449/625 [01:08<00:27,  6.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm:  0.6196:  72%|███████▏  | 449/625 [01:08<00:27,  6.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm:  0.6196:  72%|███████▏  | 450/625 [01:08<00:26,  6.50it/s]
reward: -2.4509, last reward: -0.0373, gradient norm:  11.44:  72%|███████▏  | 450/625 [01:08<00:26,  6.50it/s]
reward: -2.4509, last reward: -0.0373, gradient norm:  11.44:  72%|███████▏  | 451/625 [01:08<00:26,  6.51it/s]
reward: -2.2528, last reward: -0.0620, gradient norm:  3.992:  72%|███████▏  | 451/625 [01:08<00:26,  6.51it/s]
reward: -2.2528, last reward: -0.0620, gradient norm:  3.992:  72%|███████▏  | 452/625 [01:08<00:26,  6.51it/s]
reward: -1.6898, last reward: -0.3235, gradient norm:  6.687:  72%|███████▏  | 452/625 [01:09<00:26,  6.51it/s]
reward: -1.6898, last reward: -0.3235, gradient norm:  6.687:  72%|███████▏  | 453/625 [01:09<00:26,  6.48it/s]
reward: -1.5879, last reward: -0.0905, gradient norm:  2.84:  72%|███████▏  | 453/625 [01:09<00:26,  6.48it/s]
reward: -1.5879, last reward: -0.0905, gradient norm:  2.84:  73%|███████▎  | 454/625 [01:09<00:26,  6.49it/s]
reward: -1.8406, last reward: -0.0694, gradient norm:  2.288:  73%|███████▎  | 454/625 [01:09<00:26,  6.49it/s]
reward: -1.8406, last reward: -0.0694, gradient norm:  2.288:  73%|███████▎  | 455/625 [01:09<00:26,  6.49it/s]
reward: -1.8259, last reward: -0.0235, gradient norm:  1.304:  73%|███████▎  | 455/625 [01:09<00:26,  6.49it/s]
reward: -1.8259, last reward: -0.0235, gradient norm:  1.304:  73%|███████▎  | 456/625 [01:09<00:26,  6.49it/s]
reward: -1.8500, last reward: -0.0024, gradient norm:  1.416:  73%|███████▎  | 456/625 [01:09<00:26,  6.49it/s]
reward: -1.8500, last reward: -0.0024, gradient norm:  1.416:  73%|███████▎  | 457/625 [01:09<00:25,  6.48it/s]
reward: -1.9649, last reward: -0.4054, gradient norm:  39.3:  73%|███████▎  | 457/625 [01:09<00:25,  6.48it/s]
reward: -1.9649, last reward: -0.4054, gradient norm:  39.3:  73%|███████▎  | 458/625 [01:09<00:25,  6.49it/s]
reward: -2.2027, last reward: -0.0894, gradient norm:  4.275:  73%|███████▎  | 458/625 [01:09<00:25,  6.49it/s]
reward: -2.2027, last reward: -0.0894, gradient norm:  4.275:  73%|███████▎  | 459/625 [01:09<00:25,  6.49it/s]
reward: -1.5966, last reward: -0.0113, gradient norm:  1.368:  73%|███████▎  | 459/625 [01:10<00:25,  6.49it/s]
reward: -1.5966, last reward: -0.0113, gradient norm:  1.368:  74%|███████▎  | 460/625 [01:10<00:25,  6.48it/s]
reward: -1.6942, last reward: -0.0016, gradient norm:  0.4254:  74%|███████▎  | 460/625 [01:10<00:25,  6.48it/s]
reward: -1.6942, last reward: -0.0016, gradient norm:  0.4254:  74%|███████▍  | 461/625 [01:10<00:25,  6.48it/s]
reward: -1.6703, last reward: -0.0145, gradient norm:  2.142:  74%|███████▍  | 461/625 [01:10<00:25,  6.48it/s]
reward: -1.6703, last reward: -0.0145, gradient norm:  2.142:  74%|███████▍  | 462/625 [01:10<00:25,  6.49it/s]
reward: -1.8124, last reward: -0.0218, gradient norm:  0.9196:  74%|███████▍  | 462/625 [01:10<00:25,  6.49it/s]
reward: -1.8124, last reward: -0.0218, gradient norm:  0.9196:  74%|███████▍  | 463/625 [01:10<00:24,  6.49it/s]
reward: -1.8657, last reward: -0.0188, gradient norm:  0.8986:  74%|███████▍  | 463/625 [01:10<00:24,  6.49it/s]
reward: -1.8657, last reward: -0.0188, gradient norm:  0.8986:  74%|███████▍  | 464/625 [01:10<00:24,  6.49it/s]
reward: -2.0884, last reward: -0.0084, gradient norm:  0.5624:  74%|███████▍  | 464/625 [01:10<00:24,  6.49it/s]
reward: -2.0884, last reward: -0.0084, gradient norm:  0.5624:  74%|███████▍  | 465/625 [01:10<00:24,  6.49it/s]
reward: -1.8862, last reward: -0.0006, gradient norm:  0.5384:  74%|███████▍  | 465/625 [01:11<00:24,  6.49it/s]
reward: -1.8862, last reward: -0.0006, gradient norm:  0.5384:  75%|███████▍  | 466/625 [01:11<00:24,  6.50it/s]
reward: -2.1973, last reward: -0.0022, gradient norm:  0.5837:  75%|███████▍  | 466/625 [01:11<00:24,  6.50it/s]
reward: -2.1973, last reward: -0.0022, gradient norm:  0.5837:  75%|███████▍  | 467/625 [01:11<00:24,  6.53it/s]
reward: -1.8954, last reward: -0.0101, gradient norm:  0.6751:  75%|███████▍  | 467/625 [01:11<00:24,  6.53it/s]
reward: -1.8954, last reward: -0.0101, gradient norm:  0.6751:  75%|███████▍  | 468/625 [01:11<00:24,  6.51it/s]
reward: -1.8063, last reward: -0.0122, gradient norm:  0.9635:  75%|███████▍  | 468/625 [01:11<00:24,  6.51it/s]
reward: -1.8063, last reward: -0.0122, gradient norm:  0.9635:  75%|███████▌  | 469/625 [01:11<00:23,  6.51it/s]
reward: -2.0692, last reward: -0.0027, gradient norm:  0.4216:  75%|███████▌  | 469/625 [01:11<00:23,  6.51it/s]
reward: -2.0692, last reward: -0.0027, gradient norm:  0.4216:  75%|███████▌  | 470/625 [01:11<00:23,  6.51it/s]
reward: -2.1227, last reward: -0.0586, gradient norm:  3.162e+03:  75%|███████▌  | 470/625 [01:11<00:23,  6.51it/s]
reward: -2.1227, last reward: -0.0586, gradient norm:  3.162e+03:  75%|███████▌  | 471/625 [01:11<00:23,  6.50it/s]
reward: -1.9690, last reward: -0.0074, gradient norm:  0.4166:  75%|███████▌  | 471/625 [01:11<00:23,  6.50it/s]
reward: -1.9690, last reward: -0.0074, gradient norm:  0.4166:  76%|███████▌  | 472/625 [01:11<00:23,  6.52it/s]
reward: -2.6324, last reward: -0.0119, gradient norm:  1.345:  76%|███████▌  | 472/625 [01:12<00:23,  6.52it/s]
reward: -2.6324, last reward: -0.0119, gradient norm:  1.345:  76%|███████▌  | 473/625 [01:12<00:23,  6.54it/s]
reward: -2.0778, last reward: -0.0098, gradient norm:  1.166:  76%|███████▌  | 473/625 [01:12<00:23,  6.54it/s]
reward: -2.0778, last reward: -0.0098, gradient norm:  1.166:  76%|███████▌  | 474/625 [01:12<00:23,  6.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm:  0.4408:  76%|███████▌  | 474/625 [01:12<00:23,  6.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm:  0.4408:  76%|███████▌  | 475/625 [01:12<00:22,  6.58it/s]
reward: -1.8125, last reward: -0.0003, gradient norm:  0.1515:  76%|███████▌  | 475/625 [01:12<00:22,  6.58it/s]
reward: -1.8125, last reward: -0.0003, gradient norm:  0.1515:  76%|███████▌  | 476/625 [01:12<00:22,  6.59it/s]
reward: -2.2733, last reward: -0.0044, gradient norm:  0.2836:  76%|███████▌  | 476/625 [01:12<00:22,  6.59it/s]
reward: -2.2733, last reward: -0.0044, gradient norm:  0.2836:  76%|███████▋  | 477/625 [01:12<00:22,  6.60it/s]
reward: -1.7497, last reward: -0.0149, gradient norm:  0.7681:  76%|███████▋  | 477/625 [01:12<00:22,  6.60it/s]
reward: -1.7497, last reward: -0.0149, gradient norm:  0.7681:  76%|███████▋  | 478/625 [01:12<00:22,  6.60it/s]
reward: -1.8547, last reward: -0.0105, gradient norm:  0.7212:  76%|███████▋  | 478/625 [01:13<00:22,  6.60it/s]
reward: -1.8547, last reward: -0.0105, gradient norm:  0.7212:  77%|███████▋  | 479/625 [01:13<00:22,  6.59it/s]
reward: -1.9848, last reward: -0.0019, gradient norm:  0.6498:  77%|███████▋  | 479/625 [01:13<00:22,  6.59it/s]
reward: -1.9848, last reward: -0.0019, gradient norm:  0.6498:  77%|███████▋  | 480/625 [01:13<00:21,  6.60it/s]
reward: -2.1987, last reward: -0.0011, gradient norm:  0.5473:  77%|███████▋  | 480/625 [01:13<00:21,  6.60it/s]
reward: -2.1987, last reward: -0.0011, gradient norm:  0.5473:  77%|███████▋  | 481/625 [01:13<00:21,  6.59it/s]
reward: -1.8991, last reward: -0.0033, gradient norm:  0.6091:  77%|███████▋  | 481/625 [01:13<00:21,  6.59it/s]
reward: -1.8991, last reward: -0.0033, gradient norm:  0.6091:  77%|███████▋  | 482/625 [01:13<00:21,  6.60it/s]
reward: -1.9189, last reward: -0.0032, gradient norm:  0.5771:  77%|███████▋  | 482/625 [01:13<00:21,  6.60it/s]
reward: -1.9189, last reward: -0.0032, gradient norm:  0.5771:  77%|███████▋  | 483/625 [01:13<00:21,  6.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm:  0.7542:  77%|███████▋  | 483/625 [01:13<00:21,  6.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm:  0.7542:  77%|███████▋  | 484/625 [01:13<00:21,  6.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm:  0.4295:  77%|███████▋  | 484/625 [01:13<00:21,  6.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm:  0.4295:  78%|███████▊  | 485/625 [01:13<00:21,  6.60it/s]
reward: -2.2547, last reward: -0.0103, gradient norm:  0.4641:  78%|███████▊  | 485/625 [01:14<00:21,  6.60it/s]
reward: -2.2547, last reward: -0.0103, gradient norm:  0.4641:  78%|███████▊  | 486/625 [01:14<00:21,  6.58it/s]
reward: -2.1509, last reward: -0.0636, gradient norm:  6.547:  78%|███████▊  | 486/625 [01:14<00:21,  6.58it/s]
reward: -2.1509, last reward: -0.0636, gradient norm:  6.547:  78%|███████▊  | 487/625 [01:14<00:20,  6.60it/s]
reward: -2.0972, last reward: -0.0065, gradient norm:  0.2593:  78%|███████▊  | 487/625 [01:14<00:20,  6.60it/s]
reward: -2.0972, last reward: -0.0065, gradient norm:  0.2593:  78%|███████▊  | 488/625 [01:14<00:20,  6.61it/s]
reward: -2.1694, last reward: -0.0083, gradient norm:  0.5759:  78%|███████▊  | 488/625 [01:14<00:20,  6.61it/s]
reward: -2.1694, last reward: -0.0083, gradient norm:  0.5759:  78%|███████▊  | 489/625 [01:14<00:20,  6.60it/s]
reward: -2.0493, last reward: -0.0021, gradient norm:  0.7805:  78%|███████▊  | 489/625 [01:14<00:20,  6.60it/s]
reward: -2.0493, last reward: -0.0021, gradient norm:  0.7805:  78%|███████▊  | 490/625 [01:14<00:20,  6.61it/s]
reward: -2.0950, last reward: -0.0021, gradient norm:  0.497:  78%|███████▊  | 490/625 [01:14<00:20,  6.61it/s]
reward: -2.0950, last reward: -0.0021, gradient norm:  0.497:  79%|███████▊  | 491/625 [01:14<00:20,  6.57it/s]
reward: -1.9717, last reward: -0.0012, gradient norm:  0.3672:  79%|███████▊  | 491/625 [01:15<00:20,  6.57it/s]
reward: -1.9717, last reward: -0.0012, gradient norm:  0.3672:  79%|███████▊  | 492/625 [01:15<00:20,  6.58it/s]
reward: -2.0207, last reward: -0.0009, gradient norm:  0.331:  79%|███████▊  | 492/625 [01:15<00:20,  6.58it/s]
reward: -2.0207, last reward: -0.0009, gradient norm:  0.331:  79%|███████▉  | 493/625 [01:15<00:20,  6.59it/s]
reward: -1.8266, last reward: -0.0069, gradient norm:  0.5365:  79%|███████▉  | 493/625 [01:15<00:20,  6.59it/s]
reward: -1.8266, last reward: -0.0069, gradient norm:  0.5365:  79%|███████▉  | 494/625 [01:15<00:19,  6.57it/s]
reward: -2.2623, last reward: -0.0065, gradient norm:  0.5078:  79%|███████▉  | 494/625 [01:15<00:19,  6.57it/s]
reward: -2.2623, last reward: -0.0065, gradient norm:  0.5078:  79%|███████▉  | 495/625 [01:15<00:19,  6.59it/s]
reward: -2.0230, last reward: -0.0027, gradient norm:  0.4545:  79%|███████▉  | 495/625 [01:15<00:19,  6.59it/s]
reward: -2.0230, last reward: -0.0027, gradient norm:  0.4545:  79%|███████▉  | 496/625 [01:15<00:19,  6.60it/s]
reward: -1.6047, last reward: -0.0000, gradient norm:  0.09636:  79%|███████▉  | 496/625 [01:15<00:19,  6.60it/s]
reward: -1.6047, last reward: -0.0000, gradient norm:  0.09636:  80%|███████▉  | 497/625 [01:15<00:19,  6.60it/s]
reward: -1.8754, last reward: -0.0010, gradient norm:  0.2:  80%|███████▉  | 497/625 [01:15<00:19,  6.60it/s]
reward: -1.8754, last reward: -0.0010, gradient norm:  0.2:  80%|███████▉  | 498/625 [01:15<00:19,  6.58it/s]
reward: -2.6216, last reward: -0.0031, gradient norm:  0.8269:  80%|███████▉  | 498/625 [01:16<00:19,  6.58it/s]
reward: -2.6216, last reward: -0.0031, gradient norm:  0.8269:  80%|███████▉  | 499/625 [01:16<00:19,  6.58it/s]
reward: -1.7361, last reward: -0.0023, gradient norm:  0.4082:  80%|███████▉  | 499/625 [01:16<00:19,  6.58it/s]
reward: -1.7361, last reward: -0.0023, gradient norm:  0.4082:  80%|████████  | 500/625 [01:16<00:18,  6.60it/s]
reward: -1.6642, last reward: -0.0006, gradient norm:  0.2284:  80%|████████  | 500/625 [01:16<00:18,  6.60it/s]
reward: -1.6642, last reward: -0.0006, gradient norm:  0.2284:  80%|████████  | 501/625 [01:16<00:18,  6.61it/s]
reward: -1.9130, last reward: -0.0008, gradient norm:  0.3031:  80%|████████  | 501/625 [01:16<00:18,  6.61it/s]
reward: -1.9130, last reward: -0.0008, gradient norm:  0.3031:  80%|████████  | 502/625 [01:16<00:18,  6.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm:  0.2986:  80%|████████  | 502/625 [01:16<00:18,  6.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm:  0.2986:  80%|████████  | 503/625 [01:16<00:18,  6.61it/s]
reward: -1.7624, last reward: -0.0056, gradient norm:  0.3858:  80%|████████  | 503/625 [01:16<00:18,  6.61it/s]
reward: -1.7624, last reward: -0.0056, gradient norm:  0.3858:  81%|████████  | 504/625 [01:16<00:18,  6.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm:  0.38:  81%|████████  | 504/625 [01:16<00:18,  6.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm:  0.38:  81%|████████  | 505/625 [01:16<00:18,  6.63it/s]
reward: -1.7505, last reward: -0.0017, gradient norm:  0.2157:  81%|████████  | 505/625 [01:17<00:18,  6.63it/s]
reward: -1.7505, last reward: -0.0017, gradient norm:  0.2157:  81%|████████  | 506/625 [01:17<00:17,  6.64it/s]
reward: -1.8394, last reward: -0.0013, gradient norm:  0.3413:  81%|████████  | 506/625 [01:17<00:17,  6.64it/s]
reward: -1.8394, last reward: -0.0013, gradient norm:  0.3413:  81%|████████  | 507/625 [01:17<00:17,  6.64it/s]
reward: -1.9609, last reward: -0.0041, gradient norm:  0.6905:  81%|████████  | 507/625 [01:17<00:17,  6.64it/s]
reward: -1.9609, last reward: -0.0041, gradient norm:  0.6905:  81%|████████▏ | 508/625 [01:17<00:17,  6.65it/s]
reward: -1.8467, last reward: -0.0011, gradient norm:  0.4409:  81%|████████▏ | 508/625 [01:17<00:17,  6.65it/s]
reward: -1.8467, last reward: -0.0011, gradient norm:  0.4409:  81%|████████▏ | 509/625 [01:17<00:17,  6.65it/s]
reward: -2.0252, last reward: -0.0021, gradient norm:  0.213:  81%|████████▏ | 509/625 [01:17<00:17,  6.65it/s]
reward: -2.0252, last reward: -0.0021, gradient norm:  0.213:  82%|████████▏ | 510/625 [01:17<00:17,  6.65it/s]
reward: -1.8128, last reward: -0.0073, gradient norm:  0.3559:  82%|████████▏ | 510/625 [01:17<00:17,  6.65it/s]
reward: -1.8128, last reward: -0.0073, gradient norm:  0.3559:  82%|████████▏ | 511/625 [01:17<00:17,  6.60it/s]
reward: -2.1479, last reward: -0.0264, gradient norm:  3.68:  82%|████████▏ | 511/625 [01:18<00:17,  6.60it/s]
reward: -2.1479, last reward: -0.0264, gradient norm:  3.68:  82%|████████▏ | 512/625 [01:18<00:17,  6.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm:  5.566:  82%|████████▏ | 512/625 [01:18<00:17,  6.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm:  5.566:  82%|████████▏ | 513/625 [01:18<00:16,  6.59it/s]
reward: -2.2756, last reward: -0.0046, gradient norm:  0.5266:  82%|████████▏ | 513/625 [01:18<00:16,  6.59it/s]
reward: -2.2756, last reward: -0.0046, gradient norm:  0.5266:  82%|████████▏ | 514/625 [01:18<00:16,  6.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm:  0.9314:  82%|████████▏ | 514/625 [01:18<00:16,  6.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm:  0.9314:  82%|████████▏ | 515/625 [01:18<00:16,  6.56it/s]
reward: -2.3791, last reward: -0.0721, gradient norm:  1.14:  82%|████████▏ | 515/625 [01:18<00:16,  6.56it/s]
reward: -2.3791, last reward: -0.0721, gradient norm:  1.14:  83%|████████▎ | 516/625 [01:18<00:16,  6.53it/s]
reward: -2.4580, last reward: -0.0758, gradient norm:  0.6114:  83%|████████▎ | 516/625 [01:18<00:16,  6.53it/s]
reward: -2.4580, last reward: -0.0758, gradient norm:  0.6114:  83%|████████▎ | 517/625 [01:18<00:16,  6.53it/s]
reward: -1.9748, last reward: -0.0001, gradient norm:  0.2431:  83%|████████▎ | 517/625 [01:18<00:16,  6.53it/s]
reward: -1.9748, last reward: -0.0001, gradient norm:  0.2431:  83%|████████▎ | 518/625 [01:18<00:16,  6.52it/s]
reward: -2.1958, last reward: -0.0044, gradient norm:  0.5553:  83%|████████▎ | 518/625 [01:19<00:16,  6.52it/s]
reward: -2.1958, last reward: -0.0044, gradient norm:  0.5553:  83%|████████▎ | 519/625 [01:19<00:16,  6.51it/s]
reward: -1.8924, last reward: -0.0097, gradient norm:  17.34:  83%|████████▎ | 519/625 [01:19<00:16,  6.51it/s]
reward: -1.8924, last reward: -0.0097, gradient norm:  17.34:  83%|████████▎ | 520/625 [01:19<00:16,  6.52it/s]
reward: -2.3737, last reward: -0.0234, gradient norm:  1.899:  83%|████████▎ | 520/625 [01:19<00:16,  6.52it/s]
reward: -2.3737, last reward: -0.0234, gradient norm:  1.899:  83%|████████▎ | 521/625 [01:19<00:15,  6.52it/s]
reward: -1.9125, last reward: -0.0063, gradient norm:  0.4623:  83%|████████▎ | 521/625 [01:19<00:15,  6.52it/s]
reward: -1.9125, last reward: -0.0063, gradient norm:  0.4623:  84%|████████▎ | 522/625 [01:19<00:15,  6.52it/s]
reward: -2.3230, last reward: -0.0589, gradient norm:  0.3784:  84%|████████▎ | 522/625 [01:19<00:15,  6.52it/s]
reward: -2.3230, last reward: -0.0589, gradient norm:  0.3784:  84%|████████▎ | 523/625 [01:19<00:15,  6.55it/s]
reward: -1.9482, last reward: -0.0051, gradient norm:  1.105:  84%|████████▎ | 523/625 [01:19<00:15,  6.55it/s]
reward: -1.9482, last reward: -0.0051, gradient norm:  1.105:  84%|████████▍ | 524/625 [01:19<00:15,  6.54it/s]
reward: -2.1979, last reward: -0.0045, gradient norm:  0.6401:  84%|████████▍ | 524/625 [01:20<00:15,  6.54it/s]
reward: -2.1979, last reward: -0.0045, gradient norm:  0.6401:  84%|████████▍ | 525/625 [01:20<00:15,  6.53it/s]
reward: -2.1588, last reward: -0.0048, gradient norm:  0.6255:  84%|████████▍ | 525/625 [01:20<00:15,  6.53it/s]
reward: -2.1588, last reward: -0.0048, gradient norm:  0.6255:  84%|████████▍ | 526/625 [01:20<00:15,  6.53it/s]
reward: -1.6084, last reward: -0.0010, gradient norm:  0.3477:  84%|████████▍ | 526/625 [01:20<00:15,  6.53it/s]
reward: -1.6084, last reward: -0.0010, gradient norm:  0.3477:  84%|████████▍ | 527/625 [01:20<00:15,  6.52it/s]
reward: -2.1475, last reward: -0.0209, gradient norm:  0.3456:  84%|████████▍ | 527/625 [01:20<00:15,  6.52it/s]
reward: -2.1475, last reward: -0.0209, gradient norm:  0.3456:  84%|████████▍ | 528/625 [01:20<00:14,  6.52it/s]
reward: -1.7611, last reward: -0.1040, gradient norm:  18.52:  84%|████████▍ | 528/625 [01:20<00:14,  6.52it/s]
reward: -1.7611, last reward: -0.1040, gradient norm:  18.52:  85%|████████▍ | 529/625 [01:20<00:14,  6.51it/s]
reward: -2.0099, last reward: -0.0173, gradient norm:  1.643:  85%|████████▍ | 529/625 [01:20<00:14,  6.51it/s]
reward: -2.0099, last reward: -0.0173, gradient norm:  1.643:  85%|████████▍ | 530/625 [01:20<00:14,  6.51it/s]
reward: -2.8189, last reward: -1.4358, gradient norm:  46.61:  85%|████████▍ | 530/625 [01:20<00:14,  6.51it/s]
reward: -2.8189, last reward: -1.4358, gradient norm:  46.61:  85%|████████▍ | 531/625 [01:20<00:14,  6.51it/s]
reward: -2.9897, last reward: -2.4869, gradient norm:  51.23:  85%|████████▍ | 531/625 [01:21<00:14,  6.51it/s]
reward: -2.9897, last reward: -2.4869, gradient norm:  51.23:  85%|████████▌ | 532/625 [01:21<00:14,  6.52it/s]
reward: -2.1548, last reward: -0.9751, gradient norm:  72.21:  85%|████████▌ | 532/625 [01:21<00:14,  6.52it/s]
reward: -2.1548, last reward: -0.9751, gradient norm:  72.21:  85%|████████▌ | 533/625 [01:21<00:14,  6.55it/s]
reward: -1.6362, last reward: -0.0022, gradient norm:  0.7495:  85%|████████▌ | 533/625 [01:21<00:14,  6.55it/s]
reward: -1.6362, last reward: -0.0022, gradient norm:  0.7495:  85%|████████▌ | 534/625 [01:21<00:13,  6.53it/s]
reward: -2.1749, last reward: -0.0105, gradient norm:  0.9513:  85%|████████▌ | 534/625 [01:21<00:13,  6.53it/s]
reward: -2.1749, last reward: -0.0105, gradient norm:  0.9513:  86%|████████▌ | 535/625 [01:21<00:13,  6.51it/s]
reward: -1.7708, last reward: -0.0371, gradient norm:  1.432:  86%|████████▌ | 535/625 [01:21<00:13,  6.51it/s]
reward: -1.7708, last reward: -0.0371, gradient norm:  1.432:  86%|████████▌ | 536/625 [01:21<00:13,  6.51it/s]
reward: -2.2649, last reward: -0.0437, gradient norm:  2.327:  86%|████████▌ | 536/625 [01:21<00:13,  6.51it/s]
reward: -2.2649, last reward: -0.0437, gradient norm:  2.327:  86%|████████▌ | 537/625 [01:21<00:13,  6.49it/s]
reward: -2.5491, last reward: -0.0276, gradient norm:  1.246:  86%|████████▌ | 537/625 [01:22<00:13,  6.49it/s]
reward: -2.5491, last reward: -0.0276, gradient norm:  1.246:  86%|████████▌ | 538/625 [01:22<00:13,  6.49it/s]
reward: -2.6426, last reward: -0.7294, gradient norm:  1.078e+03:  86%|████████▌ | 538/625 [01:22<00:13,  6.49it/s]
reward: -2.6426, last reward: -0.7294, gradient norm:  1.078e+03:  86%|████████▌ | 539/625 [01:22<00:13,  6.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm:  1.576:  86%|████████▌ | 539/625 [01:22<00:13,  6.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm:  1.576:  86%|████████▋ | 540/625 [01:22<00:13,  6.48it/s]
reward: -1.7937, last reward: -0.0124, gradient norm:  0.9664:  86%|████████▋ | 540/625 [01:22<00:13,  6.48it/s]
reward: -1.7937, last reward: -0.0124, gradient norm:  0.9664:  87%|████████▋ | 541/625 [01:22<00:12,  6.49it/s]
reward: -2.3342, last reward: -0.0204, gradient norm:  1.81:  87%|████████▋ | 541/625 [01:22<00:12,  6.49it/s]
reward: -2.3342, last reward: -0.0204, gradient norm:  1.81:  87%|████████▋ | 542/625 [01:22<00:12,  6.48it/s]
reward: -2.2046, last reward: -0.0122, gradient norm:  1.004:  87%|████████▋ | 542/625 [01:22<00:12,  6.48it/s]
reward: -2.2046, last reward: -0.0122, gradient norm:  1.004:  87%|████████▋ | 543/625 [01:22<00:12,  6.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm:  0.5496:  87%|████████▋ | 543/625 [01:22<00:12,  6.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm:  0.5496:  87%|████████▋ | 544/625 [01:22<00:12,  6.50it/s]
reward: -2.0956, last reward: -0.0059, gradient norm:  1.425:  87%|████████▋ | 544/625 [01:23<00:12,  6.50it/s]
reward: -2.0956, last reward: -0.0059, gradient norm:  1.425:  87%|████████▋ | 545/625 [01:23<00:12,  6.50it/s]
reward: -2.9028, last reward: -0.5843, gradient norm:  21.12:  87%|████████▋ | 545/625 [01:23<00:12,  6.50it/s]
reward: -2.9028, last reward: -0.5843, gradient norm:  21.12:  87%|████████▋ | 546/625 [01:23<00:12,  6.48it/s]
reward: -2.0674, last reward: -0.0178, gradient norm:  0.797:  87%|████████▋ | 546/625 [01:23<00:12,  6.48it/s]
reward: -2.0674, last reward: -0.0178, gradient norm:  0.797:  88%|████████▊ | 547/625 [01:23<00:12,  6.50it/s]
reward: -2.2815, last reward: -0.0599, gradient norm:  1.227:  88%|████████▊ | 547/625 [01:23<00:12,  6.50it/s]
reward: -2.2815, last reward: -0.0599, gradient norm:  1.227:  88%|████████▊ | 548/625 [01:23<00:11,  6.50it/s]
reward: -3.1587, last reward: -0.9276, gradient norm:  20.56:  88%|████████▊ | 548/625 [01:23<00:11,  6.50it/s]
reward: -3.1587, last reward: -0.9276, gradient norm:  20.56:  88%|████████▊ | 549/625 [01:23<00:11,  6.50it/s]
reward: -3.8228, last reward: -2.9229, gradient norm:  308.2:  88%|████████▊ | 549/625 [01:23<00:11,  6.50it/s]
reward: -3.8228, last reward: -2.9229, gradient norm:  308.2:  88%|████████▊ | 550/625 [01:23<00:11,  6.49it/s]
reward: -1.6164, last reward: -0.0120, gradient norm:  2.259:  88%|████████▊ | 550/625 [01:24<00:11,  6.49it/s]
reward: -1.6164, last reward: -0.0120, gradient norm:  2.259:  88%|████████▊ | 551/625 [01:24<00:11,  6.49it/s]
reward: -1.6850, last reward: -0.0227, gradient norm:  0.9167:  88%|████████▊ | 551/625 [01:24<00:11,  6.49it/s]
reward: -1.6850, last reward: -0.0227, gradient norm:  0.9167:  88%|████████▊ | 552/625 [01:24<00:11,  6.50it/s]
reward: -2.3092, last reward: -0.0670, gradient norm:  0.9177:  88%|████████▊ | 552/625 [01:24<00:11,  6.50it/s]
reward: -2.3092, last reward: -0.0670, gradient norm:  0.9177:  88%|████████▊ | 553/625 [01:24<00:11,  6.49it/s]
reward: -2.1599, last reward: -0.0043, gradient norm:  1.195:  88%|████████▊ | 553/625 [01:24<00:11,  6.49it/s]
reward: -2.1599, last reward: -0.0043, gradient norm:  1.195:  89%|████████▊ | 554/625 [01:24<00:10,  6.50it/s]
reward: -2.4672, last reward: -0.0057, gradient norm:  0.6367:  89%|████████▊ | 554/625 [01:24<00:10,  6.50it/s]
reward: -2.4672, last reward: -0.0057, gradient norm:  0.6367:  89%|████████▉ | 555/625 [01:24<00:10,  6.49it/s]
reward: -2.3657, last reward: -0.1970, gradient norm:  4.202:  89%|████████▉ | 555/625 [01:24<00:10,  6.49it/s]
reward: -2.3657, last reward: -0.1970, gradient norm:  4.202:  89%|████████▉ | 556/625 [01:24<00:10,  6.49it/s]
reward: -2.6694, last reward: -0.1215, gradient norm:  1.324:  89%|████████▉ | 556/625 [01:24<00:10,  6.49it/s]
reward: -2.6694, last reward: -0.1215, gradient norm:  1.324:  89%|████████▉ | 557/625 [01:24<00:10,  6.50it/s]
reward: -2.2622, last reward: -0.0372, gradient norm:  0.4841:  89%|████████▉ | 557/625 [01:25<00:10,  6.50it/s]
reward: -2.2622, last reward: -0.0372, gradient norm:  0.4841:  89%|████████▉ | 558/625 [01:25<00:10,  6.50it/s]
reward: -2.2707, last reward: -0.0058, gradient norm:  5.757:  89%|████████▉ | 558/625 [01:25<00:10,  6.50it/s]
reward: -2.2707, last reward: -0.0058, gradient norm:  5.757:  89%|████████▉ | 559/625 [01:25<00:10,  6.49it/s]
reward: -2.2267, last reward: -0.0014, gradient norm:  0.5415:  89%|████████▉ | 559/625 [01:25<00:10,  6.49it/s]
reward: -2.2267, last reward: -0.0014, gradient norm:  0.5415:  90%|████████▉ | 560/625 [01:25<00:10,  6.49it/s]
reward: -2.4556, last reward: -0.0163, gradient norm:  1.146:  90%|████████▉ | 560/625 [01:25<00:10,  6.49it/s]
reward: -2.4556, last reward: -0.0163, gradient norm:  1.146:  90%|████████▉ | 561/625 [01:25<00:09,  6.47it/s]
reward: -2.1839, last reward: -0.0809, gradient norm:  0.6262:  90%|████████▉ | 561/625 [01:25<00:09,  6.47it/s]
reward: -2.1839, last reward: -0.0809, gradient norm:  0.6262:  90%|████████▉ | 562/625 [01:25<00:09,  6.49it/s]
reward: -2.0278, last reward: -0.0018, gradient norm:  1.327:  90%|████████▉ | 562/625 [01:25<00:09,  6.49it/s]
reward: -2.0278, last reward: -0.0018, gradient norm:  1.327:  90%|█████████ | 563/625 [01:25<00:09,  6.50it/s]
reward: -2.1112, last reward: -0.0011, gradient norm:  0.354:  90%|█████████ | 563/625 [01:26<00:09,  6.50it/s]
reward: -2.1112, last reward: -0.0011, gradient norm:  0.354:  90%|█████████ | 564/625 [01:26<00:09,  6.52it/s]
reward: -2.6155, last reward: -0.0004, gradient norm:  2.008:  90%|█████████ | 564/625 [01:26<00:09,  6.52it/s]
reward: -2.6155, last reward: -0.0004, gradient norm:  2.008:  90%|█████████ | 565/625 [01:26<00:09,  6.54it/s]
reward: -3.1427, last reward: -0.3582, gradient norm:  7.624:  90%|█████████ | 565/625 [01:26<00:09,  6.54it/s]
reward: -3.1427, last reward: -0.3582, gradient norm:  7.624:  91%|█████████ | 566/625 [01:26<00:08,  6.57it/s]
reward: -2.7870, last reward: -0.9490, gradient norm:  18.26:  91%|█████████ | 566/625 [01:26<00:08,  6.57it/s]
reward: -2.7870, last reward: -0.9490, gradient norm:  18.26:  91%|█████████ | 567/625 [01:26<00:08,  6.59it/s]
reward: -3.0439, last reward: -0.8796, gradient norm:  29.89:  91%|█████████ | 567/625 [01:26<00:08,  6.59it/s]
reward: -3.0439, last reward: -0.8796, gradient norm:  29.89:  91%|█████████ | 568/625 [01:26<00:08,  6.59it/s]
reward: -2.8026, last reward: -0.2720, gradient norm:  8.612:  91%|█████████ | 568/625 [01:26<00:08,  6.59it/s]
reward: -2.8026, last reward: -0.2720, gradient norm:  8.612:  91%|█████████ | 569/625 [01:26<00:08,  6.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm:  41.13:  91%|█████████ | 569/625 [01:27<00:08,  6.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm:  41.13:  91%|█████████ | 570/625 [01:27<00:11,  4.78it/s]
reward: -1.7917, last reward: -0.0129, gradient norm:  2.365:  91%|█████████ | 570/625 [01:27<00:11,  4.78it/s]
reward: -1.7917, last reward: -0.0129, gradient norm:  2.365:  91%|█████████▏| 571/625 [01:27<00:10,  5.22it/s]
reward: -1.9553, last reward: -0.0020, gradient norm:  0.6871:  91%|█████████▏| 571/625 [01:27<00:10,  5.22it/s]
reward: -1.9553, last reward: -0.0020, gradient norm:  0.6871:  92%|█████████▏| 572/625 [01:27<00:09,  5.57it/s]
reward: -2.3132, last reward: -0.0159, gradient norm:  0.8646:  92%|█████████▏| 572/625 [01:27<00:09,  5.57it/s]
reward: -2.3132, last reward: -0.0159, gradient norm:  0.8646:  92%|█████████▏| 573/625 [01:27<00:08,  5.85it/s]
reward: -1.5320, last reward: -0.0269, gradient norm:  1.02:  92%|█████████▏| 573/625 [01:27<00:08,  5.85it/s]
reward: -1.5320, last reward: -0.0269, gradient norm:  1.02:  92%|█████████▏| 574/625 [01:27<00:08,  6.05it/s]
reward: -2.2955, last reward: -0.0245, gradient norm:  1.267:  92%|█████████▏| 574/625 [01:27<00:08,  6.05it/s]
reward: -2.2955, last reward: -0.0245, gradient norm:  1.267:  92%|█████████▏| 575/625 [01:27<00:08,  6.21it/s]
reward: -2.3347, last reward: -0.0179, gradient norm:  1.528:  92%|█████████▏| 575/625 [01:28<00:08,  6.21it/s]
reward: -2.3347, last reward: -0.0179, gradient norm:  1.528:  92%|█████████▏| 576/625 [01:28<00:07,  6.33it/s]
reward: -1.9718, last reward: -0.1629, gradient norm:  8.804:  92%|█████████▏| 576/625 [01:28<00:07,  6.33it/s]
reward: -1.9718, last reward: -0.1629, gradient norm:  8.804:  92%|█████████▏| 577/625 [01:28<00:07,  6.42it/s]
reward: -2.4164, last reward: -0.0070, gradient norm:  0.4335:  92%|█████████▏| 577/625 [01:28<00:07,  6.42it/s]
reward: -2.4164, last reward: -0.0070, gradient norm:  0.4335:  92%|█████████▏| 578/625 [01:28<00:07,  6.47it/s]
reward: -2.2993, last reward: -0.0011, gradient norm:  1.371:  92%|█████████▏| 578/625 [01:28<00:07,  6.47it/s]
reward: -2.2993, last reward: -0.0011, gradient norm:  1.371:  93%|█████████▎| 579/625 [01:28<00:07,  6.52it/s]
reward: -3.3049, last reward: -0.9063, gradient norm:  34.23:  93%|█████████▎| 579/625 [01:28<00:07,  6.52it/s]
reward: -3.3049, last reward: -0.9063, gradient norm:  34.23:  93%|█████████▎| 580/625 [01:28<00:06,  6.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm:  10.91:  93%|█████████▎| 580/625 [01:28<00:06,  6.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm:  10.91:  93%|█████████▎| 581/625 [01:28<00:06,  6.58it/s]
reward: -2.5184, last reward: -0.0546, gradient norm:  21.09:  93%|█████████▎| 581/625 [01:28<00:06,  6.58it/s]
reward: -2.5184, last reward: -0.0546, gradient norm:  21.09:  93%|█████████▎| 582/625 [01:28<00:06,  6.59it/s]
reward: -2.4039, last reward: -0.4589, gradient norm:  10.86:  93%|█████████▎| 582/625 [01:29<00:06,  6.59it/s]
reward: -2.4039, last reward: -0.4589, gradient norm:  10.86:  93%|█████████▎| 583/625 [01:29<00:06,  6.61it/s]
reward: -2.4697, last reward: -0.2476, gradient norm:  4.689:  93%|█████████▎| 583/625 [01:29<00:06,  6.61it/s]
reward: -2.4697, last reward: -0.2476, gradient norm:  4.689:  93%|█████████▎| 584/625 [01:29<00:06,  6.60it/s]
reward: -2.0018, last reward: -0.2397, gradient norm:  8.393:  93%|█████████▎| 584/625 [01:29<00:06,  6.60it/s]
reward: -2.0018, last reward: -0.2397, gradient norm:  8.393:  94%|█████████▎| 585/625 [01:29<00:06,  6.61it/s]
reward: -2.4953, last reward: -0.1775, gradient norm:  24.17:  94%|█████████▎| 585/625 [01:29<00:06,  6.61it/s]
reward: -2.4953, last reward: -0.1775, gradient norm:  24.17:  94%|█████████▍| 586/625 [01:29<00:05,  6.61it/s]
reward: -2.2258, last reward: -0.0110, gradient norm:  0.7671:  94%|█████████▍| 586/625 [01:29<00:05,  6.61it/s]
reward: -2.2258, last reward: -0.0110, gradient norm:  0.7671:  94%|█████████▍| 587/625 [01:29<00:05,  6.62it/s]
reward: -2.3981, last reward: -0.0011, gradient norm:  1.617:  94%|█████████▍| 587/625 [01:29<00:05,  6.62it/s]
reward: -2.3981, last reward: -0.0011, gradient norm:  1.617:  94%|█████████▍| 588/625 [01:29<00:05,  6.63it/s]
reward: -1.8590, last reward: -0.0007, gradient norm:  1.131:  94%|█████████▍| 588/625 [01:29<00:05,  6.63it/s]
reward: -1.8590, last reward: -0.0007, gradient norm:  1.131:  94%|█████████▍| 589/625 [01:29<00:05,  6.64it/s]
reward: -1.9820, last reward: -0.4221, gradient norm:  49.4:  94%|█████████▍| 589/625 [01:30<00:05,  6.64it/s]
reward: -1.9820, last reward: -0.4221, gradient norm:  49.4:  94%|█████████▍| 590/625 [01:30<00:05,  6.64it/s]
reward: -2.1293, last reward: -0.0116, gradient norm:  0.868:  94%|█████████▍| 590/625 [01:30<00:05,  6.64it/s]
reward: -2.1293, last reward: -0.0116, gradient norm:  0.868:  95%|█████████▍| 591/625 [01:30<00:05,  6.63it/s]
reward: -2.1675, last reward: -0.0173, gradient norm:  0.5931:  95%|█████████▍| 591/625 [01:30<00:05,  6.63it/s]
reward: -2.1675, last reward: -0.0173, gradient norm:  0.5931:  95%|█████████▍| 592/625 [01:30<00:04,  6.64it/s]
reward: -2.2910, last reward: -0.0207, gradient norm:  0.5219:  95%|█████████▍| 592/625 [01:30<00:04,  6.64it/s]
reward: -2.2910, last reward: -0.0207, gradient norm:  0.5219:  95%|█████████▍| 593/625 [01:30<00:04,  6.63it/s]
reward: -2.2124, last reward: -0.1730, gradient norm:  5.737:  95%|█████████▍| 593/625 [01:30<00:04,  6.63it/s]
reward: -2.2124, last reward: -0.1730, gradient norm:  5.737:  95%|█████████▌| 594/625 [01:30<00:04,  6.63it/s]
reward: -2.2914, last reward: -0.0206, gradient norm:  0.485:  95%|█████████▌| 594/625 [01:30<00:04,  6.63it/s]
reward: -2.2914, last reward: -0.0206, gradient norm:  0.485:  95%|█████████▌| 595/625 [01:30<00:04,  6.63it/s]
reward: -2.0890, last reward: -0.0172, gradient norm:  0.3982:  95%|█████████▌| 595/625 [01:31<00:04,  6.63it/s]
reward: -2.0890, last reward: -0.0172, gradient norm:  0.3982:  95%|█████████▌| 596/625 [01:31<00:04,  6.63it/s]
reward: -2.0945, last reward: -0.0121, gradient norm:  0.4789:  95%|█████████▌| 596/625 [01:31<00:04,  6.63it/s]
reward: -2.0945, last reward: -0.0121, gradient norm:  0.4789:  96%|█████████▌| 597/625 [01:31<00:04,  6.63it/s]
reward: -2.3805, last reward: -0.0069, gradient norm:  0.4074:  96%|█████████▌| 597/625 [01:31<00:04,  6.63it/s]
reward: -2.3805, last reward: -0.0069, gradient norm:  0.4074:  96%|█████████▌| 598/625 [01:31<00:04,  6.62it/s]
reward: -2.3310, last reward: -0.0031, gradient norm:  0.5065:  96%|█████████▌| 598/625 [01:31<00:04,  6.62it/s]
reward: -2.3310, last reward: -0.0031, gradient norm:  0.5065:  96%|█████████▌| 599/625 [01:31<00:03,  6.62it/s]
reward: -2.6028, last reward: -0.0006, gradient norm:  0.6316:  96%|█████████▌| 599/625 [01:31<00:03,  6.62it/s]
reward: -2.6028, last reward: -0.0006, gradient norm:  0.6316:  96%|█████████▌| 600/625 [01:31<00:03,  6.62it/s]
reward: -2.6724, last reward: -0.0001, gradient norm:  0.6523:  96%|█████████▌| 600/625 [01:31<00:03,  6.62it/s]
reward: -2.6724, last reward: -0.0001, gradient norm:  0.6523:  96%|█████████▌| 601/625 [01:31<00:03,  6.63it/s]
reward: -2.2481, last reward: -0.0136, gradient norm:  0.4298:  96%|█████████▌| 601/625 [01:31<00:03,  6.63it/s]
reward: -2.2481, last reward: -0.0136, gradient norm:  0.4298:  96%|█████████▋| 602/625 [01:31<00:03,  6.63it/s]
reward: -2.3524, last reward: -0.0043, gradient norm:  0.2629:  96%|█████████▋| 602/625 [01:32<00:03,  6.63it/s]
reward: -2.3524, last reward: -0.0043, gradient norm:  0.2629:  96%|█████████▋| 603/625 [01:32<00:03,  6.62it/s]
reward: -2.2635, last reward: -0.0069, gradient norm:  0.7839:  96%|█████████▋| 603/625 [01:32<00:03,  6.62it/s]
reward: -2.2635, last reward: -0.0069, gradient norm:  0.7839:  97%|█████████▋| 604/625 [01:32<00:03,  6.63it/s]
reward: -2.6041, last reward: -0.8027, gradient norm:  11.7:  97%|█████████▋| 604/625 [01:32<00:03,  6.63it/s]
reward: -2.6041, last reward: -0.8027, gradient norm:  11.7:  97%|█████████▋| 605/625 [01:32<00:03,  6.63it/s]
reward: -4.4170, last reward: -3.4675, gradient norm:  60.04:  97%|█████████▋| 605/625 [01:32<00:03,  6.63it/s]
reward: -4.4170, last reward: -3.4675, gradient norm:  60.04:  97%|█████████▋| 606/625 [01:32<00:02,  6.63it/s]
reward: -4.3153, last reward: -2.9316, gradient norm:  53.11:  97%|█████████▋| 606/625 [01:32<00:02,  6.63it/s]
reward: -4.3153, last reward: -2.9316, gradient norm:  53.11:  97%|█████████▋| 607/625 [01:32<00:02,  6.63it/s]
reward: -3.0649, last reward: -0.9722, gradient norm:  30.84:  97%|█████████▋| 607/625 [01:32<00:02,  6.63it/s]
reward: -3.0649, last reward: -0.9722, gradient norm:  30.84:  97%|█████████▋| 608/625 [01:32<00:02,  6.63it/s]
reward: -2.7989, last reward: -0.0329, gradient norm:  1.261:  97%|█████████▋| 608/625 [01:33<00:02,  6.63it/s]
reward: -2.7989, last reward: -0.0329, gradient norm:  1.261:  97%|█████████▋| 609/625 [01:33<00:02,  6.64it/s]
reward: -2.1976, last reward: -0.6852, gradient norm:  20.33:  97%|█████████▋| 609/625 [01:33<00:02,  6.64it/s]
reward: -2.1976, last reward: -0.6852, gradient norm:  20.33:  98%|█████████▊| 610/625 [01:33<00:02,  6.64it/s]
reward: -2.4793, last reward: -0.1255, gradient norm:  14.69:  98%|█████████▊| 610/625 [01:33<00:02,  6.64it/s]
reward: -2.4793, last reward: -0.1255, gradient norm:  14.69:  98%|█████████▊| 611/625 [01:33<00:02,  6.64it/s]
reward: -2.4581, last reward: -0.0394, gradient norm:  2.429:  98%|█████████▊| 611/625 [01:33<00:02,  6.64it/s]
reward: -2.4581, last reward: -0.0394, gradient norm:  2.429:  98%|█████████▊| 612/625 [01:33<00:01,  6.64it/s]
reward: -2.2047, last reward: -0.0326, gradient norm:  1.147:  98%|█████████▊| 612/625 [01:33<00:01,  6.64it/s]
reward: -2.2047, last reward: -0.0326, gradient norm:  1.147:  98%|█████████▊| 613/625 [01:33<00:01,  6.63it/s]
reward: -1.8967, last reward: -0.0129, gradient norm:  0.8619:  98%|█████████▊| 613/625 [01:33<00:01,  6.63it/s]
reward: -1.8967, last reward: -0.0129, gradient norm:  0.8619:  98%|█████████▊| 614/625 [01:33<00:01,  6.63it/s]
reward: -2.5906, last reward: -0.0015, gradient norm:  0.6491:  98%|█████████▊| 614/625 [01:33<00:01,  6.63it/s]
reward: -2.5906, last reward: -0.0015, gradient norm:  0.6491:  98%|█████████▊| 615/625 [01:33<00:01,  6.63it/s]
reward: -1.6634, last reward: -0.0007, gradient norm:  0.4394:  98%|█████████▊| 615/625 [01:34<00:01,  6.63it/s]
reward: -1.6634, last reward: -0.0007, gradient norm:  0.4394:  99%|█████████▊| 616/625 [01:34<00:01,  6.64it/s]
reward: -2.0624, last reward: -0.0061, gradient norm:  0.5676:  99%|█████████▊| 616/625 [01:34<00:01,  6.64it/s]
reward: -2.0624, last reward: -0.0061, gradient norm:  0.5676:  99%|█████████▊| 617/625 [01:34<00:01,  6.64it/s]
reward: -2.3259, last reward: -0.0131, gradient norm:  0.7733:  99%|█████████▊| 617/625 [01:34<00:01,  6.64it/s]
reward: -2.3259, last reward: -0.0131, gradient norm:  0.7733:  99%|█████████▉| 618/625 [01:34<00:01,  6.65it/s]
reward: -1.7515, last reward: -0.0189, gradient norm:  0.5575:  99%|█████████▉| 618/625 [01:34<00:01,  6.65it/s]
reward: -1.7515, last reward: -0.0189, gradient norm:  0.5575:  99%|█████████▉| 619/625 [01:34<00:00,  6.64it/s]
reward: -1.9313, last reward: -0.0207, gradient norm:  0.6286:  99%|█████████▉| 619/625 [01:34<00:00,  6.64it/s]
reward: -1.9313, last reward: -0.0207, gradient norm:  0.6286:  99%|█████████▉| 620/625 [01:34<00:00,  6.64it/s]
reward: -2.4325, last reward: -0.0171, gradient norm:  0.7832:  99%|█████████▉| 620/625 [01:34<00:00,  6.64it/s]
reward: -2.4325, last reward: -0.0171, gradient norm:  0.7832:  99%|█████████▉| 621/625 [01:34<00:00,  6.63it/s]
reward: -2.1134, last reward: -0.0144, gradient norm:  1.96:  99%|█████████▉| 621/625 [01:34<00:00,  6.63it/s]
reward: -2.1134, last reward: -0.0144, gradient norm:  1.96: 100%|█████████▉| 622/625 [01:34<00:00,  6.63it/s]
reward: -2.4572, last reward: -0.0500, gradient norm:  0.5838: 100%|█████████▉| 622/625 [01:35<00:00,  6.63it/s]
reward: -2.4572, last reward: -0.0500, gradient norm:  0.5838: 100%|█████████▉| 623/625 [01:35<00:00,  6.63it/s]
reward: -2.3818, last reward: -0.0019, gradient norm:  0.8623: 100%|█████████▉| 623/625 [01:35<00:00,  6.63it/s]
reward: -2.3818, last reward: -0.0019, gradient norm:  0.8623: 100%|█████████▉| 624/625 [01:35<00:00,  6.62it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|█████████▉| 624/625 [01:35<00:00,  6.62it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|██████████| 625/625 [01:35<00:00,  6.63it/s]
reward: -2.1253, last reward: -0.0001, gradient norm:  0.6622: 100%|██████████| 625/625 [01:35<00:00,  6.55it/s]

结论

在本教程中,我们学习了如何从 抓。我们涉及的主题包括:

  • 编码时需要注意的四个基本组成部分 环境(、、种子设定和构建规范)。 我们看到了这些方法和类如何与类交互;stepresetTensorDict

  • 如何使用;

  • 如何在无状态环境的上下文中附加转换以及如何 编写自定义转换;

  • 如何在完全可区分的模拟器上训练策略。

脚本总运行时间:(2 分 51.454 秒)

估计内存使用量:320 MB

由 Sphinx-Gallery 生成的图库

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源