注意
转到末尾下载完整的示例代码。
Pendulum:使用 TorchRL 编写环境和转换¶
作者: Vincent Moens
创建环境(模拟器或物理控制系统的接口) 是强化学习和控制工程的集成部分。
TorchRL 提供了一组工具,可以在多个上下文中执行此操作。 本教程演示如何使用 PyTorch 和 TorchRL 对钟摆进行编码 从头开始的模拟器。 它受到 OpenAI-Gym/Farama-Gymnasium 的 Pendulum-v1 实现的自由启发 control 库。
主要学习内容:
如何在 TorchRL 中设计环境: - 编写规范(输入、观察和奖励); - 实现行为:seeding、reset 和 step。
转换您的环境输入和输出,并编写您自己的 变换;
如何使用 来携带任意数据结构 通过 .
TensorDict
codebase
在此过程中,我们将涉及 TorchRL 的三个关键组件:
为了让您了解 TorchRL 的环境可以实现什么,我们将 正在设计一个无状态的环境。虽然有状态环境会跟踪 遇到的最新物理状态,并依靠此来模拟状态到状态 transition,无状态环境希望将当前状态提供给 它们以及所采取的行动。TorchRL 同时支持 类型的环境,但无状态环境更通用,因此 涵盖 TorchRL 中环境 API 的更广泛功能。
对无状态环境进行建模使用户能够完全控制 input 和 模拟器的输出:可以在任何阶段或主动重置实验 从外部修改动态。但是,它假设我们有一定的控制权 在一项任务上,情况可能并非总是如此:解决我们无法解决的问题 Control the Current State 更具挑战性,但具有更广泛的应用程序集。
无状态环境的另一个优点是它们可以启用 过渡模拟的批量执行。如果后端和 implementation 允许它,代数运算可以在 标量、向量或张量。本教程提供了此类示例。
本教程的结构如下:
在对模拟器进行编码后,我们将演示如何使用它 在使用 Transform 进行训练期间。
我们将探索 TorchRL 的 API 的新途径, 包括:转换输入的可能性、矢量化执行 的模拟以及通过 模拟图。
最后,我们将训练一个简单的策略来解决我们实现的系统。
这个环境的内置版本可以在 class:~torchrl.envs.PendulumEnv 中找到。
from collections import defaultdict
from typing import Optional
import numpy as np
import torch
import tqdm
from tensordict import TensorDict, TensorDictBase
from tensordict.nn import TensorDictModule
from torch import nn
from torchrl.data import Bounded, Composite, Unbounded
from torchrl.envs import (
CatTensors,
EnvBase,
Transform,
TransformedEnv,
UnsqueezeTransform,
)
from torchrl.envs.transforms.transforms import _apply_to_composite
from torchrl.envs.utils import check_env_specs, step_mdp
DEFAULT_X = np.pi
DEFAULT_Y = 1.0
在设计新环境时,您必须注意四件事 类:
EnvBase._reset()
,用于重置模拟器的代码 处于(可能是随机的)初始状态;EnvBase._step()
哪个代码表示 state transition dynamic;EnvBase._set_seed()
它实现了种子设定机制;环境规格。
让我们首先描述手头的问题:我们想对一个简单的 摆锤,我们可以控制施加在其固定点上的扭矩。 我们的目标是将钟摆置于向上的位置(角度位置为 0 按照惯例)并使其在那个位置静止不动。 为了设计我们的动态系统,我们需要定义两个方程:运动 动作后的方程式(施加的扭矩)和奖励方程式 这将构成我们的目标函数。
对于运动方程,我们将更新角速度:
其中 是角速度,单位为 rad/sec, 是 重力,是摆锤的长度,是它的质量,是它的角位置,是扭矩。这 然后根据
我们将奖励定义为
当角度接近 0 时,该角度将最大化(Pendulum in upward 位置),角速度接近 0(无运动),扭矩为 也是 0。
对操作的效果进行编码:_step()
¶
step 方法是首先要考虑的事情,因为它将对
我们感兴趣的模拟。在 TorchRL 中,该类有一个方法,该方法接收一个实例,该实例带有一个条目,指示要采取什么操作。
EnvBase.step()
tensordict.TensorDict
"action"
为了方便从中读取和写入,并确保
键与库的预期值一致,则
模拟部分已被委托给一个私有抽象方法,该方法从 中读取输入数据,并使用输出数据写入 new。tensordict
_step()
tensordict
tensordict
该方法应执行以下操作:_step()
读取输入键 (如 ) 并执行模拟 基于这些;
"action"
检索观察结果、完成状态和奖励;
编写观察值集以及 reward 和 done 状态 在新 .
TensorDict
接下来,该方法将合并输出
of
在输入中强制执行
输入/输出一致性。
tensordict
通常,对于有状态环境,这将如下所示:
>>> policy(env.reset())
>>> print(tensordict)
TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
>>> env.step(tensordict)
>>> print(tensordict)
TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False),
observation: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=cpu,
is_shared=False)
请注意,根没有更改,唯一的修改是
包含新信息的新条目的外观。tensordict
"next"
在 Pendulum 示例中,我们的方法将读取相关的
条目,并计算
施加按键编码的力之后的钟摆
到它上面。我们计算钟摆的新角度位置,这是前一个位置加上新的
velocity 在某个时间间隔 上。_step()
tensordict
"action"
"new_th"
"th"
"new_thdot"
dt
因为我们的目标是把钟摆调高并保持在那个位置上
position,则我们的 (negative reward) 函数对于位置较低
靠近目标且速度低。
事实上,我们想阻止那些远非“向上”的头寸
和/或速度远非 0。cost
在我们的示例中,被编码为静态方法,因为我们的
environment 是无状态的。在有状态设置中,参数为
需要,因为需要从环境中读取状态。EnvBase._step()
self
def _step(tensordict):
th, thdot = tensordict["th"], tensordict["thdot"] # th := theta
g_force = tensordict["params", "g"]
mass = tensordict["params", "m"]
length = tensordict["params", "l"]
dt = tensordict["params", "dt"]
u = tensordict["action"].squeeze(-1)
u = u.clamp(-tensordict["params", "max_torque"], tensordict["params", "max_torque"])
costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (u**2)
new_thdot = (
thdot
+ (3 * g_force / (2 * length) * th.sin() + 3.0 / (mass * length**2) * u) * dt
)
new_thdot = new_thdot.clamp(
-tensordict["params", "max_speed"], tensordict["params", "max_speed"]
)
new_th = th + new_thdot * dt
reward = -costs.view(*tensordict.shape, 1)
done = torch.zeros_like(reward, dtype=torch.bool)
out = TensorDict(
{
"th": new_th,
"thdot": new_thdot,
"params": tensordict["params"],
"reward": reward,
"done": done,
},
tensordict.shape,
)
return out
def angle_normalize(x):
return ((x + torch.pi) % (2 * torch.pi)) - torch.pi
重置模拟器:_reset()
¶
我们需要关心的第二种方法是方法。就像 一样,它应该写入观察条目
并且可能是 it 输出中的 done 状态(如果 done 状态为
省略,它将由 parent method 填充 )。在某些情况下,需要
该方法从调用
它(例如,在 Multi Agent Settings 中,我们可能想要指示哪些 Agent 需要
重置)。这就是为什么该方法
还需要 a as input,尽管它可能完全为空或 .
_reset()
_step()
tensordict
False
_reset
_reset()
tensordict
None
父级会像 一样执行一些简单的检查,例如确保 state
在输出中返回,并且形状与
从规格中预期。EnvBase.reset()
EnvBase.step()
"done"
tensordict
对我们来说,唯一要考虑的重要事情是是否包含所有预期的观测值。再一次
由于我们正在使用无状态环境,因此我们将配置
的钟摆位于 名为 的嵌套 中。EnvBase._reset()
tensordict
"params"
在此示例中,我们没有传递 done 状态,因为这不是强制性的
for 并且我们的环境是非终止的,因此我们始终
预计它会是 。_reset()
False
def _reset(self, tensordict):
if tensordict is None or tensordict.is_empty():
# if no ``tensordict`` is passed, we generate a single set of hyperparameters
# Otherwise, we assume that the input ``tensordict`` contains all the relevant
# parameters to get started.
tensordict = self.gen_params(batch_size=self.batch_size)
high_th = torch.tensor(DEFAULT_X, device=self.device)
high_thdot = torch.tensor(DEFAULT_Y, device=self.device)
low_th = -high_th
low_thdot = -high_thdot
# for non batch-locked environments, the input ``tensordict`` shape dictates the number
# of simulators run simultaneously. In other contexts, the initial
# random state's shape will depend upon the environment batch-size instead.
th = (
torch.rand(tensordict.shape, generator=self.rng, device=self.device)
* (high_th - low_th)
+ low_th
)
thdot = (
torch.rand(tensordict.shape, generator=self.rng, device=self.device)
* (high_thdot - low_thdot)
+ low_thdot
)
out = TensorDict(
{
"th": th,
"thdot": thdot,
"params": tensordict["params"],
},
batch_size=tensordict.shape,
)
return out
环境元数据:env.*_spec
¶
规范定义环境的输入和输出域。 规范准确定义将要 接收,因为它们通常用于携带有关 环境。他们也可以是 用于实例化延迟定义的神经网络和测试脚本,而无需 实际查询环境(这在现实世界中可能代价高昂 例如物理系统)。
我们必须在我们的环境中编写四个规范:
EnvBase.observation_spec
:这将是一个实例,其中每个键都是一个观察值(可以是 视为规格词典)。
CompositeSpec
EnvBase.action_spec
:可以是任何类型的规格,但这是必需的 它对应于 input 中的条目"action"
tensordict
;EnvBase.reward_spec
:提供有关奖励空间的信息;EnvBase.done_spec
:提供有关已完成空间的信息 旗。
TorchRL 规范分为两个通用容器:其中
包含阶跃函数读取的信息的 spec(除以
between containing the action 和 containing
所有其他的),它对
step 输出 (和 )。
通常,您不应直接与其内容交互,而应仅与其内容交互: 、 、 和 。
如果规范以非平凡的方式组织,则原因
在 和 和 中,这两个都不应直接修改。input_spec
action_spec
state_spec
output_spec
observation_spec
reward_spec
done_spec
output_spec
input_spec
observation_spec
reward_spec
done_spec
action_spec
state_spec
output_spec
input_spec
换句话说,和 related 属性是
输出和输入规范容器内容的便捷快捷方式。observation_spec
TorchRL 提供了多个子类
对环境的输入和输出特征进行编码。
规格 形状¶
环境规范前导维度必须与 environment batch-size 的 intent 实例。这样做是为了强制 environment(包括其转换)具有 预期的输入和输出形状。这应该是 在有状态设置中准确编码。
对于非批处理锁定环境,例如我们示例中的环境(见下文), 这无关紧要,因为 Environment Batch Size 很可能为空。
def _make_spec(self, td_params):
# Under the hood, this will populate self.output_spec["observation"]
self.observation_spec = Composite(
th=Bounded(
low=-torch.pi,
high=torch.pi,
shape=(),
dtype=torch.float32,
),
thdot=Bounded(
low=-td_params["params", "max_speed"],
high=td_params["params", "max_speed"],
shape=(),
dtype=torch.float32,
),
# we need to add the ``params`` to the observation specs, as we want
# to pass it at each step during a rollout
params=make_composite_from_td(td_params["params"]),
shape=(),
)
# since the environment is stateless, we expect the previous output as input.
# For this, ``EnvBase`` expects some state_spec to be available
self.state_spec = self.observation_spec.clone()
# action-spec will be automatically wrapped in input_spec when
# `self.action_spec = spec` will be called supported
self.action_spec = Bounded(
low=-td_params["params", "max_torque"],
high=td_params["params", "max_torque"],
shape=(1,),
dtype=torch.float32,
)
self.reward_spec = Unbounded(shape=(*td_params.shape, 1))
def make_composite_from_td(td):
# custom function to convert a ``tensordict`` in a similar spec structure
# of unbounded values.
composite = Composite(
{
key: make_composite_from_td(tensor)
if isinstance(tensor, TensorDictBase)
else Unbounded(dtype=tensor.dtype, device=tensor.device, shape=tensor.shape)
for key, tensor in td.items()
},
shape=td.shape,
)
return composite
可重复的实验:种子¶
初始化实验时,为环境设定种子是一种常见操作。
的唯一目标是设置包含的
模拟器。如果可能,此操作不应调用或交互
与环境执行。父方法
包含一种机制,该机制允许使用
不同的伪随机和可重复种子。EnvBase._set_seed()
reset()
EnvBase.set_seed()
def _set_seed(self, seed: Optional[int]):
rng = torch.manual_seed(seed)
self.rng = rng
将事物包装在一起:
类¶
我们终于可以把各个部分放在一起并设计我们的环境类了。
需要在环境期间执行 spec 初始化
构造,因此我们必须注意调用
在。_make_spec()
PendulumEnv.__init__()
我们添加了一个 static 方法,它确定性地
生成一组要在执行期间使用的超参数:PendulumEnv.gen_params()
def gen_params(g=10.0, batch_size=None) -> TensorDictBase:
"""Returns a ``tensordict`` containing the physical parameters such as gravitational force and torque or speed limits."""
if batch_size is None:
batch_size = []
td = TensorDict(
{
"params": TensorDict(
{
"max_speed": 8,
"max_torque": 2.0,
"dt": 0.05,
"g": g,
"m": 1.0,
"l": 1.0,
},
[],
)
},
[],
)
if batch_size:
td = td.expand(batch_size).contiguous()
return td
我们通过将属性转换为 .这意味着我们不会强制 input 具有与环境匹配的 a。batch_locked
homonymous
False
tensordict
batch-size
下面的代码将把我们上面编码的部分放在一起。
class PendulumEnv(EnvBase):
metadata = {
"render_modes": ["human", "rgb_array"],
"render_fps": 30,
}
batch_locked = False
def __init__(self, td_params=None, seed=None, device="cpu"):
if td_params is None:
td_params = self.gen_params()
super().__init__(device=device, batch_size=[])
self._make_spec(td_params)
if seed is None:
seed = torch.empty((), dtype=torch.int64).random_().item()
self.set_seed(seed)
# Helpers: _make_step and gen_params
gen_params = staticmethod(gen_params)
_make_spec = _make_spec
# Mandatory methods: _step, _reset and _set_seed
_reset = _reset
_step = staticmethod(_step)
_set_seed = _set_seed
测试我们的环境¶
TorchRL 提供了一个简单的函数来检查(转换后的)环境是否具有 Input/Output 结构,该
匹配其 spec 指定的 1。
让我们试一试:
env = PendulumEnv()
check_env_specs(env)
我们可以查看我们的规格,以直观地表示环境 签名:
print("observation_spec:", env.observation_spec)
print("state_spec:", env.state_spec)
print("reward_spec:", env.reward_spec)
observation_spec: Composite(
th: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
thdot: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
params: Composite(
max_speed: UnboundedDiscrete(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
device=cpu,
dtype=torch.int64,
domain=discrete),
max_torque: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
dt: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
g: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
m: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
l: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
device=cpu,
shape=torch.Size([])),
device=cpu,
shape=torch.Size([]))
state_spec: Composite(
th: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
thdot: BoundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
params: Composite(
max_speed: UnboundedDiscrete(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, contiguous=True)),
device=cpu,
dtype=torch.int64,
domain=discrete),
max_torque: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
dt: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
g: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
m: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
l: UnboundedContinuous(
shape=torch.Size([]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous),
device=cpu,
shape=torch.Size([])),
device=cpu,
shape=torch.Size([]))
reward_spec: UnboundedContinuous(
shape=torch.Size([1]),
space=ContinuousBox(
low=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True),
high=Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, contiguous=True)),
device=cpu,
dtype=torch.float32,
domain=continuous)
我们也可以执行几个命令来检查输出结构 与预期匹配。
td = env.reset()
print("reset tensordict", td)
reset tensordict TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False)
我们可以运行 to generate
来自域的随机操作。A 包含
超参数和当前状态必须传递,因为我们的
environment 是无状态的。在有状态上下文中,工作
也完美无缺。env.rand_step()
action_spec
tensordict
env.rand_step()
td = env.rand_step(td)
print("random step tensordict", td)
random step tensordict TensorDict(
fields={
action: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
done: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False),
terminated: Tensor(shape=torch.Size([1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([]),
device=None,
is_shared=False)
改造环境¶
为 stateless simulator 编写环境转换稍微多一些
比有状态的更复杂:转换需要
若要在以下迭代中读取,需要应用逆变换
在下一步调用之前。
这是展示 TorchRL 的所有功能的理想场景
变换!meth.step()
例如,在下面的转换环境中,我们 entry 能够将它们堆叠在最后一个
尺寸。我们还将他们传递 as 以将他们挤回他们的
original shape 的实例。unsqueeze
["th", "thdot"]
in_keys_inv
env = TransformedEnv(
env,
# ``Unsqueeze`` the observations that we will concatenate
UnsqueezeTransform(
dim=-1,
in_keys=["th", "thdot"],
in_keys_inv=["th", "thdot"],
),
)
编写自定义转换¶
TorchRL 的转换可能无法涵盖要执行的所有操作 在执行环境之后。 编写转换不需要太多努力。环境 design,编写 transform 有两个步骤:
获得正确的动态(正向和反向);
调整环境规范。
转换可以在两个设置中使用:就其本身而言,它可以用作 .它也可以附加到
.类的结构允许
自定义不同上下文中的行为。
class Transform(nn.Module):
def forward(self, tensordict):
...
def _apply_transform(self, tensordict):
...
def _step(self, tensordict):
...
def _call(self, tensordict):
...
def inv(self, tensordict):
...
def _inv_apply_transform(self, tensordict):
...
有三个入口点 (和 )
它们都接收实例。前两个
最终将遍历 AND CALL 每个 KEY 指示的键。结果将
写入 如果提供
(如果不是,则将使用转换后的值进行更新)。
如果需要执行逆变换,类似的数据流将是
执行但使用 and 方法和跨 and 键列表。
下图总结了环境和重播的此流程
缓冲区。forward()
_step()
inv()
tensordict.TensorDict
in_keys
_apply_transform()
Transform.out_keys
in_keys
Transform.inv()
Transform._inv_apply_transform()
in_keys_inv
out_keys_inv
转换 API
在某些情况下,转换不适用于 unitary 中的键子集
方式,但会在父环境中执行一些操作,或者
使用整个 input 。
在这些情况下,和 方法应该是
re-written,并且可以跳过该方法。tensordict
_call()
forward()
_apply_transform()
让我们编写新的转换,这些转换将计算位置角度的 and 值,因为这些值对我们来说更有用
a 策略:sine
cosine
class SinTransform(Transform):
def _apply_transform(self, obs: torch.Tensor) -> None:
return obs.sin()
# The transform must also modify the data at reset time
def _reset(
self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
) -> TensorDictBase:
return self._call(tensordict_reset)
# _apply_to_composite will execute the observation spec transform across all
# in_keys/out_keys pairs and write the result in the observation_spec which
# is of type ``Composite``
@_apply_to_composite
def transform_observation_spec(self, observation_spec):
return Bounded(
low=-1,
high=1,
shape=observation_spec.shape,
dtype=observation_spec.dtype,
device=observation_spec.device,
)
class CosTransform(Transform):
def _apply_transform(self, obs: torch.Tensor) -> None:
return obs.cos()
# The transform must also modify the data at reset time
def _reset(
self, tensordict: TensorDictBase, tensordict_reset: TensorDictBase
) -> TensorDictBase:
return self._call(tensordict_reset)
# _apply_to_composite will execute the observation spec transform across all
# in_keys/out_keys pairs and write the result in the observation_spec which
# is of type ``Composite``
@_apply_to_composite
def transform_observation_spec(self, observation_spec):
return Bounded(
low=-1,
high=1,
shape=observation_spec.shape,
dtype=observation_spec.dtype,
device=observation_spec.device,
)
t_sin = SinTransform(in_keys=["th"], out_keys=["sin"])
t_cos = CosTransform(in_keys=["th"], out_keys=["cos"])
env.append_transform(t_sin)
env.append_transform(t_cos)
TransformedEnv(
env=PendulumEnv(),
transform=Compose(
UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
SinTransform(keys=['th']),
CosTransform(keys=['th'])))
将观察值连接到“observation”条目。 确保我们保留这些值以供下一个
迭 代。del_keys=False
cat_transform = CatTensors(
in_keys=["sin", "cos", "thdot"], dim=-1, out_key="observation", del_keys=False
)
env.append_transform(cat_transform)
TransformedEnv(
env=PendulumEnv(),
transform=Compose(
UnsqueezeTransform(dim=-1, in_keys=['th', 'thdot'], out_keys=['th', 'thdot'], in_keys_inv=['th', 'thdot'], out_keys_inv=['th', 'thdot']),
SinTransform(keys=['th']),
CosTransform(keys=['th']),
CatTensors(in_keys=['cos', 'sin', 'thdot'], out_key=observation)))
让我们再次检查我们的环境规范是否与收到的内容相匹配:
check_env_specs(env)
执行转出¶
执行转出是一系列简单的步骤:
重置环境
当某些条件未满足时:
在给定策略的情况下计算操作
在给定此操作的情况下执行步骤
收集数据
执行步骤
MDP
收集数据并返回
这些操作已方便地包装在方法中,我们在下面提供了该方法的简化版本。
def simple_rollout(steps=100):
# preallocate:
data = TensorDict({}, [steps])
# reset
_data = env.reset()
for i in range(steps):
_data["action"] = env.action_spec.rand()
_data = env.step(_data)
data[i] = _data
_data = step_mdp(_data, keep_other=True)
return data
print("data from rollout:", simple_rollout(100))
data from rollout: TensorDict(
fields={
action: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([100, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([100]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([100, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([100]),
device=None,
is_shared=False)
批处理计算¶
我们教程的最后一个未探索的结局是我们必须 TorchRL 中的批量计算。因为我们的环境没有 对输入数据形状做出任何假设,我们可以无缝地 对批量数据执行它。更好的是:对于非 batch lock 环境,例如我们的 Pendulum,我们可以动态地更改批次大小 而无需重新创建环境。 为此,我们只需生成具有所需形状的参数。
batch_size = 10 # number of environments to be executed in batch
td = env.reset(env.gen_params(batch_size=[batch_size]))
print("reset (batch size of 10)", td)
td = env.rand_step(td)
print("rand step (batch size of 10)", td)
reset (batch size of 10) TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False)
rand step (batch size of 10) TensorDict(
fields={
action: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10]),
device=None,
is_shared=False)
使用一批数据执行转出需要我们重置环境
,因为我们需要定义 batch_size
动态的,并且不支持:
rollout = env.rollout(
3,
auto_reset=False, # we're executing the reset out of the ``rollout`` call
tensordict=env.reset(env.gen_params(batch_size=[batch_size])),
)
print("rollout of len 3 (batch size of 10):", rollout)
rollout of len 3 (batch size of 10): TensorDict(
fields={
action: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
next: TensorDict(
fields={
cos: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
done: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
reward: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
observation: Tensor(shape=torch.Size([10, 3, 3]), device=cpu, dtype=torch.float32, is_shared=False),
params: TensorDict(
fields={
dt: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
g: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
l: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
m: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False),
max_speed: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.int64, is_shared=False),
max_torque: Tensor(shape=torch.Size([10, 3]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False),
sin: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
terminated: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.bool, is_shared=False),
th: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False),
thdot: Tensor(shape=torch.Size([10, 3, 1]), device=cpu, dtype=torch.float32, is_shared=False)},
batch_size=torch.Size([10, 3]),
device=None,
is_shared=False)
训练一个简单的策略¶
在此示例中,我们将使用 reward 作为 可微分目标,例如负损失。 我们将利用我们的动态系统完全 differentiable 通过轨迹返回进行反向传播,并调整 权重直接最大化此值。当然,在许多 设置我们所做的许多假设都不成立,例如 微分系统和对底层机制的完全访问权限。
不过,这是一个非常简单的示例,展示了训练循环如何 使用 TorchRL 中的自定义环境进行编码。
让我们首先编写策略网络:
torch.manual_seed(0)
env.set_seed(0)
net = nn.Sequential(
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(64),
nn.Tanh(),
nn.LazyLinear(1),
)
policy = TensorDictModule(
net,
in_keys=["observation"],
out_keys=["action"],
)
和我们的优化器:
optim = torch.optim.Adam(policy.parameters(), lr=2e-3)
训练循环¶
我们将陆续:
生成轨迹
对奖励求和
通过这些操作定义的图形反向传播
裁剪梯度范数并进行优化步骤
重复
在训练循环结束时,我们应该有一个接近 0 的最终奖励 这表明钟摆是向上的,并且如所愿。
batch_size = 32
pbar = tqdm.tqdm(range(20_000 // batch_size))
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optim, 20_000)
logs = defaultdict(list)
for _ in pbar:
init_td = env.reset(env.gen_params(batch_size=[batch_size]))
rollout = env.rollout(100, policy, tensordict=init_td, auto_reset=False)
traj_return = rollout["next", "reward"].mean()
(-traj_return).backward()
gn = torch.nn.utils.clip_grad_norm_(net.parameters(), 1.0)
optim.step()
optim.zero_grad()
pbar.set_description(
f"reward: {traj_return: 4.4f}, "
f"last reward: {rollout[..., -1]['next', 'reward'].mean(): 4.4f}, gradient norm: {gn: 4.4}"
)
logs["return"].append(traj_return.item())
logs["last_reward"].append(rollout[..., -1]["next", "reward"].mean().item())
scheduler.step()
def plot():
import matplotlib
from matplotlib import pyplot as plt
is_ipython = "inline" in matplotlib.get_backend()
if is_ipython:
from IPython import display
with plt.ion():
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(logs["return"])
plt.title("returns")
plt.xlabel("iteration")
plt.subplot(1, 2, 2)
plt.plot(logs["last_reward"])
plt.title("last reward")
plt.xlabel("iteration")
if is_ipython:
display.display(plt.gcf())
display.clear_output(wait=True)
plt.show()
plot()
0%| | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm: 8.519: 0%| | 0/625 [00:00<?, ?it/s]
reward: -6.0488, last reward: -5.0748, gradient norm: 8.519: 0%| | 1/625 [00:00<01:33, 6.69it/s]
reward: -7.0499, last reward: -7.4472, gradient norm: 5.073: 0%| | 1/625 [00:00<01:33, 6.69it/s]
reward: -7.0499, last reward: -7.4472, gradient norm: 5.073: 0%| | 2/625 [00:00<01:33, 6.63it/s]
reward: -7.0685, last reward: -7.0408, gradient norm: 5.552: 0%| | 2/625 [00:00<01:33, 6.63it/s]
reward: -7.0685, last reward: -7.0408, gradient norm: 5.552: 0%| | 3/625 [00:00<01:33, 6.63it/s]
reward: -6.5154, last reward: -5.9086, gradient norm: 2.527: 0%| | 3/625 [00:00<01:33, 6.63it/s]
reward: -6.5154, last reward: -5.9086, gradient norm: 2.527: 1%| | 4/625 [00:00<01:34, 6.60it/s]
reward: -6.2006, last reward: -5.9385, gradient norm: 8.155: 1%| | 4/625 [00:00<01:34, 6.60it/s]
reward: -6.2006, last reward: -5.9385, gradient norm: 8.155: 1%| | 5/625 [00:00<01:33, 6.60it/s]
reward: -6.2568, last reward: -5.4981, gradient norm: 6.223: 1%| | 5/625 [00:00<01:33, 6.60it/s]
reward: -6.2568, last reward: -5.4981, gradient norm: 6.223: 1%| | 6/625 [00:00<01:33, 6.62it/s]
reward: -5.8929, last reward: -8.4491, gradient norm: 4.581: 1%| | 6/625 [00:01<01:33, 6.62it/s]
reward: -5.8929, last reward: -8.4491, gradient norm: 4.581: 1%| | 7/625 [00:01<01:33, 6.63it/s]
reward: -6.3233, last reward: -9.0664, gradient norm: 7.596: 1%| | 7/625 [00:01<01:33, 6.63it/s]
reward: -6.3233, last reward: -9.0664, gradient norm: 7.596: 1%|▏ | 8/625 [00:01<01:34, 6.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm: 0.9579: 1%|▏ | 8/625 [00:01<01:34, 6.56it/s]
reward: -6.1021, last reward: -9.5263, gradient norm: 0.9579: 1%|▏ | 9/625 [00:01<01:34, 6.53it/s]
reward: -6.5807, last reward: -8.8075, gradient norm: 3.212: 1%|▏ | 9/625 [00:01<01:34, 6.53it/s]
reward: -6.5807, last reward: -8.8075, gradient norm: 3.212: 2%|▏ | 10/625 [00:01<01:33, 6.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm: 2.914: 2%|▏ | 10/625 [00:01<01:33, 6.56it/s]
reward: -6.2009, last reward: -8.5525, gradient norm: 2.914: 2%|▏ | 11/625 [00:01<01:33, 6.58it/s]
reward: -6.2894, last reward: -8.0115, gradient norm: 52.06: 2%|▏ | 11/625 [00:01<01:33, 6.58it/s]
reward: -6.2894, last reward: -8.0115, gradient norm: 52.06: 2%|▏ | 12/625 [00:01<01:32, 6.59it/s]
reward: -6.0977, last reward: -6.1845, gradient norm: 18.09: 2%|▏ | 12/625 [00:01<01:32, 6.59it/s]
reward: -6.0977, last reward: -6.1845, gradient norm: 18.09: 2%|▏ | 13/625 [00:01<01:32, 6.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm: 5.233: 2%|▏ | 13/625 [00:02<01:32, 6.60it/s]
reward: -6.1830, last reward: -7.4858, gradient norm: 5.233: 2%|▏ | 14/625 [00:02<01:32, 6.61it/s]
reward: -6.2863, last reward: -5.0297, gradient norm: 1.464: 2%|▏ | 14/625 [00:02<01:32, 6.61it/s]
reward: -6.2863, last reward: -5.0297, gradient norm: 1.464: 2%|▏ | 15/625 [00:02<01:31, 6.63it/s]
reward: -6.4617, last reward: -5.5997, gradient norm: 2.904: 2%|▏ | 15/625 [00:02<01:31, 6.63it/s]
reward: -6.4617, last reward: -5.5997, gradient norm: 2.904: 3%|▎ | 16/625 [00:02<01:31, 6.63it/s]
reward: -6.1647, last reward: -6.0777, gradient norm: 4.901: 3%|▎ | 16/625 [00:02<01:31, 6.63it/s]
reward: -6.1647, last reward: -6.0777, gradient norm: 4.901: 3%|▎ | 17/625 [00:02<01:31, 6.64it/s]
reward: -6.4709, last reward: -6.6813, gradient norm: 0.8317: 3%|▎ | 17/625 [00:02<01:31, 6.64it/s]
reward: -6.4709, last reward: -6.6813, gradient norm: 0.8317: 3%|▎ | 18/625 [00:02<01:31, 6.64it/s]
reward: -6.3221, last reward: -6.5554, gradient norm: 1.276: 3%|▎ | 18/625 [00:02<01:31, 6.64it/s]
reward: -6.3221, last reward: -6.5554, gradient norm: 1.276: 3%|▎ | 19/625 [00:02<01:31, 6.64it/s]
reward: -6.3353, last reward: -7.9999, gradient norm: 4.701: 3%|▎ | 19/625 [00:03<01:31, 6.64it/s]
reward: -6.3353, last reward: -7.9999, gradient norm: 4.701: 3%|▎ | 20/625 [00:03<01:31, 6.65it/s]
reward: -5.8570, last reward: -7.6656, gradient norm: 5.463: 3%|▎ | 20/625 [00:03<01:31, 6.65it/s]
reward: -5.8570, last reward: -7.6656, gradient norm: 5.463: 3%|▎ | 21/625 [00:03<01:30, 6.65it/s]
reward: -5.7779, last reward: -6.6911, gradient norm: 6.875: 3%|▎ | 21/625 [00:03<01:30, 6.65it/s]
reward: -5.7779, last reward: -6.6911, gradient norm: 6.875: 4%|▎ | 22/625 [00:03<01:30, 6.66it/s]
reward: -6.0796, last reward: -5.7082, gradient norm: 5.308: 4%|▎ | 22/625 [00:03<01:30, 6.66it/s]
reward: -6.0796, last reward: -5.7082, gradient norm: 5.308: 4%|▎ | 23/625 [00:03<01:30, 6.65it/s]
reward: -6.0421, last reward: -6.1496, gradient norm: 12.4: 4%|▎ | 23/625 [00:03<01:30, 6.65it/s]
reward: -6.0421, last reward: -6.1496, gradient norm: 12.4: 4%|▍ | 24/625 [00:03<01:30, 6.65it/s]
reward: -5.5037, last reward: -5.1755, gradient norm: 22.62: 4%|▍ | 24/625 [00:03<01:30, 6.65it/s]
reward: -5.5037, last reward: -5.1755, gradient norm: 22.62: 4%|▍ | 25/625 [00:03<01:30, 6.65it/s]
reward: -5.5029, last reward: -4.9454, gradient norm: 3.665: 4%|▍ | 25/625 [00:03<01:30, 6.65it/s]
reward: -5.5029, last reward: -4.9454, gradient norm: 3.665: 4%|▍ | 26/625 [00:03<01:30, 6.65it/s]
reward: -5.9330, last reward: -6.2118, gradient norm: 5.444: 4%|▍ | 26/625 [00:04<01:30, 6.65it/s]
reward: -5.9330, last reward: -6.2118, gradient norm: 5.444: 4%|▍ | 27/625 [00:04<01:29, 6.65it/s]
reward: -6.0995, last reward: -6.6294, gradient norm: 11.69: 4%|▍ | 27/625 [00:04<01:29, 6.65it/s]
reward: -6.0995, last reward: -6.6294, gradient norm: 11.69: 4%|▍ | 28/625 [00:04<01:29, 6.65it/s]
reward: -6.3146, last reward: -7.2909, gradient norm: 5.461: 4%|▍ | 28/625 [00:04<01:29, 6.65it/s]
reward: -6.3146, last reward: -7.2909, gradient norm: 5.461: 5%|▍ | 29/625 [00:04<01:29, 6.66it/s]
reward: -5.9720, last reward: -6.1298, gradient norm: 19.91: 5%|▍ | 29/625 [00:04<01:29, 6.66it/s]
reward: -5.9720, last reward: -6.1298, gradient norm: 19.91: 5%|▍ | 30/625 [00:04<01:29, 6.66it/s]
reward: -5.9923, last reward: -7.0345, gradient norm: 3.464: 5%|▍ | 30/625 [00:04<01:29, 6.66it/s]
reward: -5.9923, last reward: -7.0345, gradient norm: 3.464: 5%|▍ | 31/625 [00:04<01:29, 6.65it/s]
reward: -5.3438, last reward: -4.3688, gradient norm: 2.424: 5%|▍ | 31/625 [00:04<01:29, 6.65it/s]
reward: -5.3438, last reward: -4.3688, gradient norm: 2.424: 5%|▌ | 32/625 [00:04<01:29, 6.64it/s]
reward: -5.6953, last reward: -4.5233, gradient norm: 3.411: 5%|▌ | 32/625 [00:04<01:29, 6.64it/s]
reward: -5.6953, last reward: -4.5233, gradient norm: 3.411: 5%|▌ | 33/625 [00:04<01:29, 6.64it/s]
reward: -5.4288, last reward: -2.8011, gradient norm: 10.82: 5%|▌ | 33/625 [00:05<01:29, 6.64it/s]
reward: -5.4288, last reward: -2.8011, gradient norm: 10.82: 5%|▌ | 34/625 [00:05<01:29, 6.64it/s]
reward: -5.5329, last reward: -4.2677, gradient norm: 15.71: 5%|▌ | 34/625 [00:05<01:29, 6.64it/s]
reward: -5.5329, last reward: -4.2677, gradient norm: 15.71: 6%|▌ | 35/625 [00:05<01:28, 6.64it/s]
reward: -5.6969, last reward: -3.7010, gradient norm: 1.376: 6%|▌ | 35/625 [00:05<01:28, 6.64it/s]
reward: -5.6969, last reward: -3.7010, gradient norm: 1.376: 6%|▌ | 36/625 [00:05<01:28, 6.65it/s]
reward: -5.9352, last reward: -4.7707, gradient norm: 15.49: 6%|▌ | 36/625 [00:05<01:28, 6.65it/s]
reward: -5.9352, last reward: -4.7707, gradient norm: 15.49: 6%|▌ | 37/625 [00:05<01:28, 6.65it/s]
reward: -5.6178, last reward: -4.5646, gradient norm: 3.348: 6%|▌ | 37/625 [00:05<01:28, 6.65it/s]
reward: -5.6178, last reward: -4.5646, gradient norm: 3.348: 6%|▌ | 38/625 [00:05<01:28, 6.65it/s]
reward: -5.7304, last reward: -3.9407, gradient norm: 4.942: 6%|▌ | 38/625 [00:05<01:28, 6.65it/s]
reward: -5.7304, last reward: -3.9407, gradient norm: 4.942: 6%|▌ | 39/625 [00:05<01:28, 6.64it/s]
reward: -5.3882, last reward: -3.7604, gradient norm: 9.85: 6%|▌ | 39/625 [00:06<01:28, 6.64it/s]
reward: -5.3882, last reward: -3.7604, gradient norm: 9.85: 6%|▋ | 40/625 [00:06<01:28, 6.63it/s]
reward: -5.3507, last reward: -2.8928, gradient norm: 1.258: 6%|▋ | 40/625 [00:06<01:28, 6.63it/s]
reward: -5.3507, last reward: -2.8928, gradient norm: 1.258: 7%|▋ | 41/625 [00:06<01:27, 6.64it/s]
reward: -5.6978, last reward: -4.4641, gradient norm: 4.549: 7%|▋ | 41/625 [00:06<01:27, 6.64it/s]
reward: -5.6978, last reward: -4.4641, gradient norm: 4.549: 7%|▋ | 42/625 [00:06<01:27, 6.64it/s]
reward: -5.5263, last reward: -3.6047, gradient norm: 2.544: 7%|▋ | 42/625 [00:06<01:27, 6.64it/s]
reward: -5.5263, last reward: -3.6047, gradient norm: 2.544: 7%|▋ | 43/625 [00:06<01:27, 6.64it/s]
reward: -5.5005, last reward: -4.4136, gradient norm: 11.49: 7%|▋ | 43/625 [00:06<01:27, 6.64it/s]
reward: -5.5005, last reward: -4.4136, gradient norm: 11.49: 7%|▋ | 44/625 [00:06<01:27, 6.64it/s]
reward: -5.2993, last reward: -6.3222, gradient norm: 32.53: 7%|▋ | 44/625 [00:06<01:27, 6.64it/s]
reward: -5.2993, last reward: -6.3222, gradient norm: 32.53: 7%|▋ | 45/625 [00:06<01:27, 6.65it/s]
reward: -5.4046, last reward: -5.7314, gradient norm: 7.275: 7%|▋ | 45/625 [00:06<01:27, 6.65it/s]
reward: -5.4046, last reward: -5.7314, gradient norm: 7.275: 7%|▋ | 46/625 [00:06<01:27, 6.65it/s]
reward: -5.6331, last reward: -4.9318, gradient norm: 6.961: 7%|▋ | 46/625 [00:07<01:27, 6.65it/s]
reward: -5.6331, last reward: -4.9318, gradient norm: 6.961: 8%|▊ | 47/625 [00:07<01:26, 6.65it/s]
reward: -4.8331, last reward: -4.1604, gradient norm: 26.26: 8%|▊ | 47/625 [00:07<01:26, 6.65it/s]
reward: -4.8331, last reward: -4.1604, gradient norm: 26.26: 8%|▊ | 48/625 [00:07<01:26, 6.66it/s]
reward: -5.4099, last reward: -4.4761, gradient norm: 8.125: 8%|▊ | 48/625 [00:07<01:26, 6.66it/s]
reward: -5.4099, last reward: -4.4761, gradient norm: 8.125: 8%|▊ | 49/625 [00:07<01:26, 6.65it/s]
reward: -5.4262, last reward: -3.6363, gradient norm: 2.382: 8%|▊ | 49/625 [00:07<01:26, 6.65it/s]
reward: -5.4262, last reward: -3.6363, gradient norm: 2.382: 8%|▊ | 50/625 [00:07<01:26, 6.65it/s]
reward: -5.3593, last reward: -5.7377, gradient norm: 22.62: 8%|▊ | 50/625 [00:07<01:26, 6.65it/s]
reward: -5.3593, last reward: -5.7377, gradient norm: 22.62: 8%|▊ | 51/625 [00:07<01:26, 6.64it/s]
reward: -5.2847, last reward: -3.3443, gradient norm: 2.867: 8%|▊ | 51/625 [00:07<01:26, 6.64it/s]
reward: -5.2847, last reward: -3.3443, gradient norm: 2.867: 8%|▊ | 52/625 [00:07<01:26, 6.64it/s]
reward: -5.3592, last reward: -6.4760, gradient norm: 8.441: 8%|▊ | 52/625 [00:07<01:26, 6.64it/s]
reward: -5.3592, last reward: -6.4760, gradient norm: 8.441: 8%|▊ | 53/625 [00:07<01:26, 6.64it/s]
reward: -5.9950, last reward: -10.8021, gradient norm: 11.77: 8%|▊ | 53/625 [00:08<01:26, 6.64it/s]
reward: -5.9950, last reward: -10.8021, gradient norm: 11.77: 9%|▊ | 54/625 [00:08<01:25, 6.65it/s]
reward: -6.3528, last reward: -7.1214, gradient norm: 7.708: 9%|▊ | 54/625 [00:08<01:25, 6.65it/s]
reward: -6.3528, last reward: -7.1214, gradient norm: 7.708: 9%|▉ | 55/625 [00:08<01:25, 6.64it/s]
reward: -6.4023, last reward: -7.3583, gradient norm: 9.041: 9%|▉ | 55/625 [00:08<01:25, 6.64it/s]
reward: -6.4023, last reward: -7.3583, gradient norm: 9.041: 9%|▉ | 56/625 [00:08<01:25, 6.64it/s]
reward: -6.3801, last reward: -7.0310, gradient norm: 120.1: 9%|▉ | 56/625 [00:08<01:25, 6.64it/s]
reward: -6.3801, last reward: -7.0310, gradient norm: 120.1: 9%|▉ | 57/625 [00:08<01:25, 6.64it/s]
reward: -6.4244, last reward: -6.2039, gradient norm: 15.48: 9%|▉ | 57/625 [00:08<01:25, 6.64it/s]
reward: -6.4244, last reward: -6.2039, gradient norm: 15.48: 9%|▉ | 58/625 [00:08<01:25, 6.64it/s]
reward: -6.4850, last reward: -6.8748, gradient norm: 4.706: 9%|▉ | 58/625 [00:08<01:25, 6.64it/s]
reward: -6.4850, last reward: -6.8748, gradient norm: 4.706: 9%|▉ | 59/625 [00:08<01:25, 6.63it/s]
reward: -6.4897, last reward: -5.9210, gradient norm: 11.63: 9%|▉ | 59/625 [00:09<01:25, 6.63it/s]
reward: -6.4897, last reward: -5.9210, gradient norm: 11.63: 10%|▉ | 60/625 [00:09<01:25, 6.64it/s]
reward: -6.2299, last reward: -7.8964, gradient norm: 13.35: 10%|▉ | 60/625 [00:09<01:25, 6.64it/s]
reward: -6.2299, last reward: -7.8964, gradient norm: 13.35: 10%|▉ | 61/625 [00:09<01:24, 6.65it/s]
reward: -6.0832, last reward: -9.3934, gradient norm: 4.456: 10%|▉ | 61/625 [00:09<01:24, 6.65it/s]
reward: -6.0832, last reward: -9.3934, gradient norm: 4.456: 10%|▉ | 62/625 [00:09<01:24, 6.65it/s]
reward: -5.8971, last reward: -10.2933, gradient norm: 10.74: 10%|▉ | 62/625 [00:09<01:24, 6.65it/s]
reward: -5.8971, last reward: -10.2933, gradient norm: 10.74: 10%|█ | 63/625 [00:09<01:24, 6.64it/s]
reward: -5.3377, last reward: -4.6996, gradient norm: 23.29: 10%|█ | 63/625 [00:09<01:24, 6.64it/s]
reward: -5.3377, last reward: -4.6996, gradient norm: 23.29: 10%|█ | 64/625 [00:09<01:24, 6.63it/s]
reward: -5.2274, last reward: -2.8916, gradient norm: 4.098: 10%|█ | 64/625 [00:09<01:24, 6.63it/s]
reward: -5.2274, last reward: -2.8916, gradient norm: 4.098: 10%|█ | 65/625 [00:09<01:24, 6.64it/s]
reward: -5.2660, last reward: -4.9110, gradient norm: 12.28: 10%|█ | 65/625 [00:09<01:24, 6.64it/s]
reward: -5.2660, last reward: -4.9110, gradient norm: 12.28: 11%|█ | 66/625 [00:09<01:24, 6.64it/s]
reward: -5.4503, last reward: -5.6956, gradient norm: 12.22: 11%|█ | 66/625 [00:10<01:24, 6.64it/s]
reward: -5.4503, last reward: -5.6956, gradient norm: 12.22: 11%|█ | 67/625 [00:10<01:23, 6.65it/s]
reward: -5.9172, last reward: -5.4026, gradient norm: 7.946: 11%|█ | 67/625 [00:10<01:23, 6.65it/s]
reward: -5.9172, last reward: -5.4026, gradient norm: 7.946: 11%|█ | 68/625 [00:10<01:23, 6.65it/s]
reward: -5.9229, last reward: -4.5205, gradient norm: 6.294: 11%|█ | 68/625 [00:10<01:23, 6.65it/s]
reward: -5.9229, last reward: -4.5205, gradient norm: 6.294: 11%|█ | 69/625 [00:10<01:23, 6.65it/s]
reward: -5.8872, last reward: -5.6637, gradient norm: 8.019: 11%|█ | 69/625 [00:10<01:23, 6.65it/s]
reward: -5.8872, last reward: -5.6637, gradient norm: 8.019: 11%|█ | 70/625 [00:10<01:23, 6.64it/s]
reward: -5.9281, last reward: -4.2082, gradient norm: 5.724: 11%|█ | 70/625 [00:10<01:23, 6.64it/s]
reward: -5.9281, last reward: -4.2082, gradient norm: 5.724: 11%|█▏ | 71/625 [00:10<01:23, 6.63it/s]
reward: -5.8561, last reward: -5.6574, gradient norm: 8.357: 11%|█▏ | 71/625 [00:10<01:23, 6.63it/s]
reward: -5.8561, last reward: -5.6574, gradient norm: 8.357: 12%|█▏ | 72/625 [00:10<01:23, 6.62it/s]
reward: -5.4138, last reward: -4.5230, gradient norm: 7.385: 12%|█▏ | 72/625 [00:11<01:23, 6.62it/s]
reward: -5.4138, last reward: -4.5230, gradient norm: 7.385: 12%|█▏ | 73/625 [00:11<01:23, 6.61it/s]
reward: -5.4065, last reward: -5.5642, gradient norm: 9.921: 12%|█▏ | 73/625 [00:11<01:23, 6.61it/s]
reward: -5.4065, last reward: -5.5642, gradient norm: 9.921: 12%|█▏ | 74/625 [00:11<01:23, 6.62it/s]
reward: -4.9786, last reward: -3.2894, gradient norm: 32.73: 12%|█▏ | 74/625 [00:11<01:23, 6.62it/s]
reward: -4.9786, last reward: -3.2894, gradient norm: 32.73: 12%|█▏ | 75/625 [00:11<01:23, 6.63it/s]
reward: -5.4129, last reward: -7.5831, gradient norm: 9.266: 12%|█▏ | 75/625 [00:11<01:23, 6.63it/s]
reward: -5.4129, last reward: -7.5831, gradient norm: 9.266: 12%|█▏ | 76/625 [00:11<01:22, 6.62it/s]
reward: -5.7723, last reward: -7.4152, gradient norm: 5.608: 12%|█▏ | 76/625 [00:11<01:22, 6.62it/s]
reward: -5.7723, last reward: -7.4152, gradient norm: 5.608: 12%|█▏ | 77/625 [00:11<01:22, 6.63it/s]
reward: -6.1604, last reward: -8.0898, gradient norm: 4.389: 12%|█▏ | 77/625 [00:11<01:22, 6.63it/s]
reward: -6.1604, last reward: -8.0898, gradient norm: 4.389: 12%|█▏ | 78/625 [00:11<01:22, 6.63it/s]
reward: -6.5155, last reward: -5.5376, gradient norm: 36.34: 12%|█▏ | 78/625 [00:11<01:22, 6.63it/s]
reward: -6.5155, last reward: -5.5376, gradient norm: 36.34: 13%|█▎ | 79/625 [00:11<01:22, 6.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm: 8.283: 13%|█▎ | 79/625 [00:12<01:22, 6.61it/s]
reward: -6.5616, last reward: -6.4094, gradient norm: 8.283: 13%|█▎ | 80/625 [00:12<01:22, 6.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm: 5.895: 13%|█▎ | 80/625 [00:12<01:22, 6.61it/s]
reward: -6.5333, last reward: -7.4803, gradient norm: 5.895: 13%|█▎ | 81/625 [00:12<01:22, 6.62it/s]
reward: -6.6566, last reward: -5.2588, gradient norm: 7.662: 13%|█▎ | 81/625 [00:12<01:22, 6.62it/s]
reward: -6.6566, last reward: -5.2588, gradient norm: 7.662: 13%|█▎ | 82/625 [00:12<01:21, 6.62it/s]
reward: -6.4732, last reward: -6.7503, gradient norm: 6.068: 13%|█▎ | 82/625 [00:12<01:21, 6.62it/s]
reward: -6.4732, last reward: -6.7503, gradient norm: 6.068: 13%|█▎ | 83/625 [00:12<01:22, 6.60it/s]
reward: -6.0714, last reward: -7.3370, gradient norm: 8.059: 13%|█▎ | 83/625 [00:12<01:22, 6.60it/s]
reward: -6.0714, last reward: -7.3370, gradient norm: 8.059: 13%|█▎ | 84/625 [00:12<01:21, 6.61it/s]
reward: -5.8612, last reward: -6.1915, gradient norm: 9.3: 13%|█▎ | 84/625 [00:12<01:21, 6.61it/s]
reward: -5.8612, last reward: -6.1915, gradient norm: 9.3: 14%|█▎ | 85/625 [00:12<01:21, 6.61it/s]
reward: -5.3855, last reward: -5.0349, gradient norm: 15.2: 14%|█▎ | 85/625 [00:12<01:21, 6.61it/s]
reward: -5.3855, last reward: -5.0349, gradient norm: 15.2: 14%|█▍ | 86/625 [00:12<01:21, 6.61it/s]
reward: -4.9644, last reward: -3.4538, gradient norm: 3.445: 14%|█▍ | 86/625 [00:13<01:21, 6.61it/s]
reward: -4.9644, last reward: -3.4538, gradient norm: 3.445: 14%|█▍ | 87/625 [00:13<01:21, 6.62it/s]
reward: -5.0392, last reward: -4.4080, gradient norm: 11.45: 14%|█▍ | 87/625 [00:13<01:21, 6.62it/s]
reward: -5.0392, last reward: -4.4080, gradient norm: 11.45: 14%|█▍ | 88/625 [00:13<01:21, 6.62it/s]
reward: -5.1648, last reward: -5.9599, gradient norm: 143.4: 14%|█▍ | 88/625 [00:13<01:21, 6.62it/s]
reward: -5.1648, last reward: -5.9599, gradient norm: 143.4: 14%|█▍ | 89/625 [00:13<01:20, 6.62it/s]
reward: -5.4284, last reward: -5.5946, gradient norm: 10.3: 14%|█▍ | 89/625 [00:13<01:20, 6.62it/s]
reward: -5.4284, last reward: -5.5946, gradient norm: 10.3: 14%|█▍ | 90/625 [00:13<01:20, 6.63it/s]
reward: -5.2590, last reward: -5.9181, gradient norm: 11.15: 14%|█▍ | 90/625 [00:13<01:20, 6.63it/s]
reward: -5.2590, last reward: -5.9181, gradient norm: 11.15: 15%|█▍ | 91/625 [00:13<01:20, 6.63it/s]
reward: -5.4621, last reward: -5.9075, gradient norm: 8.674: 15%|█▍ | 91/625 [00:13<01:20, 6.63it/s]
reward: -5.4621, last reward: -5.9075, gradient norm: 8.674: 15%|█▍ | 92/625 [00:13<01:20, 6.62it/s]
reward: -5.1772, last reward: -4.9444, gradient norm: 8.351: 15%|█▍ | 92/625 [00:14<01:20, 6.62it/s]
reward: -5.1772, last reward: -4.9444, gradient norm: 8.351: 15%|█▍ | 93/625 [00:14<01:20, 6.63it/s]
reward: -4.9391, last reward: -4.5595, gradient norm: 8.1: 15%|█▍ | 93/625 [00:14<01:20, 6.63it/s]
reward: -4.9391, last reward: -4.5595, gradient norm: 8.1: 15%|█▌ | 94/625 [00:14<01:20, 6.61it/s]
reward: -4.8673, last reward: -4.6240, gradient norm: 14.43: 15%|█▌ | 94/625 [00:14<01:20, 6.61it/s]
reward: -4.8673, last reward: -4.6240, gradient norm: 14.43: 15%|█▌ | 95/625 [00:14<01:20, 6.61it/s]
reward: -4.5919, last reward: -5.0018, gradient norm: 26.09: 15%|█▌ | 95/625 [00:14<01:20, 6.61it/s]
reward: -4.5919, last reward: -5.0018, gradient norm: 26.09: 15%|█▌ | 96/625 [00:14<01:20, 6.61it/s]
reward: -5.1071, last reward: -3.9127, gradient norm: 2.251: 15%|█▌ | 96/625 [00:14<01:20, 6.61it/s]
reward: -5.1071, last reward: -3.9127, gradient norm: 2.251: 16%|█▌ | 97/625 [00:14<01:19, 6.62it/s]
reward: -4.9799, last reward: -5.3131, gradient norm: 19.65: 16%|█▌ | 97/625 [00:14<01:19, 6.62it/s]
reward: -4.9799, last reward: -5.3131, gradient norm: 19.65: 16%|█▌ | 98/625 [00:14<01:19, 6.63it/s]
reward: -4.9612, last reward: -3.9705, gradient norm: 12.55: 16%|█▌ | 98/625 [00:14<01:19, 6.63it/s]
reward: -4.9612, last reward: -3.9705, gradient norm: 12.55: 16%|█▌ | 99/625 [00:14<01:19, 6.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm: 6.19: 16%|█▌ | 99/625 [00:15<01:19, 6.61it/s]
reward: -4.8741, last reward: -4.2230, gradient norm: 6.19: 16%|█▌ | 100/625 [00:15<01:19, 6.62it/s]
reward: -5.0972, last reward: -5.0337, gradient norm: 11.86: 16%|█▌ | 100/625 [00:15<01:19, 6.62it/s]
reward: -5.0972, last reward: -5.0337, gradient norm: 11.86: 16%|█▌ | 101/625 [00:15<01:19, 6.62it/s]
reward: -5.0350, last reward: -5.0654, gradient norm: 10.83: 16%|█▌ | 101/625 [00:15<01:19, 6.62it/s]
reward: -5.0350, last reward: -5.0654, gradient norm: 10.83: 16%|█▋ | 102/625 [00:15<01:18, 6.62it/s]
reward: -5.2441, last reward: -4.4596, gradient norm: 7.362: 16%|█▋ | 102/625 [00:15<01:18, 6.62it/s]
reward: -5.2441, last reward: -4.4596, gradient norm: 7.362: 16%|█▋ | 103/625 [00:15<01:18, 6.63it/s]
reward: -5.1664, last reward: -5.4362, gradient norm: 8.171: 16%|█▋ | 103/625 [00:15<01:18, 6.63it/s]
reward: -5.1664, last reward: -5.4362, gradient norm: 8.171: 17%|█▋ | 104/625 [00:15<01:18, 6.61it/s]
reward: -5.4041, last reward: -5.6907, gradient norm: 7.77: 17%|█▋ | 104/625 [00:15<01:18, 6.61it/s]
reward: -5.4041, last reward: -5.6907, gradient norm: 7.77: 17%|█▋ | 105/625 [00:15<01:18, 6.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm: 11.19: 17%|█▋ | 105/625 [00:15<01:18, 6.58it/s]
reward: -5.4664, last reward: -6.2760, gradient norm: 11.19: 17%|█▋ | 106/625 [00:15<01:18, 6.57it/s]
reward: -5.0299, last reward: -3.9712, gradient norm: 9.349: 17%|█▋ | 106/625 [00:16<01:18, 6.57it/s]
reward: -5.0299, last reward: -3.9712, gradient norm: 9.349: 17%|█▋ | 107/625 [00:16<01:18, 6.58it/s]
reward: -4.3332, last reward: -2.4479, gradient norm: 5.772: 17%|█▋ | 107/625 [00:16<01:18, 6.58it/s]
reward: -4.3332, last reward: -2.4479, gradient norm: 5.772: 17%|█▋ | 108/625 [00:16<01:18, 6.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm: 4.543: 17%|█▋ | 108/625 [00:16<01:18, 6.60it/s]
reward: -4.4357, last reward: -2.9591, gradient norm: 4.543: 17%|█▋ | 109/625 [00:16<01:18, 6.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm: 4.692: 17%|█▋ | 109/625 [00:16<01:18, 6.58it/s]
reward: -4.6216, last reward: -3.1353, gradient norm: 4.692: 18%|█▊ | 110/625 [00:16<01:18, 6.60it/s]
reward: -4.6261, last reward: -3.7086, gradient norm: 4.496: 18%|█▊ | 110/625 [00:16<01:18, 6.60it/s]
reward: -4.6261, last reward: -3.7086, gradient norm: 4.496: 18%|█▊ | 111/625 [00:16<01:17, 6.60it/s]
reward: -4.7758, last reward: -5.9818, gradient norm: 21.71: 18%|█▊ | 111/625 [00:16<01:17, 6.60it/s]
reward: -4.7758, last reward: -5.9818, gradient norm: 21.71: 18%|█▊ | 112/625 [00:16<01:17, 6.59it/s]
reward: -4.7772, last reward: -7.5055, gradient norm: 62.86: 18%|█▊ | 112/625 [00:17<01:17, 6.59it/s]
reward: -4.7772, last reward: -7.5055, gradient norm: 62.86: 18%|█▊ | 113/625 [00:17<01:17, 6.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm: 18.74: 18%|█▊ | 113/625 [00:17<01:17, 6.60it/s]
reward: -4.5840, last reward: -5.3180, gradient norm: 18.74: 18%|█▊ | 114/625 [00:17<01:17, 6.62it/s]
reward: -4.2976, last reward: -3.2083, gradient norm: 10.63: 18%|█▊ | 114/625 [00:17<01:17, 6.62it/s]
reward: -4.2976, last reward: -3.2083, gradient norm: 10.63: 18%|█▊ | 115/625 [00:17<01:17, 6.62it/s]
reward: -4.5275, last reward: -3.6873, gradient norm: 15.65: 18%|█▊ | 115/625 [00:17<01:17, 6.62it/s]
reward: -4.5275, last reward: -3.6873, gradient norm: 15.65: 19%|█▊ | 116/625 [00:17<01:16, 6.63it/s]
reward: -4.4107, last reward: -3.1624, gradient norm: 19.7: 19%|█▊ | 116/625 [00:17<01:16, 6.63it/s]
reward: -4.4107, last reward: -3.1624, gradient norm: 19.7: 19%|█▊ | 117/625 [00:17<01:16, 6.63it/s]
reward: -4.6372, last reward: -3.2571, gradient norm: 15.83: 19%|█▊ | 117/625 [00:17<01:16, 6.63it/s]
reward: -4.6372, last reward: -3.2571, gradient norm: 15.83: 19%|█▉ | 118/625 [00:17<01:16, 6.62it/s]
reward: -4.4039, last reward: -4.4428, gradient norm: 13.06: 19%|█▉ | 118/625 [00:17<01:16, 6.62it/s]
reward: -4.4039, last reward: -4.4428, gradient norm: 13.06: 19%|█▉ | 119/625 [00:17<01:16, 6.63it/s]
reward: -4.4728, last reward: -3.5628, gradient norm: 12.04: 19%|█▉ | 119/625 [00:18<01:16, 6.63it/s]
reward: -4.4728, last reward: -3.5628, gradient norm: 12.04: 19%|█▉ | 120/625 [00:18<01:16, 6.61it/s]
reward: -4.6767, last reward: -5.2466, gradient norm: 6.522: 19%|█▉ | 120/625 [00:18<01:16, 6.61it/s]
reward: -4.6767, last reward: -5.2466, gradient norm: 6.522: 19%|█▉ | 121/625 [00:18<01:16, 6.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm: 19.21: 19%|█▉ | 121/625 [00:18<01:16, 6.60it/s]
reward: -4.5873, last reward: -6.5072, gradient norm: 19.21: 20%|█▉ | 122/625 [00:18<01:16, 6.60it/s]
reward: -4.6548, last reward: -6.3766, gradient norm: 5.692: 20%|█▉ | 122/625 [00:18<01:16, 6.60it/s]
reward: -4.6548, last reward: -6.3766, gradient norm: 5.692: 20%|█▉ | 123/625 [00:18<01:16, 6.60it/s]
reward: -4.5134, last reward: -7.1955, gradient norm: 11.11: 20%|█▉ | 123/625 [00:18<01:16, 6.60it/s]
reward: -4.5134, last reward: -7.1955, gradient norm: 11.11: 20%|█▉ | 124/625 [00:18<01:15, 6.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm: 11.85: 20%|█▉ | 124/625 [00:18<01:15, 6.62it/s]
reward: -4.2481, last reward: -7.0591, gradient norm: 11.85: 20%|██ | 125/625 [00:18<01:15, 6.60it/s]
reward: -4.4500, last reward: -5.3368, gradient norm: 10.19: 20%|██ | 125/625 [00:19<01:15, 6.60it/s]
reward: -4.4500, last reward: -5.3368, gradient norm: 10.19: 20%|██ | 126/625 [00:19<01:15, 6.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm: 42.81: 20%|██ | 126/625 [00:19<01:15, 6.60it/s]
reward: -3.9708, last reward: -2.7059, gradient norm: 42.81: 20%|██ | 127/625 [00:19<01:15, 6.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm: 4.843: 20%|██ | 127/625 [00:19<01:15, 6.61it/s]
reward: -4.3031, last reward: -3.2534, gradient norm: 4.843: 20%|██ | 128/625 [00:19<01:15, 6.61it/s]
reward: -4.3327, last reward: -4.6193, gradient norm: 20.96: 20%|██ | 128/625 [00:19<01:15, 6.61it/s]
reward: -4.3327, last reward: -4.6193, gradient norm: 20.96: 21%|██ | 129/625 [00:19<01:15, 6.61it/s]
reward: -4.4831, last reward: -4.1172, gradient norm: 24.81: 21%|██ | 129/625 [00:19<01:15, 6.61it/s]
reward: -4.4831, last reward: -4.1172, gradient norm: 24.81: 21%|██ | 130/625 [00:19<01:14, 6.61it/s]
reward: -4.2593, last reward: -4.4219, gradient norm: 5.962: 21%|██ | 130/625 [00:19<01:14, 6.61it/s]
reward: -4.2593, last reward: -4.4219, gradient norm: 5.962: 21%|██ | 131/625 [00:19<01:14, 6.61it/s]
reward: -4.4800, last reward: -3.8380, gradient norm: 2.899: 21%|██ | 131/625 [00:19<01:14, 6.61it/s]
reward: -4.4800, last reward: -3.8380, gradient norm: 2.899: 21%|██ | 132/625 [00:19<01:14, 6.59it/s]
reward: -4.2721, last reward: -4.9048, gradient norm: 7.166: 21%|██ | 132/625 [00:20<01:14, 6.59it/s]
reward: -4.2721, last reward: -4.9048, gradient norm: 7.166: 21%|██▏ | 133/625 [00:20<01:14, 6.59it/s]
reward: -4.2419, last reward: -4.5248, gradient norm: 25.93: 21%|██▏ | 133/625 [00:20<01:14, 6.59it/s]
reward: -4.2419, last reward: -4.5248, gradient norm: 25.93: 21%|██▏ | 134/625 [00:20<01:14, 6.60it/s]
reward: -4.2139, last reward: -4.4278, gradient norm: 20.26: 21%|██▏ | 134/625 [00:20<01:14, 6.60it/s]
reward: -4.2139, last reward: -4.4278, gradient norm: 20.26: 22%|██▏ | 135/625 [00:20<01:14, 6.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm: 22.5: 22%|██▏ | 135/625 [00:20<01:14, 6.62it/s]
reward: -4.0690, last reward: -2.5140, gradient norm: 22.5: 22%|██▏ | 136/625 [00:20<01:13, 6.62it/s]
reward: -4.1140, last reward: -3.7402, gradient norm: 11.11: 22%|██▏ | 136/625 [00:20<01:13, 6.62it/s]
reward: -4.1140, last reward: -3.7402, gradient norm: 11.11: 22%|██▏ | 137/625 [00:20<01:13, 6.62it/s]
reward: -4.5356, last reward: -5.1636, gradient norm: 400.1: 22%|██▏ | 137/625 [00:20<01:13, 6.62it/s]
reward: -4.5356, last reward: -5.1636, gradient norm: 400.1: 22%|██▏ | 138/625 [00:20<01:13, 6.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm: 13.34: 22%|██▏ | 138/625 [00:20<01:13, 6.60it/s]
reward: -5.0671, last reward: -5.8798, gradient norm: 13.34: 22%|██▏ | 139/625 [00:20<01:13, 6.62it/s]
reward: -4.8918, last reward: -6.3298, gradient norm: 7.307: 22%|██▏ | 139/625 [00:21<01:13, 6.62it/s]
reward: -4.8918, last reward: -6.3298, gradient norm: 7.307: 22%|██▏ | 140/625 [00:21<01:13, 6.62it/s]
reward: -5.1779, last reward: -4.1915, gradient norm: 11.43: 22%|██▏ | 140/625 [00:21<01:13, 6.62it/s]
reward: -5.1779, last reward: -4.1915, gradient norm: 11.43: 23%|██▎ | 141/625 [00:21<01:13, 6.62it/s]
reward: -5.1771, last reward: -4.3624, gradient norm: 6.936: 23%|██▎ | 141/625 [00:21<01:13, 6.62it/s]
reward: -5.1771, last reward: -4.3624, gradient norm: 6.936: 23%|██▎ | 142/625 [00:21<01:12, 6.63it/s]
reward: -5.1683, last reward: -3.4810, gradient norm: 13.29: 23%|██▎ | 142/625 [00:21<01:12, 6.63it/s]
reward: -5.1683, last reward: -3.4810, gradient norm: 13.29: 23%|██▎ | 143/625 [00:21<01:13, 6.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm: 19.33: 23%|██▎ | 143/625 [00:21<01:13, 6.60it/s]
reward: -4.9373, last reward: -5.4435, gradient norm: 19.33: 23%|██▎ | 144/625 [00:21<01:12, 6.61it/s]
reward: -4.4396, last reward: -4.8092, gradient norm: 118.9: 23%|██▎ | 144/625 [00:21<01:12, 6.61it/s]
reward: -4.4396, last reward: -4.8092, gradient norm: 118.9: 23%|██▎ | 145/625 [00:21<01:12, 6.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm: 15.04: 23%|██▎ | 145/625 [00:22<01:12, 6.60it/s]
reward: -4.3911, last reward: -8.2572, gradient norm: 15.04: 23%|██▎ | 146/625 [00:22<01:12, 6.62it/s]
reward: -4.4212, last reward: -3.0260, gradient norm: 26.01: 23%|██▎ | 146/625 [00:22<01:12, 6.62it/s]
reward: -4.4212, last reward: -3.0260, gradient norm: 26.01: 24%|██▎ | 147/625 [00:22<01:12, 6.62it/s]
reward: -4.0939, last reward: -4.6478, gradient norm: 9.605: 24%|██▎ | 147/625 [00:22<01:12, 6.62it/s]
reward: -4.0939, last reward: -4.6478, gradient norm: 9.605: 24%|██▎ | 148/625 [00:22<01:12, 6.62it/s]
reward: -4.6606, last reward: -4.7289, gradient norm: 11.19: 24%|██▎ | 148/625 [00:22<01:12, 6.62it/s]
reward: -4.6606, last reward: -4.7289, gradient norm: 11.19: 24%|██▍ | 149/625 [00:22<01:11, 6.63it/s]
reward: -4.9300, last reward: -4.7193, gradient norm: 8.563: 24%|██▍ | 149/625 [00:22<01:11, 6.63it/s]
reward: -4.9300, last reward: -4.7193, gradient norm: 8.563: 24%|██▍ | 150/625 [00:22<01:11, 6.63it/s]
reward: -5.1166, last reward: -4.8514, gradient norm: 8.384: 24%|██▍ | 150/625 [00:22<01:11, 6.63it/s]
reward: -5.1166, last reward: -4.8514, gradient norm: 8.384: 24%|██▍ | 151/625 [00:22<01:11, 6.63it/s]
reward: -4.9108, last reward: -5.0672, gradient norm: 9.292: 24%|██▍ | 151/625 [00:22<01:11, 6.63it/s]
reward: -4.9108, last reward: -5.0672, gradient norm: 9.292: 24%|██▍ | 152/625 [00:22<01:11, 6.63it/s]
reward: -4.8591, last reward: -4.3768, gradient norm: 9.72: 24%|██▍ | 152/625 [00:23<01:11, 6.63it/s]
reward: -4.8591, last reward: -4.3768, gradient norm: 9.72: 24%|██▍ | 153/625 [00:23<01:11, 6.63it/s]
reward: -4.2721, last reward: -3.9976, gradient norm: 10.37: 24%|██▍ | 153/625 [00:23<01:11, 6.63it/s]
reward: -4.2721, last reward: -3.9976, gradient norm: 10.37: 25%|██▍ | 154/625 [00:23<01:11, 6.63it/s]
reward: -4.0576, last reward: -2.0067, gradient norm: 8.935: 25%|██▍ | 154/625 [00:23<01:11, 6.63it/s]
reward: -4.0576, last reward: -2.0067, gradient norm: 8.935: 25%|██▍ | 155/625 [00:23<01:10, 6.64it/s]
reward: -4.4199, last reward: -5.1722, gradient norm: 18.7: 25%|██▍ | 155/625 [00:23<01:10, 6.64it/s]
reward: -4.4199, last reward: -5.1722, gradient norm: 18.7: 25%|██▍ | 156/625 [00:23<01:10, 6.64it/s]
reward: -4.8310, last reward: -7.3466, gradient norm: 28.52: 25%|██▍ | 156/625 [00:23<01:10, 6.64it/s]
reward: -4.8310, last reward: -7.3466, gradient norm: 28.52: 25%|██▌ | 157/625 [00:23<01:10, 6.64it/s]
reward: -4.8631, last reward: -6.2492, gradient norm: 89.17: 25%|██▌ | 157/625 [00:23<01:10, 6.64it/s]
reward: -4.8631, last reward: -6.2492, gradient norm: 89.17: 25%|██▌ | 158/625 [00:23<01:10, 6.64it/s]
reward: -4.8763, last reward: -6.1277, gradient norm: 24.43: 25%|██▌ | 158/625 [00:24<01:10, 6.64it/s]
reward: -4.8763, last reward: -6.1277, gradient norm: 24.43: 25%|██▌ | 159/625 [00:24<01:10, 6.64it/s]
reward: -4.5562, last reward: -5.7446, gradient norm: 23.35: 25%|██▌ | 159/625 [00:24<01:10, 6.64it/s]
reward: -4.5562, last reward: -5.7446, gradient norm: 23.35: 26%|██▌ | 160/625 [00:24<01:10, 6.64it/s]
reward: -4.1082, last reward: -4.9830, gradient norm: 22.14: 26%|██▌ | 160/625 [00:24<01:10, 6.64it/s]
reward: -4.1082, last reward: -4.9830, gradient norm: 22.14: 26%|██▌ | 161/625 [00:24<01:09, 6.64it/s]
reward: -4.0946, last reward: -2.5229, gradient norm: 10.47: 26%|██▌ | 161/625 [00:24<01:09, 6.64it/s]
reward: -4.0946, last reward: -2.5229, gradient norm: 10.47: 26%|██▌ | 162/625 [00:24<01:09, 6.64it/s]
reward: -4.4574, last reward: -4.6900, gradient norm: 112.6: 26%|██▌ | 162/625 [00:24<01:09, 6.64it/s]
reward: -4.4574, last reward: -4.6900, gradient norm: 112.6: 26%|██▌ | 163/625 [00:24<01:09, 6.63it/s]
reward: -5.2229, last reward: -4.0318, gradient norm: 6.482: 26%|██▌ | 163/625 [00:24<01:09, 6.63it/s]
reward: -5.2229, last reward: -4.0318, gradient norm: 6.482: 26%|██▌ | 164/625 [00:24<01:09, 6.64it/s]
reward: -5.0543, last reward: -4.0817, gradient norm: 5.761: 26%|██▌ | 164/625 [00:24<01:09, 6.64it/s]
reward: -5.0543, last reward: -4.0817, gradient norm: 5.761: 26%|██▋ | 165/625 [00:24<01:09, 6.64it/s]
reward: -5.2809, last reward: -4.5118, gradient norm: 5.366: 26%|██▋ | 165/625 [00:25<01:09, 6.64it/s]
reward: -5.2809, last reward: -4.5118, gradient norm: 5.366: 27%|██▋ | 166/625 [00:25<01:08, 6.65it/s]
reward: -5.1142, last reward: -4.5635, gradient norm: 5.04: 27%|██▋ | 166/625 [00:25<01:08, 6.65it/s]
reward: -5.1142, last reward: -4.5635, gradient norm: 5.04: 27%|██▋ | 167/625 [00:25<01:08, 6.64it/s]
reward: -5.1949, last reward: -4.2327, gradient norm: 4.982: 27%|██▋ | 167/625 [00:25<01:08, 6.64it/s]
reward: -5.1949, last reward: -4.2327, gradient norm: 4.982: 27%|██▋ | 168/625 [00:25<01:08, 6.65it/s]
reward: -5.0967, last reward: -5.0387, gradient norm: 7.457: 27%|██▋ | 168/625 [00:25<01:08, 6.65it/s]
reward: -5.0967, last reward: -5.0387, gradient norm: 7.457: 27%|██▋ | 169/625 [00:25<01:08, 6.65it/s]
reward: -5.0782, last reward: -5.2150, gradient norm: 10.54: 27%|██▋ | 169/625 [00:25<01:08, 6.65it/s]
reward: -5.0782, last reward: -5.2150, gradient norm: 10.54: 27%|██▋ | 170/625 [00:25<01:08, 6.65it/s]
reward: -4.5222, last reward: -4.3725, gradient norm: 22.63: 27%|██▋ | 170/625 [00:25<01:08, 6.65it/s]
reward: -4.5222, last reward: -4.3725, gradient norm: 22.63: 27%|██▋ | 171/625 [00:25<01:08, 6.64it/s]
reward: -3.9288, last reward: -3.9837, gradient norm: 83.59: 27%|██▋ | 171/625 [00:25<01:08, 6.64it/s]
reward: -3.9288, last reward: -3.9837, gradient norm: 83.59: 28%|██▊ | 172/625 [00:25<01:08, 6.63it/s]
reward: -4.1416, last reward: -4.1099, gradient norm: 30.57: 28%|██▊ | 172/625 [00:26<01:08, 6.63it/s]
reward: -4.1416, last reward: -4.1099, gradient norm: 30.57: 28%|██▊ | 173/625 [00:26<01:08, 6.62it/s]
reward: -4.8620, last reward: -6.8475, gradient norm: 18.91: 28%|██▊ | 173/625 [00:26<01:08, 6.62it/s]
reward: -4.8620, last reward: -6.8475, gradient norm: 18.91: 28%|██▊ | 174/625 [00:26<01:08, 6.62it/s]
reward: -5.1807, last reward: -6.4375, gradient norm: 18.48: 28%|██▊ | 174/625 [00:26<01:08, 6.62it/s]
reward: -5.1807, last reward: -6.4375, gradient norm: 18.48: 28%|██▊ | 175/625 [00:26<01:07, 6.63it/s]
reward: -5.1148, last reward: -5.0645, gradient norm: 14.36: 28%|██▊ | 175/625 [00:26<01:07, 6.63it/s]
reward: -5.1148, last reward: -5.0645, gradient norm: 14.36: 28%|██▊ | 176/625 [00:26<01:07, 6.62it/s]
reward: -5.2751, last reward: -4.8313, gradient norm: 15.32: 28%|██▊ | 176/625 [00:26<01:07, 6.62it/s]
reward: -5.2751, last reward: -4.8313, gradient norm: 15.32: 28%|██▊ | 177/625 [00:26<01:07, 6.63it/s]
reward: -4.9286, last reward: -6.9770, gradient norm: 24.75: 28%|██▊ | 177/625 [00:26<01:07, 6.63it/s]
reward: -4.9286, last reward: -6.9770, gradient norm: 24.75: 28%|██▊ | 178/625 [00:26<01:07, 6.62it/s]
reward: -4.5735, last reward: -5.2837, gradient norm: 15.2: 28%|██▊ | 178/625 [00:27<01:07, 6.62it/s]
reward: -4.5735, last reward: -5.2837, gradient norm: 15.2: 29%|██▊ | 179/625 [00:27<01:07, 6.63it/s]
reward: -4.2926, last reward: -1.9489, gradient norm: 18.24: 29%|██▊ | 179/625 [00:27<01:07, 6.63it/s]
reward: -4.2926, last reward: -1.9489, gradient norm: 18.24: 29%|██▉ | 180/625 [00:27<01:07, 6.63it/s]
reward: -4.1507, last reward: -3.5593, gradient norm: 37.66: 29%|██▉ | 180/625 [00:27<01:07, 6.63it/s]
reward: -4.1507, last reward: -3.5593, gradient norm: 37.66: 29%|██▉ | 181/625 [00:27<01:06, 6.64it/s]
reward: -3.8724, last reward: -4.3567, gradient norm: 16.67: 29%|██▉ | 181/625 [00:27<01:06, 6.64it/s]
reward: -3.8724, last reward: -4.3567, gradient norm: 16.67: 29%|██▉ | 182/625 [00:27<01:06, 6.64it/s]
reward: -4.3574, last reward: -3.6140, gradient norm: 13.96: 29%|██▉ | 182/625 [00:27<01:06, 6.64it/s]
reward: -4.3574, last reward: -3.6140, gradient norm: 13.96: 29%|██▉ | 183/625 [00:27<01:06, 6.63it/s]
reward: -4.7895, last reward: -6.2518, gradient norm: 14.74: 29%|██▉ | 183/625 [00:27<01:06, 6.63it/s]
reward: -4.7895, last reward: -6.2518, gradient norm: 14.74: 29%|██▉ | 184/625 [00:27<01:06, 6.63it/s]
reward: -4.6146, last reward: -5.6969, gradient norm: 11.45: 29%|██▉ | 184/625 [00:27<01:06, 6.63it/s]
reward: -4.6146, last reward: -5.6969, gradient norm: 11.45: 30%|██▉ | 185/625 [00:27<01:06, 6.63it/s]
reward: -4.8776, last reward: -5.7358, gradient norm: 13.16: 30%|██▉ | 185/625 [00:28<01:06, 6.63it/s]
reward: -4.8776, last reward: -5.7358, gradient norm: 13.16: 30%|██▉ | 186/625 [00:28<01:06, 6.62it/s]
reward: -4.3722, last reward: -4.8428, gradient norm: 23.57: 30%|██▉ | 186/625 [00:28<01:06, 6.62it/s]
reward: -4.3722, last reward: -4.8428, gradient norm: 23.57: 30%|██▉ | 187/625 [00:28<01:06, 6.62it/s]
reward: -4.2656, last reward: -3.7955, gradient norm: 54.67: 30%|██▉ | 187/625 [00:28<01:06, 6.62it/s]
reward: -4.2656, last reward: -3.7955, gradient norm: 54.67: 30%|███ | 188/625 [00:28<01:05, 6.63it/s]
reward: -4.0092, last reward: -1.7106, gradient norm: 7.829: 30%|███ | 188/625 [00:28<01:05, 6.63it/s]
reward: -4.0092, last reward: -1.7106, gradient norm: 7.829: 30%|███ | 189/625 [00:28<01:30, 4.79it/s]
reward: -4.2264, last reward: -3.6919, gradient norm: 16.17: 30%|███ | 189/625 [00:28<01:30, 4.79it/s]
reward: -4.2264, last reward: -3.6919, gradient norm: 16.17: 30%|███ | 190/625 [00:28<01:23, 5.23it/s]
reward: -4.1438, last reward: -2.1362, gradient norm: 19.43: 30%|███ | 190/625 [00:29<01:23, 5.23it/s]
reward: -4.1438, last reward: -2.1362, gradient norm: 19.43: 31%|███ | 191/625 [00:29<01:17, 5.58it/s]
reward: -4.0618, last reward: -2.8217, gradient norm: 73.63: 31%|███ | 191/625 [00:29<01:17, 5.58it/s]
reward: -4.0618, last reward: -2.8217, gradient norm: 73.63: 31%|███ | 192/625 [00:29<01:13, 5.85it/s]
reward: -3.9420, last reward: -3.6765, gradient norm: 34.1: 31%|███ | 192/625 [00:29<01:13, 5.85it/s]
reward: -3.9420, last reward: -3.6765, gradient norm: 34.1: 31%|███ | 193/625 [00:29<01:11, 6.07it/s]
reward: -3.7745, last reward: -4.0709, gradient norm: 26.48: 31%|███ | 193/625 [00:29<01:11, 6.07it/s]
reward: -3.7745, last reward: -4.0709, gradient norm: 26.48: 31%|███ | 194/625 [00:29<01:09, 6.23it/s]
reward: -3.9478, last reward: -2.6867, gradient norm: 22.82: 31%|███ | 194/625 [00:29<01:09, 6.23it/s]
reward: -3.9478, last reward: -2.6867, gradient norm: 22.82: 31%|███ | 195/625 [00:29<01:07, 6.34it/s]
reward: -3.6507, last reward: -2.6225, gradient norm: 37.44: 31%|███ | 195/625 [00:29<01:07, 6.34it/s]
reward: -3.6507, last reward: -2.6225, gradient norm: 37.44: 31%|███▏ | 196/625 [00:29<01:06, 6.40it/s]
reward: -4.2244, last reward: -3.2195, gradient norm: 10.71: 31%|███▏ | 196/625 [00:29<01:06, 6.40it/s]
reward: -4.2244, last reward: -3.2195, gradient norm: 10.71: 32%|███▏ | 197/625 [00:29<01:07, 6.37it/s]
reward: -4.5385, last reward: -3.9263, gradient norm: 31.03: 32%|███▏ | 197/625 [00:30<01:07, 6.37it/s]
reward: -4.5385, last reward: -3.9263, gradient norm: 31.03: 32%|███▏ | 198/625 [00:30<01:06, 6.44it/s]
reward: -4.1878, last reward: -3.2374, gradient norm: 34.35: 32%|███▏ | 198/625 [00:30<01:06, 6.44it/s]
reward: -4.1878, last reward: -3.2374, gradient norm: 34.35: 32%|███▏ | 199/625 [00:30<01:05, 6.49it/s]
reward: -3.8054, last reward: -2.3504, gradient norm: 5.557: 32%|███▏ | 199/625 [00:30<01:05, 6.49it/s]
reward: -3.8054, last reward: -2.3504, gradient norm: 5.557: 32%|███▏ | 200/625 [00:30<01:05, 6.52it/s]
reward: -4.0766, last reward: -4.6825, gradient norm: 38.72: 32%|███▏ | 200/625 [00:30<01:05, 6.52it/s]
reward: -4.0766, last reward: -4.6825, gradient norm: 38.72: 32%|███▏ | 201/625 [00:30<01:04, 6.55it/s]
reward: -4.2011, last reward: -5.8393, gradient norm: 21.06: 32%|███▏ | 201/625 [00:30<01:04, 6.55it/s]
reward: -4.2011, last reward: -5.8393, gradient norm: 21.06: 32%|███▏ | 202/625 [00:30<01:04, 6.56it/s]
reward: -4.0803, last reward: -3.7815, gradient norm: 10.6: 32%|███▏ | 202/625 [00:30<01:04, 6.56it/s]
reward: -4.0803, last reward: -3.7815, gradient norm: 10.6: 32%|███▏ | 203/625 [00:30<01:04, 6.58it/s]
reward: -3.8363, last reward: -3.2460, gradient norm: 32.57: 32%|███▏ | 203/625 [00:30<01:04, 6.58it/s]
reward: -3.8363, last reward: -3.2460, gradient norm: 32.57: 33%|███▎ | 204/625 [00:30<01:03, 6.59it/s]
reward: -3.8643, last reward: -3.2191, gradient norm: 8.593: 33%|███▎ | 204/625 [00:31<01:03, 6.59it/s]
reward: -3.8643, last reward: -3.2191, gradient norm: 8.593: 33%|███▎ | 205/625 [00:31<01:03, 6.59it/s]
reward: -4.0773, last reward: -5.1343, gradient norm: 14.49: 33%|███▎ | 205/625 [00:31<01:03, 6.59it/s]
reward: -4.0773, last reward: -5.1343, gradient norm: 14.49: 33%|███▎ | 206/625 [00:31<01:03, 6.60it/s]
reward: -4.1400, last reward: -5.8657, gradient norm: 17.05: 33%|███▎ | 206/625 [00:31<01:03, 6.60it/s]
reward: -4.1400, last reward: -5.8657, gradient norm: 17.05: 33%|███▎ | 207/625 [00:31<01:03, 6.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm: 33.25: 33%|███▎ | 207/625 [00:31<01:03, 6.60it/s]
reward: -3.9304, last reward: -2.7584, gradient norm: 33.25: 33%|███▎ | 208/625 [00:31<01:03, 6.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm: 10.76: 33%|███▎ | 208/625 [00:31<01:03, 6.60it/s]
reward: -3.8752, last reward: -4.2307, gradient norm: 10.76: 33%|███▎ | 209/625 [00:31<01:02, 6.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm: 40.8: 33%|███▎ | 209/625 [00:31<01:02, 6.61it/s]
reward: -3.5250, last reward: -1.4869, gradient norm: 40.8: 34%|███▎ | 210/625 [00:31<01:02, 6.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm: 193.3: 34%|███▎ | 210/625 [00:32<01:02, 6.61it/s]
reward: -3.7837, last reward: -2.5762, gradient norm: 193.3: 34%|███▍ | 211/625 [00:32<01:02, 6.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm: 136.5: 34%|███▍ | 211/625 [00:32<01:02, 6.61it/s]
reward: -3.6661, last reward: -1.8600, gradient norm: 136.5: 34%|███▍ | 212/625 [00:32<01:02, 6.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm: 21.44: 34%|███▍ | 212/625 [00:32<01:02, 6.61it/s]
reward: -4.2502, last reward: -3.1752, gradient norm: 21.44: 34%|███▍ | 213/625 [00:32<01:02, 6.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm: 30.65: 34%|███▍ | 213/625 [00:32<01:02, 6.61it/s]
reward: -4.3075, last reward: -2.8871, gradient norm: 30.65: 34%|███▍ | 214/625 [00:32<01:02, 6.58it/s]
reward: -3.9406, last reward: -2.8090, gradient norm: 20.18: 34%|███▍ | 214/625 [00:32<01:02, 6.58it/s]
reward: -3.9406, last reward: -2.8090, gradient norm: 20.18: 34%|███▍ | 215/625 [00:32<01:02, 6.54it/s]
reward: -3.6291, last reward: -2.8923, gradient norm: 7.876: 34%|███▍ | 215/625 [00:32<01:02, 6.54it/s]
reward: -3.6291, last reward: -2.8923, gradient norm: 7.876: 35%|███▍ | 216/625 [00:32<01:02, 6.52it/s]
reward: -3.5112, last reward: -3.9504, gradient norm: 3.21e+03: 35%|███▍ | 216/625 [00:32<01:02, 6.52it/s]
reward: -3.5112, last reward: -3.9504, gradient norm: 3.21e+03: 35%|███▍ | 217/625 [00:32<01:02, 6.52it/s]
reward: -3.7431, last reward: -2.7880, gradient norm: 13.73: 35%|███▍ | 217/625 [00:33<01:02, 6.52it/s]
reward: -3.7431, last reward: -2.7880, gradient norm: 13.73: 35%|███▍ | 218/625 [00:33<01:02, 6.52it/s]
reward: -3.4463, last reward: -4.5432, gradient norm: 32.37: 35%|███▍ | 218/625 [00:33<01:02, 6.52it/s]
reward: -3.4463, last reward: -4.5432, gradient norm: 32.37: 35%|███▌ | 219/625 [00:33<01:02, 6.51it/s]
reward: -3.3793, last reward: -3.3313, gradient norm: 60.63: 35%|███▌ | 219/625 [00:33<01:02, 6.51it/s]
reward: -3.3793, last reward: -3.3313, gradient norm: 60.63: 35%|███▌ | 220/625 [00:33<01:02, 6.51it/s]
reward: -3.8843, last reward: -3.0369, gradient norm: 5.065: 35%|███▌ | 220/625 [00:33<01:02, 6.51it/s]
reward: -3.8843, last reward: -3.0369, gradient norm: 5.065: 35%|███▌ | 221/625 [00:33<01:02, 6.51it/s]
reward: -3.4828, last reward: -3.8391, gradient norm: 59.85: 35%|███▌ | 221/625 [00:33<01:02, 6.51it/s]
reward: -3.4828, last reward: -3.8391, gradient norm: 59.85: 36%|███▌ | 222/625 [00:33<01:01, 6.51it/s]
reward: -3.6265, last reward: -4.2913, gradient norm: 8.947: 36%|███▌ | 222/625 [00:33<01:01, 6.51it/s]
reward: -3.6265, last reward: -4.2913, gradient norm: 8.947: 36%|███▌ | 223/625 [00:33<01:01, 6.50it/s]
reward: -3.5541, last reward: -4.1252, gradient norm: 255.9: 36%|███▌ | 223/625 [00:34<01:01, 6.50it/s]
reward: -3.5541, last reward: -4.1252, gradient norm: 255.9: 36%|███▌ | 224/625 [00:34<01:01, 6.49it/s]
reward: -3.7342, last reward: -2.2396, gradient norm: 7.995: 36%|███▌ | 224/625 [00:34<01:01, 6.49it/s]
reward: -3.7342, last reward: -2.2396, gradient norm: 7.995: 36%|███▌ | 225/625 [00:34<01:01, 6.48it/s]
reward: -3.5936, last reward: -4.1924, gradient norm: 59.49: 36%|███▌ | 225/625 [00:34<01:01, 6.48it/s]
reward: -3.5936, last reward: -4.1924, gradient norm: 59.49: 36%|███▌ | 226/625 [00:34<01:01, 6.48it/s]
reward: -3.9975, last reward: -4.2045, gradient norm: 21.77: 36%|███▌ | 226/625 [00:34<01:01, 6.48it/s]
reward: -3.9975, last reward: -4.2045, gradient norm: 21.77: 36%|███▋ | 227/625 [00:34<01:01, 6.48it/s]
reward: -3.8367, last reward: -1.9540, gradient norm: 32.26: 36%|███▋ | 227/625 [00:34<01:01, 6.48it/s]
reward: -3.8367, last reward: -1.9540, gradient norm: 32.26: 36%|███▋ | 228/625 [00:34<01:01, 6.48it/s]
reward: -3.7259, last reward: -3.6743, gradient norm: 28.62: 36%|███▋ | 228/625 [00:34<01:01, 6.48it/s]
reward: -3.7259, last reward: -3.6743, gradient norm: 28.62: 37%|███▋ | 229/625 [00:34<01:01, 6.41it/s]
reward: -3.4827, last reward: -3.7528, gradient norm: 64.85: 37%|███▋ | 229/625 [00:34<01:01, 6.41it/s]
reward: -3.4827, last reward: -3.7528, gradient norm: 64.85: 37%|███▋ | 230/625 [00:34<01:01, 6.46it/s]
reward: -3.7361, last reward: -3.8756, gradient norm: 24.69: 37%|███▋ | 230/625 [00:35<01:01, 6.46it/s]
reward: -3.7361, last reward: -3.8756, gradient norm: 24.69: 37%|███▋ | 231/625 [00:35<01:00, 6.51it/s]
reward: -3.7646, last reward: -3.1116, gradient norm: 14.25: 37%|███▋ | 231/625 [00:35<01:00, 6.51it/s]
reward: -3.7646, last reward: -3.1116, gradient norm: 14.25: 37%|███▋ | 232/625 [00:35<01:00, 6.53it/s]
reward: -3.5426, last reward: -2.8385, gradient norm: 34.07: 37%|███▋ | 232/625 [00:35<01:00, 6.53it/s]
reward: -3.5426, last reward: -2.8385, gradient norm: 34.07: 37%|███▋ | 233/625 [00:35<00:59, 6.56it/s]
reward: -3.5662, last reward: -1.8585, gradient norm: 11.26: 37%|███▋ | 233/625 [00:35<00:59, 6.56it/s]
reward: -3.5662, last reward: -1.8585, gradient norm: 11.26: 37%|███▋ | 234/625 [00:35<00:59, 6.58it/s]
reward: -3.8234, last reward: -2.7930, gradient norm: 32.18: 37%|███▋ | 234/625 [00:35<00:59, 6.58it/s]
reward: -3.8234, last reward: -2.7930, gradient norm: 32.18: 38%|███▊ | 235/625 [00:35<00:59, 6.59it/s]
reward: -4.2648, last reward: -4.9309, gradient norm: 24.83: 38%|███▊ | 235/625 [00:35<00:59, 6.59it/s]
reward: -4.2648, last reward: -4.9309, gradient norm: 24.83: 38%|███▊ | 236/625 [00:35<00:59, 6.59it/s]
reward: -4.2039, last reward: -3.6817, gradient norm: 19.24: 38%|███▊ | 236/625 [00:36<00:59, 6.59it/s]
reward: -4.2039, last reward: -3.6817, gradient norm: 19.24: 38%|███▊ | 237/625 [00:36<00:58, 6.60it/s]
reward: -4.0943, last reward: -3.1533, gradient norm: 145.1: 38%|███▊ | 237/625 [00:36<00:58, 6.60it/s]
reward: -4.0943, last reward: -3.1533, gradient norm: 145.1: 38%|███▊ | 238/625 [00:36<00:58, 6.61it/s]
reward: -4.3045, last reward: -3.0483, gradient norm: 20.89: 38%|███▊ | 238/625 [00:36<00:58, 6.61it/s]
reward: -4.3045, last reward: -3.0483, gradient norm: 20.89: 38%|███▊ | 239/625 [00:36<00:58, 6.56it/s]
reward: -4.4128, last reward: -5.2528, gradient norm: 24.97: 38%|███▊ | 239/625 [00:36<00:58, 6.56it/s]
reward: -4.4128, last reward: -5.2528, gradient norm: 24.97: 38%|███▊ | 240/625 [00:36<00:58, 6.56it/s]
reward: -4.6415, last reward: -8.0201, gradient norm: 26.74: 38%|███▊ | 240/625 [00:36<00:58, 6.56it/s]
reward: -4.6415, last reward: -8.0201, gradient norm: 26.74: 39%|███▊ | 241/625 [00:36<00:58, 6.53it/s]
reward: -4.4437, last reward: -5.4365, gradient norm: 132.7: 39%|███▊ | 241/625 [00:36<00:58, 6.53it/s]
reward: -4.4437, last reward: -5.4365, gradient norm: 132.7: 39%|███▊ | 242/625 [00:36<00:58, 6.53it/s]
reward: -4.0358, last reward: -3.4943, gradient norm: 11.46: 39%|███▊ | 242/625 [00:36<00:58, 6.53it/s]
reward: -4.0358, last reward: -3.4943, gradient norm: 11.46: 39%|███▉ | 243/625 [00:36<00:58, 6.52it/s]
reward: -4.1272, last reward: -3.5003, gradient norm: 68.09: 39%|███▉ | 243/625 [00:37<00:58, 6.52it/s]
reward: -4.1272, last reward: -3.5003, gradient norm: 68.09: 39%|███▉ | 244/625 [00:37<00:58, 6.52it/s]
reward: -4.1180, last reward: -4.2637, gradient norm: 39.25: 39%|███▉ | 244/625 [00:37<00:58, 6.52it/s]
reward: -4.1180, last reward: -4.2637, gradient norm: 39.25: 39%|███▉ | 245/625 [00:37<00:58, 6.52it/s]
reward: -4.7197, last reward: -3.0873, gradient norm: 12.2: 39%|███▉ | 245/625 [00:37<00:58, 6.52it/s]
reward: -4.7197, last reward: -3.0873, gradient norm: 12.2: 39%|███▉ | 246/625 [00:37<00:58, 6.52it/s]
reward: -4.2917, last reward: -3.6656, gradient norm: 17.17: 39%|███▉ | 246/625 [00:37<00:58, 6.52it/s]
reward: -4.2917, last reward: -3.6656, gradient norm: 17.17: 40%|███▉ | 247/625 [00:37<00:58, 6.51it/s]
reward: -4.0160, last reward: -3.0738, gradient norm: 43.07: 40%|███▉ | 247/625 [00:37<00:58, 6.51it/s]
reward: -4.0160, last reward: -3.0738, gradient norm: 43.07: 40%|███▉ | 248/625 [00:37<00:57, 6.51it/s]
reward: -4.3689, last reward: -4.0120, gradient norm: 11.81: 40%|███▉ | 248/625 [00:37<00:57, 6.51it/s]
reward: -4.3689, last reward: -4.0120, gradient norm: 11.81: 40%|███▉ | 249/625 [00:37<00:57, 6.51it/s]
reward: -4.5570, last reward: -7.0475, gradient norm: 22.45: 40%|███▉ | 249/625 [00:38<00:57, 6.51it/s]
reward: -4.5570, last reward: -7.0475, gradient norm: 22.45: 40%|████ | 250/625 [00:38<00:57, 6.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm: 18.4: 40%|████ | 250/625 [00:38<00:57, 6.49it/s]
reward: -4.4423, last reward: -5.2220, gradient norm: 18.4: 40%|████ | 251/625 [00:38<00:57, 6.48it/s]
reward: -4.2118, last reward: -4.6803, gradient norm: 15.86: 40%|████ | 251/625 [00:38<00:57, 6.48it/s]
reward: -4.2118, last reward: -4.6803, gradient norm: 15.86: 40%|████ | 252/625 [00:38<00:57, 6.48it/s]
reward: -4.1465, last reward: -3.7214, gradient norm: 25.93: 40%|████ | 252/625 [00:38<00:57, 6.48it/s]
reward: -4.1465, last reward: -3.7214, gradient norm: 25.93: 40%|████ | 253/625 [00:38<00:57, 6.49it/s]
reward: -3.8801, last reward: -2.7034, gradient norm: 103.6: 40%|████ | 253/625 [00:38<00:57, 6.49it/s]
reward: -3.8801, last reward: -2.7034, gradient norm: 103.6: 41%|████ | 254/625 [00:38<00:57, 6.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm: 17.63: 41%|████ | 254/625 [00:38<00:57, 6.48it/s]
reward: -3.9136, last reward: -4.4076, gradient norm: 17.63: 41%|████ | 255/625 [00:38<00:56, 6.50it/s]
reward: -3.7589, last reward: -4.5013, gradient norm: 143.3: 41%|████ | 255/625 [00:38<00:56, 6.50it/s]
reward: -3.7589, last reward: -4.5013, gradient norm: 143.3: 41%|████ | 256/625 [00:38<00:56, 6.49it/s]
reward: -3.8150, last reward: -3.2241, gradient norm: 113.9: 41%|████ | 256/625 [00:39<00:56, 6.49it/s]
reward: -3.8150, last reward: -3.2241, gradient norm: 113.9: 41%|████ | 257/625 [00:39<00:56, 6.51it/s]
reward: -4.0753, last reward: -3.8081, gradient norm: 14.8: 41%|████ | 257/625 [00:39<00:56, 6.51it/s]
reward: -4.0753, last reward: -3.8081, gradient norm: 14.8: 41%|████▏ | 258/625 [00:39<00:56, 6.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm: 27.63: 41%|████▏ | 258/625 [00:39<00:56, 6.50it/s]
reward: -4.1951, last reward: -4.8314, gradient norm: 27.63: 41%|████▏ | 259/625 [00:39<00:56, 6.53it/s]
reward: -4.0038, last reward: -2.5333, gradient norm: 42.85: 41%|████▏ | 259/625 [00:39<00:56, 6.53it/s]
reward: -4.0038, last reward: -2.5333, gradient norm: 42.85: 42%|████▏ | 260/625 [00:39<00:56, 6.51it/s]
reward: -4.0889, last reward: -2.4616, gradient norm: 13.78: 42%|████▏ | 260/625 [00:39<00:56, 6.51it/s]
reward: -4.0889, last reward: -2.4616, gradient norm: 13.78: 42%|████▏ | 261/625 [00:39<00:55, 6.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm: 10.98: 42%|████▏ | 261/625 [00:39<00:55, 6.51it/s]
reward: -4.0655, last reward: -2.6873, gradient norm: 10.98: 42%|████▏ | 262/625 [00:39<00:55, 6.51it/s]
reward: -3.8333, last reward: -1.9476, gradient norm: 13.47: 42%|████▏ | 262/625 [00:40<00:55, 6.51it/s]
reward: -3.8333, last reward: -1.9476, gradient norm: 13.47: 42%|████▏ | 263/625 [00:40<00:55, 6.50it/s]
reward: -3.7554, last reward: -4.3798, gradient norm: 41.76: 42%|████▏ | 263/625 [00:40<00:55, 6.50it/s]
reward: -3.7554, last reward: -4.3798, gradient norm: 41.76: 42%|████▏ | 264/625 [00:40<00:55, 6.49it/s]
reward: -3.3717, last reward: -2.3947, gradient norm: 6.529: 42%|████▏ | 264/625 [00:40<00:55, 6.49it/s]
reward: -3.3717, last reward: -2.3947, gradient norm: 6.529: 42%|████▏ | 265/625 [00:40<00:55, 6.50it/s]
reward: -4.3060, last reward: -4.6495, gradient norm: 11.24: 42%|████▏ | 265/625 [00:40<00:55, 6.50it/s]
reward: -4.3060, last reward: -4.6495, gradient norm: 11.24: 43%|████▎ | 266/625 [00:40<00:55, 6.49it/s]
reward: -4.7467, last reward: -5.8889, gradient norm: 12.35: 43%|████▎ | 266/625 [00:40<00:55, 6.49it/s]
reward: -4.7467, last reward: -5.8889, gradient norm: 12.35: 43%|████▎ | 267/625 [00:40<00:55, 6.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm: 6.591: 43%|████▎ | 267/625 [00:40<00:55, 6.50it/s]
reward: -4.9281, last reward: -4.8457, gradient norm: 6.591: 43%|████▎ | 268/625 [00:40<00:55, 6.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm: 5.771: 43%|████▎ | 268/625 [00:40<00:55, 6.49it/s]
reward: -4.7137, last reward: -4.0536, gradient norm: 5.771: 43%|████▎ | 269/625 [00:40<00:54, 6.48it/s]
reward: -4.7197, last reward: -4.1651, gradient norm: 5.388: 43%|████▎ | 269/625 [00:41<00:54, 6.48it/s]
reward: -4.7197, last reward: -4.1651, gradient norm: 5.388: 43%|████▎ | 270/625 [00:41<00:54, 6.47it/s]
reward: -4.8246, last reward: -5.5709, gradient norm: 8.281: 43%|████▎ | 270/625 [00:41<00:54, 6.47it/s]
reward: -4.8246, last reward: -5.5709, gradient norm: 8.281: 43%|████▎ | 271/625 [00:41<00:54, 6.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm: 9.032: 43%|████▎ | 271/625 [00:41<00:54, 6.48it/s]
reward: -4.7502, last reward: -5.0521, gradient norm: 9.032: 44%|████▎ | 272/625 [00:41<00:54, 6.52it/s]
reward: -4.5475, last reward: -4.7253, gradient norm: 21.18: 44%|████▎ | 272/625 [00:41<00:54, 6.52it/s]
reward: -4.5475, last reward: -4.7253, gradient norm: 21.18: 44%|████▎ | 273/625 [00:41<00:53, 6.53it/s]
reward: -4.2856, last reward: -3.7130, gradient norm: 13.53: 44%|████▎ | 273/625 [00:41<00:53, 6.53it/s]
reward: -4.2856, last reward: -3.7130, gradient norm: 13.53: 44%|████▍ | 274/625 [00:41<00:53, 6.51it/s]
reward: -3.2778, last reward: -3.4122, gradient norm: 28.52: 44%|████▍ | 274/625 [00:41<00:53, 6.51it/s]
reward: -3.2778, last reward: -3.4122, gradient norm: 28.52: 44%|████▍ | 275/625 [00:41<00:53, 6.52it/s]
reward: -3.8368, last reward: -2.1841, gradient norm: 2.07: 44%|████▍ | 275/625 [00:42<00:53, 6.52it/s]
reward: -3.8368, last reward: -2.1841, gradient norm: 2.07: 44%|████▍ | 276/625 [00:42<00:53, 6.51it/s]
reward: -3.9622, last reward: -3.1603, gradient norm: 1.003e+03: 44%|████▍ | 276/625 [00:42<00:53, 6.51it/s]
reward: -3.9622, last reward: -3.1603, gradient norm: 1.003e+03: 44%|████▍ | 277/625 [00:42<00:53, 6.53it/s]
reward: -4.0247, last reward: -2.9830, gradient norm: 8.346: 44%|████▍ | 277/625 [00:42<00:53, 6.53it/s]
reward: -4.0247, last reward: -2.9830, gradient norm: 8.346: 44%|████▍ | 278/625 [00:42<00:53, 6.51it/s]
reward: -4.2238, last reward: -4.6418, gradient norm: 14.55: 44%|████▍ | 278/625 [00:42<00:53, 6.51it/s]
reward: -4.2238, last reward: -4.6418, gradient norm: 14.55: 45%|████▍ | 279/625 [00:42<00:53, 6.51it/s]
reward: -4.0626, last reward: -4.2538, gradient norm: 17.88: 45%|████▍ | 279/625 [00:42<00:53, 6.51it/s]
reward: -4.0626, last reward: -4.2538, gradient norm: 17.88: 45%|████▍ | 280/625 [00:42<00:53, 6.50it/s]
reward: -4.0149, last reward: -3.7380, gradient norm: 13.13: 45%|████▍ | 280/625 [00:42<00:53, 6.50it/s]
reward: -4.0149, last reward: -3.7380, gradient norm: 13.13: 45%|████▍ | 281/625 [00:42<00:52, 6.50it/s]
reward: -4.2167, last reward: -2.8911, gradient norm: 11.41: 45%|████▍ | 281/625 [00:42<00:52, 6.50it/s]
reward: -4.2167, last reward: -2.8911, gradient norm: 11.41: 45%|████▌ | 282/625 [00:42<00:52, 6.47it/s]
reward: -3.8725, last reward: -4.1983, gradient norm: 18.88: 45%|████▌ | 282/625 [00:43<00:52, 6.47it/s]
reward: -3.8725, last reward: -4.1983, gradient norm: 18.88: 45%|████▌ | 283/625 [00:43<00:52, 6.48it/s]
reward: -2.8142, last reward: -2.3709, gradient norm: 43.73: 45%|████▌ | 283/625 [00:43<00:52, 6.48it/s]
reward: -2.8142, last reward: -2.3709, gradient norm: 43.73: 45%|████▌ | 284/625 [00:43<00:52, 6.47it/s]
reward: -3.2022, last reward: -2.4989, gradient norm: 11.14: 45%|████▌ | 284/625 [00:43<00:52, 6.47it/s]
reward: -3.2022, last reward: -2.4989, gradient norm: 11.14: 46%|████▌ | 285/625 [00:43<00:52, 6.50it/s]
reward: -3.6464, last reward: -1.6210, gradient norm: 43.37: 46%|████▌ | 285/625 [00:43<00:52, 6.50it/s]
reward: -3.6464, last reward: -1.6210, gradient norm: 43.37: 46%|████▌ | 286/625 [00:43<00:52, 6.50it/s]
reward: -3.9726, last reward: -3.0820, gradient norm: 39.93: 46%|████▌ | 286/625 [00:43<00:52, 6.50it/s]
reward: -3.9726, last reward: -3.0820, gradient norm: 39.93: 46%|████▌ | 287/625 [00:43<00:52, 6.49it/s]
reward: -3.6975, last reward: -2.9091, gradient norm: 29.46: 46%|████▌ | 287/625 [00:43<00:52, 6.49it/s]
reward: -3.6975, last reward: -2.9091, gradient norm: 29.46: 46%|████▌ | 288/625 [00:43<00:51, 6.49it/s]
reward: -3.4926, last reward: -2.4791, gradient norm: 160.7: 46%|████▌ | 288/625 [00:44<00:51, 6.49it/s]
reward: -3.4926, last reward: -2.4791, gradient norm: 160.7: 46%|████▌ | 289/625 [00:44<00:52, 6.40it/s]
reward: -3.0905, last reward: -1.3500, gradient norm: 31.38: 46%|████▌ | 289/625 [00:44<00:52, 6.40it/s]
reward: -3.0905, last reward: -1.3500, gradient norm: 31.38: 46%|████▋ | 290/625 [00:44<00:51, 6.46it/s]
reward: -3.2287, last reward: -2.7137, gradient norm: 26.31: 46%|████▋ | 290/625 [00:44<00:51, 6.46it/s]
reward: -3.2287, last reward: -2.7137, gradient norm: 26.31: 47%|████▋ | 291/625 [00:44<00:51, 6.50it/s]
reward: -2.9918, last reward: -1.5543, gradient norm: 29.73: 47%|████▋ | 291/625 [00:44<00:51, 6.50it/s]
reward: -2.9918, last reward: -1.5543, gradient norm: 29.73: 47%|████▋ | 292/625 [00:44<00:50, 6.54it/s]
reward: -2.9245, last reward: -0.6444, gradient norm: 2.631: 47%|████▋ | 292/625 [00:44<00:50, 6.54it/s]
reward: -2.9245, last reward: -0.6444, gradient norm: 2.631: 47%|████▋ | 293/625 [00:44<00:50, 6.56it/s]
reward: -3.0448, last reward: -0.4769, gradient norm: 7.266: 47%|████▋ | 293/625 [00:44<00:50, 6.56it/s]
reward: -3.0448, last reward: -0.4769, gradient norm: 7.266: 47%|████▋ | 294/625 [00:44<00:50, 6.58it/s]
reward: -2.8566, last reward: -1.7208, gradient norm: 25.22: 47%|████▋ | 294/625 [00:44<00:50, 6.58it/s]
reward: -2.8566, last reward: -1.7208, gradient norm: 25.22: 47%|████▋ | 295/625 [00:44<00:50, 6.60it/s]
reward: -2.8872, last reward: -1.0966, gradient norm: 8.247: 47%|████▋ | 295/625 [00:45<00:50, 6.60it/s]
reward: -2.8872, last reward: -1.0966, gradient norm: 8.247: 47%|████▋ | 296/625 [00:45<00:49, 6.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm: 2.023: 47%|████▋ | 296/625 [00:45<00:49, 6.61it/s]
reward: -2.5303, last reward: -0.1537, gradient norm: 2.023: 48%|████▊ | 297/625 [00:45<00:49, 6.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm: 7.564: 48%|████▊ | 297/625 [00:45<00:49, 6.62it/s]
reward: -2.6817, last reward: -0.2682, gradient norm: 7.564: 48%|████▊ | 298/625 [00:45<00:49, 6.63it/s]
reward: -2.4318, last reward: -0.5063, gradient norm: 14.87: 48%|████▊ | 298/625 [00:45<00:49, 6.63it/s]
reward: -2.4318, last reward: -0.5063, gradient norm: 14.87: 48%|████▊ | 299/625 [00:45<00:49, 6.62it/s]
reward: -2.7475, last reward: -1.4190, gradient norm: 21.66: 48%|████▊ | 299/625 [00:45<00:49, 6.62it/s]
reward: -2.7475, last reward: -1.4190, gradient norm: 21.66: 48%|████▊ | 300/625 [00:45<00:49, 6.63it/s]
reward: -2.8186, last reward: -2.5077, gradient norm: 22.4: 48%|████▊ | 300/625 [00:45<00:49, 6.63it/s]
reward: -2.8186, last reward: -2.5077, gradient norm: 22.4: 48%|████▊ | 301/625 [00:45<00:48, 6.63it/s]
reward: -3.1883, last reward: -1.5291, gradient norm: 7.472: 48%|████▊ | 301/625 [00:46<00:48, 6.63it/s]
reward: -3.1883, last reward: -1.5291, gradient norm: 7.472: 48%|████▊ | 302/625 [00:46<00:48, 6.63it/s]
reward: -2.1256, last reward: -0.3998, gradient norm: 11.01: 48%|████▊ | 302/625 [00:46<00:48, 6.63it/s]
reward: -2.1256, last reward: -0.3998, gradient norm: 11.01: 48%|████▊ | 303/625 [00:46<00:48, 6.63it/s]
reward: -2.3622, last reward: -0.0930, gradient norm: 1.626: 48%|████▊ | 303/625 [00:46<00:48, 6.63it/s]
reward: -2.3622, last reward: -0.0930, gradient norm: 1.626: 49%|████▊ | 304/625 [00:46<00:48, 6.63it/s]
reward: -1.9500, last reward: -0.0075, gradient norm: 0.5664: 49%|████▊ | 304/625 [00:46<00:48, 6.63it/s]
reward: -1.9500, last reward: -0.0075, gradient norm: 0.5664: 49%|████▉ | 305/625 [00:46<00:48, 6.64it/s]
reward: -2.5697, last reward: -0.3024, gradient norm: 22.61: 49%|████▉ | 305/625 [00:46<00:48, 6.64it/s]
reward: -2.5697, last reward: -0.3024, gradient norm: 22.61: 49%|████▉ | 306/625 [00:46<00:48, 6.64it/s]
reward: -2.3117, last reward: -0.0052, gradient norm: 1.006: 49%|████▉ | 306/625 [00:46<00:48, 6.64it/s]
reward: -2.3117, last reward: -0.0052, gradient norm: 1.006: 49%|████▉ | 307/625 [00:46<00:47, 6.63it/s]
reward: -2.0981, last reward: -0.0018, gradient norm: 0.9312: 49%|████▉ | 307/625 [00:46<00:47, 6.63it/s]
reward: -2.0981, last reward: -0.0018, gradient norm: 0.9312: 49%|████▉ | 308/625 [00:46<00:47, 6.63it/s]
reward: -2.5140, last reward: -0.3873, gradient norm: 3.93: 49%|████▉ | 308/625 [00:47<00:47, 6.63it/s]
reward: -2.5140, last reward: -0.3873, gradient norm: 3.93: 49%|████▉ | 309/625 [00:47<00:47, 6.63it/s]
reward: -2.0411, last reward: -0.2650, gradient norm: 3.183: 49%|████▉ | 309/625 [00:47<00:47, 6.63it/s]
reward: -2.0411, last reward: -0.2650, gradient norm: 3.183: 50%|████▉ | 310/625 [00:47<00:47, 6.63it/s]
reward: -2.1656, last reward: -0.0228, gradient norm: 2.004: 50%|████▉ | 310/625 [00:47<00:47, 6.63it/s]
reward: -2.1656, last reward: -0.0228, gradient norm: 2.004: 50%|████▉ | 311/625 [00:47<00:47, 6.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm: 11.78: 50%|████▉ | 311/625 [00:47<00:47, 6.62it/s]
reward: -2.1196, last reward: -0.2478, gradient norm: 11.78: 50%|████▉ | 312/625 [00:47<00:47, 6.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm: 82.91: 50%|████▉ | 312/625 [00:47<00:47, 6.62it/s]
reward: -2.7353, last reward: -3.0812, gradient norm: 82.91: 50%|█████ | 313/625 [00:47<00:47, 6.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm: 8.758: 50%|█████ | 313/625 [00:47<00:47, 6.62it/s]
reward: -3.0995, last reward: -2.3022, gradient norm: 8.758: 50%|█████ | 314/625 [00:47<00:46, 6.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm: 15.99: 50%|█████ | 314/625 [00:47<00:46, 6.62it/s]
reward: -3.1406, last reward: -2.4626, gradient norm: 15.99: 50%|█████ | 315/625 [00:47<00:46, 6.62it/s]
reward: -3.2156, last reward: -1.9055, gradient norm: 7.851: 50%|█████ | 315/625 [00:48<00:46, 6.62it/s]
reward: -3.2156, last reward: -1.9055, gradient norm: 7.851: 51%|█████ | 316/625 [00:48<00:46, 6.62it/s]
reward: -3.1953, last reward: -2.3774, gradient norm: 19.78: 51%|█████ | 316/625 [00:48<00:46, 6.62it/s]
reward: -3.1953, last reward: -2.3774, gradient norm: 19.78: 51%|█████ | 317/625 [00:48<00:46, 6.63it/s]
reward: -2.6385, last reward: -0.9917, gradient norm: 16.15: 51%|█████ | 317/625 [00:48<00:46, 6.63it/s]
reward: -2.6385, last reward: -0.9917, gradient norm: 16.15: 51%|█████ | 318/625 [00:48<00:46, 6.62it/s]
reward: -2.2764, last reward: -0.0536, gradient norm: 2.905: 51%|█████ | 318/625 [00:48<00:46, 6.62it/s]
reward: -2.2764, last reward: -0.0536, gradient norm: 2.905: 51%|█████ | 319/625 [00:48<00:46, 6.62it/s]
reward: -2.6391, last reward: -1.9317, gradient norm: 23.78: 51%|█████ | 319/625 [00:48<00:46, 6.62it/s]
reward: -2.6391, last reward: -1.9317, gradient norm: 23.78: 51%|█████ | 320/625 [00:48<00:46, 6.62it/s]
reward: -2.9748, last reward: -4.2679, gradient norm: 59.43: 51%|█████ | 320/625 [00:48<00:46, 6.62it/s]
reward: -2.9748, last reward: -4.2679, gradient norm: 59.43: 51%|█████▏ | 321/625 [00:48<00:45, 6.62it/s]
reward: -2.8495, last reward: -4.5125, gradient norm: 52.19: 51%|█████▏ | 321/625 [00:49<00:45, 6.62it/s]
reward: -2.8495, last reward: -4.5125, gradient norm: 52.19: 52%|█████▏ | 322/625 [00:49<00:45, 6.62it/s]
reward: -2.8177, last reward: -2.6602, gradient norm: 52.75: 52%|█████▏ | 322/625 [00:49<00:45, 6.62it/s]
reward: -2.8177, last reward: -2.6602, gradient norm: 52.75: 52%|█████▏ | 323/625 [00:49<00:45, 6.62it/s]
reward: -2.0704, last reward: -0.5776, gradient norm: 59.07: 52%|█████▏ | 323/625 [00:49<00:45, 6.62it/s]
reward: -2.0704, last reward: -0.5776, gradient norm: 59.07: 52%|█████▏ | 324/625 [00:49<00:45, 6.62it/s]
reward: -1.9833, last reward: -0.1339, gradient norm: 4.402: 52%|█████▏ | 324/625 [00:49<00:45, 6.62it/s]
reward: -1.9833, last reward: -0.1339, gradient norm: 4.402: 52%|█████▏ | 325/625 [00:49<00:45, 6.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm: 30.36: 52%|█████▏ | 325/625 [00:49<00:45, 6.62it/s]
reward: -2.2760, last reward: -2.1238, gradient norm: 30.36: 52%|█████▏ | 326/625 [00:49<00:45, 6.63it/s]
reward: -2.9299, last reward: -5.0227, gradient norm: 100.5: 52%|█████▏ | 326/625 [00:49<00:45, 6.63it/s]
reward: -2.9299, last reward: -5.0227, gradient norm: 100.5: 52%|█████▏ | 327/625 [00:49<00:44, 6.63it/s]
reward: -2.7727, last reward: -2.1607, gradient norm: 336.7: 52%|█████▏ | 327/625 [00:49<00:44, 6.63it/s]
reward: -2.7727, last reward: -2.1607, gradient norm: 336.7: 52%|█████▏ | 328/625 [00:49<00:44, 6.62it/s]
reward: -2.3958, last reward: -0.3223, gradient norm: 2.763: 52%|█████▏ | 328/625 [00:50<00:44, 6.62it/s]
reward: -2.3958, last reward: -0.3223, gradient norm: 2.763: 53%|█████▎ | 329/625 [00:50<00:44, 6.63it/s]
reward: -2.4742, last reward: -0.1797, gradient norm: 47.32: 53%|█████▎ | 329/625 [00:50<00:44, 6.63it/s]
reward: -2.4742, last reward: -0.1797, gradient norm: 47.32: 53%|█████▎ | 330/625 [00:50<00:44, 6.63it/s]
reward: -2.0144, last reward: -0.0085, gradient norm: 4.791: 53%|█████▎ | 330/625 [00:50<00:44, 6.63it/s]
reward: -2.0144, last reward: -0.0085, gradient norm: 4.791: 53%|█████▎ | 331/625 [00:50<00:44, 6.64it/s]
reward: -1.8284, last reward: -0.0428, gradient norm: 12.29: 53%|█████▎ | 331/625 [00:50<00:44, 6.64it/s]
reward: -1.8284, last reward: -0.0428, gradient norm: 12.29: 53%|█████▎ | 332/625 [00:50<00:44, 6.64it/s]
reward: -2.5229, last reward: -0.0098, gradient norm: 0.7365: 53%|█████▎ | 332/625 [00:50<00:44, 6.64it/s]
reward: -2.5229, last reward: -0.0098, gradient norm: 0.7365: 53%|█████▎ | 333/625 [00:50<00:44, 6.63it/s]
reward: -2.4566, last reward: -0.0781, gradient norm: 2.086: 53%|█████▎ | 333/625 [00:50<00:44, 6.63it/s]
reward: -2.4566, last reward: -0.0781, gradient norm: 2.086: 53%|█████▎ | 334/625 [00:50<00:43, 6.64it/s]
reward: -2.3355, last reward: -0.0230, gradient norm: 1.311: 53%|█████▎ | 334/625 [00:50<00:43, 6.64it/s]
reward: -2.3355, last reward: -0.0230, gradient norm: 1.311: 54%|█████▎ | 335/625 [00:50<00:43, 6.63it/s]
reward: -1.9346, last reward: -0.0423, gradient norm: 1.076: 54%|█████▎ | 335/625 [00:51<00:43, 6.63it/s]
reward: -1.9346, last reward: -0.0423, gradient norm: 1.076: 54%|█████▍ | 336/625 [00:51<00:43, 6.63it/s]
reward: -2.3711, last reward: -0.1335, gradient norm: 0.6855: 54%|█████▍ | 336/625 [00:51<00:43, 6.63it/s]
reward: -2.3711, last reward: -0.1335, gradient norm: 0.6855: 54%|█████▍ | 337/625 [00:51<00:43, 6.63it/s]
reward: -2.0304, last reward: -0.0023, gradient norm: 0.8459: 54%|█████▍ | 337/625 [00:51<00:43, 6.63it/s]
reward: -2.0304, last reward: -0.0023, gradient norm: 0.8459: 54%|█████▍ | 338/625 [00:51<00:43, 6.63it/s]
reward: -1.9998, last reward: -0.4399, gradient norm: 13.1: 54%|█████▍ | 338/625 [00:51<00:43, 6.63it/s]
reward: -1.9998, last reward: -0.4399, gradient norm: 13.1: 54%|█████▍ | 339/625 [00:51<00:43, 6.64it/s]
reward: -2.2303, last reward: -2.1346, gradient norm: 45.99: 54%|█████▍ | 339/625 [00:51<00:43, 6.64it/s]
reward: -2.2303, last reward: -2.1346, gradient norm: 45.99: 54%|█████▍ | 340/625 [00:51<00:43, 6.63it/s]
reward: -2.2915, last reward: -1.7116, gradient norm: 40.34: 54%|█████▍ | 340/625 [00:51<00:43, 6.63it/s]
reward: -2.2915, last reward: -1.7116, gradient norm: 40.34: 55%|█████▍ | 341/625 [00:51<00:42, 6.62it/s]
reward: -2.5560, last reward: -0.0487, gradient norm: 1.195: 55%|█████▍ | 341/625 [00:52<00:42, 6.62it/s]
reward: -2.5560, last reward: -0.0487, gradient norm: 1.195: 55%|█████▍ | 342/625 [00:52<00:42, 6.62it/s]
reward: -2.5119, last reward: -0.0358, gradient norm: 1.061: 55%|█████▍ | 342/625 [00:52<00:42, 6.62it/s]
reward: -2.5119, last reward: -0.0358, gradient norm: 1.061: 55%|█████▍ | 343/625 [00:52<00:42, 6.63it/s]
reward: -2.3305, last reward: -0.3705, gradient norm: 1.957: 55%|█████▍ | 343/625 [00:52<00:42, 6.63it/s]
reward: -2.3305, last reward: -0.3705, gradient norm: 1.957: 55%|█████▌ | 344/625 [00:52<00:42, 6.64it/s]
reward: -2.6068, last reward: -0.2112, gradient norm: 13.83: 55%|█████▌ | 344/625 [00:52<00:42, 6.64it/s]
reward: -2.6068, last reward: -0.2112, gradient norm: 13.83: 55%|█████▌ | 345/625 [00:52<00:42, 6.64it/s]
reward: -2.5731, last reward: -1.8455, gradient norm: 66.75: 55%|█████▌ | 345/625 [00:52<00:42, 6.64it/s]
reward: -2.5731, last reward: -1.8455, gradient norm: 66.75: 55%|█████▌ | 346/625 [00:52<00:42, 6.63it/s]
reward: -2.3897, last reward: -0.0376, gradient norm: 1.608: 55%|█████▌ | 346/625 [00:52<00:42, 6.63it/s]
reward: -2.3897, last reward: -0.0376, gradient norm: 1.608: 56%|█████▌ | 347/625 [00:52<00:41, 6.62it/s]
reward: -2.2264, last reward: -0.0434, gradient norm: 2.012: 56%|█████▌ | 347/625 [00:52<00:41, 6.62it/s]
reward: -2.2264, last reward: -0.0434, gradient norm: 2.012: 56%|█████▌ | 348/625 [00:52<00:41, 6.63it/s]
reward: -2.1300, last reward: -0.1215, gradient norm: 2.557: 56%|█████▌ | 348/625 [00:53<00:41, 6.63it/s]
reward: -2.1300, last reward: -0.1215, gradient norm: 2.557: 56%|█████▌ | 349/625 [00:53<00:41, 6.62it/s]
reward: -2.0968, last reward: -0.0885, gradient norm: 3.389: 56%|█████▌ | 349/625 [00:53<00:41, 6.62it/s]
reward: -2.0968, last reward: -0.0885, gradient norm: 3.389: 56%|█████▌ | 350/625 [00:53<00:41, 6.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm: 0.5052: 56%|█████▌ | 350/625 [00:53<00:41, 6.61it/s]
reward: -2.1348, last reward: -0.0073, gradient norm: 0.5052: 56%|█████▌ | 351/625 [00:53<00:41, 6.62it/s]
reward: -2.4184, last reward: -3.2817, gradient norm: 108.6: 56%|█████▌ | 351/625 [00:53<00:41, 6.62it/s]
reward: -2.4184, last reward: -3.2817, gradient norm: 108.6: 56%|█████▋ | 352/625 [00:53<00:41, 6.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm: 54.07: 56%|█████▋ | 352/625 [00:53<00:41, 6.61it/s]
reward: -2.3774, last reward: -1.8887, gradient norm: 54.07: 56%|█████▋ | 353/625 [00:53<00:41, 6.62it/s]
reward: -2.4779, last reward: -0.1009, gradient norm: 10.91: 56%|█████▋ | 353/625 [00:53<00:41, 6.62it/s]
reward: -2.4779, last reward: -0.1009, gradient norm: 10.91: 57%|█████▋ | 354/625 [00:53<00:40, 6.61it/s]
reward: -2.2588, last reward: -0.0604, gradient norm: 2.599: 57%|█████▋ | 354/625 [00:54<00:40, 6.61it/s]
reward: -2.2588, last reward: -0.0604, gradient norm: 2.599: 57%|█████▋ | 355/625 [00:54<00:40, 6.62it/s]
reward: -2.4486, last reward: -0.1176, gradient norm: 3.656: 57%|█████▋ | 355/625 [00:54<00:40, 6.62it/s]
reward: -2.4486, last reward: -0.1176, gradient norm: 3.656: 57%|█████▋ | 356/625 [00:54<00:40, 6.62it/s]
reward: -2.2436, last reward: -0.0668, gradient norm: 2.724: 57%|█████▋ | 356/625 [00:54<00:40, 6.62it/s]
reward: -2.2436, last reward: -0.0668, gradient norm: 2.724: 57%|█████▋ | 357/625 [00:54<00:40, 6.62it/s]
reward: -1.8849, last reward: -0.0012, gradient norm: 5.326: 57%|█████▋ | 357/625 [00:54<00:40, 6.62it/s]
reward: -1.8849, last reward: -0.0012, gradient norm: 5.326: 57%|█████▋ | 358/625 [00:54<00:40, 6.62it/s]
reward: -2.7511, last reward: -0.8804, gradient norm: 13.6: 57%|█████▋ | 358/625 [00:54<00:40, 6.62it/s]
reward: -2.7511, last reward: -0.8804, gradient norm: 13.6: 57%|█████▋ | 359/625 [00:54<00:40, 6.62it/s]
reward: -2.8870, last reward: -3.6728, gradient norm: 33.56: 57%|█████▋ | 359/625 [00:54<00:40, 6.62it/s]
reward: -2.8870, last reward: -3.6728, gradient norm: 33.56: 58%|█████▊ | 360/625 [00:54<00:40, 6.62it/s]
reward: -2.8841, last reward: -2.5508, gradient norm: 30.93: 58%|█████▊ | 360/625 [00:54<00:40, 6.62it/s]
reward: -2.8841, last reward: -2.5508, gradient norm: 30.93: 58%|█████▊ | 361/625 [00:54<00:40, 6.59it/s]
reward: -2.5242, last reward: -1.0268, gradient norm: 33.15: 58%|█████▊ | 361/625 [00:55<00:40, 6.59it/s]
reward: -2.5242, last reward: -1.0268, gradient norm: 33.15: 58%|█████▊ | 362/625 [00:55<00:39, 6.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm: 0.6185: 58%|█████▊ | 362/625 [00:55<00:39, 6.60it/s]
reward: -2.3232, last reward: -0.0013, gradient norm: 0.6185: 58%|█████▊ | 363/625 [00:55<00:39, 6.61it/s]
reward: -2.1378, last reward: -0.0204, gradient norm: 1.337: 58%|█████▊ | 363/625 [00:55<00:39, 6.61it/s]
reward: -2.1378, last reward: -0.0204, gradient norm: 1.337: 58%|█████▊ | 364/625 [00:55<00:39, 6.61it/s]
reward: -2.2677, last reward: -0.0355, gradient norm: 1.685: 58%|█████▊ | 364/625 [00:55<00:39, 6.61it/s]
reward: -2.2677, last reward: -0.0355, gradient norm: 1.685: 58%|█████▊ | 365/625 [00:55<00:39, 6.61it/s]
reward: -2.4884, last reward: -0.0231, gradient norm: 1.213: 58%|█████▊ | 365/625 [00:55<00:39, 6.61it/s]
reward: -2.4884, last reward: -0.0231, gradient norm: 1.213: 59%|█████▊ | 366/625 [00:55<00:39, 6.62it/s]
reward: -2.0770, last reward: -0.0014, gradient norm: 0.6793: 59%|█████▊ | 366/625 [00:55<00:39, 6.62it/s]
reward: -2.0770, last reward: -0.0014, gradient norm: 0.6793: 59%|█████▊ | 367/625 [00:55<00:39, 6.60it/s]
reward: -1.9834, last reward: -0.0349, gradient norm: 1.863: 59%|█████▊ | 367/625 [00:55<00:39, 6.60it/s]
reward: -1.9834, last reward: -0.0349, gradient norm: 1.863: 59%|█████▉ | 368/625 [00:55<00:38, 6.60it/s]
reward: -2.6709, last reward: -0.1416, gradient norm: 5.462: 59%|█████▉ | 368/625 [00:56<00:38, 6.60it/s]
reward: -2.6709, last reward: -0.1416, gradient norm: 5.462: 59%|█████▉ | 369/625 [00:56<00:38, 6.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm: 47.67: 59%|█████▉ | 369/625 [00:56<00:38, 6.61it/s]
reward: -2.5199, last reward: -3.9790, gradient norm: 47.67: 59%|█████▉ | 370/625 [00:56<00:38, 6.62it/s]
reward: -2.9401, last reward: -3.7802, gradient norm: 32.47: 59%|█████▉ | 370/625 [00:56<00:38, 6.62it/s]
reward: -2.9401, last reward: -3.7802, gradient norm: 32.47: 59%|█████▉ | 371/625 [00:56<00:38, 6.60it/s]
reward: -2.6723, last reward: -3.6507, gradient norm: 45.1: 59%|█████▉ | 371/625 [00:56<00:38, 6.60it/s]
reward: -2.6723, last reward: -3.6507, gradient norm: 45.1: 60%|█████▉ | 372/625 [00:56<00:38, 6.60it/s]
reward: -2.2678, last reward: -0.6201, gradient norm: 32.94: 60%|█████▉ | 372/625 [00:56<00:38, 6.60it/s]
reward: -2.2678, last reward: -0.6201, gradient norm: 32.94: 60%|█████▉ | 373/625 [00:56<00:38, 6.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm: 0.7385: 60%|█████▉ | 373/625 [00:56<00:38, 6.61it/s]
reward: -2.2184, last reward: -0.0075, gradient norm: 0.7385: 60%|█████▉ | 374/625 [00:56<00:37, 6.62it/s]
reward: -2.6344, last reward: -0.0576, gradient norm: 1.617: 60%|█████▉ | 374/625 [00:57<00:37, 6.62it/s]
reward: -2.6344, last reward: -0.0576, gradient norm: 1.617: 60%|██████ | 375/625 [00:57<00:37, 6.62it/s]
reward: -1.9945, last reward: -0.0772, gradient norm: 2.567: 60%|██████ | 375/625 [00:57<00:37, 6.62it/s]
reward: -1.9945, last reward: -0.0772, gradient norm: 2.567: 60%|██████ | 376/625 [00:57<00:37, 6.59it/s]
reward: -1.7576, last reward: -0.0398, gradient norm: 1.961: 60%|██████ | 376/625 [00:57<00:37, 6.59it/s]
reward: -1.7576, last reward: -0.0398, gradient norm: 1.961: 60%|██████ | 377/625 [00:57<00:37, 6.61it/s]
reward: -2.3396, last reward: -0.0022, gradient norm: 1.094: 60%|██████ | 377/625 [00:57<00:37, 6.61it/s]
reward: -2.3396, last reward: -0.0022, gradient norm: 1.094: 60%|██████ | 378/625 [00:57<00:37, 6.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm: 29.23: 60%|██████ | 378/625 [00:57<00:37, 6.61it/s]
reward: -2.3073, last reward: -0.4018, gradient norm: 29.23: 61%|██████ | 379/625 [00:57<00:51, 4.77it/s]
reward: -2.3313, last reward: -1.1869, gradient norm: 38.62: 61%|██████ | 379/625 [00:57<00:51, 4.77it/s]
reward: -2.3313, last reward: -1.1869, gradient norm: 38.62: 61%|██████ | 380/625 [00:57<00:47, 5.20it/s]
reward: -2.0481, last reward: -0.1117, gradient norm: 5.321: 61%|██████ | 380/625 [00:58<00:47, 5.20it/s]
reward: -2.0481, last reward: -0.1117, gradient norm: 5.321: 61%|██████ | 381/625 [00:58<00:43, 5.56it/s]
reward: -1.6823, last reward: -0.0001, gradient norm: 1.981: 61%|██████ | 381/625 [00:58<00:43, 5.56it/s]
reward: -1.6823, last reward: -0.0001, gradient norm: 1.981: 61%|██████ | 382/625 [00:58<00:41, 5.84it/s]
reward: -1.8305, last reward: -0.0210, gradient norm: 1.228: 61%|██████ | 382/625 [00:58<00:41, 5.84it/s]
reward: -1.8305, last reward: -0.0210, gradient norm: 1.228: 61%|██████▏ | 383/625 [00:58<00:39, 6.06it/s]
reward: -1.4908, last reward: -0.0272, gradient norm: 1.538: 61%|██████▏ | 383/625 [00:58<00:39, 6.06it/s]
reward: -1.4908, last reward: -0.0272, gradient norm: 1.538: 61%|██████▏ | 384/625 [00:58<00:38, 6.22it/s]
reward: -2.3267, last reward: -0.0111, gradient norm: 0.7965: 61%|██████▏ | 384/625 [00:58<00:38, 6.22it/s]
reward: -2.3267, last reward: -0.0111, gradient norm: 0.7965: 62%|██████▏ | 385/625 [00:58<00:37, 6.33it/s]
reward: -2.1796, last reward: -0.0039, gradient norm: 0.5396: 62%|██████▏ | 385/625 [00:58<00:37, 6.33it/s]
reward: -2.1796, last reward: -0.0039, gradient norm: 0.5396: 62%|██████▏ | 386/625 [00:58<00:37, 6.41it/s]
reward: -2.3757, last reward: -0.0490, gradient norm: 2.237: 62%|██████▏ | 386/625 [00:59<00:37, 6.41it/s]
reward: -2.3757, last reward: -0.0490, gradient norm: 2.237: 62%|██████▏ | 387/625 [00:59<00:36, 6.47it/s]
reward: -2.1394, last reward: -0.4187, gradient norm: 52.11: 62%|██████▏ | 387/625 [00:59<00:36, 6.47it/s]
reward: -2.1394, last reward: -0.4187, gradient norm: 52.11: 62%|██████▏ | 388/625 [00:59<00:36, 6.52it/s]
reward: -2.2986, last reward: -0.0038, gradient norm: 0.7954: 62%|██████▏ | 388/625 [00:59<00:36, 6.52it/s]
reward: -2.2986, last reward: -0.0038, gradient norm: 0.7954: 62%|██████▏ | 389/625 [00:59<00:36, 6.55it/s]
reward: -2.1274, last reward: -0.0063, gradient norm: 0.813: 62%|██████▏ | 389/625 [00:59<00:36, 6.55it/s]
reward: -2.1274, last reward: -0.0063, gradient norm: 0.813: 62%|██████▏ | 390/625 [00:59<00:35, 6.57it/s]
reward: -1.8706, last reward: -0.0114, gradient norm: 3.325: 62%|██████▏ | 390/625 [00:59<00:35, 6.57it/s]
reward: -1.8706, last reward: -0.0114, gradient norm: 3.325: 63%|██████▎ | 391/625 [00:59<00:35, 6.58it/s]
reward: -1.6922, last reward: -0.0004, gradient norm: 0.2423: 63%|██████▎ | 391/625 [00:59<00:35, 6.58it/s]
reward: -1.6922, last reward: -0.0004, gradient norm: 0.2423: 63%|██████▎ | 392/625 [00:59<00:35, 6.59it/s]
reward: -1.9115, last reward: -0.2602, gradient norm: 2.599: 63%|██████▎ | 392/625 [00:59<00:35, 6.59it/s]
reward: -1.9115, last reward: -0.2602, gradient norm: 2.599: 63%|██████▎ | 393/625 [00:59<00:35, 6.60it/s]
reward: -2.2449, last reward: -0.0783, gradient norm: 5.199: 63%|██████▎ | 393/625 [01:00<00:35, 6.60it/s]
reward: -2.2449, last reward: -0.0783, gradient norm: 5.199: 63%|██████▎ | 394/625 [01:00<00:34, 6.61it/s]
reward: -2.0631, last reward: -0.0057, gradient norm: 0.7444: 63%|██████▎ | 394/625 [01:00<00:34, 6.61it/s]
reward: -2.0631, last reward: -0.0057, gradient norm: 0.7444: 63%|██████▎ | 395/625 [01:00<00:34, 6.61it/s]
reward: -2.3339, last reward: -0.0167, gradient norm: 1.39: 63%|██████▎ | 395/625 [01:00<00:34, 6.61it/s]
reward: -2.3339, last reward: -0.0167, gradient norm: 1.39: 63%|██████▎ | 396/625 [01:00<00:34, 6.62it/s]
reward: -2.4806, last reward: -0.0023, gradient norm: 2.317: 63%|██████▎ | 396/625 [01:00<00:34, 6.62it/s]
reward: -2.4806, last reward: -0.0023, gradient norm: 2.317: 64%|██████▎ | 397/625 [01:00<00:34, 6.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm: 5.067: 64%|██████▎ | 397/625 [01:00<00:34, 6.61it/s]
reward: -2.4171, last reward: -0.1438, gradient norm: 5.067: 64%|██████▎ | 398/625 [01:00<00:34, 6.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm: 20.39: 64%|██████▎ | 398/625 [01:00<00:34, 6.62it/s]
reward: -2.2618, last reward: -0.5809, gradient norm: 20.39: 64%|██████▍ | 399/625 [01:00<00:34, 6.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm: 0.3364: 64%|██████▍ | 399/625 [01:01<00:34, 6.62it/s]
reward: -2.0115, last reward: -0.0054, gradient norm: 0.3364: 64%|██████▍ | 400/625 [01:01<00:34, 6.62it/s]
reward: -1.8733, last reward: -0.0184, gradient norm: 2.275: 64%|██████▍ | 400/625 [01:01<00:34, 6.62it/s]
reward: -1.8733, last reward: -0.0184, gradient norm: 2.275: 64%|██████▍ | 401/625 [01:01<00:33, 6.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm: 1.025: 64%|██████▍ | 401/625 [01:01<00:33, 6.61it/s]
reward: -1.9137, last reward: -0.0113, gradient norm: 1.025: 64%|██████▍ | 402/625 [01:01<00:33, 6.61it/s]
reward: -2.0386, last reward: -0.0625, gradient norm: 2.763: 64%|██████▍ | 402/625 [01:01<00:33, 6.61it/s]
reward: -2.0386, last reward: -0.0625, gradient norm: 2.763: 64%|██████▍ | 403/625 [01:01<00:33, 6.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm: 0.7816: 64%|██████▍ | 403/625 [01:01<00:33, 6.62it/s]
reward: -2.1332, last reward: -0.0582, gradient norm: 0.7816: 65%|██████▍ | 404/625 [01:01<00:33, 6.63it/s]
reward: -1.8341, last reward: -0.0941, gradient norm: 5.854: 65%|██████▍ | 404/625 [01:01<00:33, 6.63it/s]
reward: -1.8341, last reward: -0.0941, gradient norm: 5.854: 65%|██████▍ | 405/625 [01:01<00:33, 6.63it/s]
reward: -1.8615, last reward: -0.0968, gradient norm: 4.588: 65%|██████▍ | 405/625 [01:01<00:33, 6.63it/s]
reward: -1.8615, last reward: -0.0968, gradient norm: 4.588: 65%|██████▍ | 406/625 [01:01<00:33, 6.62it/s]
reward: -2.0981, last reward: -0.3849, gradient norm: 6.008: 65%|██████▍ | 406/625 [01:02<00:33, 6.62it/s]
reward: -2.0981, last reward: -0.3849, gradient norm: 6.008: 65%|██████▌ | 407/625 [01:02<00:32, 6.62it/s]
reward: -1.9395, last reward: -0.0765, gradient norm: 4.055: 65%|██████▌ | 407/625 [01:02<00:32, 6.62it/s]
reward: -1.9395, last reward: -0.0765, gradient norm: 4.055: 65%|██████▌ | 408/625 [01:02<00:32, 6.62it/s]
reward: -2.2685, last reward: -0.2235, gradient norm: 1.688: 65%|██████▌ | 408/625 [01:02<00:32, 6.62it/s]
reward: -2.2685, last reward: -0.2235, gradient norm: 1.688: 65%|██████▌ | 409/625 [01:02<00:32, 6.62it/s]
reward: -2.3052, last reward: -1.4249, gradient norm: 25.99: 65%|██████▌ | 409/625 [01:02<00:32, 6.62it/s]
reward: -2.3052, last reward: -1.4249, gradient norm: 25.99: 66%|██████▌ | 410/625 [01:02<00:32, 6.61it/s]
reward: -2.6806, last reward: -1.6383, gradient norm: 30.59: 66%|██████▌ | 410/625 [01:02<00:32, 6.61it/s]
reward: -2.6806, last reward: -1.6383, gradient norm: 30.59: 66%|██████▌ | 411/625 [01:02<00:32, 6.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm: 74.37: 66%|██████▌ | 411/625 [01:02<00:32, 6.58it/s]
reward: -2.3721, last reward: -2.9981, gradient norm: 74.37: 66%|██████▌ | 412/625 [01:02<00:32, 6.60it/s]
reward: -2.1862, last reward: -0.0063, gradient norm: 1.822: 66%|██████▌ | 412/625 [01:02<00:32, 6.60it/s]
reward: -2.1862, last reward: -0.0063, gradient norm: 1.822: 66%|██████▌ | 413/625 [01:02<00:32, 6.61it/s]
reward: -1.9811, last reward: -0.0171, gradient norm: 1.013: 66%|██████▌ | 413/625 [01:03<00:32, 6.61it/s]
reward: -1.9811, last reward: -0.0171, gradient norm: 1.013: 66%|██████▌ | 414/625 [01:03<00:31, 6.62it/s]
reward: -2.0252, last reward: -0.0049, gradient norm: 0.6205: 66%|██████▌ | 414/625 [01:03<00:31, 6.62it/s]
reward: -2.0252, last reward: -0.0049, gradient norm: 0.6205: 66%|██████▋ | 415/625 [01:03<00:31, 6.62it/s]
reward: -2.1108, last reward: -0.4921, gradient norm: 23.74: 66%|██████▋ | 415/625 [01:03<00:31, 6.62it/s]
reward: -2.1108, last reward: -0.4921, gradient norm: 23.74: 67%|██████▋ | 416/625 [01:03<00:31, 6.64it/s]
reward: -1.9142, last reward: -0.8130, gradient norm: 52.65: 67%|██████▋ | 416/625 [01:03<00:31, 6.64it/s]
reward: -1.9142, last reward: -0.8130, gradient norm: 52.65: 67%|██████▋ | 417/625 [01:03<00:31, 6.64it/s]
reward: -2.1725, last reward: -0.0036, gradient norm: 0.3196: 67%|██████▋ | 417/625 [01:03<00:31, 6.64it/s]
reward: -2.1725, last reward: -0.0036, gradient norm: 0.3196: 67%|██████▋ | 418/625 [01:03<00:31, 6.65it/s]
reward: -1.7795, last reward: -0.0242, gradient norm: 1.799: 67%|██████▋ | 418/625 [01:03<00:31, 6.65it/s]
reward: -1.7795, last reward: -0.0242, gradient norm: 1.799: 67%|██████▋ | 419/625 [01:03<00:31, 6.64it/s]
reward: -1.7737, last reward: -0.0138, gradient norm: 1.39: 67%|██████▋ | 419/625 [01:04<00:31, 6.64it/s]
reward: -1.7737, last reward: -0.0138, gradient norm: 1.39: 67%|██████▋ | 420/625 [01:04<00:30, 6.64it/s]
reward: -2.1462, last reward: -0.0053, gradient norm: 0.47: 67%|██████▋ | 420/625 [01:04<00:30, 6.64it/s]
reward: -2.1462, last reward: -0.0053, gradient norm: 0.47: 67%|██████▋ | 421/625 [01:04<00:30, 6.63it/s]
reward: -1.9226, last reward: -0.6139, gradient norm: 40.3: 67%|██████▋ | 421/625 [01:04<00:30, 6.63it/s]
reward: -1.9226, last reward: -0.6139, gradient norm: 40.3: 68%|██████▊ | 422/625 [01:04<00:30, 6.63it/s]
reward: -1.9889, last reward: -0.0403, gradient norm: 1.112: 68%|██████▊ | 422/625 [01:04<00:30, 6.63it/s]
reward: -1.9889, last reward: -0.0403, gradient norm: 1.112: 68%|██████▊ | 423/625 [01:04<00:30, 6.63it/s]
reward: -1.6194, last reward: -0.0032, gradient norm: 0.79: 68%|██████▊ | 423/625 [01:04<00:30, 6.63it/s]
reward: -1.6194, last reward: -0.0032, gradient norm: 0.79: 68%|██████▊ | 424/625 [01:04<00:30, 6.62it/s]
reward: -2.3989, last reward: -0.0104, gradient norm: 1.134: 68%|██████▊ | 424/625 [01:04<00:30, 6.62it/s]
reward: -2.3989, last reward: -0.0104, gradient norm: 1.134: 68%|██████▊ | 425/625 [01:04<00:30, 6.62it/s]
reward: -1.9960, last reward: -0.0009, gradient norm: 0.6009: 68%|██████▊ | 425/625 [01:04<00:30, 6.62it/s]
reward: -1.9960, last reward: -0.0009, gradient norm: 0.6009: 68%|██████▊ | 426/625 [01:04<00:30, 6.62it/s]
reward: -2.2697, last reward: -0.0914, gradient norm: 2.905: 68%|██████▊ | 426/625 [01:05<00:30, 6.62it/s]
reward: -2.2697, last reward: -0.0914, gradient norm: 2.905: 68%|██████▊ | 427/625 [01:05<00:29, 6.62it/s]
reward: -2.4256, last reward: -0.1114, gradient norm: 2.102: 68%|██████▊ | 427/625 [01:05<00:29, 6.62it/s]
reward: -2.4256, last reward: -0.1114, gradient norm: 2.102: 68%|██████▊ | 428/625 [01:05<00:29, 6.62it/s]
reward: -1.9862, last reward: -0.1932, gradient norm: 22.44: 68%|██████▊ | 428/625 [01:05<00:29, 6.62it/s]
reward: -1.9862, last reward: -0.1932, gradient norm: 22.44: 69%|██████▊ | 429/625 [01:05<00:29, 6.62it/s]
reward: -2.0637, last reward: -0.0623, gradient norm: 3.082: 69%|██████▊ | 429/625 [01:05<00:29, 6.62it/s]
reward: -2.0637, last reward: -0.0623, gradient norm: 3.082: 69%|██████▉ | 430/625 [01:05<00:29, 6.63it/s]
reward: -1.9906, last reward: -0.2031, gradient norm: 5.5: 69%|██████▉ | 430/625 [01:05<00:29, 6.63it/s]
reward: -1.9906, last reward: -0.2031, gradient norm: 5.5: 69%|██████▉ | 431/625 [01:05<00:29, 6.63it/s]
reward: -1.9948, last reward: -0.0895, gradient norm: 3.456: 69%|██████▉ | 431/625 [01:05<00:29, 6.63it/s]
reward: -1.9948, last reward: -0.0895, gradient norm: 3.456: 69%|██████▉ | 432/625 [01:05<00:29, 6.64it/s]
reward: -2.1970, last reward: -0.0256, gradient norm: 1.593: 69%|██████▉ | 432/625 [01:05<00:29, 6.64it/s]
reward: -2.1970, last reward: -0.0256, gradient norm: 1.593: 69%|██████▉ | 433/625 [01:05<00:29, 6.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm: 3.644: 69%|██████▉ | 433/625 [01:06<00:29, 6.62it/s]
reward: -2.4231, last reward: -0.0449, gradient norm: 3.644: 69%|██████▉ | 434/625 [01:06<00:28, 6.62it/s]
reward: -2.1039, last reward: -3.1973, gradient norm: 87.37: 69%|██████▉ | 434/625 [01:06<00:28, 6.62it/s]
reward: -2.1039, last reward: -3.1973, gradient norm: 87.37: 70%|██████▉ | 435/625 [01:06<00:28, 6.62it/s]
reward: -2.4561, last reward: -0.1225, gradient norm: 6.119: 70%|██████▉ | 435/625 [01:06<00:28, 6.62it/s]
reward: -2.4561, last reward: -0.1225, gradient norm: 6.119: 70%|██████▉ | 436/625 [01:06<00:28, 6.58it/s]
reward: -2.0211, last reward: -0.2125, gradient norm: 2.94: 70%|██████▉ | 436/625 [01:06<00:28, 6.58it/s]
reward: -2.0211, last reward: -0.2125, gradient norm: 2.94: 70%|██████▉ | 437/625 [01:06<00:28, 6.57it/s]
reward: -2.3866, last reward: -0.0050, gradient norm: 0.7202: 70%|██████▉ | 437/625 [01:06<00:28, 6.57it/s]
reward: -2.3866, last reward: -0.0050, gradient norm: 0.7202: 70%|███████ | 438/625 [01:06<00:28, 6.54it/s]
reward: -1.6388, last reward: -0.0072, gradient norm: 0.8657: 70%|███████ | 438/625 [01:06<00:28, 6.54it/s]
reward: -1.6388, last reward: -0.0072, gradient norm: 0.8657: 70%|███████ | 439/625 [01:06<00:28, 6.54it/s]
reward: -2.1187, last reward: -0.0015, gradient norm: 0.5116: 70%|███████ | 439/625 [01:07<00:28, 6.54it/s]
reward: -2.1187, last reward: -0.0015, gradient norm: 0.5116: 70%|███████ | 440/625 [01:07<00:28, 6.52it/s]
reward: -2.0432, last reward: -0.0025, gradient norm: 0.7809: 70%|███████ | 440/625 [01:07<00:28, 6.52it/s]
reward: -2.0432, last reward: -0.0025, gradient norm: 0.7809: 71%|███████ | 441/625 [01:07<00:28, 6.52it/s]
reward: -2.1925, last reward: -0.0103, gradient norm: 2.83: 71%|███████ | 441/625 [01:07<00:28, 6.52it/s]
reward: -2.1925, last reward: -0.0103, gradient norm: 2.83: 71%|███████ | 442/625 [01:07<00:27, 6.54it/s]
reward: -1.9570, last reward: -0.0002, gradient norm: 0.35: 71%|███████ | 442/625 [01:07<00:27, 6.54it/s]
reward: -1.9570, last reward: -0.0002, gradient norm: 0.35: 71%|███████ | 443/625 [01:07<00:27, 6.53it/s]
reward: -2.0871, last reward: -0.0022, gradient norm: 0.5601: 71%|███████ | 443/625 [01:07<00:27, 6.53it/s]
reward: -2.0871, last reward: -0.0022, gradient norm: 0.5601: 71%|███████ | 444/625 [01:07<00:27, 6.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm: 0.6061: 71%|███████ | 444/625 [01:07<00:27, 6.51it/s]
reward: -2.0165, last reward: -0.0047, gradient norm: 0.6061: 71%|███████ | 445/625 [01:07<00:27, 6.49it/s]
reward: -2.2746, last reward: -0.0027, gradient norm: 0.7887: 71%|███████ | 445/625 [01:07<00:27, 6.49it/s]
reward: -2.2746, last reward: -0.0027, gradient norm: 0.7887: 71%|███████▏ | 446/625 [01:07<00:27, 6.50it/s]
reward: -2.1835, last reward: -0.0035, gradient norm: 0.855: 71%|███████▏ | 446/625 [01:08<00:27, 6.50it/s]
reward: -2.1835, last reward: -0.0035, gradient norm: 0.855: 72%|███████▏ | 447/625 [01:08<00:27, 6.52it/s]
reward: -1.8420, last reward: -0.0103, gradient norm: 1.548: 72%|███████▏ | 447/625 [01:08<00:27, 6.52it/s]
reward: -1.8420, last reward: -0.0103, gradient norm: 1.548: 72%|███████▏ | 448/625 [01:08<00:27, 6.51it/s]
reward: -2.2653, last reward: -0.0126, gradient norm: 0.9736: 72%|███████▏ | 448/625 [01:08<00:27, 6.51it/s]
reward: -2.2653, last reward: -0.0126, gradient norm: 0.9736: 72%|███████▏ | 449/625 [01:08<00:27, 6.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm: 0.6196: 72%|███████▏ | 449/625 [01:08<00:27, 6.52it/s]
reward: -2.0594, last reward: -0.0119, gradient norm: 0.6196: 72%|███████▏ | 450/625 [01:08<00:26, 6.50it/s]
reward: -2.4509, last reward: -0.0373, gradient norm: 11.44: 72%|███████▏ | 450/625 [01:08<00:26, 6.50it/s]
reward: -2.4509, last reward: -0.0373, gradient norm: 11.44: 72%|███████▏ | 451/625 [01:08<00:26, 6.51it/s]
reward: -2.2528, last reward: -0.0620, gradient norm: 3.992: 72%|███████▏ | 451/625 [01:08<00:26, 6.51it/s]
reward: -2.2528, last reward: -0.0620, gradient norm: 3.992: 72%|███████▏ | 452/625 [01:08<00:26, 6.51it/s]
reward: -1.6898, last reward: -0.3235, gradient norm: 6.687: 72%|███████▏ | 452/625 [01:09<00:26, 6.51it/s]
reward: -1.6898, last reward: -0.3235, gradient norm: 6.687: 72%|███████▏ | 453/625 [01:09<00:26, 6.48it/s]
reward: -1.5879, last reward: -0.0905, gradient norm: 2.84: 72%|███████▏ | 453/625 [01:09<00:26, 6.48it/s]
reward: -1.5879, last reward: -0.0905, gradient norm: 2.84: 73%|███████▎ | 454/625 [01:09<00:26, 6.49it/s]
reward: -1.8406, last reward: -0.0694, gradient norm: 2.288: 73%|███████▎ | 454/625 [01:09<00:26, 6.49it/s]
reward: -1.8406, last reward: -0.0694, gradient norm: 2.288: 73%|███████▎ | 455/625 [01:09<00:26, 6.49it/s]
reward: -1.8259, last reward: -0.0235, gradient norm: 1.304: 73%|███████▎ | 455/625 [01:09<00:26, 6.49it/s]
reward: -1.8259, last reward: -0.0235, gradient norm: 1.304: 73%|███████▎ | 456/625 [01:09<00:26, 6.49it/s]
reward: -1.8500, last reward: -0.0024, gradient norm: 1.416: 73%|███████▎ | 456/625 [01:09<00:26, 6.49it/s]
reward: -1.8500, last reward: -0.0024, gradient norm: 1.416: 73%|███████▎ | 457/625 [01:09<00:25, 6.48it/s]
reward: -1.9649, last reward: -0.4054, gradient norm: 39.3: 73%|███████▎ | 457/625 [01:09<00:25, 6.48it/s]
reward: -1.9649, last reward: -0.4054, gradient norm: 39.3: 73%|███████▎ | 458/625 [01:09<00:25, 6.49it/s]
reward: -2.2027, last reward: -0.0894, gradient norm: 4.275: 73%|███████▎ | 458/625 [01:09<00:25, 6.49it/s]
reward: -2.2027, last reward: -0.0894, gradient norm: 4.275: 73%|███████▎ | 459/625 [01:09<00:25, 6.49it/s]
reward: -1.5966, last reward: -0.0113, gradient norm: 1.368: 73%|███████▎ | 459/625 [01:10<00:25, 6.49it/s]
reward: -1.5966, last reward: -0.0113, gradient norm: 1.368: 74%|███████▎ | 460/625 [01:10<00:25, 6.48it/s]
reward: -1.6942, last reward: -0.0016, gradient norm: 0.4254: 74%|███████▎ | 460/625 [01:10<00:25, 6.48it/s]
reward: -1.6942, last reward: -0.0016, gradient norm: 0.4254: 74%|███████▍ | 461/625 [01:10<00:25, 6.48it/s]
reward: -1.6703, last reward: -0.0145, gradient norm: 2.142: 74%|███████▍ | 461/625 [01:10<00:25, 6.48it/s]
reward: -1.6703, last reward: -0.0145, gradient norm: 2.142: 74%|███████▍ | 462/625 [01:10<00:25, 6.49it/s]
reward: -1.8124, last reward: -0.0218, gradient norm: 0.9196: 74%|███████▍ | 462/625 [01:10<00:25, 6.49it/s]
reward: -1.8124, last reward: -0.0218, gradient norm: 0.9196: 74%|███████▍ | 463/625 [01:10<00:24, 6.49it/s]
reward: -1.8657, last reward: -0.0188, gradient norm: 0.8986: 74%|███████▍ | 463/625 [01:10<00:24, 6.49it/s]
reward: -1.8657, last reward: -0.0188, gradient norm: 0.8986: 74%|███████▍ | 464/625 [01:10<00:24, 6.49it/s]
reward: -2.0884, last reward: -0.0084, gradient norm: 0.5624: 74%|███████▍ | 464/625 [01:10<00:24, 6.49it/s]
reward: -2.0884, last reward: -0.0084, gradient norm: 0.5624: 74%|███████▍ | 465/625 [01:10<00:24, 6.49it/s]
reward: -1.8862, last reward: -0.0006, gradient norm: 0.5384: 74%|███████▍ | 465/625 [01:11<00:24, 6.49it/s]
reward: -1.8862, last reward: -0.0006, gradient norm: 0.5384: 75%|███████▍ | 466/625 [01:11<00:24, 6.50it/s]
reward: -2.1973, last reward: -0.0022, gradient norm: 0.5837: 75%|███████▍ | 466/625 [01:11<00:24, 6.50it/s]
reward: -2.1973, last reward: -0.0022, gradient norm: 0.5837: 75%|███████▍ | 467/625 [01:11<00:24, 6.53it/s]
reward: -1.8954, last reward: -0.0101, gradient norm: 0.6751: 75%|███████▍ | 467/625 [01:11<00:24, 6.53it/s]
reward: -1.8954, last reward: -0.0101, gradient norm: 0.6751: 75%|███████▍ | 468/625 [01:11<00:24, 6.51it/s]
reward: -1.8063, last reward: -0.0122, gradient norm: 0.9635: 75%|███████▍ | 468/625 [01:11<00:24, 6.51it/s]
reward: -1.8063, last reward: -0.0122, gradient norm: 0.9635: 75%|███████▌ | 469/625 [01:11<00:23, 6.51it/s]
reward: -2.0692, last reward: -0.0027, gradient norm: 0.4216: 75%|███████▌ | 469/625 [01:11<00:23, 6.51it/s]
reward: -2.0692, last reward: -0.0027, gradient norm: 0.4216: 75%|███████▌ | 470/625 [01:11<00:23, 6.51it/s]
reward: -2.1227, last reward: -0.0586, gradient norm: 3.162e+03: 75%|███████▌ | 470/625 [01:11<00:23, 6.51it/s]
reward: -2.1227, last reward: -0.0586, gradient norm: 3.162e+03: 75%|███████▌ | 471/625 [01:11<00:23, 6.50it/s]
reward: -1.9690, last reward: -0.0074, gradient norm: 0.4166: 75%|███████▌ | 471/625 [01:11<00:23, 6.50it/s]
reward: -1.9690, last reward: -0.0074, gradient norm: 0.4166: 76%|███████▌ | 472/625 [01:11<00:23, 6.52it/s]
reward: -2.6324, last reward: -0.0119, gradient norm: 1.345: 76%|███████▌ | 472/625 [01:12<00:23, 6.52it/s]
reward: -2.6324, last reward: -0.0119, gradient norm: 1.345: 76%|███████▌ | 473/625 [01:12<00:23, 6.54it/s]
reward: -2.0778, last reward: -0.0098, gradient norm: 1.166: 76%|███████▌ | 473/625 [01:12<00:23, 6.54it/s]
reward: -2.0778, last reward: -0.0098, gradient norm: 1.166: 76%|███████▌ | 474/625 [01:12<00:23, 6.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm: 0.4408: 76%|███████▌ | 474/625 [01:12<00:23, 6.56it/s]
reward: -1.8548, last reward: -0.0017, gradient norm: 0.4408: 76%|███████▌ | 475/625 [01:12<00:22, 6.58it/s]
reward: -1.8125, last reward: -0.0003, gradient norm: 0.1515: 76%|███████▌ | 475/625 [01:12<00:22, 6.58it/s]
reward: -1.8125, last reward: -0.0003, gradient norm: 0.1515: 76%|███████▌ | 476/625 [01:12<00:22, 6.59it/s]
reward: -2.2733, last reward: -0.0044, gradient norm: 0.2836: 76%|███████▌ | 476/625 [01:12<00:22, 6.59it/s]
reward: -2.2733, last reward: -0.0044, gradient norm: 0.2836: 76%|███████▋ | 477/625 [01:12<00:22, 6.60it/s]
reward: -1.7497, last reward: -0.0149, gradient norm: 0.7681: 76%|███████▋ | 477/625 [01:12<00:22, 6.60it/s]
reward: -1.7497, last reward: -0.0149, gradient norm: 0.7681: 76%|███████▋ | 478/625 [01:12<00:22, 6.60it/s]
reward: -1.8547, last reward: -0.0105, gradient norm: 0.7212: 76%|███████▋ | 478/625 [01:13<00:22, 6.60it/s]
reward: -1.8547, last reward: -0.0105, gradient norm: 0.7212: 77%|███████▋ | 479/625 [01:13<00:22, 6.59it/s]
reward: -1.9848, last reward: -0.0019, gradient norm: 0.6498: 77%|███████▋ | 479/625 [01:13<00:22, 6.59it/s]
reward: -1.9848, last reward: -0.0019, gradient norm: 0.6498: 77%|███████▋ | 480/625 [01:13<00:21, 6.60it/s]
reward: -2.1987, last reward: -0.0011, gradient norm: 0.5473: 77%|███████▋ | 480/625 [01:13<00:21, 6.60it/s]
reward: -2.1987, last reward: -0.0011, gradient norm: 0.5473: 77%|███████▋ | 481/625 [01:13<00:21, 6.59it/s]
reward: -1.8991, last reward: -0.0033, gradient norm: 0.6091: 77%|███████▋ | 481/625 [01:13<00:21, 6.59it/s]
reward: -1.8991, last reward: -0.0033, gradient norm: 0.6091: 77%|███████▋ | 482/625 [01:13<00:21, 6.60it/s]
reward: -1.9189, last reward: -0.0032, gradient norm: 0.5771: 77%|███████▋ | 482/625 [01:13<00:21, 6.60it/s]
reward: -1.9189, last reward: -0.0032, gradient norm: 0.5771: 77%|███████▋ | 483/625 [01:13<00:21, 6.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm: 0.7542: 77%|███████▋ | 483/625 [01:13<00:21, 6.61it/s]
reward: -1.6781, last reward: -0.0004, gradient norm: 0.7542: 77%|███████▋ | 484/625 [01:13<00:21, 6.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm: 0.4295: 77%|███████▋ | 484/625 [01:13<00:21, 6.61it/s]
reward: -1.5959, last reward: -0.0064, gradient norm: 0.4295: 78%|███████▊ | 485/625 [01:13<00:21, 6.60it/s]
reward: -2.2547, last reward: -0.0103, gradient norm: 0.4641: 78%|███████▊ | 485/625 [01:14<00:21, 6.60it/s]
reward: -2.2547, last reward: -0.0103, gradient norm: 0.4641: 78%|███████▊ | 486/625 [01:14<00:21, 6.58it/s]
reward: -2.1509, last reward: -0.0636, gradient norm: 6.547: 78%|███████▊ | 486/625 [01:14<00:21, 6.58it/s]
reward: -2.1509, last reward: -0.0636, gradient norm: 6.547: 78%|███████▊ | 487/625 [01:14<00:20, 6.60it/s]
reward: -2.0972, last reward: -0.0065, gradient norm: 0.2593: 78%|███████▊ | 487/625 [01:14<00:20, 6.60it/s]
reward: -2.0972, last reward: -0.0065, gradient norm: 0.2593: 78%|███████▊ | 488/625 [01:14<00:20, 6.61it/s]
reward: -2.1694, last reward: -0.0083, gradient norm: 0.5759: 78%|███████▊ | 488/625 [01:14<00:20, 6.61it/s]
reward: -2.1694, last reward: -0.0083, gradient norm: 0.5759: 78%|███████▊ | 489/625 [01:14<00:20, 6.60it/s]
reward: -2.0493, last reward: -0.0021, gradient norm: 0.7805: 78%|███████▊ | 489/625 [01:14<00:20, 6.60it/s]
reward: -2.0493, last reward: -0.0021, gradient norm: 0.7805: 78%|███████▊ | 490/625 [01:14<00:20, 6.61it/s]
reward: -2.0950, last reward: -0.0021, gradient norm: 0.497: 78%|███████▊ | 490/625 [01:14<00:20, 6.61it/s]
reward: -2.0950, last reward: -0.0021, gradient norm: 0.497: 79%|███████▊ | 491/625 [01:14<00:20, 6.57it/s]
reward: -1.9717, last reward: -0.0012, gradient norm: 0.3672: 79%|███████▊ | 491/625 [01:15<00:20, 6.57it/s]
reward: -1.9717, last reward: -0.0012, gradient norm: 0.3672: 79%|███████▊ | 492/625 [01:15<00:20, 6.58it/s]
reward: -2.0207, last reward: -0.0009, gradient norm: 0.331: 79%|███████▊ | 492/625 [01:15<00:20, 6.58it/s]
reward: -2.0207, last reward: -0.0009, gradient norm: 0.331: 79%|███████▉ | 493/625 [01:15<00:20, 6.59it/s]
reward: -1.8266, last reward: -0.0069, gradient norm: 0.5365: 79%|███████▉ | 493/625 [01:15<00:20, 6.59it/s]
reward: -1.8266, last reward: -0.0069, gradient norm: 0.5365: 79%|███████▉ | 494/625 [01:15<00:19, 6.57it/s]
reward: -2.2623, last reward: -0.0065, gradient norm: 0.5078: 79%|███████▉ | 494/625 [01:15<00:19, 6.57it/s]
reward: -2.2623, last reward: -0.0065, gradient norm: 0.5078: 79%|███████▉ | 495/625 [01:15<00:19, 6.59it/s]
reward: -2.0230, last reward: -0.0027, gradient norm: 0.4545: 79%|███████▉ | 495/625 [01:15<00:19, 6.59it/s]
reward: -2.0230, last reward: -0.0027, gradient norm: 0.4545: 79%|███████▉ | 496/625 [01:15<00:19, 6.60it/s]
reward: -1.6047, last reward: -0.0000, gradient norm: 0.09636: 79%|███████▉ | 496/625 [01:15<00:19, 6.60it/s]
reward: -1.6047, last reward: -0.0000, gradient norm: 0.09636: 80%|███████▉ | 497/625 [01:15<00:19, 6.60it/s]
reward: -1.8754, last reward: -0.0010, gradient norm: 0.2: 80%|███████▉ | 497/625 [01:15<00:19, 6.60it/s]
reward: -1.8754, last reward: -0.0010, gradient norm: 0.2: 80%|███████▉ | 498/625 [01:15<00:19, 6.58it/s]
reward: -2.6216, last reward: -0.0031, gradient norm: 0.8269: 80%|███████▉ | 498/625 [01:16<00:19, 6.58it/s]
reward: -2.6216, last reward: -0.0031, gradient norm: 0.8269: 80%|███████▉ | 499/625 [01:16<00:19, 6.58it/s]
reward: -1.7361, last reward: -0.0023, gradient norm: 0.4082: 80%|███████▉ | 499/625 [01:16<00:19, 6.58it/s]
reward: -1.7361, last reward: -0.0023, gradient norm: 0.4082: 80%|████████ | 500/625 [01:16<00:18, 6.60it/s]
reward: -1.6642, last reward: -0.0006, gradient norm: 0.2284: 80%|████████ | 500/625 [01:16<00:18, 6.60it/s]
reward: -1.6642, last reward: -0.0006, gradient norm: 0.2284: 80%|████████ | 501/625 [01:16<00:18, 6.61it/s]
reward: -1.9130, last reward: -0.0008, gradient norm: 0.3031: 80%|████████ | 501/625 [01:16<00:18, 6.61it/s]
reward: -1.9130, last reward: -0.0008, gradient norm: 0.3031: 80%|████████ | 502/625 [01:16<00:18, 6.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm: 0.2986: 80%|████████ | 502/625 [01:16<00:18, 6.62it/s]
reward: -2.2944, last reward: -0.0035, gradient norm: 0.2986: 80%|████████ | 503/625 [01:16<00:18, 6.61it/s]
reward: -1.7624, last reward: -0.0056, gradient norm: 0.3858: 80%|████████ | 503/625 [01:16<00:18, 6.61it/s]
reward: -1.7624, last reward: -0.0056, gradient norm: 0.3858: 81%|████████ | 504/625 [01:16<00:18, 6.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm: 0.38: 81%|████████ | 504/625 [01:16<00:18, 6.62it/s]
reward: -2.0890, last reward: -0.0042, gradient norm: 0.38: 81%|████████ | 505/625 [01:16<00:18, 6.63it/s]
reward: -1.7505, last reward: -0.0017, gradient norm: 0.2157: 81%|████████ | 505/625 [01:17<00:18, 6.63it/s]
reward: -1.7505, last reward: -0.0017, gradient norm: 0.2157: 81%|████████ | 506/625 [01:17<00:17, 6.64it/s]
reward: -1.8394, last reward: -0.0013, gradient norm: 0.3413: 81%|████████ | 506/625 [01:17<00:17, 6.64it/s]
reward: -1.8394, last reward: -0.0013, gradient norm: 0.3413: 81%|████████ | 507/625 [01:17<00:17, 6.64it/s]
reward: -1.9609, last reward: -0.0041, gradient norm: 0.6905: 81%|████████ | 507/625 [01:17<00:17, 6.64it/s]
reward: -1.9609, last reward: -0.0041, gradient norm: 0.6905: 81%|████████▏ | 508/625 [01:17<00:17, 6.65it/s]
reward: -1.8467, last reward: -0.0011, gradient norm: 0.4409: 81%|████████▏ | 508/625 [01:17<00:17, 6.65it/s]
reward: -1.8467, last reward: -0.0011, gradient norm: 0.4409: 81%|████████▏ | 509/625 [01:17<00:17, 6.65it/s]
reward: -2.0252, last reward: -0.0021, gradient norm: 0.213: 81%|████████▏ | 509/625 [01:17<00:17, 6.65it/s]
reward: -2.0252, last reward: -0.0021, gradient norm: 0.213: 82%|████████▏ | 510/625 [01:17<00:17, 6.65it/s]
reward: -1.8128, last reward: -0.0073, gradient norm: 0.3559: 82%|████████▏ | 510/625 [01:17<00:17, 6.65it/s]
reward: -1.8128, last reward: -0.0073, gradient norm: 0.3559: 82%|████████▏ | 511/625 [01:17<00:17, 6.60it/s]
reward: -2.1479, last reward: -0.0264, gradient norm: 3.68: 82%|████████▏ | 511/625 [01:18<00:17, 6.60it/s]
reward: -2.1479, last reward: -0.0264, gradient norm: 3.68: 82%|████████▏ | 512/625 [01:18<00:17, 6.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm: 5.566: 82%|████████▏ | 512/625 [01:18<00:17, 6.61it/s]
reward: -2.1589, last reward: -0.0025, gradient norm: 5.566: 82%|████████▏ | 513/625 [01:18<00:16, 6.59it/s]
reward: -2.2756, last reward: -0.0046, gradient norm: 0.5266: 82%|████████▏ | 513/625 [01:18<00:16, 6.59it/s]
reward: -2.2756, last reward: -0.0046, gradient norm: 0.5266: 82%|████████▏ | 514/625 [01:18<00:16, 6.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm: 0.9314: 82%|████████▏ | 514/625 [01:18<00:16, 6.56it/s]
reward: -1.9873, last reward: -0.0112, gradient norm: 0.9314: 82%|████████▏ | 515/625 [01:18<00:16, 6.56it/s]
reward: -2.3791, last reward: -0.0721, gradient norm: 1.14: 82%|████████▏ | 515/625 [01:18<00:16, 6.56it/s]
reward: -2.3791, last reward: -0.0721, gradient norm: 1.14: 83%|████████▎ | 516/625 [01:18<00:16, 6.53it/s]
reward: -2.4580, last reward: -0.0758, gradient norm: 0.6114: 83%|████████▎ | 516/625 [01:18<00:16, 6.53it/s]
reward: -2.4580, last reward: -0.0758, gradient norm: 0.6114: 83%|████████▎ | 517/625 [01:18<00:16, 6.53it/s]
reward: -1.9748, last reward: -0.0001, gradient norm: 0.2431: 83%|████████▎ | 517/625 [01:18<00:16, 6.53it/s]
reward: -1.9748, last reward: -0.0001, gradient norm: 0.2431: 83%|████████▎ | 518/625 [01:18<00:16, 6.52it/s]
reward: -2.1958, last reward: -0.0044, gradient norm: 0.5553: 83%|████████▎ | 518/625 [01:19<00:16, 6.52it/s]
reward: -2.1958, last reward: -0.0044, gradient norm: 0.5553: 83%|████████▎ | 519/625 [01:19<00:16, 6.51it/s]
reward: -1.8924, last reward: -0.0097, gradient norm: 17.34: 83%|████████▎ | 519/625 [01:19<00:16, 6.51it/s]
reward: -1.8924, last reward: -0.0097, gradient norm: 17.34: 83%|████████▎ | 520/625 [01:19<00:16, 6.52it/s]
reward: -2.3737, last reward: -0.0234, gradient norm: 1.899: 83%|████████▎ | 520/625 [01:19<00:16, 6.52it/s]
reward: -2.3737, last reward: -0.0234, gradient norm: 1.899: 83%|████████▎ | 521/625 [01:19<00:15, 6.52it/s]
reward: -1.9125, last reward: -0.0063, gradient norm: 0.4623: 83%|████████▎ | 521/625 [01:19<00:15, 6.52it/s]
reward: -1.9125, last reward: -0.0063, gradient norm: 0.4623: 84%|████████▎ | 522/625 [01:19<00:15, 6.52it/s]
reward: -2.3230, last reward: -0.0589, gradient norm: 0.3784: 84%|████████▎ | 522/625 [01:19<00:15, 6.52it/s]
reward: -2.3230, last reward: -0.0589, gradient norm: 0.3784: 84%|████████▎ | 523/625 [01:19<00:15, 6.55it/s]
reward: -1.9482, last reward: -0.0051, gradient norm: 1.105: 84%|████████▎ | 523/625 [01:19<00:15, 6.55it/s]
reward: -1.9482, last reward: -0.0051, gradient norm: 1.105: 84%|████████▍ | 524/625 [01:19<00:15, 6.54it/s]
reward: -2.1979, last reward: -0.0045, gradient norm: 0.6401: 84%|████████▍ | 524/625 [01:20<00:15, 6.54it/s]
reward: -2.1979, last reward: -0.0045, gradient norm: 0.6401: 84%|████████▍ | 525/625 [01:20<00:15, 6.53it/s]
reward: -2.1588, last reward: -0.0048, gradient norm: 0.6255: 84%|████████▍ | 525/625 [01:20<00:15, 6.53it/s]
reward: -2.1588, last reward: -0.0048, gradient norm: 0.6255: 84%|████████▍ | 526/625 [01:20<00:15, 6.53it/s]
reward: -1.6084, last reward: -0.0010, gradient norm: 0.3477: 84%|████████▍ | 526/625 [01:20<00:15, 6.53it/s]
reward: -1.6084, last reward: -0.0010, gradient norm: 0.3477: 84%|████████▍ | 527/625 [01:20<00:15, 6.52it/s]
reward: -2.1475, last reward: -0.0209, gradient norm: 0.3456: 84%|████████▍ | 527/625 [01:20<00:15, 6.52it/s]
reward: -2.1475, last reward: -0.0209, gradient norm: 0.3456: 84%|████████▍ | 528/625 [01:20<00:14, 6.52it/s]
reward: -1.7611, last reward: -0.1040, gradient norm: 18.52: 84%|████████▍ | 528/625 [01:20<00:14, 6.52it/s]
reward: -1.7611, last reward: -0.1040, gradient norm: 18.52: 85%|████████▍ | 529/625 [01:20<00:14, 6.51it/s]
reward: -2.0099, last reward: -0.0173, gradient norm: 1.643: 85%|████████▍ | 529/625 [01:20<00:14, 6.51it/s]
reward: -2.0099, last reward: -0.0173, gradient norm: 1.643: 85%|████████▍ | 530/625 [01:20<00:14, 6.51it/s]
reward: -2.8189, last reward: -1.4358, gradient norm: 46.61: 85%|████████▍ | 530/625 [01:20<00:14, 6.51it/s]
reward: -2.8189, last reward: -1.4358, gradient norm: 46.61: 85%|████████▍ | 531/625 [01:20<00:14, 6.51it/s]
reward: -2.9897, last reward: -2.4869, gradient norm: 51.23: 85%|████████▍ | 531/625 [01:21<00:14, 6.51it/s]
reward: -2.9897, last reward: -2.4869, gradient norm: 51.23: 85%|████████▌ | 532/625 [01:21<00:14, 6.52it/s]
reward: -2.1548, last reward: -0.9751, gradient norm: 72.21: 85%|████████▌ | 532/625 [01:21<00:14, 6.52it/s]
reward: -2.1548, last reward: -0.9751, gradient norm: 72.21: 85%|████████▌ | 533/625 [01:21<00:14, 6.55it/s]
reward: -1.6362, last reward: -0.0022, gradient norm: 0.7495: 85%|████████▌ | 533/625 [01:21<00:14, 6.55it/s]
reward: -1.6362, last reward: -0.0022, gradient norm: 0.7495: 85%|████████▌ | 534/625 [01:21<00:13, 6.53it/s]
reward: -2.1749, last reward: -0.0105, gradient norm: 0.9513: 85%|████████▌ | 534/625 [01:21<00:13, 6.53it/s]
reward: -2.1749, last reward: -0.0105, gradient norm: 0.9513: 86%|████████▌ | 535/625 [01:21<00:13, 6.51it/s]
reward: -1.7708, last reward: -0.0371, gradient norm: 1.432: 86%|████████▌ | 535/625 [01:21<00:13, 6.51it/s]
reward: -1.7708, last reward: -0.0371, gradient norm: 1.432: 86%|████████▌ | 536/625 [01:21<00:13, 6.51it/s]
reward: -2.2649, last reward: -0.0437, gradient norm: 2.327: 86%|████████▌ | 536/625 [01:21<00:13, 6.51it/s]
reward: -2.2649, last reward: -0.0437, gradient norm: 2.327: 86%|████████▌ | 537/625 [01:21<00:13, 6.49it/s]
reward: -2.5491, last reward: -0.0276, gradient norm: 1.246: 86%|████████▌ | 537/625 [01:22<00:13, 6.49it/s]
reward: -2.5491, last reward: -0.0276, gradient norm: 1.246: 86%|████████▌ | 538/625 [01:22<00:13, 6.49it/s]
reward: -2.6426, last reward: -0.7294, gradient norm: 1.078e+03: 86%|████████▌ | 538/625 [01:22<00:13, 6.49it/s]
reward: -2.6426, last reward: -0.7294, gradient norm: 1.078e+03: 86%|████████▌ | 539/625 [01:22<00:13, 6.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm: 1.576: 86%|████████▌ | 539/625 [01:22<00:13, 6.49it/s]
reward: -1.9928, last reward: -0.0003, gradient norm: 1.576: 86%|████████▋ | 540/625 [01:22<00:13, 6.48it/s]
reward: -1.7937, last reward: -0.0124, gradient norm: 0.9664: 86%|████████▋ | 540/625 [01:22<00:13, 6.48it/s]
reward: -1.7937, last reward: -0.0124, gradient norm: 0.9664: 87%|████████▋ | 541/625 [01:22<00:12, 6.49it/s]
reward: -2.3342, last reward: -0.0204, gradient norm: 1.81: 87%|████████▋ | 541/625 [01:22<00:12, 6.49it/s]
reward: -2.3342, last reward: -0.0204, gradient norm: 1.81: 87%|████████▋ | 542/625 [01:22<00:12, 6.48it/s]
reward: -2.2046, last reward: -0.0122, gradient norm: 1.004: 87%|████████▋ | 542/625 [01:22<00:12, 6.48it/s]
reward: -2.2046, last reward: -0.0122, gradient norm: 1.004: 87%|████████▋ | 543/625 [01:22<00:12, 6.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm: 0.5496: 87%|████████▋ | 543/625 [01:22<00:12, 6.49it/s]
reward: -2.0000, last reward: -0.0014, gradient norm: 0.5496: 87%|████████▋ | 544/625 [01:22<00:12, 6.50it/s]
reward: -2.0956, last reward: -0.0059, gradient norm: 1.425: 87%|████████▋ | 544/625 [01:23<00:12, 6.50it/s]
reward: -2.0956, last reward: -0.0059, gradient norm: 1.425: 87%|████████▋ | 545/625 [01:23<00:12, 6.50it/s]
reward: -2.9028, last reward: -0.5843, gradient norm: 21.12: 87%|████████▋ | 545/625 [01:23<00:12, 6.50it/s]
reward: -2.9028, last reward: -0.5843, gradient norm: 21.12: 87%|████████▋ | 546/625 [01:23<00:12, 6.48it/s]
reward: -2.0674, last reward: -0.0178, gradient norm: 0.797: 87%|████████▋ | 546/625 [01:23<00:12, 6.48it/s]
reward: -2.0674, last reward: -0.0178, gradient norm: 0.797: 88%|████████▊ | 547/625 [01:23<00:12, 6.50it/s]
reward: -2.2815, last reward: -0.0599, gradient norm: 1.227: 88%|████████▊ | 547/625 [01:23<00:12, 6.50it/s]
reward: -2.2815, last reward: -0.0599, gradient norm: 1.227: 88%|████████▊ | 548/625 [01:23<00:11, 6.50it/s]
reward: -3.1587, last reward: -0.9276, gradient norm: 20.56: 88%|████████▊ | 548/625 [01:23<00:11, 6.50it/s]
reward: -3.1587, last reward: -0.9276, gradient norm: 20.56: 88%|████████▊ | 549/625 [01:23<00:11, 6.50it/s]
reward: -3.8228, last reward: -2.9229, gradient norm: 308.2: 88%|████████▊ | 549/625 [01:23<00:11, 6.50it/s]
reward: -3.8228, last reward: -2.9229, gradient norm: 308.2: 88%|████████▊ | 550/625 [01:23<00:11, 6.49it/s]
reward: -1.6164, last reward: -0.0120, gradient norm: 2.259: 88%|████████▊ | 550/625 [01:24<00:11, 6.49it/s]
reward: -1.6164, last reward: -0.0120, gradient norm: 2.259: 88%|████████▊ | 551/625 [01:24<00:11, 6.49it/s]
reward: -1.6850, last reward: -0.0227, gradient norm: 0.9167: 88%|████████▊ | 551/625 [01:24<00:11, 6.49it/s]
reward: -1.6850, last reward: -0.0227, gradient norm: 0.9167: 88%|████████▊ | 552/625 [01:24<00:11, 6.50it/s]
reward: -2.3092, last reward: -0.0670, gradient norm: 0.9177: 88%|████████▊ | 552/625 [01:24<00:11, 6.50it/s]
reward: -2.3092, last reward: -0.0670, gradient norm: 0.9177: 88%|████████▊ | 553/625 [01:24<00:11, 6.49it/s]
reward: -2.1599, last reward: -0.0043, gradient norm: 1.195: 88%|████████▊ | 553/625 [01:24<00:11, 6.49it/s]
reward: -2.1599, last reward: -0.0043, gradient norm: 1.195: 89%|████████▊ | 554/625 [01:24<00:10, 6.50it/s]
reward: -2.4672, last reward: -0.0057, gradient norm: 0.6367: 89%|████████▊ | 554/625 [01:24<00:10, 6.50it/s]
reward: -2.4672, last reward: -0.0057, gradient norm: 0.6367: 89%|████████▉ | 555/625 [01:24<00:10, 6.49it/s]
reward: -2.3657, last reward: -0.1970, gradient norm: 4.202: 89%|████████▉ | 555/625 [01:24<00:10, 6.49it/s]
reward: -2.3657, last reward: -0.1970, gradient norm: 4.202: 89%|████████▉ | 556/625 [01:24<00:10, 6.49it/s]
reward: -2.6694, last reward: -0.1215, gradient norm: 1.324: 89%|████████▉ | 556/625 [01:24<00:10, 6.49it/s]
reward: -2.6694, last reward: -0.1215, gradient norm: 1.324: 89%|████████▉ | 557/625 [01:24<00:10, 6.50it/s]
reward: -2.2622, last reward: -0.0372, gradient norm: 0.4841: 89%|████████▉ | 557/625 [01:25<00:10, 6.50it/s]
reward: -2.2622, last reward: -0.0372, gradient norm: 0.4841: 89%|████████▉ | 558/625 [01:25<00:10, 6.50it/s]
reward: -2.2707, last reward: -0.0058, gradient norm: 5.757: 89%|████████▉ | 558/625 [01:25<00:10, 6.50it/s]
reward: -2.2707, last reward: -0.0058, gradient norm: 5.757: 89%|████████▉ | 559/625 [01:25<00:10, 6.49it/s]
reward: -2.2267, last reward: -0.0014, gradient norm: 0.5415: 89%|████████▉ | 559/625 [01:25<00:10, 6.49it/s]
reward: -2.2267, last reward: -0.0014, gradient norm: 0.5415: 90%|████████▉ | 560/625 [01:25<00:10, 6.49it/s]
reward: -2.4556, last reward: -0.0163, gradient norm: 1.146: 90%|████████▉ | 560/625 [01:25<00:10, 6.49it/s]
reward: -2.4556, last reward: -0.0163, gradient norm: 1.146: 90%|████████▉ | 561/625 [01:25<00:09, 6.47it/s]
reward: -2.1839, last reward: -0.0809, gradient norm: 0.6262: 90%|████████▉ | 561/625 [01:25<00:09, 6.47it/s]
reward: -2.1839, last reward: -0.0809, gradient norm: 0.6262: 90%|████████▉ | 562/625 [01:25<00:09, 6.49it/s]
reward: -2.0278, last reward: -0.0018, gradient norm: 1.327: 90%|████████▉ | 562/625 [01:25<00:09, 6.49it/s]
reward: -2.0278, last reward: -0.0018, gradient norm: 1.327: 90%|█████████ | 563/625 [01:25<00:09, 6.50it/s]
reward: -2.1112, last reward: -0.0011, gradient norm: 0.354: 90%|█████████ | 563/625 [01:26<00:09, 6.50it/s]
reward: -2.1112, last reward: -0.0011, gradient norm: 0.354: 90%|█████████ | 564/625 [01:26<00:09, 6.52it/s]
reward: -2.6155, last reward: -0.0004, gradient norm: 2.008: 90%|█████████ | 564/625 [01:26<00:09, 6.52it/s]
reward: -2.6155, last reward: -0.0004, gradient norm: 2.008: 90%|█████████ | 565/625 [01:26<00:09, 6.54it/s]
reward: -3.1427, last reward: -0.3582, gradient norm: 7.624: 90%|█████████ | 565/625 [01:26<00:09, 6.54it/s]
reward: -3.1427, last reward: -0.3582, gradient norm: 7.624: 91%|█████████ | 566/625 [01:26<00:08, 6.57it/s]
reward: -2.7870, last reward: -0.9490, gradient norm: 18.26: 91%|█████████ | 566/625 [01:26<00:08, 6.57it/s]
reward: -2.7870, last reward: -0.9490, gradient norm: 18.26: 91%|█████████ | 567/625 [01:26<00:08, 6.59it/s]
reward: -3.0439, last reward: -0.8796, gradient norm: 29.89: 91%|█████████ | 567/625 [01:26<00:08, 6.59it/s]
reward: -3.0439, last reward: -0.8796, gradient norm: 29.89: 91%|█████████ | 568/625 [01:26<00:08, 6.59it/s]
reward: -2.8026, last reward: -0.2720, gradient norm: 8.612: 91%|█████████ | 568/625 [01:26<00:08, 6.59it/s]
reward: -2.8026, last reward: -0.2720, gradient norm: 8.612: 91%|█████████ | 569/625 [01:26<00:08, 6.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm: 41.13: 91%|█████████ | 569/625 [01:27<00:08, 6.60it/s]
reward: -2.3147, last reward: -0.8486, gradient norm: 41.13: 91%|█████████ | 570/625 [01:27<00:11, 4.78it/s]
reward: -1.7917, last reward: -0.0129, gradient norm: 2.365: 91%|█████████ | 570/625 [01:27<00:11, 4.78it/s]
reward: -1.7917, last reward: -0.0129, gradient norm: 2.365: 91%|█████████▏| 571/625 [01:27<00:10, 5.22it/s]
reward: -1.9553, last reward: -0.0020, gradient norm: 0.6871: 91%|█████████▏| 571/625 [01:27<00:10, 5.22it/s]
reward: -1.9553, last reward: -0.0020, gradient norm: 0.6871: 92%|█████████▏| 572/625 [01:27<00:09, 5.57it/s]
reward: -2.3132, last reward: -0.0159, gradient norm: 0.8646: 92%|█████████▏| 572/625 [01:27<00:09, 5.57it/s]
reward: -2.3132, last reward: -0.0159, gradient norm: 0.8646: 92%|█████████▏| 573/625 [01:27<00:08, 5.85it/s]
reward: -1.5320, last reward: -0.0269, gradient norm: 1.02: 92%|█████████▏| 573/625 [01:27<00:08, 5.85it/s]
reward: -1.5320, last reward: -0.0269, gradient norm: 1.02: 92%|█████████▏| 574/625 [01:27<00:08, 6.05it/s]
reward: -2.2955, last reward: -0.0245, gradient norm: 1.267: 92%|█████████▏| 574/625 [01:27<00:08, 6.05it/s]
reward: -2.2955, last reward: -0.0245, gradient norm: 1.267: 92%|█████████▏| 575/625 [01:27<00:08, 6.21it/s]
reward: -2.3347, last reward: -0.0179, gradient norm: 1.528: 92%|█████████▏| 575/625 [01:28<00:08, 6.21it/s]
reward: -2.3347, last reward: -0.0179, gradient norm: 1.528: 92%|█████████▏| 576/625 [01:28<00:07, 6.33it/s]
reward: -1.9718, last reward: -0.1629, gradient norm: 8.804: 92%|█████████▏| 576/625 [01:28<00:07, 6.33it/s]
reward: -1.9718, last reward: -0.1629, gradient norm: 8.804: 92%|█████████▏| 577/625 [01:28<00:07, 6.42it/s]
reward: -2.4164, last reward: -0.0070, gradient norm: 0.4335: 92%|█████████▏| 577/625 [01:28<00:07, 6.42it/s]
reward: -2.4164, last reward: -0.0070, gradient norm: 0.4335: 92%|█████████▏| 578/625 [01:28<00:07, 6.47it/s]
reward: -2.2993, last reward: -0.0011, gradient norm: 1.371: 92%|█████████▏| 578/625 [01:28<00:07, 6.47it/s]
reward: -2.2993, last reward: -0.0011, gradient norm: 1.371: 93%|█████████▎| 579/625 [01:28<00:07, 6.52it/s]
reward: -3.3049, last reward: -0.9063, gradient norm: 34.23: 93%|█████████▎| 579/625 [01:28<00:07, 6.52it/s]
reward: -3.3049, last reward: -0.9063, gradient norm: 34.23: 93%|█████████▎| 580/625 [01:28<00:06, 6.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm: 10.91: 93%|█████████▎| 580/625 [01:28<00:06, 6.56it/s]
reward: -2.8785, last reward: -0.3295, gradient norm: 10.91: 93%|█████████▎| 581/625 [01:28<00:06, 6.58it/s]
reward: -2.5184, last reward: -0.0546, gradient norm: 21.09: 93%|█████████▎| 581/625 [01:28<00:06, 6.58it/s]
reward: -2.5184, last reward: -0.0546, gradient norm: 21.09: 93%|█████████▎| 582/625 [01:28<00:06, 6.59it/s]
reward: -2.4039, last reward: -0.4589, gradient norm: 10.86: 93%|█████████▎| 582/625 [01:29<00:06, 6.59it/s]
reward: -2.4039, last reward: -0.4589, gradient norm: 10.86: 93%|█████████▎| 583/625 [01:29<00:06, 6.61it/s]
reward: -2.4697, last reward: -0.2476, gradient norm: 4.689: 93%|█████████▎| 583/625 [01:29<00:06, 6.61it/s]
reward: -2.4697, last reward: -0.2476, gradient norm: 4.689: 93%|█████████▎| 584/625 [01:29<00:06, 6.60it/s]
reward: -2.0018, last reward: -0.2397, gradient norm: 8.393: 93%|█████████▎| 584/625 [01:29<00:06, 6.60it/s]
reward: -2.0018, last reward: -0.2397, gradient norm: 8.393: 94%|█████████▎| 585/625 [01:29<00:06, 6.61it/s]
reward: -2.4953, last reward: -0.1775, gradient norm: 24.17: 94%|█████████▎| 585/625 [01:29<00:06, 6.61it/s]
reward: -2.4953, last reward: -0.1775, gradient norm: 24.17: 94%|█████████▍| 586/625 [01:29<00:05, 6.61it/s]
reward: -2.2258, last reward: -0.0110, gradient norm: 0.7671: 94%|█████████▍| 586/625 [01:29<00:05, 6.61it/s]
reward: -2.2258, last reward: -0.0110, gradient norm: 0.7671: 94%|█████████▍| 587/625 [01:29<00:05, 6.62it/s]
reward: -2.3981, last reward: -0.0011, gradient norm: 1.617: 94%|█████████▍| 587/625 [01:29<00:05, 6.62it/s]
reward: -2.3981, last reward: -0.0011, gradient norm: 1.617: 94%|█████████▍| 588/625 [01:29<00:05, 6.63it/s]
reward: -1.8590, last reward: -0.0007, gradient norm: 1.131: 94%|█████████▍| 588/625 [01:29<00:05, 6.63it/s]
reward: -1.8590, last reward: -0.0007, gradient norm: 1.131: 94%|█████████▍| 589/625 [01:29<00:05, 6.64it/s]
reward: -1.9820, last reward: -0.4221, gradient norm: 49.4: 94%|█████████▍| 589/625 [01:30<00:05, 6.64it/s]
reward: -1.9820, last reward: -0.4221, gradient norm: 49.4: 94%|█████████▍| 590/625 [01:30<00:05, 6.64it/s]
reward: -2.1293, last reward: -0.0116, gradient norm: 0.868: 94%|█████████▍| 590/625 [01:30<00:05, 6.64it/s]
reward: -2.1293, last reward: -0.0116, gradient norm: 0.868: 95%|█████████▍| 591/625 [01:30<00:05, 6.63it/s]
reward: -2.1675, last reward: -0.0173, gradient norm: 0.5931: 95%|█████████▍| 591/625 [01:30<00:05, 6.63it/s]
reward: -2.1675, last reward: -0.0173, gradient norm: 0.5931: 95%|█████████▍| 592/625 [01:30<00:04, 6.64it/s]
reward: -2.2910, last reward: -0.0207, gradient norm: 0.5219: 95%|█████████▍| 592/625 [01:30<00:04, 6.64it/s]
reward: -2.2910, last reward: -0.0207, gradient norm: 0.5219: 95%|█████████▍| 593/625 [01:30<00:04, 6.63it/s]
reward: -2.2124, last reward: -0.1730, gradient norm: 5.737: 95%|█████████▍| 593/625 [01:30<00:04, 6.63it/s]
reward: -2.2124, last reward: -0.1730, gradient norm: 5.737: 95%|█████████▌| 594/625 [01:30<00:04, 6.63it/s]
reward: -2.2914, last reward: -0.0206, gradient norm: 0.485: 95%|█████████▌| 594/625 [01:30<00:04, 6.63it/s]
reward: -2.2914, last reward: -0.0206, gradient norm: 0.485: 95%|█████████▌| 595/625 [01:30<00:04, 6.63it/s]
reward: -2.0890, last reward: -0.0172, gradient norm: 0.3982: 95%|█████████▌| 595/625 [01:31<00:04, 6.63it/s]
reward: -2.0890, last reward: -0.0172, gradient norm: 0.3982: 95%|█████████▌| 596/625 [01:31<00:04, 6.63it/s]
reward: -2.0945, last reward: -0.0121, gradient norm: 0.4789: 95%|█████████▌| 596/625 [01:31<00:04, 6.63it/s]
reward: -2.0945, last reward: -0.0121, gradient norm: 0.4789: 96%|█████████▌| 597/625 [01:31<00:04, 6.63it/s]
reward: -2.3805, last reward: -0.0069, gradient norm: 0.4074: 96%|█████████▌| 597/625 [01:31<00:04, 6.63it/s]
reward: -2.3805, last reward: -0.0069, gradient norm: 0.4074: 96%|█████████▌| 598/625 [01:31<00:04, 6.62it/s]
reward: -2.3310, last reward: -0.0031, gradient norm: 0.5065: 96%|█████████▌| 598/625 [01:31<00:04, 6.62it/s]
reward: -2.3310, last reward: -0.0031, gradient norm: 0.5065: 96%|█████████▌| 599/625 [01:31<00:03, 6.62it/s]
reward: -2.6028, last reward: -0.0006, gradient norm: 0.6316: 96%|█████████▌| 599/625 [01:31<00:03, 6.62it/s]
reward: -2.6028, last reward: -0.0006, gradient norm: 0.6316: 96%|█████████▌| 600/625 [01:31<00:03, 6.62it/s]
reward: -2.6724, last reward: -0.0001, gradient norm: 0.6523: 96%|█████████▌| 600/625 [01:31<00:03, 6.62it/s]
reward: -2.6724, last reward: -0.0001, gradient norm: 0.6523: 96%|█████████▌| 601/625 [01:31<00:03, 6.63it/s]
reward: -2.2481, last reward: -0.0136, gradient norm: 0.4298: 96%|█████████▌| 601/625 [01:31<00:03, 6.63it/s]
reward: -2.2481, last reward: -0.0136, gradient norm: 0.4298: 96%|█████████▋| 602/625 [01:31<00:03, 6.63it/s]
reward: -2.3524, last reward: -0.0043, gradient norm: 0.2629: 96%|█████████▋| 602/625 [01:32<00:03, 6.63it/s]
reward: -2.3524, last reward: -0.0043, gradient norm: 0.2629: 96%|█████████▋| 603/625 [01:32<00:03, 6.62it/s]
reward: -2.2635, last reward: -0.0069, gradient norm: 0.7839: 96%|█████████▋| 603/625 [01:32<00:03, 6.62it/s]
reward: -2.2635, last reward: -0.0069, gradient norm: 0.7839: 97%|█████████▋| 604/625 [01:32<00:03, 6.63it/s]
reward: -2.6041, last reward: -0.8027, gradient norm: 11.7: 97%|█████████▋| 604/625 [01:32<00:03, 6.63it/s]
reward: -2.6041, last reward: -0.8027, gradient norm: 11.7: 97%|█████████▋| 605/625 [01:32<00:03, 6.63it/s]
reward: -4.4170, last reward: -3.4675, gradient norm: 60.04: 97%|█████████▋| 605/625 [01:32<00:03, 6.63it/s]
reward: -4.4170, last reward: -3.4675, gradient norm: 60.04: 97%|█████████▋| 606/625 [01:32<00:02, 6.63it/s]
reward: -4.3153, last reward: -2.9316, gradient norm: 53.11: 97%|█████████▋| 606/625 [01:32<00:02, 6.63it/s]
reward: -4.3153, last reward: -2.9316, gradient norm: 53.11: 97%|█████████▋| 607/625 [01:32<00:02, 6.63it/s]
reward: -3.0649, last reward: -0.9722, gradient norm: 30.84: 97%|█████████▋| 607/625 [01:32<00:02, 6.63it/s]
reward: -3.0649, last reward: -0.9722, gradient norm: 30.84: 97%|█████████▋| 608/625 [01:32<00:02, 6.63it/s]
reward: -2.7989, last reward: -0.0329, gradient norm: 1.261: 97%|█████████▋| 608/625 [01:33<00:02, 6.63it/s]
reward: -2.7989, last reward: -0.0329, gradient norm: 1.261: 97%|█████████▋| 609/625 [01:33<00:02, 6.64it/s]
reward: -2.1976, last reward: -0.6852, gradient norm: 20.33: 97%|█████████▋| 609/625 [01:33<00:02, 6.64it/s]
reward: -2.1976, last reward: -0.6852, gradient norm: 20.33: 98%|█████████▊| 610/625 [01:33<00:02, 6.64it/s]
reward: -2.4793, last reward: -0.1255, gradient norm: 14.69: 98%|█████████▊| 610/625 [01:33<00:02, 6.64it/s]
reward: -2.4793, last reward: -0.1255, gradient norm: 14.69: 98%|█████████▊| 611/625 [01:33<00:02, 6.64it/s]
reward: -2.4581, last reward: -0.0394, gradient norm: 2.429: 98%|█████████▊| 611/625 [01:33<00:02, 6.64it/s]
reward: -2.4581, last reward: -0.0394, gradient norm: 2.429: 98%|█████████▊| 612/625 [01:33<00:01, 6.64it/s]
reward: -2.2047, last reward: -0.0326, gradient norm: 1.147: 98%|█████████▊| 612/625 [01:33<00:01, 6.64it/s]
reward: -2.2047, last reward: -0.0326, gradient norm: 1.147: 98%|█████████▊| 613/625 [01:33<00:01, 6.63it/s]
reward: -1.8967, last reward: -0.0129, gradient norm: 0.8619: 98%|█████████▊| 613/625 [01:33<00:01, 6.63it/s]
reward: -1.8967, last reward: -0.0129, gradient norm: 0.8619: 98%|█████████▊| 614/625 [01:33<00:01, 6.63it/s]
reward: -2.5906, last reward: -0.0015, gradient norm: 0.6491: 98%|█████████▊| 614/625 [01:33<00:01, 6.63it/s]
reward: -2.5906, last reward: -0.0015, gradient norm: 0.6491: 98%|█████████▊| 615/625 [01:33<00:01, 6.63it/s]
reward: -1.6634, last reward: -0.0007, gradient norm: 0.4394: 98%|█████████▊| 615/625 [01:34<00:01, 6.63it/s]
reward: -1.6634, last reward: -0.0007, gradient norm: 0.4394: 99%|█████████▊| 616/625 [01:34<00:01, 6.64it/s]
reward: -2.0624, last reward: -0.0061, gradient norm: 0.5676: 99%|█████████▊| 616/625 [01:34<00:01, 6.64it/s]
reward: -2.0624, last reward: -0.0061, gradient norm: 0.5676: 99%|█████████▊| 617/625 [01:34<00:01, 6.64it/s]
reward: -2.3259, last reward: -0.0131, gradient norm: 0.7733: 99%|█████████▊| 617/625 [01:34<00:01, 6.64it/s]
reward: -2.3259, last reward: -0.0131, gradient norm: 0.7733: 99%|█████████▉| 618/625 [01:34<00:01, 6.65it/s]
reward: -1.7515, last reward: -0.0189, gradient norm: 0.5575: 99%|█████████▉| 618/625 [01:34<00:01, 6.65it/s]
reward: -1.7515, last reward: -0.0189, gradient norm: 0.5575: 99%|█████████▉| 619/625 [01:34<00:00, 6.64it/s]
reward: -1.9313, last reward: -0.0207, gradient norm: 0.6286: 99%|█████████▉| 619/625 [01:34<00:00, 6.64it/s]
reward: -1.9313, last reward: -0.0207, gradient norm: 0.6286: 99%|█████████▉| 620/625 [01:34<00:00, 6.64it/s]
reward: -2.4325, last reward: -0.0171, gradient norm: 0.7832: 99%|█████████▉| 620/625 [01:34<00:00, 6.64it/s]
reward: -2.4325, last reward: -0.0171, gradient norm: 0.7832: 99%|█████████▉| 621/625 [01:34<00:00, 6.63it/s]
reward: -2.1134, last reward: -0.0144, gradient norm: 1.96: 99%|█████████▉| 621/625 [01:34<00:00, 6.63it/s]
reward: -2.1134, last reward: -0.0144, gradient norm: 1.96: 100%|█████████▉| 622/625 [01:34<00:00, 6.63it/s]
reward: -2.4572, last reward: -0.0500, gradient norm: 0.5838: 100%|█████████▉| 622/625 [01:35<00:00, 6.63it/s]
reward: -2.4572, last reward: -0.0500, gradient norm: 0.5838: 100%|█████████▉| 623/625 [01:35<00:00, 6.63it/s]
reward: -2.3818, last reward: -0.0019, gradient norm: 0.8623: 100%|█████████▉| 623/625 [01:35<00:00, 6.63it/s]
reward: -2.3818, last reward: -0.0019, gradient norm: 0.8623: 100%|█████████▉| 624/625 [01:35<00:00, 6.62it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|█████████▉| 624/625 [01:35<00:00, 6.62it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|██████████| 625/625 [01:35<00:00, 6.63it/s]
reward: -2.1253, last reward: -0.0001, gradient norm: 0.6622: 100%|██████████| 625/625 [01:35<00:00, 6.55it/s]
结论¶
在本教程中,我们学习了如何从 抓。我们涉及的主题包括:
编码时需要注意的四个基本组成部分 环境(、、种子设定和构建规范)。 我们看到了这些方法和类如何与类交互;
step
reset
TensorDict
如何在无状态环境的上下文中附加转换以及如何 编写自定义转换;
如何在完全可区分的模拟器上训练策略。
脚本总运行时间:(2 分 51.454 秒)
估计内存使用量:320 MB