目录

torchtune.rlhf

Components and losses for RLHF algorithms like PPO and DPO.

estimate_advantages

Estimates the advantages and returns for the PPO algorithm using Generalized Advantage Estimation https://arxiv.org/pdf/1506.02438.pdf

get_rewards_ppo

Calculates PPO rewards for the given scores, logprobs, and reference logprobs.

truncate_sequence_at_first_stop_token

Truncates sequence(s) after the first stop token and pads with fill_value.

loss.PPOLoss

Proximal Policy Optimization (PPO) Loss module.

loss.DPOLoss

Direct Preference Optimization (DPO) Loss module: https://arxiv.org/abs/2305.18290 Simply stated from the paper:

loss.RSOLoss

Statistical Rejection Sampling Optimization (RSO) or "hinge" loss module: https://arxiv.org/abs/2309.06657.

loss.SimPOLoss

Simple Preference Optimization with a Reference-Free Reward: https://arxiv.org/abs/2405.14734.

文档

访问 PyTorch 的全面开发人员文档

查看文档

教程

获取面向初学者和高级开发人员的深入教程

查看教程

资源

查找开发资源并解答您的问题

查看资源