`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

RylanSchaeffer · 2024-08-26T19:17:36Z

Feature request

Remove the implicit assumption in PPOv2Trainer and RLOOTrainer that the reward model & policy model share the same tokenizer

Motivation

Currently, as I understand, PPOv2Trainer and RLOOTrainer both assume that the reward model and the policy model share the same tokenizer. This is oftentimes not the case; for instance, I want to try different reward models from RewardBench and these are often based on different language models with different tokenizers.

This implicit assumption should be removed.

To see where this behavior pops up, please see this for an example: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L599-L601

Note that the raw tokens of the policy model are passed directly to the reward model. These tokens are not meaningful if the reward model does not share the same tokenizer.

Your contribution

If the developers agree, I'd be happy to discuss this change with them and how to best implement it.

The text was updated successfully, but these errors were encountered:

RylanSchaeffer · 2024-08-28T13:29:20Z

@qgallouedec if you could please comment, once we have a consensus of (1) whether this is indeed a problem and (2) what a reasonable solution looks like, I can work on a PR

RylanSchaeffer · 2024-09-08T18:41:01Z

@kashif can I ask you to please weigh in on this? I want to know whether you agree with this proposed change, and if so, what a solution might look like. I'd be happy to (help) implement it.

qgallouedec · 2024-10-20T17:11:54Z

This assumption is made for every trainer that use a reward model, so it also include Online DPO and its variants XPO, Nash-MD. This would be a great improvement.

kashif · 2024-10-20T17:18:54Z

right would be a good improvement i believe...

TheTahaaa · 2025-04-12T21:19:57Z

If someone struggles with this, it can be fixed easily. All you have to do is override the get_reward() function:

First, decode the generated input_ids using the tokenizer of the policy model, then re-tokenize the resulting string using the tokenizer of the reward model, and finally pass that into the reward model for scoring.

(P.S. I genuinely think this should be handled internally by the PPOTrainer 🤨)

RekkimiARG · 2025-04-25T11:02:46Z

I also have a strongle demand for this

qgallouedec added ✨ enhancement 🏋 PPO 🏋 Online DPO labels Oct 20, 2024

qgallouedec added the 🙋 help from community wanted label Dec 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

RylanSchaeffer commented Aug 26, 2024 •

edited

Loading

RylanSchaeffer commented Aug 28, 2024

Uh oh!

RylanSchaeffer commented Sep 8, 2024

Uh oh!

qgallouedec commented Oct 20, 2024 •

edited

Loading

Uh oh!

kashif commented Oct 20, 2024

Uh oh!

TheTahaaa commented Apr 12, 2025

Uh oh!

RekkimiARG commented Apr 25, 2025

Uh oh!

PPOv2Trainer and RLOOTrainer: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

PPOv2Trainer and RLOOTrainer: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

Comments

RylanSchaeffer commented Aug 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature request

Motivation

Your contribution

RylanSchaeffer commented Aug 28, 2024

Uh oh!

RylanSchaeffer commented Sep 8, 2024

Uh oh!

qgallouedec commented Oct 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Oct 20, 2024

Uh oh!

TheTahaaa commented Apr 12, 2025

Uh oh!

RekkimiARG commented Apr 25, 2025

Uh oh!

`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

`PPOv2Trainer` and `RLOOTrainer`: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

RylanSchaeffer commented Aug 26, 2024 •

edited

Loading

qgallouedec commented Oct 20, 2024 •

edited

Loading