Skip to content

PPOv2Trainer and RLOOTrainer: Remove the implicit assumption that the reward model & policy model share the same tokenizer #1979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
RylanSchaeffer opened this issue Aug 26, 2024 · 6 comments
Labels
✨ enhancement New feature or request 🙋 help from community wanted Open invitation for community members to contribute 🏋 Online DPO Related to Online DPO 🏋 PPO Related to PPO

Comments

@RylanSchaeffer
Copy link
Contributor

RylanSchaeffer commented Aug 26, 2024

Feature request

Remove the implicit assumption in PPOv2Trainer and RLOOTrainer that the reward model & policy model share the same tokenizer

Motivation

Currently, as I understand, PPOv2Trainer and RLOOTrainer both assume that the reward model and the policy model share the same tokenizer. This is oftentimes not the case; for instance, I want to try different reward models from RewardBench and these are often based on different language models with different tokenizers.

This implicit assumption should be removed.

To see where this behavior pops up, please see this for an example: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L599-L601

Note that the raw tokens of the policy model are passed directly to the reward model. These tokens are not meaningful if the reward model does not share the same tokenizer.

Your contribution

If the developers agree, I'd be happy to discuss this change with them and how to best implement it.

@RylanSchaeffer
Copy link
Contributor Author

@qgallouedec if you could please comment, once we have a consensus of (1) whether this is indeed a problem and (2) what a reasonable solution looks like, I can work on a PR

@RylanSchaeffer
Copy link
Contributor Author

@kashif can I ask you to please weigh in on this? I want to know whether you agree with this proposed change, and if so, what a solution might look like. I'd be happy to (help) implement it.

@qgallouedec
Copy link
Member

qgallouedec commented Oct 20, 2024

This assumption is made for every trainer that use a reward model, so it also include Online DPO and its variants XPO, Nash-MD. This would be a great improvement.

@qgallouedec qgallouedec added ✨ enhancement New feature or request 🏋 PPO Related to PPO 🏋 Online DPO Related to Online DPO labels Oct 20, 2024
@kashif
Copy link
Collaborator

kashif commented Oct 20, 2024

right would be a good improvement i believe...

@qgallouedec qgallouedec added the 🙋 help from community wanted Open invitation for community members to contribute label Dec 14, 2024
@TheTahaaa
Copy link

If someone struggles with this, it can be fixed easily. All you have to do is override the get_reward() function:

First, decode the generated input_ids using the tokenizer of the policy model, then re-tokenize the resulting string using the tokenizer of the reward model, and finally pass that into the reward model for scoring.

(P.S. I genuinely think this should be handled internally by the PPOTrainer 🤨)

@RekkimiARG
Copy link

I also have a strongle demand for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request 🙋 help from community wanted Open invitation for community members to contribute 🏋 Online DPO Related to Online DPO 🏋 PPO Related to PPO
Projects
None yet
Development

No branches or pull requests

8 participants
@kashif @RylanSchaeffer @TheTahaaa @qgallouedec @RekkimiARG and others