PPOv2Trainer
and RLOOTrainer
: Remove the implicit assumption that the reward model & policy model share the same tokenizer
#1979
Labels
✨ enhancement
New feature or request
🙋 help from community wanted
Open invitation for community members to contribute
🏋 Online DPO
Related to Online DPO
🏋 PPO
Related to PPO
Uh oh!
There was an error while loading. Please reload this page.
Feature request
Remove the implicit assumption in
PPOv2Trainer
andRLOOTrainer
that the reward model & policy model share the same tokenizerMotivation
Currently, as I understand,
PPOv2Trainer
andRLOOTrainer
both assume that the reward model and the policy model share the same tokenizer. This is oftentimes not the case; for instance, I want to try different reward models from RewardBench and these are often based on different language models with different tokenizers.This implicit assumption should be removed.
To see where this behavior pops up, please see this for an example: https://github.com/huggingface/trl/blob/main/trl/trainer/ppov2_trainer.py#L599-L601
Note that the raw tokens of the policy model are passed directly to the reward model. These tokens are not meaningful if the reward model does not share the same tokenizer.
Your contribution
If the developers agree, I'd be happy to discuss this change with them and how to best implement it.
The text was updated successfully, but these errors were encountered: