GitHub - OpenRLHF/OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray (PPO & GRPO & REINFORCE++ & LoRA & vLLM & RFT)

Open-source / Comprehensive / Lightweight / Easy-to-use

OpenRLHF is the first easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM, ZeRO-3 and HuggingFace Transformers, designed to make RLHF training simple and accessible:

Distributed Architecture with Ray
OpenRLHF leverages Ray for efficient distributed scheduling. It separates the Actor, Reward, Reference, and Critic models across different GPUs, enabling scalable training for models up to 70B parameters.
It also supports Hybrid Engine scheduling, allowing all models and vLLM engines to share GPU resources—minimizing idle time and maximizing GPU utilization.
vLLM Inference Acceleration + AutoTP
RLHF training spends 80% of the time on the sample generation stage. Powered by vLLM and Auto Tensor Parallelism (AutoTP), OpenRLHF delivers high-throughput, memory-efficient samples generation. Native integration with HuggingFace Transformers ensures seamless and fast generation, making it the fastest RLHF framework available.
Memory-Efficient Training with ZeRO-3 / AutoTP
Built on DeepSpeed's ZeRO-3, deepcompile and AutoTP, OpenRLHF enables large model training without heavyweight frameworks. It works directly with HuggingFace for easy loading and fine-tuning of pretrained models.
Optimized PPO Implementation
Incorporates advanced PPO tricks inspired by practical guides and community best practices, enhancing training stability and reward quality in RLHF workflows. Referencing Zhihu and Advanced Tricks for Training Large Language Models with Proximal Policy Optimization.

More details are in Slides | Technical Report | Documents

News

[2025/4] Post the blog Accelerating RLHF with vLLM, Best Practice from OpenRLHF
[2025/4] Clean OpenRLHF: Refactored the source code based on Single Controller and Unified Packing Samples
[2025/3] The CMU Advanced Natural Language Processing Spring 2025 course uses OpenRLHF as the RLHF framework teaching case.
[2025/2] Logic-RL and PRIME demonstrate that REINFORCE++ is more stable in training compared to GRPO and faster than PPO.
[2025/2] LMM-R1 is a fork of OpenRLHF, aimed at providing high-performance RL infrastructure for reproduction of DeepSeek-R1 on multimodal tasks.
[2025/2] MIT & Microsoft proposed the On the Emergence of Thinking in LLMs I: Searching for the Right Intuition using OpenRLHF
[2025/1] HKUST reproduced the DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF
[2024/12] We "proposed" 😊 the REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models.
[2024/12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the Notion Blogpost.
[2023/8] OpenRLHF was open-sourced.

Features

Distributed PPO and REINFORCE++/REINFORCE++-baseline/GRPO/RLOO implementations based on Ray.
Support Ray-based PPO and REINFORCE++/REINFORCE++-baseline/GRPO/RLOO using Hybrid Engine (--colocate_all_models, --vllm_enable_sleep and --vllm_gpu_memory_utilization 0.5)
Ray-based Reinforced Finetuning
Support DeepSpeed AutoTP training (--ds_tensor_parallel_size)
Full RLHF fine-tuning support for models with over 70 billion parameters.
Integration with vLLM for accelerated generation in RLHF tasks (--vllm_num_engines).
Implementation of DPO (Direct Preference Optimization)/IPO/cDPO and Kahneman-Tversky Optimization (KTO).
Support for Iterative DPO (GitHub: Online-RLHF).
Support for Rejection Sampling.
Implementation of Conditional SFT (arXiv:2308.12050).
Support for Knowledge Distillation (Microsoft: minillm).
Integration of Process Reward Model (PRM).
Packing of training samples for SFT, DPO, RM, PRM, and PPO (--packing_samples).
Implementation of RingAttention (--ring_attn_size, --ring_head_stride).
Support for Mixture of Experts (MoE) (--aux_loss_coef).
Integration of FlashAttention2 (--flash_attn).
Support for QLoRA (--load_in_4bit) and LoRA (--lora_rank, --target_modules).
Compatibility with HuggingFace's tokenizer.apply_chat_template for datasets (--apply_chat_template and --input_key).
Logging support with Wandb (--use_wandb) and TensorBoard (--use_tensorboard).
Checkpoint recovery functionality (--load_checkpoint and --save_steps).
Provided multi-node training scripts, such as DPO and Ray PPO.

Quick Start

Installation

To use OpenRLHF, first launch the docker container (Recommended) and pip install openrlhf inside the docker container:

# Launch the docker container
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v $PWD:/openrlhf nvcr.io/nvidia/pytorch:24.07-py3 bash
sudo pip uninstall xgboost transformer_engine flash_attn pynvml -y

# pip install
pip install openrlhf

# If you want to use vLLM acceleration (Install vLLM 0.8.3)
pip install openrlhf[vllm]
# latest vLLM is also supported
pip install openrlhf[vllm_latest]
# Install vLLM, ring-flash-attention and Liger-Kernel
pip install openrlhf[vllm,ring,liger]

# pip install the latest version
pip install git+https://github.com/OpenRLHF/OpenRLHF.git

# Or git clone
git clone https://github.com/OpenRLHF/OpenRLHF.git
cd OpenRLHF
pip install -e .

Note

We recommend using vLLM 0.8.3 or higher. We also provided the Dockerfiles for vLLM and One-Click Installation Script of Nvidia-Docker.

Prepare Datasets

OpenRLHF provides multiple data processing methods in our dataset classes. Such as in the Prompt Dataset:

def preprocess_data(data, input_template=None, input_key="input", apply_chat_template=None) -> str:
    if apply_chat_template:
        chat = data[input_key]
        if isinstance(chat, str):
            chat = [{"role": "user", "content": chat}]
        prompt = apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
    else:
        prompt = data[input_key]
        if input_template:
            prompt = input_template.format(prompt)
    return prompt

We can use --input_key to specify the JSON key name of the input datasets --prompt_data {name or path} (PPO) or --dataset {name or path}, and use --apply_chat_template to utilize the chat_template from the Huggingface Tokenizer.
If you don't want to use --apply_chat_template, you can use --input_template instead, or preprocess the datasets offline in advance.
OpenRLHF also support mixing multiple datasets using --prompt_data_probs 0.1,0.4,0.5 (PPO) or --dataset_probs 0.1,0.4,0.5.

How Chat Templating Works:

dataset = [{"input_key": [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]}]

tokenizer.apply_chat_template(dataset[0]["input_key"], tokenize=False)

"<s>[INST] Hello, how are you? [/INST]I'm doing great. How can I help you today?</s> [INST] I'd like to show off how chat templating works! [/INST]"

How to specify test datasets ?

Please set test datasets path using --eval_dataset {name or path}.

Note

The JSON key options depends on the specific datasets. See Reward Dataset and SFT Dataset

Supervised Fine-tuning

OpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using --pretrain {name or path}, --reward_pretrain {name or path} and --critic_pretrain {name or path}. We have provided some pre-trained checkpoints and datasets on HuggingFace OpenRLHF.

Then you can use the startup scripts we provide in the examples/scripts directory, or start the training using the following commands.

deepspeed --module openrlhf.cli.train_sft \
   --max_len 4096 \
   --dataset Open-Orca/OpenOrca \
   --input_key question \
   --output_key response \
   --input_template $'User: {}\nAssistant: ' \
   --train_batch_size 256 \
   --micro_train_batch_size 2 \
   --max_samples 500000 \
   --pretrain meta-llama/Meta-Llama-3-8B \
   --save_path ./checkpoint/llama3-8b-sft \
   --save_steps -1 \
   --logging_steps 1 \
   --eval_steps -1 \
   --zero_stage 2 \
   --max_epochs 1 \
   --packing_samples \
   --bf16 \
   --flash_attn \
   --learning_rate 5e-6 \
   --gradient_checkpointing \
   --use_wandb {wandb_token}

# Support HF tokenizer.apply_chat_template
# --apply_chat_template 
# --tokenizer_chat_template {HF Chat Template}

# Support RingAttention
# pip install ring_flash_attn
#   --ring_attn_size 2 \
#   --ring_head_stride 2 \

# Multi-turn fine-tuning loss
# --multiturn

# Can also be used for continued pre-training
# --pretrain_mode

Note

OpenRLHF SFT/DPO/RewardModel/PPO trainers support --packing_samples based on --flash_attn

Reward Model Training

deepspeed --module openrlhf.cli.train_rm \
   --save_path ./checkpoint/llama3-8b-rm \
   --save_steps -1 \
   --logging_steps 1 \
   --eval_steps -1 \
   --train_batch_size 256 \
   --micro_train_batch_size 1 \
   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
   --bf16 \
   --max_epochs 1 \
   --max_len 8192 \
   --zero_stage 3 \
   --learning_rate 9e-6 \
   --dataset OpenRLHF/preference_dataset_mixture2_and_safe_pku \
   --apply_chat_template \
   --chosen_key chosen \
   --rejected_key rejected \
   --flash_attn \
   --packing_samples \
   --gradient_checkpointing \
   --use_wandb {wandb_token}

It is recommended to set the --value_prefix_head option of the Reward Model to score, so that we can load the model using AutoModelForSequenceClassification:

reward_model = AutoModelForSequenceClassification.from_pretrained(
              reward_model_path,
              num_labels=1,
              torch_dtype=torch.bfloat16,
              attn_implementation="flash_attention_2",
              use_cache=False,
          )
inputs = xxxx (Left Padding Input Tokens)
reward = reward_model.model(*inputs).last_hidden_state
reward = reward_model.score(reward)[:, -1]

PPO/REINFORCE++ with Ray and vLLM

To improve RLHF training speed or support 70B models, we can use the PPO with Ray and vLLM acceleration (Hybrid Engine)

# launch the master node of ray in container
ray start --head --node-ip-address 0.0.0.0 --num-gpus 8

# if you want to launch ray on more nodes, use
ray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8

ray job submit --address="http://127.0.0.1:8265" \
   --runtime-env-json='{"working_dir": "/openrlhf"}' \
   -- python3 -m openrlhf.cli.train_ppo_ray \
   --ref_num_nodes 1 \
   --ref_num_gpus_per_node 8 \
   --reward_num_nodes 1 \
   --reward_num_gpus_per_node 8 \
   --critic_num_nodes 1 \
   --critic_num_gpus_per_node 8 \
   --actor_num_nodes 1 \
   --actor_num_gpus_per_node 8 \
   --vllm_num_engines 4 \
   --vllm_tensor_parallel_size 2 \
   --colocate_all_models \
   --vllm_gpu_memory_utilization 0.5 \
   --pretrain OpenRLHF/Llama-3-8b-sft-mixture \
   --reward_pretrain OpenRLHF/Llama-3-8b-rm-700k \
   --save_path /openrlhf/examples/test_scripts/final/llama3-8b-rlhf \
   --ckpt_path /openrlhf/examples/test_scripts/ckpt/llama3-8b-rlhf \
   --save_hf_ckpt \
   --micro_train_batch_size 8 \
   --train_batch_size 128 \
   --micro_rollout_batch_size 16 \
   --rollout_batch_size 1024 \
   --n_samples_per_prompt 1 \
   --max_epochs 1 \
   --prompt_max_len 1024 \
   --max_samples 100000 \
   --generate_max_len 1024 \
   --zero_stage 3 \
   --bf16 \
   --actor_learning_rate 5e-7 \
   --critic_learning_rate 9e-6 \
   --init_kl_coef 0.01 \
   --prompt_data OpenRLHF/prompt-collection-v0.1 \
   --input_key context_messages \
   --apply_chat_template \
   --normalize_reward \
   --gradient_checkpointing \
   --packing_samples \
   --vllm_sync_backend nccl \
   --enforce_eager \
   --vllm_enable_sleep \
   --deepspeed_enable_sleep
   --use_wandb {wandb_token}

# Support REINFORCE++  | RLOO | REINFORCE++-baseline | GRPO | Dr. GRPO
# --advantage_estimator reinforce | rloo | reinforce_baseline | group_norm | dr_grpo

# Set --init_kl_coef to 0 will not launch the reference model

# Support remote reward model (HTTP)
# --remote_rm_url http://localhost:5000/get_reward

# Support N samples
# --n_samples_per_prompt 4

Note

You can also use setup_commands to let Ray automatically deploy the environment, such as --runtime-env-json='{"setup_commands": ["pip install openrlhf[vllm]"]}'.

Note

RLOO and REINFORCE++-baseline in OPENRLHF are a modification based on REINFORCE++:

REINFORCE++ integrates key optimization techniques from PPO (such as advantage normalization and PPO-clip loss) into REINFORCE while eliminating the need for a critic network.
REINFORCE++-baseline uses the mean reward of multiple samples from the same prompt as the baseline to reshape the rewards (with global batch normalization /std).
RLOO in OpenRLHF modifies the original version by incorporating the per-token KL reward and utilizing the PPO-clip loss.
Dr. GRPO remove the group normalization /std in GRPO.

Note

If you you encounter an error related to index out of range when deepspeed sets up the GPU devices, you can try to set the environment variable RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES as a workaround.

# For NVIDIA GPUs:
export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1

The launch scripts and documents for supported algorithms are in example/scripts and Documents - Usage

Reinforced Fine-tuning

OpenRLHF supports convenient and efficient Reinforced Fine-tuning. You only need to implement a file containing the custom reward_func function and pass its path to the remote_rm_url parameter. Such as

# reward_func.py
import torch

def reward_func(queries, prompts, labels):
    # queries is prompts + responses
    # labels is answers
    print(queries)
    return torch.randn(len(queries))

then just set

ray job submit --address="http://127.0.0.1:8265" \
  --runtime-env-json='{"working_dir": "/openrlhf"}' \
  -- python3 -m openrlhf.cli.train_ppo_ray \
  ...
  --remote_rm_url /path/to/reward_func.py \
  --label_key answer

where the label_key parameter is used to pass additional sample information such as answer to the reward function.

LoRA

If you use LoRA (Low-Rank Adaptation), OpenRLHF will not save the full weights by default instead of LoRA Adapter. To continue in your task normally, you should combine the Adapter with weights of your base model

python -m openrlhf.cli.lora_combiner \
    --model_path meta-llama/Meta-Llama-3-8B \
    --lora_path ./checkpoint/llama3-8b-rm \
    --output_path ./checkpoint/llama-3-8b-rm-combined \
    --is_rm \
    --bf16

Performance

We optimized DSChat's performance to the greatest extent possible by employing techniques such as enabling Adam offload, along with reward model (RM) and reference model (Ref) offload to increase the micro-batch size during the inference stage and avoid out-of-memory issues. We even fixed some bugs in DSChat to enable the Hybrid Engine (HE) for LLaMA2. The average time (seconds) it took to train 1024 prompts with 1 PPO epoch using the Optimized DSChat and OpenRLHF:

Size	NVIDIA A800-80GB GPUs	Optimized DSChat (with Hybrid Engine)	OpenRLHF	Speedup
7B	16	855.09	471.11	1.82x
13B	32	1528.93	608.93	2.5x
34B	32	3634.98	1526.4	2.4x
70B	32	10407.0	4488.53	2.3x

Note

The data is outdated; please refer to the performance tuning section for re-testing.

Performance Tuning Guide

To achieve optimal performance, we recommend allocating nodes vLLM:Actor:Critic = 1:1:1.

For example, for a 70B model with 48 A100 GPUs, it is advised to allocate 16 A100 GPUs to the vLLM Engine, 16 GPUs to the Actor model, and the remaining 16 GPUs to the Critic model.
Using hybrid engine --colocate_all_models and --vllm_enable_sleep and --deepspeed_enable_sleep rather than distributed RLHF when there are enough GPU memory.
Enable the --colocate_critic_reward, --colocate_actor_ref options to merge nodes.
You should increase the rollout_micro_batch_size (and minimize the TP size of vLLM engine) as much as possible. During the training phase, a larger --micro_train_batch_size is better and enable --packing_samples.
When there are enough GPU memory, please disable --adam_offload and enable --overlap_comm. Also enable --deepcompile to speed up the training.
For vLLM, please use --vllm_sync_backend nccl
Enable enable_prefix_caching in vLLM generation when n_samples_per_prompts > 1.
For a large base model, if an OOM occurs, do not use any --colocate_xxxx options.

Companies and Organizations using OpenRLHF

Google
ByteDance
Tencent
Alibaba
Baidu
China Telecom
Vivo
Allen AI
NexusFlow
Jülich Supercomputing Centre (JSC)
Berkeley Starling Team
M-A-P
...

Join Us

How to Join?

Email us at janhu9527@gmail.com or join GitHub Organization. Please include the following details:
- Your name
- Your GitHub username
- Your areas of interest
- Your skills and experience related to NLP and/or AI
You can also join us through the official GitHub OpenRLHF ↗ project page. Just create an issue about your interest to contribute and we will get back to you.

What can you do?

Join the team and participate in the development of the OpenRLHF project.
Contribute to the project by submitting pull requests.
Help improve documentation, fix bugs, or create new features.
Share the project and help us grow the community.

Sponsor Us

Your sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on Open Collective ↗.

Starchart

Contributors

A big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.

References & Acknowledgements

We would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:

Our project would also like to thank ColossalChat and DeepSpeedChat. In the early stages of the project, we referred to their code design. Our project would like to thank Netmind.AI for the GPU support of developing ring attention.

(2024/7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.

Citation

@article{hu2024openrlhf,
  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},
  author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},
  journal={arXiv preprint arXiv:2405.11143},
  year={2024}
}

Name	Name	Last commit message	Last commit date
Latest commit situozhang Fix PPO evaluation: incorrect batch decoding of queries_list (#1005 ) May 1, 2025 c438a86 · May 1, 2025 History 1,271 Commits
.github/workflows	.github/workflows	Revert "update"	Jul 31, 2024
dockerfile	dockerfile	update dockerfile	Apr 14, 2025
docs	docs	update logo	Jul 11, 2024
examples/scripts	examples/scripts	update scripts	Apr 27, 2025
openrlhf	openrlhf	Fix PPO evaluation: incorrect batch decoding of queries_list (#1005 )	May 1, 2025
.gitignore	.gitignore	update	Jul 6, 2024
.pre-commit-config.yaml	.pre-commit-config.yaml	[pre-commit.ci] pre-commit suggestions (#948 )	Apr 8, 2025
CONTRIBUTING.md	CONTRIBUTING.md	rename to openrlhf	Oct 15, 2023
LICENSE	LICENSE	update license (#73 )	Aug 20, 2023
README.md	README.md	Update README.md (#1003 )	May 1, 2025
README_ja.md	README_ja.md	update README.md	Apr 27, 2025
README_zh.md	README_zh.md	update README.md	Apr 27, 2025
pyproject.toml	pyproject.toml	fix pip install -e .	Jul 16, 2024
requirements.txt	requirements.txt	Fix self._wandb.Table	Apr 22, 2025
setup.py	setup.py	bump version to 0.7.0	Apr 12, 2025
version.txt	version.txt	bump version	Apr 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News

Features

Quick Start

Installation

Prepare Datasets

Supervised Fine-tuning

Reward Model Training

PPO/REINFORCE++ with Ray and vLLM

Reinforced Fine-tuning

LoRA

Performance

Performance Tuning Guide

Companies and Organizations using OpenRLHF

Join Us

Sponsor Us

Starchart

Contributors

References & Acknowledgements

Citation

About

Releases 73

Packages

Contributors 69

Languages

License

OpenRLHF/OpenRLHF

Folders and files

Latest commit

History

Repository files navigation

News

Features

Quick Start

Installation

Prepare Datasets

Supervised Fine-tuning

Reward Model Training

PPO/REINFORCE++ with Ray and vLLM

Reinforced Fine-tuning

LoRA

Performance

Performance Tuning Guide

Companies and Organizations using OpenRLHF

Join Us

Sponsor Us

Starchart

Contributors

References & Acknowledgements

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 73

Packages 0

Contributors 69

Languages

Packages