Recommended:
Optional:
Recommended:
Optional:
Analysis:
Read the survey for a lot more information.
Analysis:
Vertical Scaling:
Parallel Scaling:
2025
(Here RLHF and RLAIF utilize PPO by default)
2022
2023
| Model | Algorithm |
|---|---|
| GPT4 | RLHF + Rule-Based Reward Model(RBRM) |
| Llama2 | RLHF + Rejection Sampling |
| qwen | RLHF |
| zephyr-7B | direct distillation of LM Alignment |
| starling-7B | RLAIF |
| Gemini | RLHF, continuously updating Reward Model |
2024
| Model | Algorithm |
|---|---|
| DeepSeekMath | proposed GRPO |
| Claude3 | RLAIF(see Constitutional AI) |
| InternLM2 | COOL RLHF |
| Reka | RLHF |
| Llama3 | rejection-sampling + DPO with some tricks |
| Phi-3 | DPO |
| Zephyr 141B-A39B | ORPO |
| DeepSeek-V2 | GRPO |
| Qwen2 | DPO |
| Nemotron-4 340B | DPO + RPO |
| ChatGLM | ChatGLM-RLHF |
| Hermes 3 | DPO + LoRA |
| Gemma2 | RLHF |
| Qwen2.5 | DPO for offline and GRPO for online |
| Hunyuan-Large | DPO |
| Phi-4 | DPO |
| DeepSeek-V3 | GRPO |
2025
| Model | Algorithm |
|---|---|
| MiniMax-01 | DPO for offline, modified GRPO for online |
| Kimi-k1.5 | MDPO for CoT |
| DeepSeek-R1 | GRPO for CoT |
| Qwen3 | GRPO for CoT |
| Phi-4-reasoning | GRPO for CoT |