melon-test

Reinforcement Learning

Slides

Books

Papers

Surveys

Recommended:

Optional:

Basic RL Algorithms

Recommended:

Optional:

LLM Alignment Techniques

RLHF

RLAIF

More About Reward Models For Alignment

DPO

Analysis:

Read the survey for a lot more information.

Aha Moment

Process Reward Model

Test-Time Scaling

Analysis:

Vertical Scaling:

Parallel Scaling:

Reward-Model-Based RL Algorithms

Reward-Model-Free RL Algorithms (mostly about DPO, thus optional)

Appendix: Math Provers

2025

(Here RLHF and RLAIF utilize PPO by default)

2022

2023

Model Algorithm
GPT4 RLHF + Rule-Based Reward Model(RBRM)
Llama2 RLHF + Rejection Sampling
qwen RLHF
zephyr-7B direct distillation of LM Alignment
starling-7B RLAIF
Gemini RLHF, continuously updating Reward Model

2024

Model Algorithm
DeepSeekMath proposed GRPO
Claude3 RLAIF(see Constitutional AI)
InternLM2 COOL RLHF
Reka RLHF
Llama3 rejection-sampling + DPO with some tricks
Phi-3 DPO
Zephyr 141B-A39B ORPO
DeepSeek-V2 GRPO
Qwen2 DPO
Nemotron-4 340B DPO + RPO
ChatGLM ChatGLM-RLHF
Hermes 3 DPO + LoRA
Gemma2 RLHF
Qwen2.5 DPO for offline and GRPO for online
Hunyuan-Large DPO
Phi-4 DPO
DeepSeek-V3 GRPO

2025

Model Algorithm
MiniMax-01 DPO for offline, modified GRPO for online
Kimi-k1.5 MDPO for CoT
DeepSeek-R1 GRPO for CoT
Qwen3 GRPO for CoT
Phi-4-reasoning GRPO for CoT