melon-test

Reinforcement Learning

Slides

Books

Papers

Surveys

Recommended:

Optional:

A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More
A Survey of Reinforcement Learning from Human Feedback

Basic RL Algorithms

Recommended:

Optional:

LLM Alignment Techniques

RLHF

RLAIF

More About Reward Models For Alignment

DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Analysis:

Read the survey for a lot more information.

Aha Moment

Process Reward Model

Test-Time Scaling

Analysis:

Vertical Scaling:

s1: Simple test-time scaling

Parallel Scaling:

Reward-Model-Based RL Algorithms

Reward-Model-Free RL Algorithms (mostly about DPO, thus optional)

Appendix: Math Provers

2025

(Here RLHF and RLAIF utilize PPO by default)

2022

InstructGPT: proposed RLHF

2023

Model	Algorithm
GPT4	RLHF + Rule-Based Reward Model(RBRM)
Llama2	RLHF + Rejection Sampling
qwen	RLHF
zephyr-7B	direct distillation of LM Alignment
starling-7B	RLAIF
Gemini	RLHF, continuously updating Reward Model

2024

Model	Algorithm
DeepSeekMath	proposed GRPO
Claude3	RLAIF(see Constitutional AI)
InternLM2	COOL RLHF
Reka	RLHF
Llama3	rejection-sampling + DPO with some tricks
Phi-3	DPO
Zephyr 141B-A39B	ORPO
DeepSeek-V2	GRPO
Qwen2	DPO
Nemotron-4 340B	DPO + RPO
ChatGLM	ChatGLM-RLHF
Hermes 3	DPO + LoRA
Gemma2	RLHF
Qwen2.5	DPO for offline and GRPO for online
Hunyuan-Large	DPO
Phi-4	DPO
DeepSeek-V3	GRPO

2025

Model	Algorithm
MiniMax-01	DPO for offline, modified GRPO for online
Kimi-k1.5	MDPO for CoT
DeepSeek-R1	GRPO for CoT
Qwen3	GRPO for CoT
Phi-4-reasoning	GRPO for CoT