🩺 Dr. GRPO loss #3256

qgallouedec · 2025-04-07T17:00:32Z

What does this PR do?

This PR supersedes #3231 #3138
Closes #3178

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-04-07T17:49:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

trl/trainer/grpo_trainer.py

qgallouedec · 2025-04-07T19:01:08Z

Same effective batch size (256)

GPUs	Grad accum steps	Per device batch size
1	2	128
1	4	64
1	8	32
1	16	16
1	32	8
2	1	128
2	2	64
2	4	32
2	8	16
2	16	8
4	1	64
4	2	32
4	4	16
4	8	8
8	1	32
8	2	16
8	4	8

from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig

dataset = load_dataset("trl-lib/tldr", split="train[:500]")

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c)) for c in completions]

ga = 2
bs = 16

args = GRPOConfig(
    output_dir=f"DrGRPO_bs{bs}_ga_{ga}_4GPU",
    per_device_train_batch_size=bs,
    gradient_accumulation_steps=ga,
    num_train_epochs=1,
    logging_steps=1,
    max_prompt_length=64,
    max_completion_length=64,
    loss_type="drgrpo",
)

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    args=args,
    reward_funcs=reward_num_unique_chars,
    train_dataset=dataset,
)
trainer.train()

lewtun

LGTM with some nits and a question about what BNPO refers to

docs/source/grpo_trainer.md

lewtun · 2025-04-08T09:24:28Z

trl/trainer/grpo_config.py

-            difficulty bias.
+            applied. The [Dr. GRPO paper](https://huggingface.co/papers/2503.14476) recommends not scaling the rewards,
+            as scaling by the standard deviation introduces a question-level difficulty bias.
+        loss_type (`str`, *optional*, defaults to `"bnpo"`):


What is bnpo? Would be good to have a reference to where it's defined (I thought we had DAPO as the default loss)

In fact, I realized while doing this PR that it wasn't exactly DAPO that was being used, but a variant of BNPO as defined here :

Let me try to clarify here. Losses per token are normalized by

GRPO: the length of the sequence

DAPO: the average sequence length in the group

BNPO: the average sequence length in the batch

TRL's BNPO: the average sequence length in the local batch*; this is what I call bnpo in the code, but it's not 100% correct

Dr GRPO: by the maximum possible length of the completion

*a batch is made up of num_devices * gradient_accumulations local batches

Special cases:
When

per_device_batch_size==num_generations, TRL's BNPO is equivalent to DAPO

per_device_batch_size==1, TRL's BNPO is equivalent to GRPO

gradient_accumualtion_steps==1 and num_devices=1, TRL's BNPO is equivalent to the actual BNPO.

@qgallouedec Thanks for the comprehensive support! A minor comment for your future consideration: Dr. GRPO does not constrain the constant normalizer to be MAX_LEN (although it's easier to just use that). This can affect the update scale (related to your recent tweet https://x.com/QGallouedec/status/1908741708021457357). In fact, different constant of x in the setting in your tweet can be absorbed into the constant normalizer we propose in the paper, and MAX_LEN is a convenient example.

lewtun · 2025-04-08T09:28:44Z

trl/trainer/grpo_config.py

+                slightly vary depending on the local batch size, despite a constant effective batch size.
+            - `"drgrpo"`: Token-level losses are aggregated by normalizing with a global constant. This method was
+                introduced in the [Dr. GRPO paper](https://huggingface.co/papers/2503.14476) to eliminate length bias.
+                The value of the constant corresponds to `max_completion_length`.


If I understand correctly, @edbeeching was trying something slightly different in #3231 that did local scaling per batch instead of a global constant. Do you know if there's much difference between the two?

They are roughly equivalent, I have closed my PR in favor of this one.

qgallouedec · 2025-04-08T13:22:11Z

Results mostly match, expect for the the loss and the grad norm, Dr GRPO seems to reduce the range, as expected.

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

…o dr-grpo-loss

qgallouedec added 3 commits April 7, 2025 16:45

hf papers link

281cc93

loss_type

f93c9b0

test

94a8cdc

qgallouedec marked this pull request as ready for review April 7, 2025 17:14

qgallouedec added 2 commits April 7, 2025 17:43

documentation

161f052

minor

a32e12c

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani April 7, 2025 17:52

raise error when loss type isn't support by liger

411adcc

ScottHoang reviewed Apr 7, 2025

View reviewed changes

trl/trainer/grpo_trainer.py Show resolved Hide resolved

qgallouedec changed the title ~~Dr. GRPO loss~~ 🩺 Dr. GRPO loss Apr 7, 2025

update doc

68e3afe

edbeeching mentioned this pull request Apr 8, 2025

[GRPO] Adds an option to scale the loss by a constant factor #3231

Closed

lewtun approved these changes Apr 8, 2025

View reviewed changes

qgallouedec and others added 11 commits April 8, 2025 14:28

clarify

f893e96

style and default

45f74a4

Update docs/source/grpo_trainer.md

2b97bc3

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

drgrpo -> dr_grpo

b1c9ee0

Merge branch 'dr-grpo-loss' of https://github.com/huggingface/trl int…

4b8c9ff

…o dr-grpo-loss

clarification

eb499fa

implement grpo

aac380e

Merge branch 'main' into dr-grpo-loss

fe989f8

Merge branch 'main' into dr-grpo-loss

ca030bb

oops

2d68d10

Merge branch 'main' into dr-grpo-loss

4be9e2d

qgallouedec merged commit 5e2e9cb into main Apr 9, 2025
10 checks passed

qgallouedec deleted the dr-grpo-loss branch April 9, 2025 18:13

qgallouedec mentioned this pull request Apr 9, 2025

Fix length bias for Dr GRPO #3138

Closed

5 tasks

lkevinzc mentioned this pull request Apr 10, 2025

Tracking related fixes in other open-source projects sail-sg/understand-r1-zero#23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🩺 Dr. GRPO loss #3256

🩺 Dr. GRPO loss #3256

qgallouedec commented Apr 7, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 7, 2025

qgallouedec commented Apr 7, 2025 •

edited

Loading

lewtun left a comment

lewtun Apr 8, 2025

qgallouedec Apr 8, 2025 •

edited

Loading

qgallouedec Apr 8, 2025

lkevinzc Apr 10, 2025

lewtun Apr 8, 2025

edbeeching Apr 8, 2025

qgallouedec commented Apr 8, 2025 •

edited

Loading

🩺 Dr. GRPO loss #3256

🩺 Dr. GRPO loss #3256

Conversation

qgallouedec commented Apr 7, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Apr 7, 2025

qgallouedec commented Apr 7, 2025 • edited Loading

lewtun left a comment

Choose a reason for hiding this comment

lewtun Apr 8, 2025

Choose a reason for hiding this comment

qgallouedec Apr 8, 2025 • edited Loading

Choose a reason for hiding this comment

qgallouedec Apr 8, 2025

Choose a reason for hiding this comment

lkevinzc Apr 10, 2025

Choose a reason for hiding this comment

lewtun Apr 8, 2025

Choose a reason for hiding this comment

edbeeching Apr 8, 2025

Choose a reason for hiding this comment

qgallouedec commented Apr 8, 2025 • edited Loading

qgallouedec commented Apr 7, 2025 •

edited

Loading

qgallouedec commented Apr 7, 2025 •

edited

Loading

qgallouedec Apr 8, 2025 •

edited

Loading

qgallouedec commented Apr 8, 2025 •

edited

Loading