[GRPO] Adds an option to scale the loss by a constant factor #3231

edbeeching · 2025-04-04T10:16:03Z

What does this PR do?

This PR adds an option to scale the loss by a constant factor, equal to the maximum possible tokens in a batch.

The reasoning behind this PR is that I believe that the current normalization scheme, implemented in #2881, is not invariant to the ordering of samples across devices / gradient accumulation steps. Which may cause instabilities in training.

Toy Example

Consider a DDP=2 setting with a per_device_train_batch_size=4. For this example, assume that the loss per token is 1.
With global normalization:

Here each token has an equal contribution the loss, but only inside the current device. The loss is not comparable to a setting where all the batch is on a single device, for example consider a DDP=1 setting with per_device_train_batch_size=4

One potential solution would be to gather the number of unmasked tokens across all devices and use this for normalization. But the same issue would also occur across gradient accumulation steps.

Proposed solution

Calculate a constant factor max_tokens_norm = per_device_train_batch_size * (max_prompt_length +max_completion_length) and always normalize the loss by this constant factor.

The learning rate will probably need to be increased to get comparable results with our other baselines.

HuggingFaceDocBuilderDev · 2025-04-04T10:20:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun

LGTM with a nit and suggestion on whether we should unit test the scaling

lewtun · 2025-04-04T12:28:16Z

trl/trainer/grpo_config.py

@@ -101,6 +101,8 @@ class GRPOConfig(TrainingArguments):
            speed, but may be numerically unstable for long training runs.
        num_iterations (`int`, *optional*, defaults to `1`):
            Number of iterations per batch (denoted as μ in the algorithm).
+        use_max_tokens_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use the max tokens norm. If `True`, the loss is normalized by a consant, the maximum possible number of tokens


Suggestion to clarify what we mean by "maximum possible"

Suggested change

Whether to use the max tokens norm. If `True`, the loss is normalized by a consant, the maximum possible number of tokens

Whether to use the max tokens norm. If `True`, the loss is normalized by a constant factor that is determined by the total number of prompt and completions tokens in a batch.

lewtun · 2025-04-04T12:28:40Z

trl/trainer/grpo_config.py

+    use_max_tokens_norm: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to use the max tokens norm. If `True`, the loss is normalized by a constant, the maximum "


Ditto here if you agree with the change above

lewtun · 2025-04-04T12:34:17Z

trl/trainer/grpo_trainer.py

-        loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()
+
+        if self.use_max_tokens_norm:
+            loss = (per_token_loss * completion_mask).sum() / self.max_tokens_norm


I'm not sure how easy it is to unit test this, but would it make sense to do it so that we're sure the loss is being computed as your diagrams show?

E.g. an integration test would be to check that specifying the config params gives the expected scaling for some dummy inputs

qgallouedec · 2025-04-04T14:24:53Z

trl/trainer/grpo_config.py

@@ -101,6 +101,8 @@ class GRPOConfig(TrainingArguments):
            speed, but may be numerically unstable for long training runs.
        num_iterations (`int`, *optional*, defaults to `1`):
            Number of iterations per batch (denoted as μ in the algorithm).
+        use_max_tokens_norm (`bool`, *optional*, defaults to `False`):


So this is the loss proposed in Dr GRPO, correct?
If so, I think it should be explicitly mentioned in the doc.

I am not sure actually, I thought that was our current implementation. I will take another look.

Currently we use a modified version of DAPO where we normalize per local batch (and not per group).

It the above figure, we use something between BNPO (hard to implement with grad accum) and DAPO

edbeeching · 2025-04-08T06:53:14Z

closing in favor of #3256

adds an option to scale the loss by a constant factor

218b20f

edbeeching requested review from qgallouedec and lewtun April 4, 2025 10:16

fix args

8a3f94b

lewtun approved these changes Apr 4, 2025

View reviewed changes

qgallouedec reviewed Apr 4, 2025

View reviewed changes

qgallouedec mentioned this pull request Apr 7, 2025

🩺 Dr. GRPO loss #3256

Merged

5 tasks

edbeeching closed this Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GRPO] Adds an option to scale the loss by a constant factor #3231

[GRPO] Adds an option to scale the loss by a constant factor #3231

edbeeching commented Apr 4, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 4, 2025

lewtun left a comment

lewtun Apr 4, 2025

lewtun Apr 4, 2025

lewtun Apr 4, 2025

qgallouedec Apr 4, 2025

edbeeching Apr 7, 2025

qgallouedec Apr 7, 2025

edbeeching commented Apr 8, 2025

	Whether to use the max tokens norm. If `True`, the loss is normalized by a consant, the maximum possible number of tokens
	Whether to use the max tokens norm. If `True`, the loss is normalized by a constant factor that is determined by the total number of prompt and completions tokens in a batch.

[GRPO] Adds an option to scale the loss by a constant factor #3231

[GRPO] Adds an option to scale the loss by a constant factor #3231

Conversation

edbeeching commented Apr 4, 2025 • edited Loading

What does this PR do?

Toy Example

Proposed solution

HuggingFaceDocBuilderDev commented Apr 4, 2025

lewtun left a comment

Choose a reason for hiding this comment

lewtun Apr 4, 2025

Choose a reason for hiding this comment

lewtun Apr 4, 2025

Choose a reason for hiding this comment

lewtun Apr 4, 2025

Choose a reason for hiding this comment

qgallouedec Apr 4, 2025

Choose a reason for hiding this comment

edbeeching Apr 7, 2025

Choose a reason for hiding this comment

qgallouedec Apr 7, 2025

Choose a reason for hiding this comment

edbeeching commented Apr 8, 2025

edbeeching commented Apr 4, 2025 •

edited

Loading