-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix length bias for Dr GRPO #3138
Conversation
Hi @qgallouedec I'm new to this codebase, but I saw your earlier Dr. GRPO merge and happened to be experimenting this this code today, and couldn't help but wonder if this bit was missing? |
Thanks, but I don't understand, why dividing the loss by the max sequence length? |
In the paper, under "Length Bias Also Exists in Open-Source PPO Implementation" they mention that existing masked mean calculation inherited from PPO codebases introduced length bias. They provide a Listing 1 which seems to imply this is what they do in their implementation. It seems the grpo_trainer.py still had these inherited biases? https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf |
Actually, I don't think so. Since #2881, we’ve been using global normalization instead of per-sequence normalization. I believe this is equivalent to what is described in the paper. Am I right? The only thing I'm not 100% sure about is whether we should divide by
Do they mean constant for the batch? In that case, |
Re-reading the part you quoted, they do mention generation budget as a constant which would be constant across the whole training? But I think then maybe my change should only divide by They say e.g. as well, does that mean it could be any constant? 🤷 |
Updated to use just max gen length now https://x.com/zzlccc/status/1903674024150081998 BTW thanks for reviewing my changes! |
This part of S3.1 from https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf appears to be missing. Is the max_completion_length used here right?
0b1abfa
to
4329f9c
Compare
My current opinion is that there isn't really any data to show how these different ways of normalizing compare. To decide on the relevance of adding this option, more data would be needed. If someone in the community would like to carry out an experiment, that would be very useful. |
We mean any constant during the whole training ( I am writing a summary for this, and hopefully could clarify the length bias so that the community could fix it. |
I get it. Let's use max generation length. Or maybe even |
Sure! Any constant will do to remove the length bias. The scale of the constant will affect the gradient magnitude, so In fact, the DAPO's loss is something in between. They still use some form of length normalization (question-level) which is biased as well. May I suggest that we remove it to avoid future confusion? |
Yes, I think your approach should be the default. Even maybe the unique way. |
I'll run some experiments tommorow |
Cool @qgallouedec , thank you so much for the efforts in pushing open-source LLM RL forward! By the way, maybe we could re-considering the name of the flag. I guess If your experiment results turn out good, maybe we could consider:
|
Thanks a lot for sharing this. That's very helpful. That's rather surprising. What surprises me the most is the red curve. Why would the model start generating long sequences if the objective isn't biased toward long sequence? |
Hey @minosvasilias , thank you a lot for the results! Instead of this one you tried
Could you please try this:
The difference is that the latter would do a constant normalization over the sequence length then average over samples. This should gives a smaller loss magnitude (depending on the loss.shape[0]) compared to your previous trial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses the length bias issue in the GRPO trainer by modifying the loss normalization when using a completion mask.
- Updated loss calculation to replace division by the sum of the mask with division based on a fixed maximum completion length.
- Adjusted the aggregation and averaging of the per-token loss to mitigate length bias.
@lkevinzc thank you for the suggestion! I should have available compute to do at least one run tonight with that config. |
I've gotten some pretty surprising results. Here are the first part comparing the different loss types: if self.args.loss_type == "drgrpo":
loss = ((per_token_loss * completion_mask).sum(dim=-1)).mean() / self.max_completion_length
elif self.args.loss_type == "dapo":
loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()
elif self.args.loss_type == "grpo":
loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean() |
Thanks for the results! Agree that the reward scale matters when we do not scale that for advantage. I guess to get better stability we could do a global reward normalization (a common trick in RL), which is different from the GRPO's intra-group normalization. BTW, may I know your task and reward function? A few observations: 1/ For the first part, it seems that the base policy may already converge to a fixed behavior (not exploratory enough) so the response length nearly doesn't change? If this is the case, it would make the exploitation of the length bias less likely, thus the response lengths for all runs are about the same. 2/ Also for the first part, it's surprising that drgrpo has larger losses than grpo. The denominator of drgrpo loss should always be the largest (max completion length >= active tokens), so the average loss should be smaller? |
Apologies for the delayed update. Very similar behavior as other runs except the (green) pre-0.16.0 loss. Looking at the loss graph we can see that it spikes at a somewhat lower magnitude as you predicted @lkevinzc . ![]() |
Closed in favour of #3256 |
This part of S3.1 from https://github.com/sail-sg/understand-r1-zero/blob/main/understand-r1-zero.pdf appears to be missing?
In the paper, under "Length Bias Also Exists in Open-Source PPO Implementation" they mention that existing masked mean calculation inherited from PPO codebases introduced length bias. They provide a Listing 1 which seems to imply this is what they do in their implementation.
It seems the grpo_trainer.py still had these inherited biases?
See also: https://x.com/zzlccc/status/1903291175420961241
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.