-
-
Notifications
You must be signed in to change notification settings - Fork 4k
Chunk Across Batch and Context length for logprob calculations for grpo #3628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nightly
Are you sure you want to change the base?
Chunk Across Batch and Context length for logprob calculations for grpo #3628
Conversation
Refactor grpo_trainer functions to handle log probabilities and entropies. Introduce mixed precision handling and improve input processing for model predictions.
Adapt logic to handle image sizes and chunk pixel values based on image grid dimensions.
Refactor padding logic to incorporate max_left_pad variable for better handling of prompt completion.
Refactor padding logic and remove commented code.
Added check for vllm_importance_sampling_correction in conditions using self.use_vllm.
Disable TRL's importance sampling logic in the function.
for more information, see https://pre-commit.ci
Summary of ChangesHello @pluesclues, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly overhauls the log probability calculation mechanism for the GRPO trainer. The primary goal is to accurately compute logprobs by processing input sequences in smaller, manageable chunks, addressing challenges related to varying batch and context lengths, especially when dealing with padding. It also introduces specific handling for visual inputs and adjusts the application of importance sampling. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces chunking for log probability calculations in GRPO to better manage memory usage. The changes primarily involve patching GRPO trainer functions in unsloth/models/rl_replacements.py, with the core logic for batch chunking implemented in the _get_per_token_logps_and_entropies function. My review identifies a potential bug concerning tensor device placement, along with suggestions to improve code maintainability by addressing code duplication and enhancing readability in line with Python's style guidelines.
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
unsloth/models/rl_replacements.py
Outdated
| replacement_string = """ if "image_sizes" in prompt_inputs: | ||
| output["image_sizes"] = prompt_inputs["image_sizes"] | ||
| if self.use_vllm: | ||
| try: | ||
| if max_left_pad is not None: | ||
| output["max_left_pad"] = torch.tensor(sampling_per_token_logps.shape[0] * [max_left_pad]).unsqueeze(-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Protect max_left_pad output when sampling logprobs missing
The new block in _generate_and_score_completions builds output["max_left_pad"] using sampling_per_token_logps.shape before that variable is guaranteed to exist. In the common non-vLLM path (or when importance sampling correction is disabled), sampling_per_token_logps is never defined, so hitting this code will raise a NameError before any completions are returned, breaking GRPO training without vLLM. The guard below only protects the later assignment, so max_left_pad needs to be gated on the presence of sampling logprobs or sized from another tensor.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this to use prompt_ids.shape[0] to handle the case vLLM is not used.
| advantages = inputs["advantages"] | ||
| # per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1) | ||
| # per_token_loss = -(per_token_loss - self.beta * per_token_kl) | ||
| # loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean() | ||
| old_hidden_states = inputs.get("old_per_token_logps", None) | ||
| old_logps = inputs.get("old_per_token_logps", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per-token loss branch references undefined hidden states
In compute_loss, the inputs are now read into ref_logps and old_logps, but the per-token-logits branch still uses ref_hidden_states/old_hidden_states, which are no longer defined. If _get_per_token_logps (or the TRL default) returns actual logits again, this block will raise a NameError before loss computation. The variable names here need to match the new inputs to avoid crashing when per-token logprobs are available.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not even be using the old compiled path, but I did address this regardless and took this suggestion.
| if self.use_vllm:""" | ||
| function = function.replace(replace_part, new_replacement) | ||
|
|
||
| # Important note: we disable TRL's importance sampling logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the jusitification for such disabling here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why we disable it is because of the LLM path mainly, where we put the left pad tokens to the right side. And we need to adjust the samplling_logprob tensor that is returned from vllm to adjust for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant add that as a comment in the code :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| attention_mask = input_ids != self.processing_class.pad_token_id | ||
| attention_mask = attention_mask.to(attention_mask.dtype) | ||
| else: | ||
| max_left_pad = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please mention: Essetially, for VLMs we do not go via optimised impl path in models/ so we need not encounter flash attn left padding issue so we don't need to worry about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| attention_mask_chunks = torch.chunk(attention_mask, chunks = B, dim = 0) | ||
|
|
||
| def chunk_optional(tensor, chunks): | ||
| if tensor is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that this doesn't raise error for tensors? Something like unambiguous value None for tensor, maybe use all or any or something like that?
I generally found x is not None more effective for tensors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is primarily for specific image parameters not being passed when doing the LLM path. So like for example if I am training LLM and not VLM, pixel_values = None or it is set to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work when pixel_values is not None aka its a filled tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a filled tensor, it is chunked correctly, I just verified. Unless you were talking about soemthing else?
unsloth/models/rl_replacements.py
Outdated
| image_grid_thw_chunks = [None] * B | ||
| pixel_attention_mask_chunks = [None] * B | ||
|
|
||
| # This is the chunkng logit from trl 0.23.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Fix typos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| if image_grid_thw is not None and pixel_values is not None: | ||
| if image_grid_thw.shape[0] != B: | ||
| raise ValueError( | ||
| f"This logic requires image_grid_thw.shape[0] ({image_grid_thw.shape[0]}) " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qsn: Does this mean we enforce one image per prompt or is it that we expect (bsz, num_images, img_shape) as input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can still I believe support multiple images its just that the batch sizes must match, I believe it is the latter in this case.
| rows_per_sample = image_grid_thw.prod(dim = -1) | ||
| rows_per_sample_list = rows_per_sample.cpu().tolist() | ||
|
|
||
| pixel_values_chunks = list( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list is only to ensure python iterability I suppose?
Can you please run on some dummy inputs that this (just split-list of tensor as standalone and nothing else) doesn't replicate the data? Ideally it should not but good to validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked memory usage when intializing the tensor and doing the list(chunk(tensor)) that trl does, it seems that it doesn't copy tensors and uses the same pointers.
| logit_scale_multiply = 0 | ||
| logit_scale_divide = getattr(model.config, "logits_scaling", 0) | ||
| if logit_scale_divide is None: | ||
| logit_scale_divide = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it smart to set division factor to 0?
also always wondered if we should incorporate this within the multiply itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it to zero because in here https://github.com/pluesclues/unsloth-zoo/blob/ccccf08e0ffef23a45d55cf1235fdcf1a4d918cc/unsloth_zoo/rl_replacements.py#L89-L99, we only do these logit opperations if they these parameters are passed and are not zero. This logic is also in here: https://github.com/unslothai/unsloth-zoo/blob/7209a76795401a08f43403c3d79c17a01ac6eedb/unsloth_zoo/rl_replacements.py#L237-L251.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ik, we do use zero as nothing case here, but like why do we even do that? Can't we refactor this?
@danielhanchen thoughts?
unsloth/models/rl_replacements.py
Outdated
| completion_input_ids_chunk = input_ids_chunk[ | ||
| :, -logits_to_keep: | ||
| ] | ||
| # breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| ) | ||
|
|
||
| all_logprobs_list.append(logprobs_chunk) | ||
| logprobs = torch.cat(all_logprobs_list, dim = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very curious as to what the memory usage before and after this step looks like
It'd be great if you can get that info...
Essentially, the logit calculation is chunked which is great. But all logits are still materialised at once...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulting dim is [bsz, context_length] for logprobs, since we grab the specific token logprobs that are actually used in the sequence rather than storing [bsz, context_length, vocab_dim].
unsloth/models/rl_replacements.py
Outdated
| left_pad_tokens_per_prompt = calculate_pad_tokens_in_prompt( | ||
| input_ids, logits_to_keep, self.processing_class.pad_token_id | ||
| ) | ||
| max_left_pad = max(left_pad_tokens_per_prompt).item() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch op here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for more information, see https://pre-commit.ci
Add comments explaining the disabling of TRL's importance sampling logic.
Add comment explaining handling of VLMs in logits processing.
Relies on: unslothai/unsloth-zoo#357