Skip to content

Conversation

@pluesclues
Copy link
Collaborator

Refactor grpo_trainer functions to handle log probabilities and entropies. Introduce mixed precision handling and improve input processing for model predictions.
Adapt logic to handle image sizes and chunk pixel values based on image grid dimensions.
Refactor padding logic to incorporate max_left_pad variable for better handling of prompt completion.
Refactor padding logic and remove commented code.
Added check for vllm_importance_sampling_correction in conditions using self.use_vllm.
Disable TRL's importance sampling logic in the function.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @pluesclues, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the log probability calculation mechanism for the GRPO trainer. The primary goal is to accurately compute logprobs by processing input sequences in smaller, manageable chunks, addressing challenges related to varying batch and context lengths, especially when dealing with padding. It also introduces specific handling for visual inputs and adjusts the application of importance sampling.

Highlights

  • GRPO Logprob Calculation Refinement: Introduced a new grpo_selective_log_softmax and refactored the _get_per_token_logps_and_entropies function to process inputs in chunks, improving log probability calculations for GRPO.
  • Padding Handling: Enhanced the handling of left padding by calculating and propagating max_left_pad through the logprob calculation pipeline, ensuring correct alignment for models with varying input lengths.
  • TRL Importance Sampling Adjustment: Explicitly disabled TRL's default importance sampling correction logic for vLLM within the GRPO trainer.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces chunking for log probability calculations in GRPO to better manage memory usage. The changes primarily involve patching GRPO trainer functions in unsloth/models/rl_replacements.py, with the core logic for batch chunking implemented in the _get_per_token_logps_and_entropies function. My review identifies a potential bug concerning tensor device placement, along with suggestions to improve code maintainability by addressing code duplication and enhancing readability in line with Python's style guidelines.

@danielhanchen danielhanchen changed the base branch from main to nightly November 27, 2025 03:37
@danielhanchen
Copy link
Contributor

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 325 to 339
replacement_string = """ if "image_sizes" in prompt_inputs:
output["image_sizes"] = prompt_inputs["image_sizes"]
if self.use_vllm:
try:
if max_left_pad is not None:
output["max_left_pad"] = torch.tensor(sampling_per_token_logps.shape[0] * [max_left_pad]).unsqueeze(-1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Protect max_left_pad output when sampling logprobs missing

The new block in _generate_and_score_completions builds output["max_left_pad"] using sampling_per_token_logps.shape before that variable is guaranteed to exist. In the common non-vLLM path (or when importance sampling correction is disabled), sampling_per_token_logps is never defined, so hitting this code will raise a NameError before any completions are returned, breaking GRPO training without vLLM. The guard below only protects the later assignment, so max_left_pad needs to be gated on the presence of sampling logprobs or sized from another tensor.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this to use prompt_ids.shape[0] to handle the case vLLM is not used.

Comment on lines 684 to +807
advantages = inputs["advantages"]
# per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
# per_token_loss = -(per_token_loss - self.beta * per_token_kl)
# loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
old_hidden_states = inputs.get("old_per_token_logps", None)
old_logps = inputs.get("old_per_token_logps", None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge per-token loss branch references undefined hidden states

In compute_loss, the inputs are now read into ref_logps and old_logps, but the per-token-logits branch still uses ref_hidden_states/old_hidden_states, which are no longer defined. If _get_per_token_logps (or the TRL default) returns actual logits again, this block will raise a NameError before loss computation. The variable names here need to match the new inputs to avoid crashing when per-token logprobs are available.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not even be using the old compiled path, but I did address this regardless and took this suggestion.

if self.use_vllm:"""
function = function.replace(replace_part, new_replacement)

# Important note: we disable TRL's importance sampling logic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the jusitification for such disabling here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why we disable it is because of the LLM path mainly, where we put the left pad tokens to the right side. And we need to adjust the samplling_logprob tensor that is returned from vllm to adjust for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant add that as a comment in the code :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

attention_mask = input_ids != self.processing_class.pad_token_id
attention_mask = attention_mask.to(attention_mask.dtype)
else:
max_left_pad = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please mention: Essetially, for VLMs we do not go via optimised impl path in models/ so we need not encounter flash attn left padding issue so we don't need to worry about that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

attention_mask_chunks = torch.chunk(attention_mask, chunks = B, dim = 0)

def chunk_optional(tensor, chunks):
if tensor is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that this doesn't raise error for tensors? Something like unambiguous value None for tensor, maybe use all or any or something like that?
I generally found x is not None more effective for tensors

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is primarily for specific image parameters not being passed when doing the LLM path. So like for example if I am training LLM and not VLM, pixel_values = None or it is set to None.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work when pixel_values is not None aka its a filled tensor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a filled tensor, it is chunked correctly, I just verified. Unless you were talking about soemthing else?

image_grid_thw_chunks = [None] * B
pixel_attention_mask_chunks = [None] * B

# This is the chunkng logit from trl 0.23.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Fix typos

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if image_grid_thw is not None and pixel_values is not None:
if image_grid_thw.shape[0] != B:
raise ValueError(
f"This logic requires image_grid_thw.shape[0] ({image_grid_thw.shape[0]}) "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qsn: Does this mean we enforce one image per prompt or is it that we expect (bsz, num_images, img_shape) as input?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can still I believe support multiple images its just that the batch sizes must match, I believe it is the latter in this case.

rows_per_sample = image_grid_thw.prod(dim = -1)
rows_per_sample_list = rows_per_sample.cpu().tolist()

pixel_values_chunks = list(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is only to ensure python iterability I suppose?
Can you please run on some dummy inputs that this (just split-list of tensor as standalone and nothing else) doesn't replicate the data? Ideally it should not but good to validate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked memory usage when intializing the tensor and doing the list(chunk(tensor)) that trl does, it seems that it doesn't copy tensors and uses the same pointers.

logit_scale_multiply = 0
logit_scale_divide = getattr(model.config, "logits_scaling", 0)
if logit_scale_divide is None:
logit_scale_divide = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it smart to set division factor to 0?
also always wondered if we should incorporate this within the multiply itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ik, we do use zero as nothing case here, but like why do we even do that? Can't we refactor this?
@danielhanchen thoughts?

completion_input_ids_chunk = input_ids_chunk[
:, -logits_to_keep:
]
# breakpoint()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: cleanup

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)

all_logprobs_list.append(logprobs_chunk)
logprobs = torch.cat(all_logprobs_list, dim = 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very curious as to what the memory usage before and after this step looks like
It'd be great if you can get that info...

Essentially, the logit calculation is chunked which is great. But all logits are still materialised at once...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resulting dim is [bsz, context_length] for logprobs, since we grab the specific token logprobs that are actually used in the sequence rather than storing [bsz, context_length, vocab_dim].

left_pad_tokens_per_prompt = calculate_pad_tokens_in_prompt(
input_ids, logits_to_keep, self.processing_class.pad_token_id
)
max_left_pad = max(left_pad_tokens_per_prompt).item()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch op here as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants