Chunk Across Batch and Context length for logprob calculations for grpo #3628

pluesclues · 2025-11-21T21:20:50Z

Refactor grpo_trainer functions to handle log probabilities and entropies. Introduce mixed precision handling and improve input processing for model predictions.

Adapt logic to handle image sizes and chunk pixel values based on image grid dimensions.

Refactor padding logic to incorporate max_left_pad variable for better handling of prompt completion.

Refactor padding logic and remove commented code.

Added check for vllm_importance_sampling_correction in conditions using self.use_vllm.

Disable TRL's importance sampling logic in the function.

for more information, see https://pre-commit.ci

gemini-code-assist · 2025-11-21T21:21:18Z

Summary of Changes

Hello @pluesclues, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly overhauls the log probability calculation mechanism for the GRPO trainer. The primary goal is to accurately compute logprobs by processing input sequences in smaller, manageable chunks, addressing challenges related to varying batch and context lengths, especially when dealing with padding. It also introduces specific handling for visual inputs and adjusts the application of importance sampling.

Highlights

GRPO Logprob Calculation Refinement: Introduced a new grpo_selective_log_softmax and refactored the _get_per_token_logps_and_entropies function to process inputs in chunks, improving log probability calculations for GRPO.
Padding Handling: Enhanced the handling of left padding by calculating and propagating max_left_pad through the logprob calculation pipeline, ensuring correct alignment for models with varying input lengths.
TRL Importance Sampling Adjustment: Explicitly disabled TRL's default importance sampling correction logic for vLLM within the GRPO trainer.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces chunking for log probability calculations in GRPO to better manage memory usage. The changes primarily involve patching GRPO trainer functions in unsloth/models/rl_replacements.py, with the core logic for batch chunking implemented in the _get_per_token_logps_and_entropies function. My review identifies a potential bug concerning tensor device placement, along with suggestions to improve code maintainability by addressing code duplication and enhancing readability in line with Python's style guidelines.

for more information, see https://pre-commit.ci

danielhanchen · 2025-11-27T07:43:21Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-27T07:48:53Z

unsloth/models/rl_replacements.py

    replacement_string = """        if "image_sizes" in prompt_inputs:
            output["image_sizes"] = prompt_inputs["image_sizes"]
-        
-        if self.use_vllm:
-            try:
+        if max_left_pad is not None:
+            output["max_left_pad"] = torch.tensor(sampling_per_token_logps.shape[0] * [max_left_pad]).unsqueeze(-1)        


Protect max_left_pad output when sampling logprobs missing

The new block in _generate_and_score_completions builds output["max_left_pad"] using sampling_per_token_logps.shape before that variable is guaranteed to exist. In the common non-vLLM path (or when importance sampling correction is disabled), sampling_per_token_logps is never defined, so hitting this code will raise a NameError before any completions are returned, breaking GRPO training without vLLM. The guard below only protects the later assignment, so max_left_pad needs to be gated on the presence of sampling logprobs or sized from another tensor.

Useful? React with 👍 / 👎.

I fixed this to use prompt_ids.shape[0] to handle the case vLLM is not used.

chatgpt-codex-connector · 2025-11-27T07:48:54Z

unsloth/models/rl_replacements.py

        advantages = inputs["advantages"]
        # per_token_loss = torch.exp(per_token_logps - per_token_logps.detach()) * advantages.unsqueeze(1)
        # per_token_loss = -(per_token_loss - self.beta * per_token_kl)
        # loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
-        old_hidden_states = inputs.get("old_per_token_logps", None)
+        old_logps = inputs.get("old_per_token_logps", None)


per-token loss branch references undefined hidden states

In compute_loss, the inputs are now read into ref_logps and old_logps, but the per-token-logits branch still uses ref_hidden_states/old_hidden_states, which are no longer defined. If _get_per_token_logps (or the TRL default) returns actual logits again, this block will raise a NameError before loss computation. The variable names here need to match the new inputs to avoid crashing when per-token logprobs are available.

Useful? React with 👍 / 👎.

We should not even be using the old compiled path, but I did address this regardless and took this suggestion.

for more information, see https://pre-commit.ci

Datta0 · 2025-12-01T03:46:25Z

unsloth/models/rl_replacements.py

        if self.use_vllm:"""
            function = function.replace(replace_part, new_replacement)

+    # Important note: we disable TRL's importance sampling logic


Can you please add the jusitification for such disabling here?

The reason why we disable it is because of the LLM path mainly, where we put the left pad tokens to the right side. And we need to adjust the samplling_logprob tensor that is returned from vllm to adjust for this.

I meant add that as a comment in the code :)

Datta0 · 2025-12-01T03:56:06Z

unsloth/models/rl_replacements.py

+                attention_mask = input_ids != self.processing_class.pad_token_id
+                attention_mask = attention_mask.to(attention_mask.dtype)
+            else:
+                max_left_pad = 0


Can you please mention: Essetially, for VLMs we do not go via optimised impl path in models/ so we need not encounter flash attn left padding issue so we don't need to worry about that

Datta0 · 2025-12-01T03:57:09Z

unsloth/models/rl_replacements.py

+            attention_mask_chunks = torch.chunk(attention_mask, chunks = B, dim = 0)
+
+            def chunk_optional(tensor, chunks):
+                if tensor is None:


Are you sure that this doesn't raise error for tensors? Something like unambiguous value None for tensor, maybe use all or any or something like that?
I generally found x is not None more effective for tensors

This is primarily for specific image parameters not being passed when doing the LLM path. So like for example if I am training LLM and not VLM, pixel_values = None or it is set to None.

Does it work when pixel_values is not None aka its a filled tensor?

If it is a filled tensor, it is chunked correctly, I just verified. Unless you were talking about soemthing else?

Datta0 · 2025-12-01T04:00:04Z

unsloth/models/rl_replacements.py

+            image_grid_thw_chunks = [None] * B
+            pixel_attention_mask_chunks = [None] * B
+
+            # This is the chunkng logit from trl 0.23.0


NIT: Fix typos

Datta0 · 2025-12-01T04:01:07Z

unsloth/models/rl_replacements.py

+            if image_grid_thw is not None and pixel_values is not None:
+                if image_grid_thw.shape[0] != B:
+                    raise ValueError(
+                        f"This logic requires image_grid_thw.shape[0] ({image_grid_thw.shape[0]}) "


Qsn: Does this mean we enforce one image per prompt or is it that we expect (bsz, num_images, img_shape) as input?

This can still I believe support multiple images its just that the batch sizes must match, I believe it is the latter in this case.

Datta0 · 2025-12-01T04:06:37Z

unsloth/models/rl_replacements.py

+                rows_per_sample = image_grid_thw.prod(dim = -1)
+                rows_per_sample_list = rows_per_sample.cpu().tolist()
+
+                pixel_values_chunks = list(


This list is only to ensure python iterability I suppose?
Can you please run on some dummy inputs that this (just split-list of tensor as standalone and nothing else) doesn't replicate the data? Ideally it should not but good to validate.

I checked memory usage when intializing the tensor and doing the list(chunk(tensor)) that trl does, it seems that it doesn't copy tensors and uses the same pointers.

Datta0 · 2025-12-01T04:11:00Z

unsloth/models/rl_replacements.py

+                logit_scale_multiply = 0
+            logit_scale_divide = getattr(model.config, "logits_scaling", 0)
+            if logit_scale_divide is None:
+                logit_scale_divide = 0


Is it smart to set division factor to 0?
also always wondered if we should incorporate this within the multiply itself

I set it to zero because in here https://github.com/pluesclues/unsloth-zoo/blob/ccccf08e0ffef23a45d55cf1235fdcf1a4d918cc/unsloth_zoo/rl_replacements.py#L89-L99, we only do these logit opperations if they these parameters are passed and are not zero. This logic is also in here: https://github.com/unslothai/unsloth-zoo/blob/7209a76795401a08f43403c3d79c17a01ac6eedb/unsloth_zoo/rl_replacements.py#L237-L251.

Yes ik, we do use zero as nothing case here, but like why do we even do that? Can't we refactor this?
@danielhanchen thoughts?

Datta0 · 2025-12-01T04:12:58Z

unsloth/models/rl_replacements.py

+                            completion_input_ids_chunk = input_ids_chunk[
+                                :, -logits_to_keep:
+                            ]
+                        # breakpoint()


NIT: cleanup

Datta0 · 2025-12-01T04:13:53Z

unsloth/models/rl_replacements.py

+                        )
+
+                        all_logprobs_list.append(logprobs_chunk)
+                    logprobs = torch.cat(all_logprobs_list, dim = 0)


I'm very curious as to what the memory usage before and after this step looks like
It'd be great if you can get that info...

Essentially, the logit calculation is chunked which is great. But all logits are still materialised at once...

The resulting dim is [bsz, context_length] for logprobs, since we grab the specific token logprobs that are actually used in the sequence rather than storing [bsz, context_length, vocab_dim].

danielhanchen · 2025-12-01T04:59:33Z

unsloth/models/rl_replacements.py

+                left_pad_tokens_per_prompt = calculate_pad_tokens_in_prompt(
+                    input_ids, logits_to_keep, self.processing_class.pad_token_id
+                )
+                max_left_pad = max(left_pad_tokens_per_prompt).item()


torch op here as well

for more information, see https://pre-commit.ci

Add comments explaining the disabling of TRL's importance sampling logic.

Add comment explaining handling of VLMs in logits processing.

pluesclues added 17 commits November 7, 2025 16:37

make it compatible with chunked hidden states selective log softmax

5278458

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

65d6d9f

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

1e49528

Refactor grpo_trainer for logps and entropies handling

494f611

Refactor grpo_trainer functions to handle log probabilities and entropies. Introduce mixed precision handling and improve input processing for model predictions.

Update fmt.Println message from 'Hello World'

387939f

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

eccd41d

Refactor chunking logic for pixel values and image grid

95abf46

Adapt logic to handle image sizes and chunk pixel values based on image grid dimensions.

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

52b23ff

Refactor padding logic with max_left_pad handling

f2102c8

Refactor padding logic to incorporate max_left_pad variable for better handling of prompt completion.

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

16f6be6

Clean up padding logic and remove unused comments

ac15b81

Refactor padding logic and remove commented code.

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

f49bf4f

Update vllm usage conditions with importance sampling check

ea6964a

Added check for vllm_importance_sampling_correction in conditions using self.use_vllm.

Disable TRL importance sampling logic

ca6b826

Disable TRL's importance sampling logic in the function.

Refactor error handling in rl_replacements.py

f2b29ee

Refactor vllm_importance_sampling_correction checks

8d263b1

Add grpo_selective_log_softmax to RL replacements

d332e93

pluesclues mentioned this pull request Nov 21, 2025

Chunk Across Batch and Context length for logprob calculations for grpo unslothai/unsloth-zoo#357

Open

[pre-commit.ci] auto fixes from pre-commit.com hooks

4be35d8

for more information, see https://pre-commit.ci

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

pluesclues and others added 2 commits November 21, 2025 16:36

Refactor code for readability and consistency

84c56aa

[pre-commit.ci] auto fixes from pre-commit.com hooks

9b2539c

for more information, see https://pre-commit.ci

danielhanchen changed the base branch from main to nightly November 27, 2025 03:37

chatgpt-codex-connector bot reviewed Nov 27, 2025

View reviewed changes

pluesclues and others added 4 commits November 30, 2025 19:32

Merge branch 'nightly' into alternative_compute_chunked_loss

52cbcaf

[pre-commit.ci] auto fixes from pre-commit.com hooks

21573bd

for more information, see https://pre-commit.ci

Fix max_left_pad assignment to use prompt_ids shape

08b3237

Refactor loss computation to use log probabilities

052cb39

Datta0 reviewed Dec 1, 2025

View reviewed changes

danielhanchen reviewed Dec 1, 2025

View reviewed changes

pluesclues and others added 11 commits December 1, 2025 00:06

Replace max with torch.max for left padding calculation

5c4c8bc

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

dd22916

Update rl.py

68ffcb7

Update _utils.py

fb9ea50

Updated max left pad handling for trl 0.24.0

fdf0bb6

[pre-commit.ci] auto fixes from pre-commit.com hooks

e478df7

for more information, see https://pre-commit.ci

Merge branch 'unslothai:main' into alternative_compute_chunked_loss

e2573a9

Clarify importance sampling logic in vLLM

de7ebb6

Add comments explaining the disabling of TRL's importance sampling logic.

Clarify VLMs handling in logits_chunk processing

63b092c

Add comment explaining handling of VLMs in logits processing.

Fix typo in comment for chunking logic

f0a61ed

Remove commented breakpoint from rl_replacements.py

e9966c1

Uh oh!

Chunk Across Batch and Context length for logprob calculations for grpo #3628

Are you sure you want to change the base?

Chunk Across Batch and Context length for logprob calculations for grpo #3628

Uh oh!

Conversation

pluesclues commented Nov 21, 2025

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented Nov 27, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment