Skip to content

Conversation

@mandy-li
Copy link

This PR enables dequant fp8 weights quantized with compressed-tensor method channel-wise

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
c309bb5245b6d05228c9d2f9c8f3e769c08d9194


def get_dequant_weights_func(self, ) -> Optional[Callable[[torch.nn.Module], torch.Tensor]]:
return self.dequant_fp8_weight

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to assign get_dequant_weights_func to the layer to stay consistent with the existing implementation, and no changes are required on the INC side.

else:
# For INC path, we attach the dequant func to the layer
layer.get_dequant_weights_func = types.MethodType(get_dequant_weights_func, layer)

@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
e924bbb4f4ac3258a71a18ac4c753c8056bc059f

@mandy-li
Copy link
Author

@yiliu30 , address your comment by binding dequant function to linear layer after loading weight. Please review

Copy link
Contributor

@yiliu30 yiliu30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yiliu30
Copy link
Contributor

yiliu30 commented Nov 25, 2025

@xuechendi Please be aware this change, thanks!

@mandy-li mandy-li force-pushed the main branch 2 times, most recently from 73d8ed6 to b91f94c Compare November 26, 2025 07:46
@github-actions
Copy link

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

@xuechendi
Copy link
Collaborator

@skavulya @lkk12014402 , please help to cross review since you're working on compressed-tensor


# bind dequant function to layer for per-channel quantization
if layer.scheme.strategy == QuantizationStrategy.CHANNEL:
hpu_ops.bind_dequant_func(layer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PR is only for INC dynamic, should not bind the dequant for any per-channel here, right?
What is the scope for this PR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inc. I can check if QUANT_CONFIG env var is set or not if you think necessary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add check, we can't hijack non inc path

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the change will also impact #552
Please also did a check for dynamic scheme

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this should apply to static quant as well.

If we change bind_dequant_func() to something like fp8_perchannel_linear_postprocess_weights to be consistent with fp8_block_linear_postprocess_weights which is not INC specific, do I still need to check if inc path?

def dequant_fp8_weight(self, layer: torch.nn.Module) -> torch.Tensor:
if layer.scheme.strategy == QuantizationStrategy.CHANNEL: # weights were quantized per-channel
dequant_weight = layer.weight.to(layer.weight_scale.dtype) * layer.weight_scale.squeeze()
return dequant_weight.to(torch.bfloat16).t()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this works for Gaudi2? Will it gets nan since scale might out of range

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, for g3

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I checked CI, seems Gaudi2 is not getting nan, this is quite unexpected.
@yiliu30, is there any recent changes fix the Gaudi2 scale issue? Or it is because "scale_method": "ACT_MAXABS_PCS_POW2_WEIGHT_MAXABS_PTS_POW2_HW", will keep range under 244?
image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I realized this is handled at create_weights

@xuechendi
Copy link
Collaborator

@yiliu30 , please help to review, this PR is to enable INC dynamic for compressed_tensor, would like to know if meet your initial design

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

✅ CI Passed

All checks passed successfully against the following vllm commit:
0353d2e162cbda776d9dbfe026e65303204a7f1f

@skavulya
Copy link
Contributor

skavulya commented Dec 1, 2025

LGTM

return wrapper


def bind_dequant_func(layer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to follow the same name pattern to the rest: fp8_perchannel_linear_postprocess_weights

@yiliu30
Copy link
Contributor

yiliu30 commented Dec 2, 2025

@yiliu30 , please help to review, this PR is to enable INC dynamic for compressed_tensor, would like to know if meet your initial design

Yes, it’s aligned with what we did for block-wise scaling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants