Skip to content

Conversation

@zahrayousefijamarani
Copy link

causal param should be True to generate correct answers during inference.

causal param should be True to generate correct answers during inference.
@matthewygf
Copy link
Collaborator

@zahrayousefijamarani Thanks for the contribution.

While you are correct that, the causal param should be true. The related issue seems to indicate that without the sparse attention impl, causal = true will lead the perf degradation. I am a bit hesitate that we turn this on without the sparse attn impl. WDYT ?

cc @egretwAlker @YaoJiayi

@egretwAlker
Copy link
Contributor

Hi! LMCFlashAttnBackend is not called in our implementation. It is left there to refer to LMCache. We use LMCAttnBackend which is a pytorch implementation using correct causal mask. So in flash attention impl, causal on or off has degradation for accuracy either way, bc it doesn't take into account the selected token positions.
But if you find that using flash attention causal mask doesn't downgrade, let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants