getting error while fine tuning gemma 3 #2376

peteparker123 · 2025-04-18T16:25:48Z

peteparker123
Apr 18, 2025

i tried to fine tune gemma 3 model using unsloth but i am getting the below error.
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half
i have already fine tuned deepseek,qwen,llama but i didn't get this error but i am getting this error for only this. i tried to resolve it but it didn't work out.

Answered by danielhanchen

Oct 11, 2025

Oh my apologies I did not notice we have already solved this issue since late June 2025 and we did not notify any of you - so sorry!

Gemma-3 works as expected, but you need to update Unsloth or rerun the Gemma-3 notebook for eg our Gemma 3 270M Chess example or Gemma 3 4B finetuning example

To update Unsloth, please do:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

To enable full finetuning on Gemma-3 do:

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, #…

View full answer

uscne · 2025-04-22T09:52:09Z

uscne
Apr 22, 2025

hi, can you show your code here ?

1 reply

peteparker123 Apr 23, 2025
Author

hi,
``
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/gemma-3-4b-it",
max_seq_length = 2048,
load_in_4bit = True,
load_in_8bit = False,
full_finetuning = False,
)
model = FastModel.get_peft_model(
model,
finetune_vision_layers = False, # Turn off for just text!
finetune_language_layers = True, # Should leave on!
finetune_attention_modules = True, # Attention good for GRPO
finetune_mlp_modules = True, # SHould leave on always!

r = 16,           # Larger = higher accuracy, but might overfit
lora_alpha = 16,  # Recommended alpha == r at least
lora_dropout = 0,
bias = "none",
random_state = 3407,
use_gradient_checkpointing = "unsloth",

)
#prompt_code
#dataset preprocessing code....
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = formatted_dataset,
eval_dataset = None,
args = SFTConfig(
dataset_text_field = "text",
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4, # Use GA to mimic batch size!
warmup_steps = 5,
# num_train_epochs = 1, # Set this for 1 full training run.
max_steps = 30,
learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
report_to = "none", # Use this for WandB etc
),
)
trainer_stats = trainer.train()
``
but i am getting,
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

Preet-Sojitra · 2025-04-23T10:19:37Z

Preet-Sojitra
Apr 23, 2025

Hi @uscneps ,

I am getting the same error with gemma3 12b it and gemma3 4b it model
Here's my code:

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-12b-it",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False,
    full_finetuning = False, # [NEW!] We have full finetuning now!
)
model = FastModel.get_peft_model(
    model,
    finetune_vision_layers     = False, # Turn off for just text!
    finetune_language_layers   = True,  # Should leave on!
    finetune_attention_modules = True,  # Attention good for GRPO
    finetune_mlp_modules       = True,  # SHould leave on always!
    r = 8,           # Larger = higher accuracy, but might overfit
    lora_alpha = 8,  # Recommended alpha == r at least
    lora_dropout = 0,
    bias = "none",
    random_state = 3407,
)

# <data preparation and stuff copied from official unsloth gemma3 collab notebook>

from trl import SFTTrainer, SFTConfig
from transformers import TrainingArguments, DataCollatorForSeq2Seq, EarlyStoppingCallback
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    # tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    args=SFTConfig(
        dataset_text_field = "text",
        dataset_num_proc = 2,
        max_seq_length = 2048,
        packing = False,
        learning_rate = 2e-4,
        output_dir = "gemma-4b-it-checkpoints",
        eval_strategy="epoch",
        per_device_train_batch_size = 8,
        per_device_eval_batch_size = 8,
        gradient_accumulation_steps=4,
        torch_empty_cache_steps = 16,
        weight_decay = 0.01,
        num_train_epochs=6,
        lr_scheduler_type = "cosine",
        warmup_steps = 50,
        logging_steps = 1,
        logging_nan_inf_filter = False,
        save_strategy="epoch",
        save_total_limit=3,
        seed = 42,
        bf16 = False,
        fp16= False,
        run_name="gemma3-4b-it-trial-01",
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        optim = "adamw_8bit",
        report_to = "wandb",
    ),
)
trainer_stats = unsloth_train(trainer)

Error:

    123 
    124     def forward(self, input: Tensor) -> Tensor:
--> 125         return F.linear(input, self.weight, self.bias)
    126 
    127     def extra_repr(self) -> str:

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

Detailed Error Stack

RuntimeError                              Traceback (most recent call last)
[<ipython-input-18-5e5199183188>](https://localhost:8080/#) in <cell line: 0>()
      1 from unsloth import unsloth_train
----> 2 trainer_stats = unsloth_train(trainer)
      3 
      4 # trainer_stats = trainer.train()

[/usr/local/lib/python3.11/dist-packages/unsloth/trainer.py](https://localhost:8080/#) in unsloth_train(trainer, *args, **kwargs)
     43 if Version(transformers_version) > Version("4.45.2"):
     44     def unsloth_train(trainer, *args, **kwargs):
---> 45         return trainer.train(*args, **kwargs)
     46     pass
     47 else:

[/usr/local/lib/python3.11/dist-packages/transformers/trainer.py](https://localhost:8080/#) in train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   2243                 hf_hub_utils.enable_progress_bars()
   2244         else:
-> 2245             return inner_training_loop(
   2246                 args=args,
   2247                 resume_from_checkpoint=resume_from_checkpoint,

/usr/local/lib/python3.11/dist-packages/unsloth_zoo/compiler.py in _fast_inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)

/usr/local/lib/python3.11/dist-packages/unsloth/models/_utils.py in _unsloth_training_step(self, model, inputs, num_items_in_batch)

[/content/unsloth_compiled_cache/UnslothSFTTrainer.py](https://localhost:8080/#) in compute_loss(self, model, inputs, return_outputs, num_items_in_batch)
    746 
    747     def compute_loss(self, model, inputs, return_outputs = False, num_items_in_batch = None):
--> 748         outputs = super().compute_loss(
    749             model,
    750             inputs,

[/usr/local/lib/python3.11/dist-packages/unsloth/models/_utils.py](https://localhost:8080/#) in _unsloth_pre_compute_loss(self, model, inputs, *args, **kwargs)
   1027         )
   1028     pass
-> 1029     outputs = self._old_compute_loss(model, inputs, *args, **kwargs)
   1030     return outputs
   1031 pass

[/usr/local/lib/python3.11/dist-packages/transformers/trainer.py](https://localhost:8080/#) in compute_loss(self, model, inputs, return_outputs, num_items_in_batch)
   3799                 loss_kwargs["num_items_in_batch"] = num_items_in_batch
   3800             inputs = {**inputs, **loss_kwargs}
-> 3801         outputs = model(**inputs)
   3802         # Save past state if it exists
   3803         # TODO: this needs to be fixed and made cleaner later.

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

[/usr/local/lib/python3.11/dist-packages/peft/peft_model.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, task_ids, **kwargs)
   1717             with self._enable_peft_forward_hooks(**kwargs):
   1718                 kwargs = {k: v for k, v in kwargs.items() if k not in self.special_peft_forward_args}
-> 1719                 return self.base_model(
   1720                     input_ids=input_ids,
   1721                     attention_mask=attention_mask,

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

[/usr/local/lib/python3.11/dist-packages/peft/tuners/tuners_utils.py](https://localhost:8080/#) in forward(self, *args, **kwargs)
    195 
    196     def forward(self, *args: Any, **kwargs: Any):
--> 197         return self.model.forward(*args, **kwargs)
    198 
    199     def _pre_injection_hook(self, model: nn.Module, config: PeftConfig, adapter_name: str) -> None:

[/content/unsloth_compiled_cache/unsloth_compiled_module_gemma3.py](https://localhost:8080/#) in forward(self, input_ids, pixel_values, attention_mask, position_ids, past_key_values, token_type_ids, cache_position, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, logits_to_keep, **lm_kwargs)
    993         **lm_kwargs,
    994     ) -> Union[Tuple, Gemma3CausalLMOutputWithPast]:
--> 995         return Gemma3ForConditionalGeneration_forward(self, input_ids, pixel_values, attention_mask, position_ids, past_key_values, token_type_ids, cache_position, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, logits_to_keep, **lm_kwargs)
    996 
    997     def prepare_inputs_for_generation(

[/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py](https://localhost:8080/#) in wrapper(self, *args, **kwargs)
    963 
    964         try:
--> 965             output = func(self, *args, **kwargs)
    966             if is_requested_to_return_tuple or (is_configured_to_return_tuple and is_top_level_module):
    967                 output = output.to_tuple()

[/usr/local/lib/python3.11/dist-packages/transformers/utils/deprecation.py](https://localhost:8080/#) in wrapped_func(*args, **kwargs)
    170                 warnings.warn(message, FutureWarning, stacklevel=2)
    171 
--> 172             return func(*args, **kwargs)
    173 
    174         return wrapped_func

[/content/unsloth_compiled_cache/unsloth_compiled_module_gemma3.py](https://localhost:8080/#) in Gemma3ForConditionalGeneration_forward(self, input_ids, pixel_values, attention_mask, position_ids, past_key_values, token_type_ids, cache_position, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, logits_to_keep, **lm_kwargs)
    750         attention_mask, token_type_ids, past_key_values, cache_position, inputs_embeds, is_training
    751     )
--> 752     outputs: CausalLMOutputWithPast = self.language_model(
    753         attention_mask=causal_mask,
    754         position_ids=position_ids,

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

[/content/unsloth_compiled_cache/unsloth_compiled_module_gemma3.py](https://localhost:8080/#) in forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, cache_position, logits_to_keep, **loss_kwargs)
    514         **loss_kwargs,
    515     ) -> CausalLMOutputWithPast:
--> 516         return Gemma3ForCausalLM_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, cache_position, logits_to_keep, **loss_kwargs)
    517 
    518     def prepare_inputs_for_generation(

[/usr/local/lib/python3.11/dist-packages/transformers/utils/generic.py](https://localhost:8080/#) in wrapper(self, *args, **kwargs)
    963 
    964         try:
--> 965             output = func(self, *args, **kwargs)
    966             if is_requested_to_return_tuple or (is_configured_to_return_tuple and is_top_level_module):
    967                 output = output.to_tuple()

[/usr/local/lib/python3.11/dist-packages/transformers/utils/deprecation.py](https://localhost:8080/#) in wrapped_func(*args, **kwargs)
    170                 warnings.warn(message, FutureWarning, stacklevel=2)
    171 
--> 172             return func(*args, **kwargs)
    173 
    174         return wrapped_func

[/content/unsloth_compiled_cache/unsloth_compiled_module_gemma3.py](https://localhost:8080/#) in Gemma3ForCausalLM_forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, cache_position, logits_to_keep, **loss_kwargs)
    358 
    359     if labels is None:
--> 360         logits = self.lm_head(hidden_states[:, slice_indices, :])
    361     elif (UNSLOTH_STUDIO_ENABLED and NOT_RETURN_LOGITS and labels is not None) and not requires_grad_:
    362         loss = fast_linear_cross_entropy(

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _wrapped_call_impl(self, *args, **kwargs)
   1737             return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1738         else:
-> 1739             return self._call_impl(*args, **kwargs)
   1740 
   1741     # torchrec tests the code consistency with the following code

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1748                 or _global_backward_pre_hooks or _global_backward_hooks
   1749                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1750             return forward_call(*args, **kwargs)
   1751 
   1752         result = None

[/usr/local/lib/python3.11/dist-packages/torch/nn/modules/linear.py](https://localhost:8080/#) in forward(self, input)
    123 
    124     def forward(self, input: Tensor) -> Tensor:
--> 125         return F.linear(input, self.weight, self.bias)
    126 
    127     def extra_repr(self) -> str:

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

0 replies

N-E-W-T-O-N · 2025-06-10T07:42:40Z

N-E-W-T-O-N
Jun 10, 2025

I got the same error when I tried to fine-tune "sarvamai/sarvam-translate", which is a gemma-3-based model

model, tokenizer = FastModel.from_pretrained(
    model_name = "sarvamai/sarvam-translate",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    trust_remote_code=True
    # token = "hf_...", # use one if using gated models
)
.
.
.
trainer_stats = trainer.train()

error

<ipython-input-27-3d62c575fcfd> in <cell line: 0>()
----> 1 trainer_stats = trainer.train()

16 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/modules/linear.py in forward(self, input)
    123 
    124     def forward(self, input: Tensor) -> Tensor:
--> 125         return F.linear(input, self.weight, self.bias)
    126 
    127     def extra_repr(self) -> str:

RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::Half

0 replies

danielhanchen · 2025-10-11T07:17:18Z

danielhanchen
Oct 11, 2025
Maintainer

Oh my apologies I did not notice we have already solved this issue since late June 2025 and we did not notify any of you - so sorry!

Gemma-3 works as expected, but you need to update Unsloth or rerun the Gemma-3 notebook for eg our Gemma 3 270M Chess example or Gemma 3 4B finetuning example

To update Unsloth, please do:

pip install --upgrade --force-reinstall --no-cache-dir --no-deps unsloth unsloth_zoo

To enable full finetuning on Gemma-3 do:

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = True, # [NEW!] We have full finetuning now!
)

To enable dtype == torch.float32 ie full precision LoRA (or full finetuning) do:

model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/gemma-3-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    # full_finetuning = True, # [NEW!] We have full finetuning now!
    torch_dtype = torch.float32,
)

If you experience OOMs with Gemma-3 270M full finetuning, reminder to change

per_device_train_batch_size = 8,
gradient_accumulation_steps = 1, # Use GA to mimic batch size!

to

per_device_train_batch_size = 1,
gradient_accumulation_steps = 8, # Use GA to mimic batch size!

and due to our universal Gradient Accumulation bug fix, both the above are equivalent, with the 2nd batch_size=1 using much less memory.

@peteparker123 @N-E-W-T-O-N @uscne @Preet-Sojitra So sorry probably tagging you all is way too late.

1 reply

uscne Oct 11, 2025

Thank you, no worries and thanks for your time!

Uh oh!

getting error while fine tuning gemma 3 #2376

Uh oh!

peteparker123 Apr 18, 2025

Replies: 4 comments · 2 replies

Uh oh!

uscne Apr 22, 2025

Uh oh!

Uh oh!

peteparker123 Apr 23, 2025 Author

Uh oh!

Uh oh!

Preet-Sojitra Apr 23, 2025

Uh oh!

Uh oh!

N-E-W-T-O-N Jun 10, 2025

Uh oh!

danielhanchen Oct 11, 2025 Maintainer

Uh oh!

uscne Oct 11, 2025

peteparker123
Apr 18, 2025

Replies: 4 comments 2 replies

uscne
Apr 22, 2025

peteparker123 Apr 23, 2025
Author

Preet-Sojitra
Apr 23, 2025

N-E-W-T-O-N
Jun 10, 2025

danielhanchen
Oct 11, 2025
Maintainer