Skip to content

Conversation

@therealnaveenkamal
Copy link

What does this PR do ?

  • Add MLFlow integration for experiment tracking and artifact logging.

  • This PR adds comprehensive MLFlow support to Megatron Bridge, enabling users to log training metrics, configuration parameters, and checkpoint artifacts to MLFlow tracking servers. The integration includes automatic checkpoint artifact logging with iteration-based artifact paths.

  • MLFlow logging follows the same pattern as the existing W&B integration and can be enabled via LoggerConfig with mlflow_experiment and mlflow_run_name parameters.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
  • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

cc @Phlip79

Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Signed-off-by: Naveenraj Kamalakannan <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 1, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Phlip79 Phlip79 linked an issue Dec 1, 2025 that may be closed by this pull request
@ericharper ericharper requested a review from Phlip79 December 1, 2025 16:37
@ericharper ericharper requested review from maanug-nv and removed request for maanug-nv December 1, 2025 16:38
@Phlip79 Phlip79 requested a review from maanug-nv December 2, 2025 02:35
@therealnaveenkamal
Copy link
Author

Hi @Phlip79 - implemented all the changes.

mlflow.set_experiment(logger_cfg.mlflow_experiment)

# Prepare tags and params
def safe_serialize(obj: Any) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this to utils? it's duplicated with

def safe_serialize(obj):

@yaoyu-33
Copy link
Contributor

yaoyu-33 commented Dec 2, 2025

thanks for the contribution, all look good. Had one comment.

Copy link
Contributor

@maanug-nv maanug-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support. Left 1 suggestion. Can I also ask you to add unit tests similar to what we do for wandb?

1) Install MLFlow (if not already available):

```bash
pip install mlflow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this as an optional dependency, and maybe in validate, or LoggerConfig.finalize() assert that mlflow is importable. Maybe even a hard dep, since wandb is hard dep.

Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Signed-off-by: Naveenraj Kamalakannan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Logging to MLFlow

4 participants