Skip to content

Conversation

@jeffdaily
Copy link
Contributor

@jeffdaily jeffdaily commented Sep 22, 2025

Previously #6086 added ROCm support but after numerous rebases it lost critical changes. This PR restores the ROCm build.

CMakeLists.txt was updated to add cuda sources to the ROCm build and mark *.cu files as HIP files. Many additions made to include/LightGBM/cuda/cuda_rocm_interop.h. Running a script to hipify sources in-place prior to running the build is no longer needed (and was removed in #6086).

Previously microsoft#6086 added ROCm support but after numerous rebases it lost
critical changes. This PR restores the ROCm build.

There are many source file changes but most were automated using the
following:

```bash
for f in `grep -rl '#ifdef USE_CUDA'`
do
    sed -i 's@#ifdef USE_CUDA@#if defined(USE_CUDA) || defined(USE_ROCM)@g' $f
done

for f in `grep -rl '#endif  // USE_CUDA'`
do
    sed -i 's@#endif  // USE_CUDA@#endif  // USE_CUDA || USE_ROCM@g' $f
done
```
error: explicit specialization in non-namespace scope ‘class ArrowChunkedArrayTest’
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I put up some initial thoughts, in a non-blocking review because I'm about to be unavailable for an extended time so I don't want to block the PR.

I'm glad you're saying this will fix the the ROCm builds but to be honest, it's also frustrating to need this... I asked multiple times in #6086 if anyone involved had tested that PR and it seems like it was merged without anyone even manually testing it. So I'm sorry but I feel I have to ask this... have you manually tested this? Both building the library and running the Python tests with device_type="cuda"? If not, can you please do that and let us know before this is merged? Or @shiyu1994 are you able to help with testing like that?

It would also be nice to see some installation docs added in https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst. There are currently no docs about the ROCm version.

@jameslamb jameslamb added fix gpu (ROCm) Issue is related to the ROCm GPU variant. labels Sep 23, 2025
@jameslamb jameslamb changed the title [ROCm] re-add support after #6086 [ROCm] add ROCm support (pt. 2) Sep 23, 2025
@jeffdaily
Copy link
Contributor Author

@jameslamb pytest results

====================================================== short test summary info ======================================================
FAILED tests/python_package_test/test_basic.py::test_dataset_construction_overwrites_user_provided_metadata_fields - lightgbm.basic.LightGBMError: Positions in learning to rank is not supported in CUDA version yet.
FAILED tests/python_package_test/test_engine.py::test_ranking_with_position_information_with_dataset_constructor - lightgbm.basic.LightGBMError: Positions in learning to rank is not supported in CUDA version yet.
===================== 2 failed, 1098 passed, 12 skipped, 8 xfailed, 4 xpassed, 913 warnings in 95.53s (0:01:35) =====================

I've also added docs for building with ROCm.

@jeffdaily
Copy link
Contributor Author

@jameslamb regenerate-parameters pre-commit check is failing for me. It wans to undo my changes to docs/Parameters.rst.

diff --git a/docs/Parameters.rst b/docs/Parameters.rst
index e2c7d8a1..1e9eacf7 100644
--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -264,7 +264,7 @@ Core Parameters

    -  ``cpu`` supports all LightGBM functionality and is portable across the widest range of operating systems and hardware

-   -  ``cuda`` offers faster training than ``gpu`` or ``cpu``, but only works on GPUs supporting CUDA or ROCm
+   -  ``cuda`` offers faster training than ``gpu`` or ``cpu``, but only works on GPUs supporting CUDA

    -  ``gpu`` can be faster than ``cpu`` and works on a wider range of GPUs than CUDA

@@ -272,7 +272,7 @@ Core Parameters

    -  **Note**: for the faster speed, GPU uses 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can set ``gpu_use_dp=true`` to enable 64-bit float point, but it will slow down the training

-   -  **Note**: refer to `Installation Guide <./Installation-Guide.rst>`__ to build LightGBM with GPU, CUDA, or ROCm support
+   -  **Note**: refer to `Installation Guide <./Installation-Guide.rst>`__ to build LightGBM with GPU or CUDA support

 -  ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">&#x1F517;&#xFE0E;</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state``

@jeffdaily jeffdaily requested a review from jameslamb September 23, 2025 18:30
@jameslamb
Copy link
Collaborator

regenerate-parameters pre-commit check is failing for me. It wans to undo my changes to docs/Parameters.rst.

We code-generate that file from comments in include/LightGBM/config.h.

  1. make the changes you want in include/LightGBM/config.h
  2. run pre-commit run --all-files (this should automatically update docs/Parameters.rst
  3. push changes here

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the updates!

If you have any issues with the pre-commit stuff let me know, I'd be happy to push a fix here. Don't want to waste your valuable time fiddling with that.

This is easier to review now, and thanks to other things I noticed from that I have a few other very minor suggestions.

But overall this is looking great to me. I really appreciate you adding docs and running the tests, and great to see that they mostly ran successfully! The 2 that failed for you failed because they use an only-set-in-CI environment variable to detect whether you're running with CUDA support enabled:

if getenv("TASK", "") == "cuda":
position = None

You could set that environment to cover them, but not a big deal either way.

export TASK=CUDA

Both building the library and running the Python tests with device_type="cuda"?

I'm so sorry, but I should have mentioned explicitly... did you apply something like what we do in CI here to ensure LightGBM tries to use the GPU by default in all tests?

In CI, we patch the source code before compiling to make device_type="cuda" the default, so all tests will try to use the GPU:

LightGBM/.ci/test.sh

Lines 189 to 192 in 8d4ac68

sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' ./include/LightGBM/config.h
grep -q 'std::string device_type = "cuda"' ./include/LightGBM/config.h || exit 1 # make sure that changes were really done
# by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests
sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' ./include/LightGBM/config.h

If not, can you please try that?

@jeffdaily
Copy link
Contributor Author

@jameslamb edited include/LightGBM/config.h to set "cuda" as device_type and gpu_use_dp=true.

reran pytest.

TASK=cuda pytest tests/python_package_test/

====================================================== short test summary info ======================================================
FAILED tests/python_package_test/test_engine.py::test_all_expected_params_are_written_out_to_model_text - AssertionError: assert '[force_row_wise: 1]' in 'tree\nversion=v4\nnum_class=1\nnum_tree_per_iteration=1\nlabel_index=0\nmax_fea...
===================== 1 failed, 1083 passed, 28 skipped, 8 xfailed, 4 xpassed, 911 warnings in 87.84s (0:01:27) =====================

And testlightgbm ran cleanly.

[----------] Global test environment tear-down
[==========] 31 tests from 8 test suites ran. (8445 ms total)
[  PASSED  ] 31 tests.

@jeffdaily jeffdaily requested a review from jameslamb September 25, 2025 17:17
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much!

That one test failure is a bit concerning, because device_type="cuda" really should always result in [force_row_wise: 1] in the params:

LightGBM/src/io/config.cpp

Lines 410 to 416 in 8d4ac68

} else if (device_type == std::string("cuda")) {
// force row-wise for cuda version
force_col_wise = false;
force_row_wise = true;
if (deterministic) {
Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
}

HOWEVER... I don't think it's worth blocking the PR over. I even think I might remember that test being flaky in the past.

I left one tiny suggestion on the Python build script, but don't have any other suggestions so marking "approve". I would really like one other reviewer who's more familiar with C++ (@guolinke or @shiyu1994 ?) to review before we merge this.


The `original GPU version <#build-gpu-version>`__ of LightGBM (``device_type=gpu``) is based on OpenCL.

The ROCm-based version (``device_type=cuda``) is a separate implementation. Yes, the ROCm version reuses the ``device_type=cuda`` as a convenience for users. Use this version in Linux environments with an AMD GPU.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this, given that PyTorch does the same thing: https://docs.pytorch.org/docs/stable/notes/hip.html

(no action required, just commenting for other maintainers to see)

@jeffdaily
Copy link
Contributor Author

That one test failure is a bit concerning, because device_type="cuda" really should always result in [force_row_wise: 1] in the params:

@jameslamb This was my mistake. When I make the changes to config.h, I didn't reinstall the python package. This test is passing.

@jameslamb
Copy link
Collaborator

Oh excellent, thanks!!

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily Thank you very much for coming back and providing this PR with fixes! We really appreciate it!

I don't have any comments for this PR.

Note to myself: add docs about ROCm compilation to python-package /README.rst after the merge.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving.

@StrikerRUS
Copy link
Collaborator

@shiyu1994 Could you please have a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review fix gpu (ROCm) Issue is related to the ROCm GPU variant.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants