[ROCm] add ROCm support (pt. 2) #7039

jeffdaily · 2025-09-22T18:00:42Z

Previously #6086 added ROCm support but after numerous rebases it lost critical changes. This PR restores the ROCm build.

CMakeLists.txt was updated to add cuda sources to the ROCm build and mark *.cu files as HIP files. Many additions made to include/LightGBM/cuda/cuda_rocm_interop.h. Running a script to hipify sources in-place prior to running the build is no longer needed (and was removed in #6086).

Previously microsoft#6086 added ROCm support but after numerous rebases it lost critical changes. This PR restores the ROCm build. There are many source file changes but most were automated using the following: ```bash for f in `grep -rl '#ifdef USE_CUDA'` do sed -i 's@#ifdef USE_CUDA@#if defined(USE_CUDA) || defined(USE_ROCM)@g' $f done for f in `grep -rl '#endif // USE_CUDA'` do sed -i 's@#endif // USE_CUDA@#endif // USE_CUDA || USE_ROCM@g' $f done ```

error: explicit specialization in non-namespace scope ‘class ArrowChunkedArrayTest’

jameslamb

Thanks for the PR. I put up some initial thoughts, in a non-blocking review because I'm about to be unavailable for an extended time so I don't want to block the PR.

I'm glad you're saying this will fix the the ROCm builds but to be honest, it's also frustrating to need this... I asked multiple times in #6086 if anyone involved had tested that PR and it seems like it was merged without anyone even manually testing it. So I'm sorry but I feel I have to ask this... have you manually tested this? Both building the library and running the Python tests with device_type="cuda"? If not, can you please do that and let us know before this is merged? Or @shiyu1994 are you able to help with testing like that?

It would also be nice to see some installation docs added in https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst. There are currently no docs about the ROCm version.

tests/cpp_tests/test_arrow.cpp

CMakeLists.txt

This reverts commit e461e86.

Instead of replacing all #ifdef USE_CUDA, just add USE_CUDA define to ROCm build.

jeffdaily · 2025-09-23T18:20:48Z

@jameslamb pytest results

====================================================== short test summary info ======================================================
FAILED tests/python_package_test/test_basic.py::test_dataset_construction_overwrites_user_provided_metadata_fields - lightgbm.basic.LightGBMError: Positions in learning to rank is not supported in CUDA version yet.
FAILED tests/python_package_test/test_engine.py::test_ranking_with_position_information_with_dataset_constructor - lightgbm.basic.LightGBMError: Positions in learning to rank is not supported in CUDA version yet.
===================== 2 failed, 1098 passed, 12 skipped, 8 xfailed, 4 xpassed, 913 warnings in 95.53s (0:01:35) =====================

I've also added docs for building with ROCm.

jeffdaily · 2025-09-23T18:30:13Z

@jameslamb regenerate-parameters pre-commit check is failing for me. It wans to undo my changes to docs/Parameters.rst.

diff --git a/docs/Parameters.rst b/docs/Parameters.rst
index e2c7d8a1..1e9eacf7 100644
--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -264,7 +264,7 @@ Core Parameters

    -  ``cpu`` supports all LightGBM functionality and is portable across the widest range of operating systems and hardware

-   -  ``cuda`` offers faster training than ``gpu`` or ``cpu``, but only works on GPUs supporting CUDA or ROCm
+   -  ``cuda`` offers faster training than ``gpu`` or ``cpu``, but only works on GPUs supporting CUDA

    -  ``gpu`` can be faster than ``cpu`` and works on a wider range of GPUs than CUDA

@@ -272,7 +272,7 @@ Core Parameters

    -  **Note**: for the faster speed, GPU uses 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can set ``gpu_use_dp=true`` to enable 64-bit float point, but it will slow down the training

-   -  **Note**: refer to `Installation Guide <./Installation-Guide.rst>`__ to build LightGBM with GPU, CUDA, or ROCm support
+   -  **Note**: refer to `Installation Guide <./Installation-Guide.rst>`__ to build LightGBM with GPU or CUDA support

 -  ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">&#x1F517;&#xFE0E;</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state``

jameslamb · 2025-09-25T00:43:01Z

regenerate-parameters pre-commit check is failing for me. It wans to undo my changes to docs/Parameters.rst.

We code-generate that file from comments in include/LightGBM/config.h.

make the changes you want in include/LightGBM/config.h
run pre-commit run --all-files (this should automatically update docs/Parameters.rst
push changes here

jameslamb

Thanks so much for the updates!

If you have any issues with the pre-commit stuff let me know, I'd be happy to push a fix here. Don't want to waste your valuable time fiddling with that.

This is easier to review now, and thanks to other things I noticed from that I have a few other very minor suggestions.

But overall this is looking great to me. I really appreciate you adding docs and running the tests, and great to see that they mostly ran successfully! The 2 that failed for you failed because they use an only-set-in-CI environment variable to detect whether you're running with CUDA support enabled:

LightGBM/tests/python_package_test/test_basic.py

Lines 513 to 514 in 8d4ac68

    
           if getenv("TASK", "") == "cuda": 
        
               position = None

You could set that environment to cover them, but not a big deal either way.

export TASK=CUDA

Both building the library and running the Python tests with device_type="cuda"?

I'm so sorry, but I should have mentioned explicitly... did you apply something like what we do in CI here to ensure LightGBM tries to use the GPU by default in all tests?

In CI, we patch the source code before compiling to make device_type="cuda" the default, so all tests will try to use the GPU:

LightGBM/.ci/test.sh

Lines 189 to 192 in 8d4ac68

    
           sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' ./include/LightGBM/config.h 
        
           grep -q 'std::string device_type = "cuda"' ./include/LightGBM/config.h || exit 1  # make sure that changes were really done 
        
           # by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests 
        
           sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' ./include/LightGBM/config.h

If not, can you please try that?

src/treelearner/cuda/cuda_single_gpu_tree_learner.hpp

src/treelearner/serial_tree_learner.h

include/LightGBM/cuda/cuda_rocm_interop.h

jeffdaily · 2025-09-25T17:10:55Z

@jameslamb edited include/LightGBM/config.h to set "cuda" as device_type and gpu_use_dp=true.

reran pytest.

TASK=cuda pytest tests/python_package_test/

====================================================== short test summary info ======================================================
FAILED tests/python_package_test/test_engine.py::test_all_expected_params_are_written_out_to_model_text - AssertionError: assert '[force_row_wise: 1]' in 'tree\nversion=v4\nnum_class=1\nnum_tree_per_iteration=1\nlabel_index=0\nmax_fea...
===================== 1 failed, 1083 passed, 28 skipped, 8 xfailed, 4 xpassed, 911 warnings in 87.84s (0:01:27) =====================

And testlightgbm ran cleanly.

[----------] Global test environment tear-down
[==========] 31 tests from 8 test suites ran. (8445 ms total)
[  PASSED  ] 31 tests.

jameslamb

Thanks very much!

That one test failure is a bit concerning, because device_type="cuda" really should always result in [force_row_wise: 1] in the params:

LightGBM/src/io/config.cpp

Lines 410 to 416 in 8d4ac68

    
           } else if (device_type == std::string("cuda")) { 
        
             // force row-wise for cuda version 
        
             force_col_wise = false; 
        
             force_row_wise = true; 
        
             if (deterministic) { 
        
               Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic."); 
        
             }

HOWEVER... I don't think it's worth blocking the PR over. I even think I might remember that test being flaky in the past.

I left one tiny suggestion on the Python build script, but don't have any other suggestions so marking "approve". I would really like one other reviewer who's more familiar with C++ (@guolinke or @shiyu1994 ?) to review before we merge this.

build-python.sh

jameslamb · 2025-09-26T16:10:15Z

docs/Installation-Guide.rst

+
+The `original GPU version <#build-gpu-version>`__ of LightGBM (``device_type=gpu``) is based on OpenCL.
+
+The ROCm-based version (``device_type=cuda``) is a separate implementation. Yes, the ROCm version reuses the ``device_type=cuda`` as a convenience for users.  Use this version in Linux environments with an AMD GPU.


I'm ok with this, given that PyTorch does the same thing: https://docs.pytorch.org/docs/stable/notes/hip.html

(no action required, just commenting for other maintainers to see)

jeffdaily · 2025-09-26T17:29:49Z

That one test failure is a bit concerning, because device_type="cuda" really should always result in [force_row_wise: 1] in the params:

@jameslamb This was my mistake. When I make the changes to config.h, I didn't reinstall the python package. This test is passing.

jameslamb · 2025-09-27T03:10:32Z

Oh excellent, thanks!!

StrikerRUS

@jeffdaily Thank you very much for coming back and providing this PR with fixes! We really appreciate it!

I don't have any comments for this PR.

Note to myself: add docs about ROCm compilation to python-package /README.rst after the merge.

StrikerRUS

Re-approving.

StrikerRUS · 2025-10-28T09:39:14Z

@shiyu1994 Could you please have a look?

jeffdaily added 3 commits September 22, 2025 17:28

Fix error in cpp_tests/test_arrow.cpp.

e461e86

error: explicit specialization in non-namespace scope ‘class ArrowChunkedArrayTest’

update for ROCm 7 BC-breaking change to warpSize

06741cc

jeffdaily requested review from StrikerRUS, borchero, guolinke, jameslamb, jmoralez and shiyu1994 as code owners September 22, 2025 18:00

lint

e601565

jameslamb reviewed Sep 23, 2025

View reviewed changes

tests/cpp_tests/test_arrow.cpp Outdated Show resolved Hide resolved

CMakeLists.txt Show resolved Hide resolved

jameslamb added fix gpu (ROCm) Issue is related to the ROCm GPU variant. labels Sep 23, 2025

jameslamb changed the title ~~[ROCm] re-add support after #6086~~ [ROCm] add ROCm support (pt. 2) Sep 23, 2025

jeffdaily added 5 commits September 23, 2025 17:02

Revert "Fix error in cpp_tests/test_arrow.cpp."

1b3deb5

This reverts commit e461e86.

partial revert of 61ec4f1

20996c9

Instead of replacing all #ifdef USE_CUDA, just add USE_CUDA define to ROCm build.

add --use-rocm option to build-python.sh

3eafff7

fix cuda build missing CUDASUCCESS_OR_FATAL in vector_cudahost.h

9655896

add rocm docs

0aa90c5

jeffdaily requested a review from jameslamb September 23, 2025 18:30

jameslamb requested changes Sep 25, 2025

View reviewed changes

fix doc using pre-commit

1b550a7

apply reviewer suggestions

2e98916

jeffdaily requested a review from jameslamb September 25, 2025 17:17

Merge branch 'master' into rocm3

76953f1

jameslamb approved these changes Sep 26, 2025

View reviewed changes

fix build-python.sh doc

5acc678

jameslamb added the awaiting review label Oct 4, 2025

Merge branch 'master' into rocm3

552fafc

StrikerRUS approved these changes Oct 6, 2025

View reviewed changes

StrikerRUS and others added 2 commits October 9, 2025 14:55

Merge branch 'master' into rocm3

6453a04

fix build for rocm 7.0

5437ce4

StrikerRUS approved these changes Oct 10, 2025

View reviewed changes

Merge branch 'master' into rocm3

2f7bd8e

jameslamb mentioned this pull request Oct 13, 2025

[ci][c++] fixed build/header_guard errors from cpplint #7055

Merged

jeffdaily and others added 2 commits October 21, 2025 17:53

Merge branch 'master' into rocm3

548cec8

Merge branch 'master' into rocm3

08ce574

	sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' ./include/LightGBM/config.h
	grep -q 'std::string device_type = "cuda"' ./include/LightGBM/config.h \|\| exit 1 # make sure that changes were really done
	# by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests
	sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' ./include/LightGBM/config.h

	} else if (device_type == std::string("cuda")) {
	// force row-wise for cuda version
	force_col_wise = false;
	force_row_wise = true;
	if (deterministic) {
	Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic.");
	}


		The `original GPU version <#build-gpu-version>`__ of LightGBM (``device_type=gpu``) is based on OpenCL.

		The ROCm-based version (``device_type=cuda``) is a separate implementation. Yes, the ROCm version reuses the ``device_type=cuda`` as a convenience for users. Use this version in Linux environments with an AMD GPU.

[ROCm] add ROCm support (pt. 2) #7039

Are you sure you want to change the base?

[ROCm] add ROCm support (pt. 2) #7039

Conversation

jeffdaily commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeffdaily commented Sep 23, 2025

Uh oh!

jeffdaily commented Sep 23, 2025

Uh oh!

jameslamb commented Sep 25, 2025

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffdaily commented Sep 25, 2025

Uh oh!

jameslamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jameslamb Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Sep 26, 2025

Uh oh!

jameslamb commented Sep 27, 2025

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

StrikerRUS left a comment

Choose a reason for hiding this comment

Uh oh!

StrikerRUS commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jeffdaily commented Sep 22, 2025 •

edited

Loading