-
Notifications
You must be signed in to change notification settings - Fork 4k
[ROCm] add ROCm support (pt. 2) #7039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Previously microsoft#6086 added ROCm support but after numerous rebases it lost critical changes. This PR restores the ROCm build. There are many source file changes but most were automated using the following: ```bash for f in `grep -rl '#ifdef USE_CUDA'` do sed -i 's@#ifdef USE_CUDA@#if defined(USE_CUDA) || defined(USE_ROCM)@g' $f done for f in `grep -rl '#endif // USE_CUDA'` do sed -i 's@#endif // USE_CUDA@#endif // USE_CUDA || USE_ROCM@g' $f done ```
error: explicit specialization in non-namespace scope ‘class ArrowChunkedArrayTest’
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I put up some initial thoughts, in a non-blocking review because I'm about to be unavailable for an extended time so I don't want to block the PR.
I'm glad you're saying this will fix the the ROCm builds but to be honest, it's also frustrating to need this... I asked multiple times in #6086 if anyone involved had tested that PR and it seems like it was merged without anyone even manually testing it. So I'm sorry but I feel I have to ask this... have you manually tested this? Both building the library and running the Python tests with device_type="cuda"? If not, can you please do that and let us know before this is merged? Or @shiyu1994 are you able to help with testing like that?
It would also be nice to see some installation docs added in https://github.com/microsoft/LightGBM/blob/master/docs/Installation-Guide.rst. There are currently no docs about the ROCm version.
This reverts commit e461e86.
Instead of replacing all #ifdef USE_CUDA, just add USE_CUDA define to ROCm build.
|
@jameslamb pytest results I've also added docs for building with ROCm. |
|
@jameslamb diff --git a/docs/Parameters.rst b/docs/Parameters.rst
index e2c7d8a1..1e9eacf7 100644
--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -264,7 +264,7 @@ Core Parameters
- ``cpu`` supports all LightGBM functionality and is portable across the widest range of operating systems and hardware
- - ``cuda`` offers faster training than ``gpu`` or ``cpu``, but only works on GPUs supporting CUDA or ROCm
+ - ``cuda`` offers faster training than ``gpu`` or ``cpu``, but only works on GPUs supporting CUDA
- ``gpu`` can be faster than ``cpu`` and works on a wider range of GPUs than CUDA
@@ -272,7 +272,7 @@ Core Parameters
- **Note**: for the faster speed, GPU uses 32-bit float point to sum up by default, so this may affect the accuracy for some tasks. You can set ``gpu_use_dp=true`` to enable 64-bit float point, but it will slow down the training
- - **Note**: refer to `Installation Guide <./Installation-Guide.rst>`__ to build LightGBM with GPU, CUDA, or ROCm support
+ - **Note**: refer to `Installation Guide <./Installation-Guide.rst>`__ to build LightGBM with GPU or CUDA support
- ``seed`` :raw-html:`<a id="seed" title="Permalink to this parameter" href="#seed">🔗︎</a>`, default = ``None``, type = int, aliases: ``random_seed``, ``random_state``
|
We code-generate that file from comments in
|
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the updates!
If you have any issues with the pre-commit stuff let me know, I'd be happy to push a fix here. Don't want to waste your valuable time fiddling with that.
This is easier to review now, and thanks to other things I noticed from that I have a few other very minor suggestions.
But overall this is looking great to me. I really appreciate you adding docs and running the tests, and great to see that they mostly ran successfully! The 2 that failed for you failed because they use an only-set-in-CI environment variable to detect whether you're running with CUDA support enabled:
LightGBM/tests/python_package_test/test_basic.py
Lines 513 to 514 in 8d4ac68
| if getenv("TASK", "") == "cuda": | |
| position = None |
You could set that environment to cover them, but not a big deal either way.
export TASK=CUDABoth building the library and running the Python tests with device_type="cuda"?
I'm so sorry, but I should have mentioned explicitly... did you apply something like what we do in CI here to ensure LightGBM tries to use the GPU by default in all tests?
In CI, we patch the source code before compiling to make device_type="cuda" the default, so all tests will try to use the GPU:
Lines 189 to 192 in 8d4ac68
| sed -i'.bak' 's/std::string device_type = "cpu";/std::string device_type = "cuda";/' ./include/LightGBM/config.h | |
| grep -q 'std::string device_type = "cuda"' ./include/LightGBM/config.h || exit 1 # make sure that changes were really done | |
| # by default ``gpu_use_dp=false`` for efficiency. change to ``true`` here for exact results in ci tests | |
| sed -i'.bak' 's/gpu_use_dp = false;/gpu_use_dp = true;/' ./include/LightGBM/config.h |
If not, can you please try that?
|
@jameslamb edited include/LightGBM/config.h to set "cuda" as device_type and gpu_use_dp=true. reran pytest. TASK=cuda pytest tests/python_package_test/ ====================================================== short test summary info ======================================================
FAILED tests/python_package_test/test_engine.py::test_all_expected_params_are_written_out_to_model_text - AssertionError: assert '[force_row_wise: 1]' in 'tree\nversion=v4\nnum_class=1\nnum_tree_per_iteration=1\nlabel_index=0\nmax_fea...
===================== 1 failed, 1083 passed, 28 skipped, 8 xfailed, 4 xpassed, 911 warnings in 87.84s (0:01:27) =====================And [----------] Global test environment tear-down
[==========] 31 tests from 8 test suites ran. (8445 ms total)
[ PASSED ] 31 tests. |
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much!
That one test failure is a bit concerning, because device_type="cuda" really should always result in [force_row_wise: 1] in the params:
Lines 410 to 416 in 8d4ac68
| } else if (device_type == std::string("cuda")) { | |
| // force row-wise for cuda version | |
| force_col_wise = false; | |
| force_row_wise = true; | |
| if (deterministic) { | |
| Log::Warning("Although \"deterministic\" is set, the results ran by GPU may be non-deterministic."); | |
| } |
HOWEVER... I don't think it's worth blocking the PR over. I even think I might remember that test being flaky in the past.
I left one tiny suggestion on the Python build script, but don't have any other suggestions so marking "approve". I would really like one other reviewer who's more familiar with C++ (@guolinke or @shiyu1994 ?) to review before we merge this.
|
|
||
| The `original GPU version <#build-gpu-version>`__ of LightGBM (``device_type=gpu``) is based on OpenCL. | ||
|
|
||
| The ROCm-based version (``device_type=cuda``) is a separate implementation. Yes, the ROCm version reuses the ``device_type=cuda`` as a convenience for users. Use this version in Linux environments with an AMD GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this, given that PyTorch does the same thing: https://docs.pytorch.org/docs/stable/notes/hip.html
(no action required, just commenting for other maintainers to see)
@jameslamb This was my mistake. When I make the changes to config.h, I didn't reinstall the python package. This test is passing. |
|
Oh excellent, thanks!! |
StrikerRUS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffdaily Thank you very much for coming back and providing this PR with fixes! We really appreciate it!
I don't have any comments for this PR.
Note to myself: add docs about ROCm compilation to python-package /README.rst after the merge.
StrikerRUS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approving.
|
@shiyu1994 Could you please have a look? |
Previously #6086 added ROCm support but after numerous rebases it lost critical changes. This PR restores the ROCm build.
CMakeLists.txt was updated to add cuda sources to the ROCm build and mark *.cu files as HIP files. Many additions made to include/LightGBM/cuda/cuda_rocm_interop.h. Running a script to hipify sources in-place prior to running the build is no longer needed (and was removed in #6086).