Skip to content

Conversation

@seagater
Copy link
Contributor

@seagater seagater commented Nov 21, 2025

Tune the nThreadsPerBlock for message size in 32KB to 256KB range for FP8 and Half datatype on MI300.

Before tuning:

-- fp8e4m3 -- -- -- fp8e5m2 -- -- half -- --
Msg Size Counts Out-of-place In-place Improvement Out-of-place In-place Improvement Counts Out-of-place In-place
1K 1024 5.38 5.52 -0.20% 5.33 5.4 0.70% 512 5.26 5.28
2K 2048 5.44 5.51 2.50% 5.42 5.51 2.90% 1024 5.37 5.44
4K 4096 5.54 5.6 6.40% 5.54 5.62 6.40% 2048 5.58 5.55
8K 8192 5.95 6.08 7.60% 5.95 6.07 7.60% 4096 5.92 6
16K 16384 6.5 6.56 25.90% 6.48 6.57 26.10% 8192 6.44 6.48
32K 32768 9 9.1 1.70% 8.96 9.03 2.20% 16384 8.77 8.87
64K 65536 9.35 9.45 4.50% 9.32 9.43 4.80% 32768 9.16 9.24
128K 131072 11.72 11.89 3.10% 11.73 11.89 3.00% 65536 9.79 10.01
256K 262144 12.37 12.51 10.10% 12.34 12.51 10.30% 131072 12.09 12.28
512K 524188 13.96 14.04 27.20% 13.99 14.07 27.00% 262144 13.76 13.86
1M 1048576 19.13 19.34 20.70% 19.14 19.34 20.60% 524288 19.17 19.33
2M 2097152 24.55 24.55 32.90% 24.49 24.55 33.10% 1048576 24.11 24.13
4M 4194304 37.25 37.3 38.70% 37.25 37.23 38.70% 2097152 36.58 36.45
8M 8388608 61.36 61.75 43.00% 61.32 61.69 43.10% 4194304 60.72 61.02
16M 16777216 109.3 109.5 44.70% 109.2 109.6 44.70% 8388608 107.7 108.2
32M 33554432 200.7 201.6 47.70% 200.8 201.6 47.70% 16777216 197.6 198.3
64M 67108864 388.9 389.5 48.30% 389.1 389.3 48.30% 33554432 384 384.8
128M 134217728 763 761.9   762.7 762.2   67108864 752.6 752.9

After tuning:

-- fp8e4m3 -- -- -- fp8e5m2 -- -- half -- --
Msg Size Counts Out-of-place In-place Improvement Out-of-place In-place Improvement Counts Out-of-place In-place
1K 1024 5.28 5.32 0.8% 5.29 5.34 0.6% 512 5.21 5.25
2K 2048 5.36 5.49 3.1% 5.37 5.5 2.9% 1024 5.32 5.43
4K 4096 5.51 5.6 6.5% 5.53 5.59 6.1% 2048 5.53 5.53
8K 8192 5.9 6.03 8.2% 5.92 6.03 7.9% 4096 5.89 5.96
16K 16384 6.45 6.54 18.7% 6.48 6.53 18.3% 8192 6.43 6.46
32K 32768 8.14 8.21 5.3% 8.14 8.21 5.3% 16384 7.93 8.01
64K 65536 8.83 8.91 7.2% 8.83 8.95 7.2% 32768 8.6 8.74
128K 131072 9.23 9.41 21.7% 9.25 9.44 21.5% 65536 9.52 9.71
256K 262144 10.32 10.62 24.8% 10.33 10.6 24.8% 131072 11.79 12.25
512K 524188 13.93 14 27.1% 13.97 14.05 26.9% 262144 13.73 13.9
1M 1048576 19.12 19.32 20.7% 19.13 19.31 20.6% 524288 19.12 19.29
2M 2097152 24.51 24.53 32.8% 24.51 24.6 32.8% 1048576 24.1 24.03
4M 4194304 37.07 37.23 38.8% 37.21 37.23 38.6% 2097152 36.49 36.51
8M 8388608 61.36 61.63 43.0% 61.48 61.74 42.9% 4194304 60.58 60.99
16M 16777216 109.2 109.7 44.8% 109 109.5 44.9% 8388608 107.7 108.3
32M 33554432 200.4 201.4 47.8% 200.9 201.8 47.7% 16777216 197.7 198.1
64M 67108864 388.4 388.6 48.4% 388.9 388.9 48.3% 33554432 384.2 384.7
128M 134217728 761.5 761.6   762.9 763   67108864 752.6 753.3

@seagater seagater requested review from Binyang2014, chhwang and Copilot and removed request for chhwang and Copilot November 21, 2025 18:07
Copilot finished reviewing on behalf of seagater November 21, 2025 18:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes GPU kernel performance for AllReduce operations on AMD MI300 by tuning the nThreadsPerBlock parameter for FP8 (both e4m3 and e5m2 variants) and Half datatypes in the 32KB-256KB message size range. The tuning achieves significant performance improvements, particularly for FP8 datatypes at 128K-256K sizes (21-25% improvement) and 64K sizes (7% improvement).

Key changes:

  • Added AMD HIP platform-specific tuning for Half datatype with reduced thread counts (64 for 32KB, 128 for 64-256KB)
  • Added FP8-specific tuning with progressively larger thread counts (64 for 32KB, 128 for 64KB, 256 for 128-256KB)
  • Used nested preprocessor directives to ensure platform and type compatibility

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@seagater seagater requested a review from chhwang November 21, 2025 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants