Releases · pytorch/helion

18 Nov 18:21

oulgen

v0.2.3

2644d0a

v0.2.3 Latest

Latest

What's Changed

[CI] Fail the distributed CI job if any unit test fails by @yf225 in #1125
Add DE-Surrogate hybrid autotuner algorithm + early stopping option for DE and DE-Surrogate by @FranciscoThiesen in #1096
Update AGENTS.md by @jansel in #1128
Add Settings.persistent_reserved_sms by @jansel in #1129
Add Settings.autotune_force_persistent by @jansel in #1130
[CI] Fix fbcode test_breakpoint error by @yf225 in #1132
Auto-select index_dtype by @jansel in #1131
Support tuple indexing by hl.static_range iterator by @yf225 in #1134
Fix CI to surface errors correctly, fix all existing errors by @yf225 in #1138
Workaround TRITON_INTERPRET bug breaking tests by @jansel in #1139
Fix size 0 tensor handling by @jansel in #1140
[Benchmark CI] Print generated Triton code for the best config by @yf225 in #1142
Use pyrefly for type checking by @rchen152 in #1143
fix pyrefly errors by @oulgen in #1144
[CI] Skip TestBreakpoint in ref-eager CI job by @yf225 in #1141
Bump pyrefly to 0.42.1 and remove 'sed' workaround. by @rchen152 in #1145

New Contributors

@FranciscoThiesen made their first contribution in #1096
@rchen152 made their first contribution in #1143

Full Changelog: v0.2.2...v0.2.3

Contributors

jansel, oulgen, and 3 other contributors

Assets 2

12 Nov 18:58

oulgen

v0.2.2

51580b4

v0.2.2

What's Changed

[Benchmark] Update welford torch.compile function name by @yf225 in #1029
chore: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1030
chore: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1031
[Benchmark CI] Set welford num_inputs to 6 to avoid timeout by @yf225 in #1032
Default config: reduce block_size and num_stages to avoid shared mem OOM by @yf225 in #1033
Default config: reduce block_size further to avoid shared mem OOM by @yf225 in #1034
Disable autotuner progress bar in fbcode unit test by @yf225 in #1035
Always print cached config by @oulgen in #1036
Fix dtype mismatch error in se_block example by @yf225 in #1040
Upgrade clang version by @oulgen in #1043
Fix missing static_shapes=False in deployment_autotuning.md by @jansel in #1042
Fix matmul output dtype to match PyTorch eager behavior by @yf225 in #1044
Fix layernorm bwd unit test by @yf225 in #1047
Fix FlattenedTileStrategy to handle unit-sized block dimensions by @yf225 in #1048
[CI] Fix debug_str() to be compatible with latest PyTorch nightly by @yf225 in #1050
[Fix upcoming CI error] Set current node in inductor lowering by @yf225 in #1052
Remove Section Navigation pane from Deployment and Autotuning page. by @choijon5 in #1051
Add settings.autotune_baseline_fn to allow passing in custom baseline function to autotuner by @yf225 in #1054
Add HELION_PRINT_REPRO=1 to print Helion kernel repro script to console by @yf225 in #1049
Fix caching for CPUs by @oulgen in #1055
Add get_num_sm for cpu by @oulgen in #1056
Support torch.rand / torch.rand_like with dynamic tile sizes by @yf225 in #1057
Remove line numbers from expected files by @oulgen in #1061
Ignore passed in config when force autotune is turned on by @oulgen in #1060
Update Watch Talk link to Triton conf talk. by @choijon5 in #1058
Helion Puzzle docs bug fixes by @Athe-kunal in #1062
Update test_persistent_kernels.expected by @jansel in #1070
Make HELION_PRINT_REPRO=1 take effect in more error cases by @yf225 in #1066
add geglu backward by @parsshar-RH in #1069
[Unblock internal] Fix log capture issue on internal tests by @yf225 in #1076
Add best effort triton-cpu support by @oulgen in #1037
Update test_debug_utils.py by @oulgen in #1077
Raise user error if device-loop is empty after DCE by @yf225 in #1074
Add GRPO loss example by @ighoshsubho in #1063
Use HELION_PRINT_REPRO=1 to print repro when device IR lowering or Triton codegen error by @yf225 in #1078
add AMD demo link by @vivienfanghuagood in #1068
Update test.yml by @oulgen in #1083
Fix GRPO loss example unit tests by @yf225 in #1079
Remove requirements.txt by @oulgen in #1088
Relax requirements for inline_triton output_like=None by @jansel in #1087
feat(autotuner): Make autotune cache class configurable via env var by @fulvius31 in #1071
Add support for while and pass by @jansel in #1090
Update sphinxtheme to pull from pypi package by @sekyondaMeta in #1091
[Autotuner] Better error message for default config error by @yf225 in #1092
Ignore illegal instruction errors by @jansel in #1093
Update talk links to PTC version by @jansel in #1094
Add autotuning log by @jansel in #1095
Fix builtin min / max handling in device loop by @yf225 in #1085
Add skipIfRocm to failing test on main by @jansel in #1101
Fix lint in newer triton by @jansel in #1098
Add AGENTS.md by @jansel in #1100
Refactor _decorators.codegen to allow multiple backends by @jansel in #1099
Add extra line before repro log; update repro log tests by @yf225 in #1102
Refactor inductor_lowering.py into two files by @jansel in #1103
Use CPU machine for triton-cpu by @oulgen in #1105
Fix no libdw.so issue on AMD CI by @yf225 in #1107
Fixes in helion puzzles by @Athe-kunal in #1104
Add distributed CI job (4xH100) and example unit tests by @yf225 in #1106
Generalize aten_lowering.py for multiple backends by @jansel in #1108
Support tensor.T for transpose by @yf225 in #1110
Add warning to discourage use of acc += lhs @ rhs pattern by @yf225 in #1111
Remove @helion.jit usage and advise use of @helion.kernel by @yf225 in #1116

New Contributors

@Athe-kunal made their first contribution in #1062
@parsshar-RH made their first contribution in #1069
@ighoshsubho made their first contribution in #1063
@vivienfanghuagood made their first contribution in #1068
@fulvius31 made their first contribution in #1071

Full Changelog: v0.2.1...v0.2.2

Contributors

jansel, oulgen, and 9 other contributors

Assets 2

26 Oct 23:16

oulgen

v0.2.1

c5dbbbe

v0.2.1

What's Changed

No autotuning on block_ptr if tma is available by @PaulZhang12 in #997
Add reps for benchmarking stability by @PaulZhang12 in #999
Prioritize outermost loop for warp spec by @PaulZhang12 in #1000
Add backward pass for softmax kernel by @karthickai in #744
Fix linter in softmax by @oulgen in #1003
Fix test_examples.expected by @oulgen in #1002
Beef up caching tests by @oulgen in #1001
Add HELION_ASSERT_CACHE_HIT to debug/explain cache miss by @oulgen in #1006
Better error message for calling Helion kernel from another kernel by @yf225 in #1008
Assert that we are cache hitting on the CI by @oulgen in #1007
Always raise FailedToUnpackTile when for tile_m, tile_d in hl.tile(m, d) is used by @yf225 in #1009
Adding demo for running softmax kernel on Google colab by @choijon5 in #944
int4 gemm accurate baselines by @PaulZhang12 in #1010
Add sitemap xml by @sekyondaMeta in #1013
[helion] backward support for swiglu by @shunting314 in #756
Raise informative error when hl.dot with 3D inputs have batch dim mismatch by @yf225 in #1012
[CI] Fix AMD journal check errors by @yf225 in #1016
Support breakpoint() in device code when interpret mode is on by @yf225 in #1020
Sort requirements file by @oulgen in #1021
Better type checking for eviction policies by @oulgen in #1024
Bump linter versions by @jansel in #1018
Garbage collect expected results by @jansel in #1017
Make indexing choice a list by @oulgen in #1025
[Docs] Add list of indexing autotuning docs by @oulgen in #1027
Make store indexing also individually tunable by @oulgen in #1028

New Contributors

@shunting314 made their first contribution in #756

Full Changelog: v0.2.0...v0.2.1

Contributors

jansel, oulgen, and 6 other contributors

Assets 2

20 Oct 20:54

oulgen

v0.2.0

3a0e975

v0.2.0

What's Changed

Verify compiled kernels in subprocess by @jansel in #914
Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
Support warp specialization on B200 by @oulgen in #935
Update README.md by @oulgen in #943
Register tile symbol origin, to support tile + offset use case in blackwell attention by @yf225 in #939
[CI] Print failed tests by @oulgen in #942
Update examples to use run_example by @jansel in #941
blackwell attn with triton attr set by @v0i0 in #918
Set static_shapes=True by @oulgen in #937
run.py env var to skip exception logging by @v0i0 in #946
Fix bug with unit sized dims and block_sizes by @jansel in #932
Update static_shapes docs by @jansel in #951
Add tile.count by @oulgen in #955
Auto detect low vram by @oulgen in #956
[CI] Use official PyTorch 2.9 by @oulgen in #962
Use interleaved_bench for run_example by @jansel in #945
Generalize tile_with_offset pass by @jansel in #949
Docstring updates by @jansel in #952
Import updates by @jansel in #953
Add missing environment variables to docs by @jansel in #957
Print out errors vs timeouts in autotuning status by @jansel in #960
Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
Exit autotuning faster on KeyboardInterrupt by @jansel in #963
Remove default settings by @jansel in #964
Add missing settings environment variables by @jansel in #965
Skip test_differential_evolution_search due to slowness by @jansel in #968
[Benchmark CI] Give nightly job permissions by @oulgen in #970
[Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
[Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
[blackwell attn example] qk scale as param by @v0i0 in #969
[Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
Add initial backwards compatibility tests by @oulgen in #958
Remove unrolling + warp spec by @PaulZhang12 in #967
[Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
[Benchmark] Fix tritonbench auto-installation by @yf225 in #980
[Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
Make fork default precompilation strategy by @oulgen in #979
[benchmarks] change tritonbench path by @xuzhao9 in #966
Add skipIfA10G decorator by @yf225 in #982
Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
Fix timeouts in autotuning by @jansel in #985
Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
Remove extra debug log for timeouts by @jansel in #987
Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
Generalize test cases to support XPU by @EikanWang in #983
Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
Update README.md by @oulgen in #992
Update README.md by @oulgen in #993
Mamba2 Chunk Scan & State by @v0i0 in #950
Remove unrolling with tma + pipelining by @PaulZhang12 in #994
Add provenance annotations to output code by @jansel in #988

Full Changelog: v0.1.8...v0.2.0

Contributors

xuzhao9, jansel, and 7 other contributors

Assets 2

15 Oct 00:37

oulgen

v0.1.8

b77301f

v0.1.8

What's Changed

fix rmsnorm fwd tritonbench by @v0i0 in #840
Update input shapes for example kernels by @yf225 in #845
Extend eviction policy tests to all indexing types by @oulgen in #833
[Docs] Remove early development warning by @oulgen in #846
[Docs] Add link to gpumode discord by @oulgen in #847
[Docs] Add PTC promotional material by @oulgen in #848
[Benchmark] Add low mem dropout example by @karthickai in #641
Update lint.yml by @oulgen in #854
Remove hl.register_reduction_dim API by @yf225 in #834
Error message for boolean masking or torch.nonzero by @yf225 in #687
Remove hardcoded block_size=1 usage in attention kernel example by @yf225 in #843
Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
Decrease num_stages default from 3 to 2, to avoid shared memory OOM by @yf225 in #841
Allow user-defined specialization key by @jansel in #853
[Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
Remove legacy register_inductor_lowering code by @yf225 in #864
Set setstate/getstate methods to Config by @jansel in #868
[doc] Add deployment/autotuning guide by @jansel in #869
[Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
Fix sphinx warnings by @jansel in #871
Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
[CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
[Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
[Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
Print Triton code when error for easier debugging by @yf225 in #874
Terminate autotuning faster if progress is minimal by @oulgen in #855
Update README.md by @oulgen in #877
[CI] pin b200 to pytorch2.9 by @oulgen in #878
[Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
[Benchmark] bf16 x int16 helion kernel by @karthickai in #794
Install git for benchmarks by @oulgen in #882
Pin AMD to 6.4.4 by @oulgen in #883
Faster int4 gemm by @PaulZhang12 in #751
Pin AMD to 6.4.4 by @oulgen in #881
Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
[Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
[Benchmark] Use bespoke setup-python action by @oulgen in #885
[Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
Add dependabot by @oulgen in #888
Update dependabot.yml by @oulgen in #891
chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
Upgrade ruff==0.14.0 by @jansel in #889
[Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
[Benchmark] use logger.exception for process errors by @oulgen in #902
[Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
Query minimum dot size for XPU by @EikanWang in #900
Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
[CI] Pin amd to rocm7.0 by @oulgen in #907
[Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
[Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
[Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
Remove cache around set_triton_allocator by @oulgen in #912
Add int4_gemm by @oulgen in #917
chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
Catch missing cudnn error by @jansel in #873
Add progress bar for precompiling by @jansel in #919
Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
Avoid setting default --input-sample-mode to equally-spaced-k by @yf225 in #922
Remove triton_helpers.* usage in lifted device function arguments by @yf225 in #849
Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
Suggest use of @helion.kernel(index_dtype=torch.int64) if index offset is out of bound for int32 by @yf225 in #850
Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
Support hl.arange() with non-power-of-2 input by @yf225 in #862
Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
Generalize examples with the DEVICE variable by @adam-smnk in #915
Fix lint error by @jansel in #926
Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
Support tile+offset and tensor descriptors by @jansel in #928
Fix triton/torch.compile compability issue by @jansel in #927
Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
Update the Agent ID by @sekyondaMeta in #931
[Benchmark CI] Use --non-square flag for gemm by @yf225 in #938

New Contributors

@dependabot[bot] made their first contribution in #893
@tianrengao made their first contribution in #748

Full Changelog: v0.1.7...v0.1.8

Contributors

jansel, oulgen, and 10 other contributors

Assets 2

08 Oct 19:16

oulgen

v0.1.7

269deb3

v0.1.7

What's Changed

Generalize the cuda-bias test cases by replacing hardcoded "cuda" literal with the DEVICE variable by @EikanWang in #775
Make progress bar prettier by @oulgen in #786
Upgrade ruff==0.13.3 pyright==1.1.406 by @jansel in #790
Add hl.split and hl.join by @jansel in #791
Generalize test_print and test_tensor_descriptor to support different accelerators by @EikanWang in #801
Limit rebench to 1000 iterations by @jansel in #789
Turn down autotuner defaults by @jansel in #788
Enable torch.xpu._XpuDeviceProperties in Helion kernel by @EikanWang in #798
Better error message for augmented assignment (e.g. +=) on host tensor without subscript by @yf225 in #807
Add Pattern Search autotuning algorithm to docs. by @choijon5 in #810
Support 0dim tensor in output code printing by @oulgen in #806
Set range_num_stages <= 1 if using tensor_descriptor, to avoid CUDA misaligned address error by @yf225 in #792
Add hl.inline_triton API by @jansel in #811
Add out_dtype arg to hl.dot by @jansel in #813
Add autotune_config_overrides by @jansel in #814
Reduce initial_population to 100 by @jansel in #800
Disable range_num_stages for kernels with aliasing by @jansel in #812
Adding new setting, autotune_max_generations, that allows user to set the maximum number of generations for autotuning by @choijon5 in #796
Increase tolerance for test_matmul_reshape_m_2 by @jansel in #816
Update docs by @jansel in #815
Fix torch version check by @adam-smnk in #818
[Benchmark] Keep going when a single benchmark fails by @oulgen in #820
Faster Helion JSD by @PaulZhang12 in #733
Faster KL Div by @PaulZhang12 in #822
Normalize device name and decorate cuda-only test cases by @EikanWang in #819
Improved log messages for autotuning by @choijon5 in #817
Apply simplification to range indexing in order to reuse block size symbols by @yf225 in #809
Fix hl.rand to use tile specific offsets instead of fixed offsets, ensure unique random num per tile by @karthickai in #685
Match cuda versions for benchmark by @oulgen in #828
Print nvidia-smi/rocminfo by @oulgen in #827
Dump nvidia-smi/rocminfo on benchmarks by @oulgen in #829
Add 3.14 support by @oulgen in #830
Remove py312 vanilla test by @oulgen in #831
Pad to next power of 2 for hl.specialize'ed shape value used in device tensor creation by @yf225 in #804
Autotune eviction policy by @oulgen in #823
[Docs] Consistent pre-commit/lint by @oulgen in #836
[Docs] Recommend venv instead of conda by @oulgen in #837
[Docs] Helion works on 3.10 through 3.14 by @oulgen in #838
[Docs] Add eviction policy by @oulgen in #839
Update to use the new attribute setting for tf32. by @choijon5 in #835

Full Changelog: v0.1.6...v0.1.7

Contributors

jansel, oulgen, and 6 other contributors

Assets 2

02 Oct 22:32

oulgen

v0.1.6

3322ca9

v0.1.6

What's Changed

ci: Always auth for benchmarking workflows by @seemethere in #719
[Benchmark] jagged_sum kernel and test by @Sibylau in #676
Skip default config printing if in ref eager mode by @yf225 in #721
[Benchmark CI] Make benchmark runner respect custom CLI args by @yf225 in #723
Upgrade rocm CI to 7.0 by @oulgen in #720
Add eviction policy argument to tl.load by @oulgen in #714
[CI] use complete rocm docker images by @oulgen in #724
More inconsistent naming by @oulgen in #725
[Benchmark] jagged_layer_norm kernel and test by @Sibylau in #704
[Bug fix] Preserve masks on reduction inputs that depend on reduction outputs; fix layer_norm accuracy check failure by @yf225 in #722
Support torch.matmul with 3D inputs by @yf225 in #715
Slightly improve logs by @angelayi in #740
Autotuning Progress Bar by @msaroufim in #739
make tritonbench optional in run.py so install works again by @v0i0 in #746
fix new factory when size comes from kwargs by @v0i0 in #750
Add linting instructions to README by @msaroufim in #763
Add backward kernel for exp by @aditvenk in #736
fix roll reduction meta when for ops with none output (like wait), cl… by @v0i0 in #767
Move upload benchmark results to a separate workflows by @huydhn in #758
Add flash_attention to benchmarks by @oulgen in #769
Fix jagged_layer_norm linter error by @yf225 in #770
Add SIGINT handler for clean interrupt of autotuning background processes by @msaroufim in #766
Enable tensor descriptor for XPU by @EikanWang in #765
Fix the issue that the XPU kernels cannot be cached well by @EikanWang in #761
Print Helion kernel source line in symbolic shape debugging by @yf225 in #771
ci: Set fail-fast to false by @seemethere in #776
Add XPU support for RNG operations by @EikanWang in #774
Enable test_dot for XPU by @EikanWang in #773
Handle XPU compilation error by @adam-smnk in #779
Fix type prop for and/or by @oulgen in #781
Make print output code more robust by @oulgen in #780
Revert "Add SIGINT handler for clean interrupt of autotuning background processes" by @oulgen in #784
Add torch compile unit test to helion by @oulgen in #782

New Contributors

@seemethere made their first contribution in #719
@angelayi made their first contribution in #740
@msaroufim made their first contribution in #739
@aditvenk made their first contribution in #736
@EikanWang made their first contribution in #765
@adam-smnk made their first contribution in #779

Full Changelog: v0.1.5...v0.1.6

Contributors

huydhn, seemethere, and 9 other contributors

Assets 2

29 Sep 18:29

oulgen

v0.1.5

994aaf9

v0.1.5

What's Changed

[Benchmark CI] Print config that causes tritonbench accuracy check failure by @yf225 in #716
Add AMD to benchmarks by @oulgen in #717
[Docs] Move docs requirements to docs/requirements.txt to make compatible with pypi by @oulgen in #718

Full Changelog: v0.1.4...v0.1.5

Contributors

oulgen and yf225

Assets 2

29 Sep 16:17

oulgen

v0.1.4

0428d5d

v0.1.4

What's Changed

Update benchmark.yml by @oulgen in #570
Update benchmark.yml by @oulgen in #571
[Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
Add rms norm and cross entropy by @oulgen in #568
Update benchmark_dispatch.yml by @oulgen in #573
Update linters by @oulgen in #569
Print config for PassManager::run triton errors by @jansel in #565
Error when invalid loop reduction number config is generated by @oulgen in #572
Add skipIfLowVRAM or use_default_config=False to specific unit tests to enable local testing by @yf225 in #574
Fix bug with block_size smaller than minimum by @jansel in #575
Better shape errors for mismatched tile sizes by @jansel in #566
Print warning if block_size is specified in interpret mode. by @choijon5 in #576
Run all shapes for benchmarks by @oulgen in #578
[Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
[Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
[Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
Do not benchmark twice by @oulgen in #583
Add missing functions to docs by @jansel in #586
hl.atomic_add: support 1D tensor as index by @yf225 in #587
Add atomic and/or/min/max/cas/xchg by @jansel in #589
Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
Add link to github to docs by @jansel in #591
Support layernorm without bias by @mengluy0125 in #585
Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
Add layer_norm backward kernels by @yf225 in #588
Fix tf32 warning by @jansel in #592
[Benchmark] geglu example and test by @Sibylau in #582
Print default config when running with it by @oulgen in #599
[Benchmark] swiglu example and test by @Sibylau in #584
Login to Docker from the workflows by @huydhn in #601
Add rms_norm backward kernels by @mengluy0125 in #597
Revert "Login to Docker from the workflows" by @oulgen in #604
Fix static shape typo by @oulgen in #609
Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using tl.dot(acc=...) for addmm / baddbmm by @yf225 in #564
Fix rms_norm and layer_norm by @mengluy0125 in #603
[Benchmark] jsd kernel and test by @Sibylau in #611
Refactor autotune error handling by @jansel in #595
Possible fix for CI failures by @jansel in #617
[Benchmark] Welford kernel and example by @karthickai in #614
[Benchmark] kl_div kernel and test by @Sibylau in #615
Ignore TServiceRouterException errors while autotuning by @jansel in #618
[Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
Add more kernels to benchmarking by @oulgen in #632
Reorder benchmarks by @oulgen in #633
[Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
Support using block size var outside of hl.tile loop by @yf225 in #619
[Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
Always clear inductor cache before benchmark by @yf225 in #608
Make hl.specialize work on sequences by @jansel in #636
Better error for passing Tile to hl.tile by @jansel in #640
[Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
int4_gemm: remove use_default_config=True by @yf225 in #639
[Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
Add HELION_AUTOTUNE_RANDOM_SEED env var and autotune_random_seed setting by @yf225 in #644
Bump linter by @oulgen in #647
Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
Fix lint related to welford and also local_cache by @yf225 in #646
Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
PT Sphinx Theme Test by @sekyondaMeta in #600
Print static_shapes settings value along with config for accurate repro by @yf225 in #649
[Benchmark] gather_gemv kernel and test by @Sibylau in #635
Add HELION_SKIP_CACHE env by @jansel in #653
[lint] Remove UP038 reference by @jansel in #650
Fix register_block_size codegen by @yf225 in #659
Raise better error when hl.atomic_* is used on device tensor by @yf225 in #658
[Autotune] Filter bad config with accuracy check by @yf225 in #655
Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
Log autotune random seed for easier repro by @yf225 in #661
Fix misaligned address error for matmul by @yf225 in #662
skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
rms_norm: get weight from function args by @yf225 in #664
skip full autotune if configs are provided by @xuanzhang816 in #670
[example] fused_linear_jsd by @v0i0 in #494
Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
No background image by @sekyondaMeta in #663
Remove github link from index.md by @oulgen in #675
[Autotune] Allow skipping Triton compilation error by @yf225 in #679
[Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
Fix missing block size constexpr assignment in host code by @yf225 in #678
[CI] Fix missing setuptools by @yf225 in #680
faster rms norm backwards kernel by @v0i0 in #624
[Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
[Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
Increase tolerance for _validate_against_baseline by @jansel in #691
[Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
Print bad default config if compute baseline fails by @yf225 in #688
Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
rms norm: improve fwd perf by @v0i0 in #669
Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
[Autotune] Skip Triton shared memory OOM by @yf225 in https://git...

Contributors

huydhn, jansel, and 9 other contributors

Assets 2

05 Sep 00:49

oulgen

v0.1.3

a61bd17

v0.1.3

What's Changed

Add torch compile to benchmark by @oulgen in #545
Fix issues with wrong dtypes in generated code by @jansel in #542
Limit concurrent precompile jobs while autotuning by @jansel in #543
Create basic helion benchmark runner by @oulgen in #544
Add multi selection radio buttons by @oulgen in #547
Fix benchmark condition by @oulgen in #548
Move to dispatcher model for benchmarking by @oulgen in #549
Give permissions by @oulgen in #550
Do not downgrade torch/triton by @oulgen in #551
Use uv for pip freeze by @oulgen in #552
Add jagged hstu attention example (i.e. ragged_attention) by @xuanzhang816 in #554
Install quack/torchbench with no deps by @oulgen in #553
Update test-reports dir by @oulgen in #556
torch.rand_like and torch.randn_like support by @yf225 in #530
[Benchmark] add addmm example and test by @Sibylau in #555
Kick off benchmarks at midnight by @oulgen in #559
Use profiler instead of inductor_benchmarker by @oulgen in #560
Shard kernels by @oulgen in #561
Add layer_norm and softmax by @oulgen in #562
[Fix CI] Convert tiles to sizes for all torch.* functions by @yf225 in #563

Full Changelog: v0.1.2...v0.1.3

Contributors

jansel, oulgen, and 3 other contributors

Assets 2

Releases: pytorch/helion

v0.2.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.1

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.0

What's Changed

Contributors

Uh oh!

v0.1.8

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.7

What's Changed

Contributors

Uh oh!

v0.1.6

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.5

What's Changed

Contributors

Uh oh!

v0.1.4

What's Changed

Contributors

Uh oh!

v0.1.3

What's Changed

Contributors

Uh oh!