Releases: pytorch/helion
Releases · pytorch/helion
v0.2.3
What's Changed
- [CI] Fail the distributed CI job if any unit test fails by @yf225 in #1125
- Add DE-Surrogate hybrid autotuner algorithm + early stopping option for DE and DE-Surrogate by @FranciscoThiesen in #1096
- Update AGENTS.md by @jansel in #1128
- Add Settings.persistent_reserved_sms by @jansel in #1129
- Add Settings.autotune_force_persistent by @jansel in #1130
- [CI] Fix fbcode test_breakpoint error by @yf225 in #1132
- Auto-select index_dtype by @jansel in #1131
- Support tuple indexing by hl.static_range iterator by @yf225 in #1134
- Fix CI to surface errors correctly, fix all existing errors by @yf225 in #1138
- Workaround TRITON_INTERPRET bug breaking tests by @jansel in #1139
- Fix size 0 tensor handling by @jansel in #1140
- [Benchmark CI] Print generated Triton code for the best config by @yf225 in #1142
- Use pyrefly for type checking by @rchen152 in #1143
- fix pyrefly errors by @oulgen in #1144
- [CI] Skip TestBreakpoint in ref-eager CI job by @yf225 in #1141
- Bump pyrefly to 0.42.1 and remove 'sed' workaround. by @rchen152 in #1145
New Contributors
- @FranciscoThiesen made their first contribution in #1096
- @rchen152 made their first contribution in #1143
Full Changelog: v0.2.2...v0.2.3
v0.2.2
What's Changed
- [Benchmark] Update welford torch.compile function name by @yf225 in #1029
- chore: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1030
- chore: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1031
- [Benchmark CI] Set welford num_inputs to 6 to avoid timeout by @yf225 in #1032
- Default config: reduce block_size and num_stages to avoid shared mem OOM by @yf225 in #1033
- Default config: reduce block_size further to avoid shared mem OOM by @yf225 in #1034
- Disable autotuner progress bar in fbcode unit test by @yf225 in #1035
- Always print cached config by @oulgen in #1036
- Fix dtype mismatch error in se_block example by @yf225 in #1040
- Upgrade clang version by @oulgen in #1043
- Fix missing static_shapes=False in deployment_autotuning.md by @jansel in #1042
- Fix matmul output dtype to match PyTorch eager behavior by @yf225 in #1044
- Fix layernorm bwd unit test by @yf225 in #1047
- Fix FlattenedTileStrategy to handle unit-sized block dimensions by @yf225 in #1048
- [CI] Fix debug_str() to be compatible with latest PyTorch nightly by @yf225 in #1050
- [Fix upcoming CI error] Set current node in inductor lowering by @yf225 in #1052
- Remove Section Navigation pane from Deployment and Autotuning page. by @choijon5 in #1051
- Add
settings.autotune_baseline_fnto allow passing in custom baseline function to autotuner by @yf225 in #1054 - Add
HELION_PRINT_REPRO=1to print Helion kernel repro script to console by @yf225 in #1049 - Fix caching for CPUs by @oulgen in #1055
- Add get_num_sm for cpu by @oulgen in #1056
- Support torch.rand / torch.rand_like with dynamic tile sizes by @yf225 in #1057
- Remove line numbers from expected files by @oulgen in #1061
- Ignore passed in config when force autotune is turned on by @oulgen in #1060
- Update Watch Talk link to Triton conf talk. by @choijon5 in #1058
- Helion Puzzle docs bug fixes by @Athe-kunal in #1062
- Update test_persistent_kernels.expected by @jansel in #1070
- Make HELION_PRINT_REPRO=1 take effect in more error cases by @yf225 in #1066
- add geglu backward by @parsshar-RH in #1069
- [Unblock internal] Fix log capture issue on internal tests by @yf225 in #1076
- Add best effort triton-cpu support by @oulgen in #1037
- Update test_debug_utils.py by @oulgen in #1077
- Raise user error if device-loop is empty after DCE by @yf225 in #1074
- Add GRPO loss example by @ighoshsubho in #1063
- Use HELION_PRINT_REPRO=1 to print repro when device IR lowering or Triton codegen error by @yf225 in #1078
- add AMD demo link by @vivienfanghuagood in #1068
- Update test.yml by @oulgen in #1083
- Fix GRPO loss example unit tests by @yf225 in #1079
- Remove requirements.txt by @oulgen in #1088
- Relax requirements for inline_triton output_like=None by @jansel in #1087
- feat(autotuner): Make autotune cache class configurable via env var by @fulvius31 in #1071
- Add support for while and pass by @jansel in #1090
- Update sphinxtheme to pull from pypi package by @sekyondaMeta in #1091
- [Autotuner] Better error message for default config error by @yf225 in #1092
- Ignore illegal instruction errors by @jansel in #1093
- Update talk links to PTC version by @jansel in #1094
- Add autotuning log by @jansel in #1095
- Fix builtin min / max handling in device loop by @yf225 in #1085
- Add skipIfRocm to failing test on main by @jansel in #1101
- Fix lint in newer triton by @jansel in #1098
- Add AGENTS.md by @jansel in #1100
- Refactor _decorators.codegen to allow multiple backends by @jansel in #1099
- Add extra line before repro log; update repro log tests by @yf225 in #1102
- Refactor inductor_lowering.py into two files by @jansel in #1103
- Use CPU machine for triton-cpu by @oulgen in #1105
- Fix no libdw.so issue on AMD CI by @yf225 in #1107
- Fixes in helion puzzles by @Athe-kunal in #1104
- Add distributed CI job (4xH100) and example unit tests by @yf225 in #1106
- Generalize aten_lowering.py for multiple backends by @jansel in #1108
- Support tensor.T for transpose by @yf225 in #1110
- Add warning to discourage use of
acc += lhs @ rhspattern by @yf225 in #1111 - Remove
@helion.jitusage and advise use of@helion.kernelby @yf225 in #1116
New Contributors
- @Athe-kunal made their first contribution in #1062
- @parsshar-RH made their first contribution in #1069
- @ighoshsubho made their first contribution in #1063
- @vivienfanghuagood made their first contribution in #1068
- @fulvius31 made their first contribution in #1071
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- No autotuning on block_ptr if tma is available by @PaulZhang12 in #997
- Add reps for benchmarking stability by @PaulZhang12 in #999
- Prioritize outermost loop for warp spec by @PaulZhang12 in #1000
- Add backward pass for softmax kernel by @karthickai in #744
- Fix linter in softmax by @oulgen in #1003
- Fix test_examples.expected by @oulgen in #1002
- Beef up caching tests by @oulgen in #1001
- Add HELION_ASSERT_CACHE_HIT to debug/explain cache miss by @oulgen in #1006
- Better error message for calling Helion kernel from another kernel by @yf225 in #1008
- Assert that we are cache hitting on the CI by @oulgen in #1007
- Always raise
FailedToUnpackTilewhenfor tile_m, tile_d in hl.tile(m, d)is used by @yf225 in #1009 - Adding demo for running softmax kernel on Google colab by @choijon5 in #944
- int4 gemm accurate baselines by @PaulZhang12 in #1010
- Add sitemap xml by @sekyondaMeta in #1013
- [helion] backward support for swiglu by @shunting314 in #756
- Raise informative error when
hl.dotwith 3D inputs have batch dim mismatch by @yf225 in #1012 - [CI] Fix AMD journal check errors by @yf225 in #1016
- Support
breakpoint()in device code when interpret mode is on by @yf225 in #1020 - Sort requirements file by @oulgen in #1021
- Better type checking for eviction policies by @oulgen in #1024
- Bump linter versions by @jansel in #1018
- Garbage collect expected results by @jansel in #1017
- Make indexing choice a list by @oulgen in #1025
- [Docs] Add list of indexing autotuning docs by @oulgen in #1027
- Make store indexing also individually tunable by @oulgen in #1028
New Contributors
- @shunting314 made their first contribution in #756
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- Verify compiled kernels in subprocess by @jansel in #914
- Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
- Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
- Support warp specialization on B200 by @oulgen in #935
- Update README.md by @oulgen in #943
- Register tile symbol origin, to support
tile + offsetuse case in blackwell attention by @yf225 in #939 - [CI] Print failed tests by @oulgen in #942
- Update examples to use run_example by @jansel in #941
- blackwell attn with triton attr set by @v0i0 in #918
- Set static_shapes=True by @oulgen in #937
- run.py env var to skip exception logging by @v0i0 in #946
- Fix bug with unit sized dims and block_sizes by @jansel in #932
- Update static_shapes docs by @jansel in #951
- Add tile.count by @oulgen in #955
- Auto detect low vram by @oulgen in #956
- [CI] Use official PyTorch 2.9 by @oulgen in #962
- Use interleaved_bench for run_example by @jansel in #945
- Generalize tile_with_offset pass by @jansel in #949
- Docstring updates by @jansel in #952
- Import updates by @jansel in #953
- Add missing environment variables to docs by @jansel in #957
- Print out errors vs timeouts in autotuning status by @jansel in #960
- Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
- Exit autotuning faster on KeyboardInterrupt by @jansel in #963
- Remove default settings by @jansel in #964
- Add missing settings environment variables by @jansel in #965
- Skip test_differential_evolution_search due to slowness by @jansel in #968
- [Benchmark CI] Give nightly job permissions by @oulgen in #970
- [Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
- [Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
- [blackwell attn example] qk scale as param by @v0i0 in #969
- [Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
- Add initial backwards compatibility tests by @oulgen in #958
- Remove unrolling + warp spec by @PaulZhang12 in #967
- [Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
- [Benchmark] Fix tritonbench auto-installation by @yf225 in #980
- [Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
- Make fork default precompilation strategy by @oulgen in #979
- [benchmarks] change tritonbench path by @xuzhao9 in #966
- Add skipIfA10G decorator by @yf225 in #982
- Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
- Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
- Fix timeouts in autotuning by @jansel in #985
- Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
- Remove extra debug log for timeouts by @jansel in #987
- Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
- Generalize test cases to support XPU by @EikanWang in #983
- Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
- Update README.md by @oulgen in #992
- Update README.md by @oulgen in #993
- Mamba2 Chunk Scan & State by @v0i0 in #950
- Remove unrolling with tma + pipelining by @PaulZhang12 in #994
- Add provenance annotations to output code by @jansel in #988
Full Changelog: v0.1.8...v0.2.0
v0.1.8
What's Changed
- fix rmsnorm fwd tritonbench by @v0i0 in #840
- Update input shapes for example kernels by @yf225 in #845
- Extend eviction policy tests to all indexing types by @oulgen in #833
- [Docs] Remove early development warning by @oulgen in #846
- [Docs] Add link to gpumode discord by @oulgen in #847
- [Docs] Add PTC promotional material by @oulgen in #848
- [Benchmark] Add low mem dropout example by @karthickai in #641
- Update lint.yml by @oulgen in #854
- Remove
hl.register_reduction_dimAPI by @yf225 in #834 - Error message for boolean masking or torch.nonzero by @yf225 in #687
- Remove hardcoded
block_size=1usage in attention kernel example by @yf225 in #843 - Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
- Decrease
num_stagesdefault from 3 to 2, to avoid shared memory OOM by @yf225 in #841 - Allow user-defined specialization key by @jansel in #853
- [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
- Remove legacy
register_inductor_loweringcode by @yf225 in #864 - Set setstate/getstate methods to Config by @jansel in #868
- [doc] Add deployment/autotuning guide by @jansel in #869
- [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
- Fix sphinx warnings by @jansel in #871
- Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
- [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
- [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
- [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
- Print Triton code when error for easier debugging by @yf225 in #874
- Terminate autotuning faster if progress is minimal by @oulgen in #855
- Update README.md by @oulgen in #877
- [CI] pin b200 to pytorch2.9 by @oulgen in #878
- [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
- [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
- Install git for benchmarks by @oulgen in #882
- Pin AMD to 6.4.4 by @oulgen in #883
- Faster int4 gemm by @PaulZhang12 in #751
- Pin AMD to 6.4.4 by @oulgen in #881
- Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
- [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
- [Benchmark] Use bespoke setup-python action by @oulgen in #885
- [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
- Add dependabot by @oulgen in #888
- Update dependabot.yml by @oulgen in #891
- chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
- chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
- chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
- chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
- Upgrade ruff==0.14.0 by @jansel in #889
- [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
- chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
- [Benchmark] use logger.exception for process errors by @oulgen in #902
- [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
- Query minimum dot size for XPU by @EikanWang in #900
- Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
- [CI] Pin amd to rocm7.0 by @oulgen in #907
- [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
- [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
- [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
- Remove cache around set_triton_allocator by @oulgen in #912
- Add int4_gemm by @oulgen in #917
- chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
- Catch missing cudnn error by @jansel in #873
- Add progress bar for precompiling by @jansel in #919
- Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
- Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
- Avoid setting default
--input-sample-modetoequally-spaced-kby @yf225 in #922 - Remove
triton_helpers.*usage in lifted device function arguments by @yf225 in #849 - Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
- Suggest use of
@helion.kernel(index_dtype=torch.int64)if index offset is out of bound for int32 by @yf225 in #850 - Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
- Support
hl.arange()with non-power-of-2 input by @yf225 in #862 - Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
- Generalize examples with the DEVICE variable by @adam-smnk in #915
- Fix lint error by @jansel in #926
- Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
- Support tile+offset and tensor descriptors by @jansel in #928
- Fix triton/torch.compile compability issue by @jansel in #927
- Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
- Update the Agent ID by @sekyondaMeta in #931
- [Benchmark CI] Use
--non-squareflag for gemm by @yf225 in #938
New Contributors
- @dependabot[bot] made their first contribution in #893
- @tianrengao made their first contribution in #748
Full Changelog: v0.1.7...v0.1.8
v0.1.7
What's Changed
- Generalize the cuda-bias test cases by replacing hardcoded "cuda" literal with the DEVICE variable by @EikanWang in #775
- Make progress bar prettier by @oulgen in #786
- Upgrade ruff==0.13.3 pyright==1.1.406 by @jansel in #790
- Add hl.split and hl.join by @jansel in #791
- Generalize test_print and test_tensor_descriptor to support different accelerators by @EikanWang in #801
- Limit rebench to 1000 iterations by @jansel in #789
- Turn down autotuner defaults by @jansel in #788
- Enable torch.xpu._XpuDeviceProperties in Helion kernel by @EikanWang in #798
- Better error message for augmented assignment (e.g. +=) on host tensor without subscript by @yf225 in #807
- Add Pattern Search autotuning algorithm to docs. by @choijon5 in #810
- Support 0dim tensor in output code printing by @oulgen in #806
- Set range_num_stages <= 1 if using tensor_descriptor, to avoid CUDA misaligned address error by @yf225 in #792
- Add hl.inline_triton API by @jansel in #811
- Add out_dtype arg to hl.dot by @jansel in #813
- Add autotune_config_overrides by @jansel in #814
- Reduce initial_population to 100 by @jansel in #800
- Disable range_num_stages for kernels with aliasing by @jansel in #812
- Adding new setting, autotune_max_generations, that allows user to set the maximum number of generations for autotuning by @choijon5 in #796
- Increase tolerance for test_matmul_reshape_m_2 by @jansel in #816
- Update docs by @jansel in #815
- Fix torch version check by @adam-smnk in #818
- [Benchmark] Keep going when a single benchmark fails by @oulgen in #820
- Faster Helion JSD by @PaulZhang12 in #733
- Faster KL Div by @PaulZhang12 in #822
- Normalize device name and decorate cuda-only test cases by @EikanWang in #819
- Improved log messages for autotuning by @choijon5 in #817
- Apply simplification to range indexing in order to reuse block size symbols by @yf225 in #809
- Fix hl.rand to use tile specific offsets instead of fixed offsets, ensure unique random num per tile by @karthickai in #685
- Match cuda versions for benchmark by @oulgen in #828
- Print nvidia-smi/rocminfo by @oulgen in #827
- Dump nvidia-smi/rocminfo on benchmarks by @oulgen in #829
- Add 3.14 support by @oulgen in #830
- Remove py312 vanilla test by @oulgen in #831
- Pad to next power of 2 for hl.specialize'ed shape value used in device tensor creation by @yf225 in #804
- Autotune eviction policy by @oulgen in #823
- [Docs] Consistent pre-commit/lint by @oulgen in #836
- [Docs] Recommend venv instead of conda by @oulgen in #837
- [Docs] Helion works on 3.10 through 3.14 by @oulgen in #838
- [Docs] Add eviction policy by @oulgen in #839
- Update to use the new attribute setting for tf32. by @choijon5 in #835
Full Changelog: v0.1.6...v0.1.7
v0.1.6
What's Changed
- ci: Always auth for benchmarking workflows by @seemethere in #719
- [Benchmark] jagged_sum kernel and test by @Sibylau in #676
- Skip default config printing if in ref eager mode by @yf225 in #721
- [Benchmark CI] Make benchmark runner respect custom CLI args by @yf225 in #723
- Upgrade rocm CI to 7.0 by @oulgen in #720
- Add eviction policy argument to tl.load by @oulgen in #714
- [CI] use complete rocm docker images by @oulgen in #724
- More inconsistent naming by @oulgen in #725
- [Benchmark] jagged_layer_norm kernel and test by @Sibylau in #704
- [Bug fix] Preserve masks on reduction inputs that depend on reduction outputs; fix layer_norm accuracy check failure by @yf225 in #722
- Support torch.matmul with 3D inputs by @yf225 in #715
- Slightly improve logs by @angelayi in #740
- Autotuning Progress Bar by @msaroufim in #739
- make tritonbench optional in run.py so install works again by @v0i0 in #746
- fix new factory when size comes from kwargs by @v0i0 in #750
- Add linting instructions to README by @msaroufim in #763
- Add backward kernel for exp by @aditvenk in #736
- fix roll reduction meta when for ops with none output (like wait), cl… by @v0i0 in #767
- Move upload benchmark results to a separate workflows by @huydhn in #758
- Add flash_attention to benchmarks by @oulgen in #769
- Fix jagged_layer_norm linter error by @yf225 in #770
- Add SIGINT handler for clean interrupt of autotuning background processes by @msaroufim in #766
- Enable tensor descriptor for XPU by @EikanWang in #765
- Fix the issue that the XPU kernels cannot be cached well by @EikanWang in #761
- Print Helion kernel source line in symbolic shape debugging by @yf225 in #771
- ci: Set fail-fast to false by @seemethere in #776
- Add XPU support for RNG operations by @EikanWang in #774
- Enable test_dot for XPU by @EikanWang in #773
- Handle XPU compilation error by @adam-smnk in #779
- Fix type prop for and/or by @oulgen in #781
- Make print output code more robust by @oulgen in #780
- Revert "Add SIGINT handler for clean interrupt of autotuning background processes" by @oulgen in #784
- Add torch compile unit test to helion by @oulgen in #782
New Contributors
- @seemethere made their first contribution in #719
- @angelayi made their first contribution in #740
- @msaroufim made their first contribution in #739
- @aditvenk made their first contribution in #736
- @EikanWang made their first contribution in #765
- @adam-smnk made their first contribution in #779
Full Changelog: v0.1.5...v0.1.6
v0.1.5
v0.1.4
What's Changed
- Update benchmark.yml by @oulgen in #570
- Update benchmark.yml by @oulgen in #571
- [Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
- Add rms norm and cross entropy by @oulgen in #568
- Update benchmark_dispatch.yml by @oulgen in #573
- Update linters by @oulgen in #569
- Print config for PassManager::run triton errors by @jansel in #565
- Error when invalid loop reduction number config is generated by @oulgen in #572
- Add
skipIfLowVRAMoruse_default_config=Falseto specific unit tests to enable local testing by @yf225 in #574 - Fix bug with block_size smaller than minimum by @jansel in #575
- Better shape errors for mismatched tile sizes by @jansel in #566
- Print warning if block_size is specified in interpret mode. by @choijon5 in #576
- Run all shapes for benchmarks by @oulgen in #578
- [Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
- [Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
- [Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
- Do not benchmark twice by @oulgen in #583
- Add missing functions to docs by @jansel in #586
- hl.atomic_add: support 1D tensor as index by @yf225 in #587
- Add atomic and/or/min/max/cas/xchg by @jansel in #589
- Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
- Add link to github to docs by @jansel in #591
- Support layernorm without bias by @mengluy0125 in #585
- Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
- Add layer_norm backward kernels by @yf225 in #588
- Fix tf32 warning by @jansel in #592
- [Benchmark] geglu example and test by @Sibylau in #582
- Print default config when running with it by @oulgen in #599
- [Benchmark] swiglu example and test by @Sibylau in #584
- Login to Docker from the workflows by @huydhn in #601
- Add rms_norm backward kernels by @mengluy0125 in #597
- Revert "Login to Docker from the workflows" by @oulgen in #604
- Fix static shape typo by @oulgen in #609
- Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using
tl.dot(acc=...)for addmm / baddbmm by @yf225 in #564 - Fix rms_norm and layer_norm by @mengluy0125 in #603
- [Benchmark] jsd kernel and test by @Sibylau in #611
- Refactor autotune error handling by @jansel in #595
- Possible fix for CI failures by @jansel in #617
- [Benchmark] Welford kernel and example by @karthickai in #614
- [Benchmark] kl_div kernel and test by @Sibylau in #615
- Ignore TServiceRouterException errors while autotuning by @jansel in #618
- [Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
- Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
- Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
- Add more kernels to benchmarking by @oulgen in #632
- Reorder benchmarks by @oulgen in #633
- [Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
- Support using block size var outside of hl.tile loop by @yf225 in #619
- [Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
- Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
- Always clear inductor cache before benchmark by @yf225 in #608
- Make hl.specialize work on sequences by @jansel in #636
- Better error for passing Tile to hl.tile by @jansel in #640
- [Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
- int4_gemm: remove use_default_config=True by @yf225 in #639
- [Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
- Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
- Add
HELION_AUTOTUNE_RANDOM_SEEDenv var andautotune_random_seedsetting by @yf225 in #644 - Bump linter by @oulgen in #647
- Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
- Fix lint related to welford and also local_cache by @yf225 in #646
- Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
- PT Sphinx Theme Test by @sekyondaMeta in #600
- Print
static_shapessettings value along with config for accurate repro by @yf225 in #649 - [Benchmark] gather_gemv kernel and test by @Sibylau in #635
- Add HELION_SKIP_CACHE env by @jansel in #653
- [lint] Remove UP038 reference by @jansel in #650
- Fix
register_block_sizecodegen by @yf225 in #659 - Raise better error when
hl.atomic_*is used on device tensor by @yf225 in #658 - [Autotune] Filter bad config with accuracy check by @yf225 in #655
- Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
- Log autotune random seed for easier repro by @yf225 in #661
- Fix misaligned address error for matmul by @yf225 in #662
- skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
- rms_norm: get weight from function args by @yf225 in #664
- skip full autotune if configs are provided by @xuanzhang816 in #670
- [example] fused_linear_jsd by @v0i0 in #494
- Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
- No background image by @sekyondaMeta in #663
- Remove github link from index.md by @oulgen in #675
- [Autotune] Allow skipping Triton compilation error by @yf225 in #679
- [Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
- Fix missing block size constexpr assignment in host code by @yf225 in #678
- [CI] Fix missing setuptools by @yf225 in #680
- faster rms norm backwards kernel by @v0i0 in #624
- [Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
- [Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
- Increase tolerance for _validate_against_baseline by @jansel in #691
- [Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
- Print bad default config if compute baseline fails by @yf225 in #688
- Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
- rms norm: improve fwd perf by @v0i0 in #669
- Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
- [Autotune] Skip Triton shared memory OOM by @yf225 in https://git...
v0.1.3
What's Changed
- Add torch compile to benchmark by @oulgen in #545
- Fix issues with wrong dtypes in generated code by @jansel in #542
- Limit concurrent precompile jobs while autotuning by @jansel in #543
- Create basic helion benchmark runner by @oulgen in #544
- Add multi selection radio buttons by @oulgen in #547
- Fix benchmark condition by @oulgen in #548
- Move to dispatcher model for benchmarking by @oulgen in #549
- Give permissions by @oulgen in #550
- Do not downgrade torch/triton by @oulgen in #551
- Use uv for pip freeze by @oulgen in #552
- Add jagged hstu attention example (i.e. ragged_attention) by @xuanzhang816 in #554
- Install quack/torchbench with no deps by @oulgen in #553
- Update test-reports dir by @oulgen in #556
- torch.rand_like and torch.randn_like support by @yf225 in #530
- [Benchmark] add addmm example and test by @Sibylau in #555
- Kick off benchmarks at midnight by @oulgen in #559
- Use profiler instead of inductor_benchmarker by @oulgen in #560
- Shard kernels by @oulgen in #561
- Add layer_norm and softmax by @oulgen in #562
- [Fix CI] Convert tiles to sizes for all torch.* functions by @yf225 in #563
Full Changelog: v0.1.2...v0.1.3