Skip to content

Releases: pytorch/helion

v0.2.3

18 Nov 18:21
2644d0a

Choose a tag to compare

What's Changed

  • [CI] Fail the distributed CI job if any unit test fails by @yf225 in #1125
  • Add DE-Surrogate hybrid autotuner algorithm + early stopping option for DE and DE-Surrogate by @FranciscoThiesen in #1096
  • Update AGENTS.md by @jansel in #1128
  • Add Settings.persistent_reserved_sms by @jansel in #1129
  • Add Settings.autotune_force_persistent by @jansel in #1130
  • [CI] Fix fbcode test_breakpoint error by @yf225 in #1132
  • Auto-select index_dtype by @jansel in #1131
  • Support tuple indexing by hl.static_range iterator by @yf225 in #1134
  • Fix CI to surface errors correctly, fix all existing errors by @yf225 in #1138
  • Workaround TRITON_INTERPRET bug breaking tests by @jansel in #1139
  • Fix size 0 tensor handling by @jansel in #1140
  • [Benchmark CI] Print generated Triton code for the best config by @yf225 in #1142
  • Use pyrefly for type checking by @rchen152 in #1143
  • fix pyrefly errors by @oulgen in #1144
  • [CI] Skip TestBreakpoint in ref-eager CI job by @yf225 in #1141
  • Bump pyrefly to 0.42.1 and remove 'sed' workaround. by @rchen152 in #1145

New Contributors

Full Changelog: v0.2.2...v0.2.3

v0.2.2

12 Nov 18:58
51580b4

Choose a tag to compare

What's Changed

  • [Benchmark] Update welford torch.compile function name by @yf225 in #1029
  • chore: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1030
  • chore: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1031
  • [Benchmark CI] Set welford num_inputs to 6 to avoid timeout by @yf225 in #1032
  • Default config: reduce block_size and num_stages to avoid shared mem OOM by @yf225 in #1033
  • Default config: reduce block_size further to avoid shared mem OOM by @yf225 in #1034
  • Disable autotuner progress bar in fbcode unit test by @yf225 in #1035
  • Always print cached config by @oulgen in #1036
  • Fix dtype mismatch error in se_block example by @yf225 in #1040
  • Upgrade clang version by @oulgen in #1043
  • Fix missing static_shapes=False in deployment_autotuning.md by @jansel in #1042
  • Fix matmul output dtype to match PyTorch eager behavior by @yf225 in #1044
  • Fix layernorm bwd unit test by @yf225 in #1047
  • Fix FlattenedTileStrategy to handle unit-sized block dimensions by @yf225 in #1048
  • [CI] Fix debug_str() to be compatible with latest PyTorch nightly by @yf225 in #1050
  • [Fix upcoming CI error] Set current node in inductor lowering by @yf225 in #1052
  • Remove Section Navigation pane from Deployment and Autotuning page. by @choijon5 in #1051
  • Add settings.autotune_baseline_fn to allow passing in custom baseline function to autotuner by @yf225 in #1054
  • Add HELION_PRINT_REPRO=1 to print Helion kernel repro script to console by @yf225 in #1049
  • Fix caching for CPUs by @oulgen in #1055
  • Add get_num_sm for cpu by @oulgen in #1056
  • Support torch.rand / torch.rand_like with dynamic tile sizes by @yf225 in #1057
  • Remove line numbers from expected files by @oulgen in #1061
  • Ignore passed in config when force autotune is turned on by @oulgen in #1060
  • Update Watch Talk link to Triton conf talk. by @choijon5 in #1058
  • Helion Puzzle docs bug fixes by @Athe-kunal in #1062
  • Update test_persistent_kernels.expected by @jansel in #1070
  • Make HELION_PRINT_REPRO=1 take effect in more error cases by @yf225 in #1066
  • add geglu backward by @parsshar-RH in #1069
  • [Unblock internal] Fix log capture issue on internal tests by @yf225 in #1076
  • Add best effort triton-cpu support by @oulgen in #1037
  • Update test_debug_utils.py by @oulgen in #1077
  • Raise user error if device-loop is empty after DCE by @yf225 in #1074
  • Add GRPO loss example by @ighoshsubho in #1063
  • Use HELION_PRINT_REPRO=1 to print repro when device IR lowering or Triton codegen error by @yf225 in #1078
  • add AMD demo link by @vivienfanghuagood in #1068
  • Update test.yml by @oulgen in #1083
  • Fix GRPO loss example unit tests by @yf225 in #1079
  • Remove requirements.txt by @oulgen in #1088
  • Relax requirements for inline_triton output_like=None by @jansel in #1087
  • feat(autotuner): Make autotune cache class configurable via env var by @fulvius31 in #1071
  • Add support for while and pass by @jansel in #1090
  • Update sphinxtheme to pull from pypi package by @sekyondaMeta in #1091
  • [Autotuner] Better error message for default config error by @yf225 in #1092
  • Ignore illegal instruction errors by @jansel in #1093
  • Update talk links to PTC version by @jansel in #1094
  • Add autotuning log by @jansel in #1095
  • Fix builtin min / max handling in device loop by @yf225 in #1085
  • Add skipIfRocm to failing test on main by @jansel in #1101
  • Fix lint in newer triton by @jansel in #1098
  • Add AGENTS.md by @jansel in #1100
  • Refactor _decorators.codegen to allow multiple backends by @jansel in #1099
  • Add extra line before repro log; update repro log tests by @yf225 in #1102
  • Refactor inductor_lowering.py into two files by @jansel in #1103
  • Use CPU machine for triton-cpu by @oulgen in #1105
  • Fix no libdw.so issue on AMD CI by @yf225 in #1107
  • Fixes in helion puzzles by @Athe-kunal in #1104
  • Add distributed CI job (4xH100) and example unit tests by @yf225 in #1106
  • Generalize aten_lowering.py for multiple backends by @jansel in #1108
  • Support tensor.T for transpose by @yf225 in #1110
  • Add warning to discourage use of acc += lhs @ rhs pattern by @yf225 in #1111
  • Remove @helion.jit usage and advise use of @helion.kernel by @yf225 in #1116

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1

26 Oct 23:16
c5dbbbe

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.2.0

20 Oct 20:54
3a0e975

Choose a tag to compare

What's Changed

  • Verify compiled kernels in subprocess by @jansel in #914
  • Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
  • Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
  • Support warp specialization on B200 by @oulgen in #935
  • Update README.md by @oulgen in #943
  • Register tile symbol origin, to support tile + offset use case in blackwell attention by @yf225 in #939
  • [CI] Print failed tests by @oulgen in #942
  • Update examples to use run_example by @jansel in #941
  • blackwell attn with triton attr set by @v0i0 in #918
  • Set static_shapes=True by @oulgen in #937
  • run.py env var to skip exception logging by @v0i0 in #946
  • Fix bug with unit sized dims and block_sizes by @jansel in #932
  • Update static_shapes docs by @jansel in #951
  • Add tile.count by @oulgen in #955
  • Auto detect low vram by @oulgen in #956
  • [CI] Use official PyTorch 2.9 by @oulgen in #962
  • Use interleaved_bench for run_example by @jansel in #945
  • Generalize tile_with_offset pass by @jansel in #949
  • Docstring updates by @jansel in #952
  • Import updates by @jansel in #953
  • Add missing environment variables to docs by @jansel in #957
  • Print out errors vs timeouts in autotuning status by @jansel in #960
  • Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
  • Exit autotuning faster on KeyboardInterrupt by @jansel in #963
  • Remove default settings by @jansel in #964
  • Add missing settings environment variables by @jansel in #965
  • Skip test_differential_evolution_search due to slowness by @jansel in #968
  • [Benchmark CI] Give nightly job permissions by @oulgen in #970
  • [Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
  • [Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
  • [blackwell attn example] qk scale as param by @v0i0 in #969
  • [Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
  • Add initial backwards compatibility tests by @oulgen in #958
  • Remove unrolling + warp spec by @PaulZhang12 in #967
  • [Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
  • [Benchmark] Fix tritonbench auto-installation by @yf225 in #980
  • [Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
  • Make fork default precompilation strategy by @oulgen in #979
  • [benchmarks] change tritonbench path by @xuzhao9 in #966
  • Add skipIfA10G decorator by @yf225 in #982
  • Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
  • Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
  • Fix timeouts in autotuning by @jansel in #985
  • Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
  • Remove extra debug log for timeouts by @jansel in #987
  • Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
  • Generalize test cases to support XPU by @EikanWang in #983
  • Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
  • Update README.md by @oulgen in #992
  • Update README.md by @oulgen in #993
  • Mamba2 Chunk Scan & State by @v0i0 in #950
  • Remove unrolling with tma + pipelining by @PaulZhang12 in #994
  • Add provenance annotations to output code by @jansel in #988

Full Changelog: v0.1.8...v0.2.0

v0.1.8

15 Oct 00:37
b77301f

Choose a tag to compare

What's Changed

  • fix rmsnorm fwd tritonbench by @v0i0 in #840
  • Update input shapes for example kernels by @yf225 in #845
  • Extend eviction policy tests to all indexing types by @oulgen in #833
  • [Docs] Remove early development warning by @oulgen in #846
  • [Docs] Add link to gpumode discord by @oulgen in #847
  • [Docs] Add PTC promotional material by @oulgen in #848
  • [Benchmark] Add low mem dropout example by @karthickai in #641
  • Update lint.yml by @oulgen in #854
  • Remove hl.register_reduction_dim API by @yf225 in #834
  • Error message for boolean masking or torch.nonzero by @yf225 in #687
  • Remove hardcoded block_size=1 usage in attention kernel example by @yf225 in #843
  • Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
  • Decrease num_stages default from 3 to 2, to avoid shared memory OOM by @yf225 in #841
  • Allow user-defined specialization key by @jansel in #853
  • [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
  • Remove legacy register_inductor_lowering code by @yf225 in #864
  • Set setstate/getstate methods to Config by @jansel in #868
  • [doc] Add deployment/autotuning guide by @jansel in #869
  • [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
  • Fix sphinx warnings by @jansel in #871
  • Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
  • [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
  • [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
  • [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
  • Print Triton code when error for easier debugging by @yf225 in #874
  • Terminate autotuning faster if progress is minimal by @oulgen in #855
  • Update README.md by @oulgen in #877
  • [CI] pin b200 to pytorch2.9 by @oulgen in #878
  • [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
  • [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
  • Install git for benchmarks by @oulgen in #882
  • Pin AMD to 6.4.4 by @oulgen in #883
  • Faster int4 gemm by @PaulZhang12 in #751
  • Pin AMD to 6.4.4 by @oulgen in #881
  • Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
  • [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
  • [Benchmark] Use bespoke setup-python action by @oulgen in #885
  • [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
  • Add dependabot by @oulgen in #888
  • Update dependabot.yml by @oulgen in #891
  • chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
  • chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
  • chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
  • chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
  • Upgrade ruff==0.14.0 by @jansel in #889
  • [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
  • chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
  • [Benchmark] use logger.exception for process errors by @oulgen in #902
  • [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
  • Query minimum dot size for XPU by @EikanWang in #900
  • Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
  • [CI] Pin amd to rocm7.0 by @oulgen in #907
  • [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
  • [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
  • [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
  • Remove cache around set_triton_allocator by @oulgen in #912
  • Add int4_gemm by @oulgen in #917
  • chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
  • Catch missing cudnn error by @jansel in #873
  • Add progress bar for precompiling by @jansel in #919
  • Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
  • Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
  • Avoid setting default --input-sample-mode to equally-spaced-k by @yf225 in #922
  • Remove triton_helpers.* usage in lifted device function arguments by @yf225 in #849
  • Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
  • Suggest use of @helion.kernel(index_dtype=torch.int64) if index offset is out of bound for int32 by @yf225 in #850
  • Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
  • Support hl.arange() with non-power-of-2 input by @yf225 in #862
  • Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
  • Generalize examples with the DEVICE variable by @adam-smnk in #915
  • Fix lint error by @jansel in #926
  • Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
  • Support tile+offset and tensor descriptors by @jansel in #928
  • Fix triton/torch.compile compability issue by @jansel in #927
  • Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
  • Update the Agent ID by @sekyondaMeta in #931
  • [Benchmark CI] Use --non-square flag for gemm by @yf225 in #938

New Contributors

Full Changelog: v0.1.7...v0.1.8

v0.1.7

08 Oct 19:16
269deb3

Choose a tag to compare

What's Changed

  • Generalize the cuda-bias test cases by replacing hardcoded "cuda" literal with the DEVICE variable by @EikanWang in #775
  • Make progress bar prettier by @oulgen in #786
  • Upgrade ruff==0.13.3 pyright==1.1.406 by @jansel in #790
  • Add hl.split and hl.join by @jansel in #791
  • Generalize test_print and test_tensor_descriptor to support different accelerators by @EikanWang in #801
  • Limit rebench to 1000 iterations by @jansel in #789
  • Turn down autotuner defaults by @jansel in #788
  • Enable torch.xpu._XpuDeviceProperties in Helion kernel by @EikanWang in #798
  • Better error message for augmented assignment (e.g. +=) on host tensor without subscript by @yf225 in #807
  • Add Pattern Search autotuning algorithm to docs. by @choijon5 in #810
  • Support 0dim tensor in output code printing by @oulgen in #806
  • Set range_num_stages <= 1 if using tensor_descriptor, to avoid CUDA misaligned address error by @yf225 in #792
  • Add hl.inline_triton API by @jansel in #811
  • Add out_dtype arg to hl.dot by @jansel in #813
  • Add autotune_config_overrides by @jansel in #814
  • Reduce initial_population to 100 by @jansel in #800
  • Disable range_num_stages for kernels with aliasing by @jansel in #812
  • Adding new setting, autotune_max_generations, that allows user to set the maximum number of generations for autotuning by @choijon5 in #796
  • Increase tolerance for test_matmul_reshape_m_2 by @jansel in #816
  • Update docs by @jansel in #815
  • Fix torch version check by @adam-smnk in #818
  • [Benchmark] Keep going when a single benchmark fails by @oulgen in #820
  • Faster Helion JSD by @PaulZhang12 in #733
  • Faster KL Div by @PaulZhang12 in #822
  • Normalize device name and decorate cuda-only test cases by @EikanWang in #819
  • Improved log messages for autotuning by @choijon5 in #817
  • Apply simplification to range indexing in order to reuse block size symbols by @yf225 in #809
  • Fix hl.rand to use tile specific offsets instead of fixed offsets, ensure unique random num per tile by @karthickai in #685
  • Match cuda versions for benchmark by @oulgen in #828
  • Print nvidia-smi/rocminfo by @oulgen in #827
  • Dump nvidia-smi/rocminfo on benchmarks by @oulgen in #829
  • Add 3.14 support by @oulgen in #830
  • Remove py312 vanilla test by @oulgen in #831
  • Pad to next power of 2 for hl.specialize'ed shape value used in device tensor creation by @yf225 in #804
  • Autotune eviction policy by @oulgen in #823
  • [Docs] Consistent pre-commit/lint by @oulgen in #836
  • [Docs] Recommend venv instead of conda by @oulgen in #837
  • [Docs] Helion works on 3.10 through 3.14 by @oulgen in #838
  • [Docs] Add eviction policy by @oulgen in #839
  • Update to use the new attribute setting for tf32. by @choijon5 in #835

Full Changelog: v0.1.6...v0.1.7

v0.1.6

02 Oct 22:32
3322ca9

Choose a tag to compare

What's Changed

  • ci: Always auth for benchmarking workflows by @seemethere in #719
  • [Benchmark] jagged_sum kernel and test by @Sibylau in #676
  • Skip default config printing if in ref eager mode by @yf225 in #721
  • [Benchmark CI] Make benchmark runner respect custom CLI args by @yf225 in #723
  • Upgrade rocm CI to 7.0 by @oulgen in #720
  • Add eviction policy argument to tl.load by @oulgen in #714
  • [CI] use complete rocm docker images by @oulgen in #724
  • More inconsistent naming by @oulgen in #725
  • [Benchmark] jagged_layer_norm kernel and test by @Sibylau in #704
  • [Bug fix] Preserve masks on reduction inputs that depend on reduction outputs; fix layer_norm accuracy check failure by @yf225 in #722
  • Support torch.matmul with 3D inputs by @yf225 in #715
  • Slightly improve logs by @angelayi in #740
  • Autotuning Progress Bar by @msaroufim in #739
  • make tritonbench optional in run.py so install works again by @v0i0 in #746
  • fix new factory when size comes from kwargs by @v0i0 in #750
  • Add linting instructions to README by @msaroufim in #763
  • Add backward kernel for exp by @aditvenk in #736
  • fix roll reduction meta when for ops with none output (like wait), cl… by @v0i0 in #767
  • Move upload benchmark results to a separate workflows by @huydhn in #758
  • Add flash_attention to benchmarks by @oulgen in #769
  • Fix jagged_layer_norm linter error by @yf225 in #770
  • Add SIGINT handler for clean interrupt of autotuning background processes by @msaroufim in #766
  • Enable tensor descriptor for XPU by @EikanWang in #765
  • Fix the issue that the XPU kernels cannot be cached well by @EikanWang in #761
  • Print Helion kernel source line in symbolic shape debugging by @yf225 in #771
  • ci: Set fail-fast to false by @seemethere in #776
  • Add XPU support for RNG operations by @EikanWang in #774
  • Enable test_dot for XPU by @EikanWang in #773
  • Handle XPU compilation error by @adam-smnk in #779
  • Fix type prop for and/or by @oulgen in #781
  • Make print output code more robust by @oulgen in #780
  • Revert "Add SIGINT handler for clean interrupt of autotuning background processes" by @oulgen in #784
  • Add torch compile unit test to helion by @oulgen in #782

New Contributors

Full Changelog: v0.1.5...v0.1.6

v0.1.5

29 Sep 18:29
994aaf9

Choose a tag to compare

What's Changed

  • [Benchmark CI] Print config that causes tritonbench accuracy check failure by @yf225 in #716
  • Add AMD to benchmarks by @oulgen in #717
  • [Docs] Move docs requirements to docs/requirements.txt to make compatible with pypi by @oulgen in #718

Full Changelog: v0.1.4...v0.1.5

v0.1.4

29 Sep 16:17
0428d5d

Choose a tag to compare

What's Changed

  • Update benchmark.yml by @oulgen in #570
  • Update benchmark.yml by @oulgen in #571
  • [Benchmark] Use custom kernel metric mappings list to accomodate for inconsistent namings by @oulgen in #567
  • Add rms norm and cross entropy by @oulgen in #568
  • Update benchmark_dispatch.yml by @oulgen in #573
  • Update linters by @oulgen in #569
  • Print config for PassManager::run triton errors by @jansel in #565
  • Error when invalid loop reduction number config is generated by @oulgen in #572
  • Add skipIfLowVRAM or use_default_config=False to specific unit tests to enable local testing by @yf225 in #574
  • Fix bug with block_size smaller than minimum by @jansel in #575
  • Better shape errors for mismatched tile sizes by @jansel in #566
  • Print warning if block_size is specified in interpret mode. by @choijon5 in #576
  • Run all shapes for benchmarks by @oulgen in #578
  • [Benchmarks] Cooldown the GPU before recording results by @oulgen in #579
  • [Benchmark] Fix layer_norm accuracy issue by @yf225 in #580
  • [Benchmark] Remove hardcoded num_inputs for rms_norm kernel by @yf225 in #581
  • Do not benchmark twice by @oulgen in #583
  • Add missing functions to docs by @jansel in #586
  • hl.atomic_add: support 1D tensor as index by @yf225 in #587
  • Add atomic and/or/min/max/cas/xchg by @jansel in #589
  • Add test shard with HELION_DEBUG_DTYPE_ASSERTS=1, only run one ref-eager shard by @jansel in #590
  • Add link to github to docs by @jansel in #591
  • Support layernorm without bias by @mengluy0125 in #585
  • Allow passing tritonbench operator instance into kernel benchmark wrapper; Always return lambda for timing measurement by @yf225 in #596
  • Add layer_norm backward kernels by @yf225 in #588
  • Fix tf32 warning by @jansel in #592
  • [Benchmark] geglu example and test by @Sibylau in #582
  • Print default config when running with it by @oulgen in #599
  • [Benchmark] swiglu example and test by @Sibylau in #584
  • Login to Docker from the workflows by @huydhn in #601
  • Add rms_norm backward kernels by @mengluy0125 in #597
  • Revert "Login to Docker from the workflows" by @oulgen in #604
  • Fix static shape typo by @oulgen in #609
  • Add small dim size (<16) support to hl.dot and torch.addmm; Always prefer using tl.dot(acc=...) for addmm / baddbmm by @yf225 in #564
  • Fix rms_norm and layer_norm by @mengluy0125 in #603
  • [Benchmark] jsd kernel and test by @Sibylau in #611
  • Refactor autotune error handling by @jansel in #595
  • Possible fix for CI failures by @jansel in #617
  • [Benchmark] Welford kernel and example by @karthickai in #614
  • [Benchmark] kl_div kernel and test by @Sibylau in #615
  • Ignore TServiceRouterException errors while autotuning by @jansel in #618
  • [Example] int4_gemm kernel example and tritonbench integration by @yf225 in #613
  • Set requires_grad=True for rms_norm backward inputs by @yf225 in #629
  • Adjust tolerance for test_rms_norm_bwd_dx by @yf225 in #628
  • Add more kernels to benchmarking by @oulgen in #632
  • Reorder benchmarks by @oulgen in #633
  • [Ref Mode] Fix hl.store for complex mask pattern by @yf225 in #621
  • Support using block size var outside of hl.tile loop by @yf225 in #619
  • [Benchmark CI] Print input shapes and surface problematic Helion config by @yf225 in #626
  • Fix ValueError: numel (2097152) exceeds triton maximum tensor numel (1048576) by @mengluy0125 in #625
  • Always clear inductor cache before benchmark by @yf225 in #608
  • Make hl.specialize work on sequences by @jansel in #636
  • Better error for passing Tile to hl.tile by @jansel in #640
  • [Example] grouped_gemm kernel example and tritonbench integration by @yf225 in #620
  • int4_gemm: remove use_default_config=True by @yf225 in #639
  • [Easy][Benchmark CI] Exit job on any exception, for easier error catching by @yf225 in #643
  • Avoid skipping CUDA errors that crashes the CUDA context by @yf225 in #645
  • Add HELION_AUTOTUNE_RANDOM_SEED env var and autotune_random_seed setting by @yf225 in #644
  • Bump linter by @oulgen in #647
  • Skip test_autotune_random_seed_from_env_var on rocm by @oulgen in #648
  • Fix lint related to welford and also local_cache by @yf225 in #646
  • Skip test_autotune_random_seed_from_settings on rocm by @yf225 in #651
  • PT Sphinx Theme Test by @sekyondaMeta in #600
  • Print static_shapes settings value along with config for accurate repro by @yf225 in #649
  • [Benchmark] gather_gemv kernel and test by @Sibylau in #635
  • Add HELION_SKIP_CACHE env by @jansel in #653
  • [lint] Remove UP038 reference by @jansel in #650
  • Fix register_block_size codegen by @yf225 in #659
  • Raise better error when hl.atomic_* is used on device tensor by @yf225 in #658
  • [Autotune] Filter bad config with accuracy check by @yf225 in #655
  • Add hl.rand op with seed arg lowering to tl.rand by @karthickai in #652
  • Log autotune random seed for easier repro by @yf225 in #661
  • Fix misaligned address error for matmul by @yf225 in #662
  • skip gather_gemv code check for B200 and fb_code by @Sibylau in #666
  • rms_norm: get weight from function args by @yf225 in #664
  • skip full autotune if configs are provided by @xuanzhang816 in #670
  • [example] fused_linear_jsd by @v0i0 in #494
  • Fix CI by moving B200 to cuda13 and downgrade a100/h100 to cuda12.8 by @oulgen in #674
  • No background image by @sekyondaMeta in #663
  • Remove github link from index.md by @oulgen in #675
  • [Autotune] Allow skipping Triton compilation error by @yf225 in #679
  • [Benchmark CI] Run one kernel per gpu to maximize successful kernel reporting by @yf225 in #681
  • Fix missing block size constexpr assignment in host code by @yf225 in #678
  • [CI] Fix missing setuptools by @yf225 in #680
  • faster rms norm backwards kernel by @v0i0 in #624
  • [Benchmark CI] Use do_bench cudagraph to avoid profiler failure; select specific kernel impls to run by @yf225 in #682
  • [Benchmark CI] use --op instead of --kernel for better tritonbench compat by @yf225 in #694
  • Increase tolerance for _validate_against_baseline by @jansel in #691
  • [Benchmark CI] Use equally-spaced K input shapes by @yf225 in #689
  • Print bad default config if compute baseline fails by @yf225 in #688
  • Support HELION_AUTOTUNE_ACCURACY_CHECK=0 by @jansel in #692
  • rms norm: improve fwd perf by @v0i0 in #669
  • Revert "Add hl.rand op with seed arg lowering to tl.rand (#652)" by @jansel in #698
  • [Autotune] Skip Triton shared memory OOM by @yf225 in https://git...
Read more

v0.1.3

05 Sep 00:49
a61bd17

Choose a tag to compare

What's Changed

Full Changelog: v0.1.2...v0.1.3