I saw, tokendagger, a drop-in replacement for tiktokken that appears to have some different performance characteristics.
Would there be any interest in porting over those changes to tiktokken? I see two big differences -- one is how it handles special tokens and the other are regex-related modifications -- IE using PCRE2 JIT, but hyperscan might be even faster.
Happy to give it a shot and run a few benchmarks!
Here's a rough implementation of how tokendagger handles special tokens, which would be more performant when there's a large special token library, but the user is only working with a small subset of those tokens. Not so useful for GPT's tokenization, which is the focus of tiktokken, but it could speed up when it's being used for other implementations.
IE: we iterate through the tokens with a simple find and cache the locations of those tokens instead of building out a special regex.
Jeffrharr#1