Interest in porting any of the implementation from TokenDagger? (With Possible PR)

I saw, [tokendagger](https://github.com/M4THYOU/TokenDagger), a drop-in replacement for tiktokken that appears to have some different performance characteristics.

Would there be any interest in porting over those changes to tiktokken? I see two big differences -- one is how it handles special tokens and the other are regex-related modifications -- IE using PCRE2 JIT, but hyperscan might be even faster.

Happy to give it a shot and run a few benchmarks!

---
Here's a rough implementation of how tokendagger handles special tokens, which would be more performant when there's a large special token library, but the user is only working with a small subset of those tokens. Not so useful for GPT's tokenization, which is the focus of tiktokken, but it could speed up when it's being used for other implementations.

IE: we iterate through the tokens with a simple `find` and cache the locations of those tokens instead of building out a special regex.

https://github.com/Jeffrharr/tiktoken/pull/1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interest in porting any of the implementation from TokenDagger? (With Possible PR) #441

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Interest in porting any of the implementation from TokenDagger? (With Possible PR) #441

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions