Skip to content

Interest in porting any of the implementation from TokenDagger? (With Possible PR) #441

@Jeffrharr

Description

@Jeffrharr

I saw, tokendagger, a drop-in replacement for tiktokken that appears to have some different performance characteristics.

Would there be any interest in porting over those changes to tiktokken? I see two big differences -- one is how it handles special tokens and the other are regex-related modifications -- IE using PCRE2 JIT, but hyperscan might be even faster.

Happy to give it a shot and run a few benchmarks!


Here's a rough implementation of how tokendagger handles special tokens, which would be more performant when there's a large special token library, but the user is only working with a small subset of those tokens. Not so useful for GPT's tokenization, which is the focus of tiktokken, but it could speed up when it's being used for other implementations.

IE: we iterate through the tokens with a simple find and cache the locations of those tokens instead of building out a special regex.

Jeffrharr#1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions