Skip to content

add built-in ability to limit search to first N bytes of file/stream (like with head -c) #3035

@moschroe

Description

@moschroe

Describe your feature request

I recently had the use case of searching an nginx cache (using slicing) containing many 100 GiB of data. Each file contains the response HTTP header for the slice. The slices can be over a MiB in size (could vary depending on nginx config), but the content is irrelevant, all I needed was within the first 1-2KiB of every file.

It was a pain to find certain cache slices because ripgrep has to stream the entire file (it contains a binary header, so binary mode must be used) in case there is no match, which holds for the vast majority of files.

I thought of having a built-in head -c that would prevent needless examination of file contents beyond a known header.

An anecdotal benchmark on my system shows a factor 2000 lower search time with a cold filesystem cache and factor 400 with a warm cache. This is a completely different dataset with fewer, larger files, though, the original system is unfortunately no longer accessible to me.

I implemented a draft for this feature here, would love a review and, hopefully, merge: https://github.com/moschroe/ripgrep/tree/feat_head-bytes (should I open a PR?)
It is most likely not as clean as it could be, so I'd be happy to improve it until acceptable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAn enhancement to the functionality of the software.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions