-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe your feature request
I recently had the use case of searching an nginx cache (using slicing) containing many 100 GiB of data. Each file contains the response HTTP header for the slice. The slices can be over a MiB in size (could vary depending on nginx config), but the content is irrelevant, all I needed was within the first 1-2KiB of every file.
It was a pain to find certain cache slices because ripgrep has to stream the entire file (it contains a binary header, so binary mode must be used) in case there is no match, which holds for the vast majority of files.
I thought of having a built-in head -c that would prevent needless examination of file contents beyond a known header.
An anecdotal benchmark on my system shows a factor 2000 lower search time with a cold filesystem cache and factor 400 with a warm cache. This is a completely different dataset with fewer, larger files, though, the original system is unfortunately no longer accessible to me.
I implemented a draft for this feature here, would love a review and, hopefully, merge: https://github.com/moschroe/ripgrep/tree/feat_head-bytes (should I open a PR?)
It is most likely not as clean as it could be, so I'd be happy to improve it until acceptable.