Redis-backed locality-sensitive hashing toolkit that stores bucket membership in Redis while keeping the heavy vector payloads in your primary datastore.
- Overview
- Architecture Snapshot
- Key Features
- Installation
- Quick Start
- Ingestion Pipelines
- Querying Modes
- Persistence & Lifecycle
- Performance & Scaling Guidelines
- Troubleshooting
- API Surface Summary
- Development & Testing
- License
LSHRS orchestrates the full locality-sensitive hashing (LSH) workflow:
- Hash incoming vectors into stable banded signatures via random projections.
- Store only bucket membership in Redis for low-latency candidate enumeration.
- Optionally rerank candidates using cosine similarity with vectors fetched from your system of record.
The out-of-the-box configuration chooses bands/rows automatically, pipelines Redis operations, and exposes hooks for streaming data ingestion, persistence, and operational maintenance.
| Concern | Component | Description |
|---|---|---|
| Hashing | LSHHasher |
Generates banded random-projection signatures. |
| Storage | RedisStorage |
Persists bucket membership using Redis sets and pipelines for batch writes. |
| Ingestion | LSHRS.create_signatures() |
Streams vectors from PostgreSQL or Parquet via pluggable loaders. |
| Reranking | top_k_cosine() |
Computes cosine similarity for candidate reranking. |
| Configuration | get_optimal_config() |
Picks band/row counts that match a target similarity threshold. |
- Redis-native buckets: Uses Redis sets for O(1) membership updates and pipelined batch ingestion.
- Progressive indexing: Stream vectors from PostgreSQL (
iter_postgres_vectors()) or Parquet (iter_parquet_vectors()) without exhausting memory. - Dual retrieval modes: Choose fast top-k collision lookups or cosine-reranked top-p filtering through
LSHRS.query(). - Persistable hashing state: Save and reload projection matrices with
LSHRS.save_to_disk()andLSHRS.load_from_disk(). - Operational safety: Snapshot configuration with
LSHRS.stats(), clear indices viaLSHRS.clear(), and surgically delete members usingLSHRS.delete().
uv install lshrsOr, if installing for a postgres database:
uv install 'lshrs[postgres]'git clone https://github.com/mxngjxa/lshrs.git
cd lshrs
uv sync -e ".[dev]"Note
The project targets Python ≥ 3.13 as defined in pyproject.toml.
- PostgreSQL streaming requires
psycopg. Install withuv add 'lshrs[postgres]'oruv add 'psycopg[binary]'. - Parquet ingestion requires
pyarrow. Install withuv add pyarrowor include it in your extras.
import numpy as np
from lshrs import lshrs
def fetch_vectors(indices: list[int]) -> np.ndarray:
# Replace with your vector store retrieval (PostgreSQL, disk, object store, etc.)
embeddings = np.load("vectors.npy")
return embeddings[indices]
lsh = LSHRS(
dim=768,
num_perm=256,
redis_host="localhost",
redis_prefix="demo",
vector_fetch_fn=fetch_vectors,
)
# Stream index construction from PostgreSQL
lsh.create_signatures(
format="postgres",
dsn="postgresql://user:pass@localhost/db",
table="documents",
index_column="doc_id",
vector_column="embedding",
)
# Insert an ad-hoc document
lsh.ingest(42, np.random.randn(768).astype(np.float32))
# Retrieve candidates
query = np.random.randn(768).astype(np.float32)
top10 = lsh.get_top_k(query, topk=10)
reranked = lsh.get_above_p(query, p=0.2)The code above exercises LSHRS.create_signatures(), LSHRS.ingest(), LSHRS.get_top_k(), and LSHRS.get_above_p().
iter_postgres_vectors() yields (indices, vectors) batches using server-side cursors:
lsh.create_signatures(
format="postgres",
dsn="postgresql://reader:[email protected]/search",
table="embeddings",
index_column="item_id",
vector_column="embedding",
batch_size=5_000,
where_clause="updated_at >= NOW() - INTERVAL '1 day'",
)Tip
Provide a custom connection_factory if you need pooled connections or TLS configuration.
iter_parquet_vectors() supports memory-friendly batch loads from Parquet files:
for ids, batch in iter_parquet_vectors(
"captures/2024-01-embeddings.parquet",
index_column="document_id",
vector_column="embedding",
batch_size=8_192,
):
lsh.index(ids, batch)Important
Install pyarrow prior to using the Parquet loader; otherwise iter_parquet_vectors() raises ImportError.
LSHRS.index()ingests vector batches you already hold in memory.LSHRS.ingest()is ideal for realtime single-document updates.- Under the hood,
RedisStorage.batch_add()leverages Redis pipelines for throughput.
LSHRS.query() provides two complementary retrieval patterns:
| Mode | When to use | Result |
|---|---|---|
Top-k (top_p=None) |
Latency-critical scenarios that only require coarse candidates. | Returns List[int] ordered by band collisions. |
Top-p (top_p=0.0–1.0) |
Precision-sensitive flows that can rerank using original vectors. | Returns List[Tuple[int,float]] of (index, cosine_similarity) pairs. |
Caution
Reranking requires configuring vector_fetch_fn when instantiating LSHRS; otherwise top-p queries raise RuntimeError.
Supporting helpers:
LSHRS.get_top_k()wrapsqueryfor pure top-k retrieval.LSHRS.get_above_p()wrapsquerywith a similarity-mass cutoff.- Cosine scoring is provided by
cosine_similarity()andtop_k_cosine().
| Operation | Purpose | Reference |
|---|---|---|
| Snapshot configuration | Inspect runtime parameters and Redis namespace. | LSHRS.stats() |
| Flush & clear | Remove all Redis buckets for the configured prefix. | LSHRS.clear() |
| Hard delete members | Remove specific indices across all buckets. | LSHRS.delete() |
| Persist projections | Save configuration and projection matrices to disk. | LSHRS.save_to_disk() |
| Restore projections | Rebuild an instance using saved matrices. | LSHRS.load_from_disk() |
Warning
LSHRS.clear() is irreversible—every key with the configured prefix is deleted. Back up state with LSHRS.save_to_disk() beforehand if you need to rebuild.
- Choose sensible hash parameters:
get_optimal_config()finds bands/rows that approximate your target similarity threshold. Inspect S-curve behavior withcompute_collision_probability(). - Normalize inputs: Pre-normalize vectors or rely on
l2_norm()for consistent cosine scores. - Batch ingestion: When indexing large volumes, route operations through
LSHRS.index()to letRedisStorage.batch_add()coalesce writes. - Monitor bucket sizes: Large buckets indicate low selectivity. Adjust
num_perm,num_bands, or the similarity threshold to trade precision vs. recall. - Pipeline warmup: Flush outstanding operations with
LSHRS._flush_buffer()(indirectly called) before measuring latency or persisting state.
| Symptom | Likely Cause | Resolution |
|---|---|---|
ImportError: psycopg is required |
PostgreSQL loader invoked without optional dependency. | Install psycopg[binary] or avoid format="postgres". |
ValueError: Vectors must have shape (n, dim) |
Supplied batch dimension mismatched the configured dim. |
Ensure all vectors match the dim passed to LSHRS.__init__(). |
ValueError: Cannot normalize zero vector |
Zero-length vectors were passed to cosine scoring utilities. | Filter zero vectors before reranking or normalize upstream. |
| Empty search results | Buckets never flushed to Redis. | Call LSHRS.index() (auto flushes) or explicitly invoke LSHRS._flush_buffer() before querying. |
| Extremely large buckets | Similarity threshold too low / insufficient hash bits. | Increase num_perm or tweak target threshold via get_optimal_config(). |
Tip
Use Redis SCAN commands (e.g., SCAN 0 MATCH lsh:*) to inspect bucket distribution during tuning.
| Area | Description | Primary Entry Point |
|---|---|---|
| Ingestion orchestration | Bulk streaming with source-aware loaders. | LSHRS.create_signatures() |
| Batch ingestion | Hash and store vectors already in memory. | LSHRS.index() |
| Single ingestion | Add or update one vector id on the fly. | LSHRS.ingest() |
| Candidate enumeration | General-purpose search with optional reranking. | LSHRS.query() |
| Hash persistence | Save and restore LSH projection matrices. | LSHRS.save_to_disk() / LSHRS.load_from_disk() |
| Redis maintenance | Prefix-aware key deletion and batch removal. | RedisStorage.clear() / RedisStorage.remove_indices() |
| Probability utilities | Analyze band/row trade-offs and false rates. | compute_collision_probability() / compute_false_rates() |
-
Install development dependencies:
uv add -e ".[dev]" -
Run the test suite:
uv run --dev pytest
-
Lint (if you have
ruffconfigured):uv run --dev ruff check
Note
Example snippets in this README are intended to be run under Python 3.13 with NumPy 2.x and Redis ≥ 7 as specified in pyproject.toml.
Licensed under the terms of LICENSE.