LSHRS

Redis-backed locality-sensitive hashing toolkit that stores bucket membership in Redis while keeping the heavy vector payloads in your primary datastore.

Overview

LSHRS orchestrates the full locality-sensitive hashing (LSH) workflow:

Hash incoming vectors into stable banded signatures via random projections.
Store only bucket membership in Redis for low-latency candidate enumeration.
Optionally rerank candidates using cosine similarity with vectors fetched from your system of record.

The out-of-the-box configuration chooses bands/rows automatically, pipelines Redis operations, and exposes hooks for streaming data ingestion, persistence, and operational maintenance.

Architecture Snapshot

Concern	Component	Description
Hashing	`LSHHasher`	Generates banded random-projection signatures.
Storage	`RedisStorage`	Persists bucket membership using Redis sets and pipelines for batch writes.
Ingestion	`LSHRS.create_signatures()`	Streams vectors from PostgreSQL or Parquet via pluggable loaders.
Reranking	`top_k_cosine()`	Computes cosine similarity for candidate reranking.
Configuration	`get_optimal_config()`	Picks band/row counts that match a target similarity threshold.

Key Features

Redis-native buckets: Uses Redis sets for O(1) membership updates and pipelined batch ingestion.
Progressive indexing: Stream vectors from PostgreSQL (iter_postgres_vectors()) or Parquet (iter_parquet_vectors()) without exhausting memory.
Dual retrieval modes: Choose fast top-k collision lookups or cosine-reranked top-p filtering through LSHRS.query().
Persistable hashing state: Save and reload projection matrices with LSHRS.save_to_disk() and LSHRS.load_from_disk().
Operational safety: Snapshot configuration with LSHRS.stats(), clear indices via LSHRS.clear(), and surgically delete members using LSHRS.delete().

Installation

PyPI

uv install lshrs

Or, if installing for a postgres database:

uv install 'lshrs[postgres]'

From source checkout

git clone https://github.com/mxngjxa/lshrs.git
cd lshrs
uv sync -e ".[dev]"

Note

The project targets Python ≥ 3.13 as defined in pyproject.toml.

Optional extras

PostgreSQL streaming requires psycopg. Install with uv add 'lshrs[postgres]' or uv add 'psycopg[binary]'.
Parquet ingestion requires pyarrow. Install with uv add pyarrow or include it in your extras.

Quick Start

import numpy as np
from lshrs import lshrs

def fetch_vectors(indices: list[int]) -> np.ndarray:
    # Replace with your vector store retrieval (PostgreSQL, disk, object store, etc.)
    embeddings = np.load("vectors.npy")
    return embeddings[indices]

lsh = LSHRS(
    dim=768,
    num_perm=256,
    redis_host="localhost",
    redis_prefix="demo",
    vector_fetch_fn=fetch_vectors,
)

# Stream index construction from PostgreSQL
lsh.create_signatures(
    format="postgres",
    dsn="postgresql://user:pass@localhost/db",
    table="documents",
    index_column="doc_id",
    vector_column="embedding",
)

# Insert an ad-hoc document
lsh.ingest(42, np.random.randn(768).astype(np.float32))

# Retrieve candidates
query = np.random.randn(768).astype(np.float32)
top10 = lsh.get_top_k(query, topk=10)
reranked = lsh.get_above_p(query, p=0.2)

The code above exercises LSHRS.create_signatures(), LSHRS.ingest(), LSHRS.get_top_k(), and LSHRS.get_above_p().

Ingestion Pipelines

Streaming from PostgreSQL

iter_postgres_vectors() yields (indices, vectors) batches using server-side cursors:

lsh.create_signatures(
    format="postgres",
    dsn="postgresql://reader:[email protected]/search",
    table="embeddings",
    index_column="item_id",
    vector_column="embedding",
    batch_size=5_000,
    where_clause="updated_at >= NOW() - INTERVAL '1 day'",
)

Tip

Provide a custom connection_factory if you need pooled connections or TLS configuration.

Streaming from Parquet

iter_parquet_vectors() supports memory-friendly batch loads from Parquet files:

for ids, batch in iter_parquet_vectors(
    "captures/2024-01-embeddings.parquet",
    index_column="document_id",
    vector_column="embedding",
    batch_size=8_192,
):
    lsh.index(ids, batch)

Important

Install pyarrow prior to using the Parquet loader; otherwise iter_parquet_vectors() raises ImportError.

Manual or Buffered Ingestion

LSHRS.index() ingests vector batches you already hold in memory.
LSHRS.ingest() is ideal for realtime single-document updates.
Under the hood, RedisStorage.batch_add() leverages Redis pipelines for throughput.

Querying Modes

LSHRS.query() provides two complementary retrieval patterns:

Mode	When to use	Result
Top-k (`top_p=None`)	Latency-critical scenarios that only require coarse candidates.	Returns `List[int]` ordered by band collisions.
Top-p (`top_p=0.0–1.0`)	Precision-sensitive flows that can rerank using original vectors.	Returns `List[Tuple[int,float]]` of `(index, cosine_similarity)` pairs.

Caution

Reranking requires configuring vector_fetch_fn when instantiating LSHRS; otherwise top-p queries raise RuntimeError.

Supporting helpers:

LSHRS.get_top_k() wraps query for pure top-k retrieval.
LSHRS.get_above_p() wraps query with a similarity-mass cutoff.
Cosine scoring is provided by cosine_similarity() and top_k_cosine().

Persistence & Lifecycle

Operation	Purpose	Reference
Snapshot configuration	Inspect runtime parameters and Redis namespace.	`LSHRS.stats()`
Flush & clear	Remove all Redis buckets for the configured prefix.	`LSHRS.clear()`
Hard delete members	Remove specific indices across all buckets.	`LSHRS.delete()`
Persist projections	Save configuration and projection matrices to disk.	`LSHRS.save_to_disk()`
Restore projections	Rebuild an instance using saved matrices.	`LSHRS.load_from_disk()`

Warning

LSHRS.clear() is irreversible—every key with the configured prefix is deleted. Back up state with LSHRS.save_to_disk() beforehand if you need to rebuild.

Performance & Scaling Guidelines

Choose sensible hash parameters: get_optimal_config() finds bands/rows that approximate your target similarity threshold. Inspect S-curve behavior with compute_collision_probability().
Normalize inputs: Pre-normalize vectors or rely on l2_norm() for consistent cosine scores.
Batch ingestion: When indexing large volumes, route operations through LSHRS.index() to let RedisStorage.batch_add() coalesce writes.
Monitor bucket sizes: Large buckets indicate low selectivity. Adjust num_perm, num_bands, or the similarity threshold to trade precision vs. recall.
Pipeline warmup: Flush outstanding operations with LSHRS._flush_buffer() (indirectly called) before measuring latency or persisting state.

Troubleshooting

Symptom	Likely Cause	Resolution
`ImportError: psycopg is required`	PostgreSQL loader invoked without optional dependency.	Install `psycopg[binary]` or avoid `format="postgres"`.
`ValueError: Vectors must have shape (n, dim)`	Supplied batch dimension mismatched the configured `dim`.	Ensure all vectors match the `dim` passed to `LSHRS.__init__()`.
`ValueError: Cannot normalize zero vector`	Zero-length vectors were passed to cosine scoring utilities.	Filter zero vectors before reranking or normalize upstream.
Empty search results	Buckets never flushed to Redis.	Call `LSHRS.index()` (auto flushes) or explicitly invoke `LSHRS._flush_buffer()` before querying.
Extremely large buckets	Similarity threshold too low / insufficient hash bits.	Increase `num_perm` or tweak target threshold via `get_optimal_config()`.

Tip

Use Redis SCAN commands (e.g., SCAN 0 MATCH lsh:*) to inspect bucket distribution during tuning.

API Surface Summary

Area	Description	Primary Entry Point
Ingestion orchestration	Bulk streaming with source-aware loaders.	`LSHRS.create_signatures()`
Batch ingestion	Hash and store vectors already in memory.	`LSHRS.index()`
Single ingestion	Add or update one vector id on the fly.	`LSHRS.ingest()`
Candidate enumeration	General-purpose search with optional reranking.	`LSHRS.query()`
Hash persistence	Save and restore LSH projection matrices.	`LSHRS.save_to_disk()` / `LSHRS.load_from_disk()`
Redis maintenance	Prefix-aware key deletion and batch removal.	`RedisStorage.clear()` / `RedisStorage.remove_indices()`
Probability utilities	Analyze band/row trade-offs and false rates.	`compute_collision_probability()` / `compute_false_rates()`

Development & Testing

Install development dependencies:
```
uv add -e ".[dev]"
```
Run the test suite:
```
uv run --dev pytest
```
Lint (if you have ruff configured):
```
uv run --dev ruff check
```

Note

Example snippets in this README are intended to be run under Python 3.13 with NumPy 2.x and Redis ≥ 7 as specified in pyproject.toml.

License

Licensed under the terms of LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
bin		bin
docs		docs
lshrs		lshrs
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
qa.md		qa.md
requirements.md		requirements.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LSHRS

Table of Contents

Overview

Architecture Snapshot

Key Features

Installation

PyPI

From source checkout

Optional extras

Quick Start

Ingestion Pipelines

Streaming from PostgreSQL

Streaming from Parquet

Manual or Buffered Ingestion

Querying Modes

Persistence & Lifecycle

Performance & Scaling Guidelines

Troubleshooting

API Surface Summary

Development & Testing

License

About

Uh oh!

Releases

Packages

Languages

License

mxngjxa/lshrs

Folders and files

Latest commit

History

Repository files navigation

LSHRS

Table of Contents

Overview

Architecture Snapshot

Key Features

Installation

PyPI

From source checkout

Optional extras

Quick Start

Ingestion Pipelines

Streaming from PostgreSQL

Streaming from Parquet

Manual or Buffered Ingestion

Querying Modes

Persistence & Lifecycle

Performance & Scaling Guidelines

Troubleshooting

API Surface Summary

Development & Testing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages