Skip to content
/ lshrs Public

Locality Sensitive Hashing (LSH) based recommendation system. Integrates with Redis and your own database.

License

Notifications You must be signed in to change notification settings

mxngjxa/lshrs

Repository files navigation

LSHRS

Python Version PyPI version MIT License Ruff ty

Redis-backed locality-sensitive hashing toolkit that stores bucket membership in Redis while keeping the heavy vector payloads in your primary datastore.

PyPI Testing and Linting Downloads

logo

Table of Contents

Overview

LSHRS orchestrates the full locality-sensitive hashing (LSH) workflow:

  1. Hash incoming vectors into stable banded signatures via random projections.
  2. Store only bucket membership in Redis for low-latency candidate enumeration.
  3. Optionally rerank candidates using cosine similarity with vectors fetched from your system of record.

The out-of-the-box configuration chooses bands/rows automatically, pipelines Redis operations, and exposes hooks for streaming data ingestion, persistence, and operational maintenance.

Architecture Snapshot

Concern Component Description
Hashing LSHHasher Generates banded random-projection signatures.
Storage RedisStorage Persists bucket membership using Redis sets and pipelines for batch writes.
Ingestion LSHRS.create_signatures() Streams vectors from PostgreSQL or Parquet via pluggable loaders.
Reranking top_k_cosine() Computes cosine similarity for candidate reranking.
Configuration get_optimal_config() Picks band/row counts that match a target similarity threshold.

Key Features

Installation

PyPI

uv install lshrs

Or, if installing for a postgres database:

uv install 'lshrs[postgres]'

From source checkout

git clone https://github.com/mxngjxa/lshrs.git
cd lshrs
uv sync -e ".[dev]"

Note

The project targets Python ≥ 3.13 as defined in pyproject.toml.

Optional extras

  • PostgreSQL streaming requires psycopg. Install with uv add 'lshrs[postgres]' or uv add 'psycopg[binary]'.
  • Parquet ingestion requires pyarrow. Install with uv add pyarrow or include it in your extras.

Quick Start

import numpy as np
from lshrs import lshrs

def fetch_vectors(indices: list[int]) -> np.ndarray:
    # Replace with your vector store retrieval (PostgreSQL, disk, object store, etc.)
    embeddings = np.load("vectors.npy")
    return embeddings[indices]

lsh = LSHRS(
    dim=768,
    num_perm=256,
    redis_host="localhost",
    redis_prefix="demo",
    vector_fetch_fn=fetch_vectors,
)

# Stream index construction from PostgreSQL
lsh.create_signatures(
    format="postgres",
    dsn="postgresql://user:pass@localhost/db",
    table="documents",
    index_column="doc_id",
    vector_column="embedding",
)

# Insert an ad-hoc document
lsh.ingest(42, np.random.randn(768).astype(np.float32))

# Retrieve candidates
query = np.random.randn(768).astype(np.float32)
top10 = lsh.get_top_k(query, topk=10)
reranked = lsh.get_above_p(query, p=0.2)

The code above exercises LSHRS.create_signatures(), LSHRS.ingest(), LSHRS.get_top_k(), and LSHRS.get_above_p().

Ingestion Pipelines

Streaming from PostgreSQL

iter_postgres_vectors() yields (indices, vectors) batches using server-side cursors:

lsh.create_signatures(
    format="postgres",
    dsn="postgresql://reader:[email protected]/search",
    table="embeddings",
    index_column="item_id",
    vector_column="embedding",
    batch_size=5_000,
    where_clause="updated_at >= NOW() - INTERVAL '1 day'",
)

Tip

Provide a custom connection_factory if you need pooled connections or TLS configuration.

Streaming from Parquet

iter_parquet_vectors() supports memory-friendly batch loads from Parquet files:

for ids, batch in iter_parquet_vectors(
    "captures/2024-01-embeddings.parquet",
    index_column="document_id",
    vector_column="embedding",
    batch_size=8_192,
):
    lsh.index(ids, batch)

Important

Install pyarrow prior to using the Parquet loader; otherwise iter_parquet_vectors() raises ImportError.

Manual or Buffered Ingestion

Querying Modes

LSHRS.query() provides two complementary retrieval patterns:

Mode When to use Result
Top-k (top_p=None) Latency-critical scenarios that only require coarse candidates. Returns List[int] ordered by band collisions.
Top-p (top_p=0.0–1.0) Precision-sensitive flows that can rerank using original vectors. Returns List[Tuple[int,float]] of (index, cosine_similarity) pairs.

Caution

Reranking requires configuring vector_fetch_fn when instantiating LSHRS; otherwise top-p queries raise RuntimeError.

Supporting helpers:

Persistence & Lifecycle

Operation Purpose Reference
Snapshot configuration Inspect runtime parameters and Redis namespace. LSHRS.stats()
Flush & clear Remove all Redis buckets for the configured prefix. LSHRS.clear()
Hard delete members Remove specific indices across all buckets. LSHRS.delete()
Persist projections Save configuration and projection matrices to disk. LSHRS.save_to_disk()
Restore projections Rebuild an instance using saved matrices. LSHRS.load_from_disk()

Warning

LSHRS.clear() is irreversible—every key with the configured prefix is deleted. Back up state with LSHRS.save_to_disk() beforehand if you need to rebuild.

Performance & Scaling Guidelines

  • Choose sensible hash parameters: get_optimal_config() finds bands/rows that approximate your target similarity threshold. Inspect S-curve behavior with compute_collision_probability().
  • Normalize inputs: Pre-normalize vectors or rely on l2_norm() for consistent cosine scores.
  • Batch ingestion: When indexing large volumes, route operations through LSHRS.index() to let RedisStorage.batch_add() coalesce writes.
  • Monitor bucket sizes: Large buckets indicate low selectivity. Adjust num_perm, num_bands, or the similarity threshold to trade precision vs. recall.
  • Pipeline warmup: Flush outstanding operations with LSHRS._flush_buffer() (indirectly called) before measuring latency or persisting state.

Troubleshooting

Symptom Likely Cause Resolution
ImportError: psycopg is required PostgreSQL loader invoked without optional dependency. Install psycopg[binary] or avoid format="postgres".
ValueError: Vectors must have shape (n, dim) Supplied batch dimension mismatched the configured dim. Ensure all vectors match the dim passed to LSHRS.__init__().
ValueError: Cannot normalize zero vector Zero-length vectors were passed to cosine scoring utilities. Filter zero vectors before reranking or normalize upstream.
Empty search results Buckets never flushed to Redis. Call LSHRS.index() (auto flushes) or explicitly invoke LSHRS._flush_buffer() before querying.
Extremely large buckets Similarity threshold too low / insufficient hash bits. Increase num_perm or tweak target threshold via get_optimal_config().

Tip

Use Redis SCAN commands (e.g., SCAN 0 MATCH lsh:*) to inspect bucket distribution during tuning.

API Surface Summary

Area Description Primary Entry Point
Ingestion orchestration Bulk streaming with source-aware loaders. LSHRS.create_signatures()
Batch ingestion Hash and store vectors already in memory. LSHRS.index()
Single ingestion Add or update one vector id on the fly. LSHRS.ingest()
Candidate enumeration General-purpose search with optional reranking. LSHRS.query()
Hash persistence Save and restore LSH projection matrices. LSHRS.save_to_disk() / LSHRS.load_from_disk()
Redis maintenance Prefix-aware key deletion and batch removal. RedisStorage.clear() / RedisStorage.remove_indices()
Probability utilities Analyze band/row trade-offs and false rates. compute_collision_probability() / compute_false_rates()

Development & Testing

  1. Install development dependencies:

    uv add -e ".[dev]"
  2. Run the test suite:

    uv run --dev pytest
  3. Lint (if you have ruff configured):

    uv run --dev ruff check

Note

Example snippets in this README are intended to be run under Python 3.13 with NumPy 2.x and Redis ≥ 7 as specified in pyproject.toml.

License

Licensed under the terms of LICENSE.

About

Locality Sensitive Hashing (LSH) based recommendation system. Integrates with Redis and your own database.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published