Skip to content

Conversation

@jerry-024
Copy link
Contributor

@jerry-024 jerry-024 commented Dec 1, 2025

Purpose

Introduce vector index support for Paimon using Apache Lucene 9.x to enable:

  • Approximate nearest neighbor (ANN) search for vector embeddings
  • Support for semantic search use cases
  • Integration with ML/AI workflows

Implementation

  • Uses Lucene's HNSW algorithm for efficient ANN search
  • Supports float32 and byte vectors
  • Supports COSINE, DOT_PRODUCT, and EUCLIDEAN similarity metrics

Tests

  • VectorGlobalIndexTest

API and Format

Documentation

@jerry-024 jerry-024 marked this pull request as draft December 1, 2025 09:37
* upstream/master:
  [test] Fix test SparkWriteITCase.testTruncatePartitionValueNull
  [orc] Limiting Memory Usage of OrcBulkWriter When Writing VectorizedRowBatch (apache#6590)
  [arrow] Improve customization capabilities for data type conversion. (apache#6695)
  [spark] Fix NPE in spark truncate null partitions
  [core] Exclude .class files from sources.jar (apache#6707)
  [core] DataEvolutionFileStoreScan should not filter files by read type when it contains no physical columns. (apache#6714)
  [spark] Update scalafmt version to 3.10.2 (apache#6709)
  [variant] Extract only required columns when reading shredded variants (apache#6720)
  [python] Fix read large volume of blob data (apache#6701)
  [flink] Support cdc source (apache#6606)
  [hive] fix splitting for bucket tables (apache#6594)
  [spark] Update spark build topology for global index (2) (apache#6703)
  [test][spark] Fix the flaky test setDefaultDatabase (apache#6696)
  [spark] Update global index build topology (apache#6700)
  [spark] Introduce global file index builder on spark (apache#6684)
  [python] Fix with_shard feature for blob data (apache#6691)
  [test][spark] Add alter with incompatible col type test case (apache#6689)
  [variant] Introduce withVariantAccess in ReadBuilder (apache#6685)
  [python] Fix file name prefix in postpone mode. (apache#6668)
* upstream/master:
  [core] format table: throw exception when get partiiton from file system (apache#6730)
  [core] Enrich static creation methods for GlobalIndexResult
  [spark] Fix duplicate column error when merging on _ROW_ID (apache#6727)
  [core] Remove Builder of SplitReadProvider.Context
  [core] Intruduce bitmap64 for GlobalIndexResult and support with ranges filter push down (apache#6725)
  [spark] Push down partition filter when compactUnAwareBucketTable (apache#6663)
  [core] Support the new split description for reading data (apache#6711)
  [python] support paimon as  ray datasource for distributed processing (apache#6686)
  [docs] Add missing PaimonSparkSessionExtensions to Spark configs (apache#6729)
  Revert "[flink] Flink batch job support specify partition with max_pt() and max_two_pt()" without review
  [flink] Flink batch job support specify partition with max_pt() and max_two_pt()
  [spark] Update the display of size and timing metrics (apache#6722)
  [core] Append commit should check bucket number if latest commit user is different (apache#6723)
  [core] ShardScanner should not keep partition parameter (apache#6724)
@jerry-024 jerry-024 requested a review from Copilot December 4, 2025 07:46
Copilot finished reviewing on behalf of jerry-024 December 4, 2025 07:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new paimon-vector module that implements vector indexing capabilities using Apache Lucene 9.12.3. The implementation provides efficient approximate nearest neighbor (ANN) search using the HNSW algorithm with support for multiple similarity metrics (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT) and both float and byte vector types.

Key Changes:

  • New paimon-vector module with Lucene-based vector indexing for similarity search
  • Comprehensive test suite with 11 test cases covering various scenarios (dimensions, metrics, sharding)
  • Service provider interface integration for pluggable global indexer factory

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pom.xml Adds paimon-vector profile activated for JDK 11+
paimon-vector/pom.xml Defines new module with Lucene 9.12.3 dependencies
VectorGlobalIndexerFactory.java Factory implementation for creating vector indexers via SPI
VectorGlobalIndexer.java Main indexer that creates writer and reader instances
VectorGlobalIndexWriter.java Implements vector index writing using Lucene HNSW with batch support
VectorGlobalIndexReader.java Implements KNN search across multiple index shards with Lucene
VectorIndexOptions.java Configuration options for dimension, metric, HNSW parameters
VectorIndexMetadata.java Serializable metadata for vector index parameters
VectorIndex.java Abstract base class for vector index entries
FloatVectorIndex.java Float vector implementation with Lucene KnnFloatVectorField
ByteVectorIndex.java Byte vector implementation with Lucene KnnByteVectorField
META-INF/.../GlobalIndexerFactory Service provider configuration registering VectorGlobalIndexerFactory
VectorGlobalIndexTest.java Comprehensive test suite covering write/read, metrics, dimensions, multi-shard scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jerry-024 jerry-024 requested a review from Copilot December 4, 2025 08:46
Copilot finished reviewing on behalf of jerry-024 December 4, 2025 08:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 388 to 389
// Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)
// Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are different (vehicles)
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)" but this could be misleading because the similarity actually depends on the metric being used (EUCLIDEAN in this test). The comment describes conceptual similarity but doesn't clarify that these are close in Euclidean distance.

Suggested change
// Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)
// Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are different (vehicles)
// Apple [0.9, 0.1], Banana [0.8, 0.2], and Orange [0.85, 0.15] are close in Euclidean distance (fruits)
// Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are close to each other in Euclidean distance (vehicles), but far from the fruit vectors.

Copilot uses AI. Check for mistakes.
}
return byteOut.toByteArray();
}

Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serializeMetadata method only serializes metadata but there's no corresponding deserializeMetadata method. This creates an asymmetric API and may cause confusion. Consider adding a matching deserialization method for consistency and future use.

Suggested change
public static VectorIndexMetadata deserializeMetadata(byte[] data) throws IOException {
try (java.io.DataInputStream dataIn = new java.io.DataInputStream(new java.io.ByteArrayInputStream(data))) {
int dimension = dataIn.readInt();
String similarityFunction = dataIn.readUTF();
int m = dataIn.readInt();
int efConstruction = dataIn.readInt();
return new VectorIndexMetadata(dimension, similarityFunction, m, efConstruction);
}
}

Copilot uses AI. Check for mistakes.
private static IndexWriterConfig getIndexWriterConfig(
int m, int efConstruction, int writeBufferSize) {
IndexWriterConfig config = new IndexWriterConfig();
config.setRAMBufferSizeMB(writeBufferSize); // Increase buffer for better performance
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Increase buffer for better performance" but this is setting the buffer to a configurable value, not necessarily increasing it. The comment should reflect that this is configuring the RAM buffer size based on user settings.

Suggested change
config.setRAMBufferSizeMB(writeBufferSize); // Increase buffer for better performance
config.setRAMBufferSizeMB(writeBufferSize); // Configure RAM buffer size based on user settings

Copilot uses AI. Check for mistakes.
Comment on lines 456 to 457
// Test 3: Query with vector similar to "Car" - should find Car first
float[] carQueryVector = new float[] {0.15f, 0.85f};
GlobalIndexResult carSearchResult = reader.search(carQueryVector, 1);

List<Long> carResultIds = new ArrayList<>();
carSearchResult.results().iterator().forEachRemaining(carResultIds::add);

assertThat(carResultIds).as("Query similar to Car should find Car").contains(carId);
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test comment and assertion are incorrect. The query vector [0.15f, 0.85f] is exactly equal to the Bike vector [0.15f, 0.85f] (line 394), not the Car vector [0.1f, 0.9f] (line 392). The test should either: (1) change the comment and assertion to expect Bike, or (2) change the query vector to be closer to Car, such as [0.12f, 0.88f].

Copilot uses AI. Check for mistakes.
Comment on lines 580 to 587
private Options createDefaultOptions() {
Options options = new Options();
options.setInteger("vector.dim", 128);
options.setString("vector.metric", "COSINE");
options.setInteger("vector.m", 16);
options.setInteger("vector.ef-construction", 100);
return options;
}
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The createDefaultOptions() method sets the metric to "COSINE" but many tests override this to use "EUCLIDEAN". Consider renaming this method to something like createBaseOptions() or documenting that these are base settings that tests typically override, to avoid confusion about what the "default" actually means.

Copilot uses AI. Check for mistakes.
# limitations under the License.
################################################################################

name: UTCase and ITCase Spark 4.x
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow name "UTCase and ITCase Spark 4.x" is incorrect for a paimon-vector workflow. This should be something like "UTCase and ITCase Paimon Vector" to accurately reflect what this workflow tests.

Suggested change
name: UTCase and ITCase Spark 4.x
name: UTCase and ITCase Paimon Vector

Copilot uses AI. Check for mistakes.
.stringType()
.defaultValue("EUCLIDEAN")
.withDescription(
"The similarity metric for vector search (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT)");
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default similarity metric is "EUCLIDEAN" but the documentation in the PR description mentions COSINE, DOT_PRODUCT, and EUCLIDEAN without specifying which is the default. Consider documenting why EUCLIDEAN was chosen as the default, as COSINE is often preferred for semantic search use cases mentioned in the PR purpose.

Suggested change
"The similarity metric for vector search (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT)");
"The similarity metric for vector search. Options are COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT. " +
"The default is EUCLIDEAN for compatibility with common vector index algorithms. " +
"Note: COSINE is often preferred for semantic search use cases, but EUCLIDEAN is used here as the default for broader applicability.");

Copilot uses AI. Check for mistakes.
Comment on lines +237 to +244
p -> {
try {
java.nio.file.Files.delete(p);
} catch (IOException e) {
// Ignore cleanup errors
}
});
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda silently ignores cleanup errors which could mask issues during testing or debugging. Consider at least logging these errors instead of completely ignoring them, or document why silent failure is acceptable here.

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +225
.forEach(
p -> {
try {
java.nio.file.Files.delete(p);
} catch (IOException e) {
// Ignore cleanup errors
}
});
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lambda silently ignores cleanup errors which could mask issues during testing or debugging. Consider at least logging these errors instead of completely ignoring them, or document why silent failure is acceptable here. This is the same issue as in VectorGlobalIndexWriter.

Copilot uses AI. Check for mistakes.
Set<String> fieldsToLoad = Set.of(VectorIndex.ROW_ID_FIELD);
// Collect row IDs from results
for (org.apache.lucene.search.ScoreDoc scoreDoc : topDocs.scoreDocs) {
float rawScore = scoreDoc.score;
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable rawScore is declared but never used. Consider removing it or using it for logging/debugging purposes if score information is relevant.

Suggested change
float rawScore = scoreDoc.score;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant