Test vector index #6716

jerry-024 · 2025-12-01T09:37:27Z

Purpose

Introduce vector index support for Paimon using Apache Lucene 9.x to enable:

Approximate nearest neighbor (ANN) search for vector embeddings
Support for semantic search use cases
Integration with ML/AI workflows

Implementation

Uses Lucene's HNSW algorithm for efficient ANN search
Supports float32 and byte vectors
Supports COSINE, DOT_PRODUCT, and EUCLIDEAN similarity metrics

Tests

VectorGlobalIndexTest

API and Format

Documentation

* upstream/master: [test] Fix test SparkWriteITCase.testTruncatePartitionValueNull [orc] Limiting Memory Usage of OrcBulkWriter When Writing VectorizedRowBatch (apache#6590) [arrow] Improve customization capabilities for data type conversion. (apache#6695) [spark] Fix NPE in spark truncate null partitions [core] Exclude .class files from sources.jar (apache#6707) [core] DataEvolutionFileStoreScan should not filter files by read type when it contains no physical columns. (apache#6714) [spark] Update scalafmt version to 3.10.2 (apache#6709) [variant] Extract only required columns when reading shredded variants (apache#6720) [python] Fix read large volume of blob data (apache#6701) [flink] Support cdc source (apache#6606) [hive] fix splitting for bucket tables (apache#6594) [spark] Update spark build topology for global index (2) (apache#6703) [test][spark] Fix the flaky test setDefaultDatabase (apache#6696) [spark] Update global index build topology (apache#6700) [spark] Introduce global file index builder on spark (apache#6684) [python] Fix with_shard feature for blob data (apache#6691) [test][spark] Add alter with incompatible col type test case (apache#6689) [variant] Introduce withVariantAccess in ReadBuilder (apache#6685) [python] Fix file name prefix in postpone mode. (apache#6668)

* upstream/master: [core] format table: throw exception when get partiiton from file system (apache#6730) [core] Enrich static creation methods for GlobalIndexResult [spark] Fix duplicate column error when merging on _ROW_ID (apache#6727) [core] Remove Builder of SplitReadProvider.Context [core] Intruduce bitmap64 for GlobalIndexResult and support with ranges filter push down (apache#6725) [spark] Push down partition filter when compactUnAwareBucketTable (apache#6663) [core] Support the new split description for reading data (apache#6711) [python] support paimon as ray datasource for distributed processing (apache#6686) [docs] Add missing PaimonSparkSessionExtensions to Spark configs (apache#6729) Revert "[flink] Flink batch job support specify partition with max_pt() and max_two_pt()" without review [flink] Flink batch job support specify partition with max_pt() and max_two_pt() [spark] Update the display of size and timing metrics (apache#6722) [core] Append commit should check bucket number if latest commit user is different (apache#6723) [core] ShardScanner should not keep partition parameter (apache#6724)

Copilot

Pull request overview

This PR introduces a new paimon-vector module that implements vector indexing capabilities using Apache Lucene 9.12.3. The implementation provides efficient approximate nearest neighbor (ANN) search using the HNSW algorithm with support for multiple similarity metrics (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT) and both float and byte vector types.

Key Changes:

New paimon-vector module with Lucene-based vector indexing for similarity search
Comprehensive test suite with 11 test cases covering various scenarios (dimensions, metrics, sharding)
Service provider interface integration for pluggable global indexer factory

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
pom.xml	Adds `paimon-vector` profile activated for JDK 11+
paimon-vector/pom.xml	Defines new module with Lucene 9.12.3 dependencies
VectorGlobalIndexerFactory.java	Factory implementation for creating vector indexers via SPI
VectorGlobalIndexer.java	Main indexer that creates writer and reader instances
VectorGlobalIndexWriter.java	Implements vector index writing using Lucene HNSW with batch support
VectorGlobalIndexReader.java	Implements KNN search across multiple index shards with Lucene
VectorIndexOptions.java	Configuration options for dimension, metric, HNSW parameters
VectorIndexMetadata.java	Serializable metadata for vector index parameters
VectorIndex.java	Abstract base class for vector index entries
FloatVectorIndex.java	Float vector implementation with Lucene KnnFloatVectorField
ByteVectorIndex.java	Byte vector implementation with Lucene KnnByteVectorField
META-INF/.../GlobalIndexerFactory	Service provider configuration registering VectorGlobalIndexerFactory
VectorGlobalIndexTest.java	Comprehensive test suite covering write/read, metrics, dimensions, multi-shard scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paimon-vector/src/test/java/org/apache/paimon/vector/VectorGlobalIndexTest.java

paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexReader.java

paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexWriter.java

paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexReader.java

paimon-vector/src/main/java/org/apache/paimon/vector/ByteVectorIndex.java

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-04T08:57:24Z

paimon-vector/src/test/java/org/apache/paimon/vector/VectorGlobalIndexTest.java

+        // Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)
+        // Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are different (vehicles)


The comment says "Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)" but this could be misleading because the similarity actually depends on the metric being used (EUCLIDEAN in this test). The comment describes conceptual similarity but doesn't clarify that these are close in Euclidean distance.

Suggested change

// Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)

// Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are different (vehicles)

// Apple [0.9, 0.1], Banana [0.8, 0.2], and Orange [0.85, 0.15] are close in Euclidean distance (fruits)

// Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are close to each other in Euclidean distance (vehicles), but far from the fruit vectors.

Copilot · 2025-12-04T08:57:24Z

paimon-vector/src/main/java/org/apache/paimon/vector/VectorIndexMetadata.java

+        }
+        return byteOut.toByteArray();
+    }
+


The serializeMetadata method only serializes metadata but there's no corresponding deserializeMetadata method. This creates an asymmetric API and may cause confusion. Consider adding a matching deserialization method for consistency and future use.

Suggested change

public static VectorIndexMetadata deserializeMetadata(byte[] data) throws IOException {

try (java.io.DataInputStream dataIn = new java.io.DataInputStream(new java.io.ByteArrayInputStream(data))) {

int dimension = dataIn.readInt();

String similarityFunction = dataIn.readUTF();

int m = dataIn.readInt();

int efConstruction = dataIn.readInt();

return new VectorIndexMetadata(dimension, similarityFunction, m, efConstruction);

}

}

Copilot · 2025-12-04T08:57:24Z

paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexWriter.java

+    private static IndexWriterConfig getIndexWriterConfig(
+            int m, int efConstruction, int writeBufferSize) {
+        IndexWriterConfig config = new IndexWriterConfig();
+        config.setRAMBufferSizeMB(writeBufferSize); // Increase buffer for better performance


The comment says "Increase buffer for better performance" but this is setting the buffer to a configurable value, not necessarily increasing it. The comment should reflect that this is configuring the RAM buffer size based on user settings.

Suggested change

config.setRAMBufferSizeMB(writeBufferSize); // Increase buffer for better performance

config.setRAMBufferSizeMB(writeBufferSize); // Configure RAM buffer size based on user settings

Copilot · 2025-12-04T08:57:25Z

paimon-vector/src/test/java/org/apache/paimon/vector/VectorGlobalIndexTest.java

+        // Test 3: Query with vector similar to "Car" - should find Car first
+        float[] carQueryVector = new float[] {0.15f, 0.85f};
+        GlobalIndexResult carSearchResult = reader.search(carQueryVector, 1);
+
+        List<Long> carResultIds = new ArrayList<>();
+        carSearchResult.results().iterator().forEachRemaining(carResultIds::add);
+
+        assertThat(carResultIds).as("Query similar to Car should find Car").contains(carId);


The test comment and assertion are incorrect. The query vector [0.15f, 0.85f] is exactly equal to the Bike vector [0.15f, 0.85f] (line 394), not the Car vector [0.1f, 0.9f] (line 392). The test should either: (1) change the comment and assertion to expect Bike, or (2) change the query vector to be closer to Car, such as [0.12f, 0.88f].

Copilot · 2025-12-04T08:57:25Z

paimon-vector/src/test/java/org/apache/paimon/vector/VectorGlobalIndexTest.java

+    private Options createDefaultOptions() {
+        Options options = new Options();
+        options.setInteger("vector.dim", 128);
+        options.setString("vector.metric", "COSINE");
+        options.setInteger("vector.m", 16);
+        options.setInteger("vector.ef-construction", 100);
+        return options;
+    }


The createDefaultOptions() method sets the metric to "COSINE" but many tests override this to use "EUCLIDEAN". Consider renaming this method to something like createBaseOptions() or documenting that these are base settings that tests typically override, to avoid confusion about what the "default" actually means.

Copilot · 2025-12-04T08:57:25Z

.github/workflows/utitcase-paimon-vector.yml

+# limitations under the License.
+################################################################################
+
+name: UTCase and ITCase Spark 4.x


The workflow name "UTCase and ITCase Spark 4.x" is incorrect for a paimon-vector workflow. This should be something like "UTCase and ITCase Paimon Vector" to accurately reflect what this workflow tests.

Suggested change

name: UTCase and ITCase Spark 4.x

name: UTCase and ITCase Paimon Vector

Copilot · 2025-12-04T08:57:25Z

paimon-vector/src/main/java/org/apache/paimon/vector/VectorIndexOptions.java

+                    .stringType()
+                    .defaultValue("EUCLIDEAN")
+                    .withDescription(
+                            "The similarity metric for vector search (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT)");


The default similarity metric is "EUCLIDEAN" but the documentation in the PR description mentions COSINE, DOT_PRODUCT, and EUCLIDEAN without specifying which is the default. Consider documenting why EUCLIDEAN was chosen as the default, as COSINE is often preferred for semantic search use cases mentioned in the PR purpose.

Suggested change

"The similarity metric for vector search (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT)");

"The similarity metric for vector search. Options are COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT. " +

"The default is EUCLIDEAN for compatibility with common vector index algorithms. " +

"Note: COSINE is often preferred for semantic search use cases, but EUCLIDEAN is used here as the default for broader applicability.");

Copilot · 2025-12-04T08:57:26Z

paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexWriter.java

+                            p -> {
+                                try {
+                                    java.nio.file.Files.delete(p);
+                                } catch (IOException e) {
+                                    // Ignore cleanup errors
+                                }
+                            });


The lambda silently ignores cleanup errors which could mask issues during testing or debugging. Consider at least logging these errors instead of completely ignoring them, or document why silent failure is acceptable here.

Copilot · 2025-12-04T08:57:26Z

paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexReader.java

+                    .forEach(
+                            p -> {
+                                try {
+                                    java.nio.file.Files.delete(p);
+                                } catch (IOException e) {
+                                    // Ignore cleanup errors
+                                }
+                            });


The lambda silently ignores cleanup errors which could mask issues during testing or debugging. Consider at least logging these errors instead of completely ignoring them, or document why silent failure is acceptable here. This is the same issue as in VectorGlobalIndexWriter.

Copilot · 2025-12-04T08:57:26Z

paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexReader.java

+                Set<String> fieldsToLoad = Set.of(VectorIndex.ROW_ID_FIELD);
+                // Collect row IDs from results
+                for (org.apache.lucene.search.ScoreDoc scoreDoc : topDocs.scoreDocs) {
+                    float rawScore = scoreDoc.score;


The variable rawScore is declared but never used. Consider removing it or using it for logging/debugging purposes if score information is relevant.

Suggested change

float rawScore = scoreDoc.score;

jerry-024 added 3 commits November 28, 2025 17:27

init

f1b82e9

add demo

04f1c6e

fix

4401379

jerry-024 marked this pull request as draft December 1, 2025 09:37

jerry-024 added 7 commits December 2, 2025 11:28

fix

7ace2ab

change to lucene

785754a

fix

62f7db5

fix

e9e6ed5

fix

08477e1

jerry-024 requested a review from Copilot December 4, 2025 07:46

Copilot started reviewing on behalf of jerry-024 December 4, 2025 07:47 View session

Copilot finished reviewing on behalf of jerry-024 December 4, 2025 07:48

Copilot AI reviewed Dec 4, 2025

View reviewed changes

jerry-024 requested a review from Copilot December 4, 2025 08:46

Copilot started reviewing on behalf of jerry-024 December 4, 2025 08:46 View session

Copilot finished reviewing on behalf of jerry-024 December 4, 2025 08:49

Copilot AI reviewed Dec 4, 2025

View reviewed changes

add test script, split, notice

213332f

jerry-024 force-pushed the test_vector_index branch from 1179721 to 213332f Compare December 4, 2025 09:17

shade lucene

10fb06e

		// Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)
		// Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are different (vehicles)

+    public static VectorIndexMetadata deserializeMetadata(byte[] data) throws IOException {
+        try (java.io.DataInputStream dataIn = new java.io.DataInputStream(new java.io.ByteArrayInputStream(data))) {
+            int dimension = dataIn.readInt();
+            String similarityFunction = dataIn.readUTF();
+            int m = dataIn.readInt();
+            int efConstruction = dataIn.readInt();
+            return new VectorIndexMetadata(dimension, similarityFunction, m, efConstruction);
+        }
+    }

	config.setRAMBufferSizeMB(writeBufferSize); // Increase buffer for better performance
	config.setRAMBufferSizeMB(writeBufferSize); // Configure RAM buffer size based on user settings

	name: UTCase and ITCase Spark 4.x
	name: UTCase and ITCase Paimon Vector

-                            "The similarity metric for vector search (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT)");
+                            "The similarity metric for vector search. Options are COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT. " +
+                            "The default is EUCLIDEAN for compatibility with common vector index algorithms. " +
+                            "Note: COSINE is often preferred for semantic search use cases, but EUCLIDEAN is used here as the default for broader applicability.");

Test vector index #6716

Are you sure you want to change the base?

Test vector index #6716

Conversation

jerry-024 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Implementation

Tests

API and Format

Documentation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jerry-024 commented Dec 1, 2025 •

edited

Loading