-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Test vector index #6716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Test vector index #6716
Conversation
* upstream/master: [test] Fix test SparkWriteITCase.testTruncatePartitionValueNull [orc] Limiting Memory Usage of OrcBulkWriter When Writing VectorizedRowBatch (apache#6590) [arrow] Improve customization capabilities for data type conversion. (apache#6695) [spark] Fix NPE in spark truncate null partitions [core] Exclude .class files from sources.jar (apache#6707) [core] DataEvolutionFileStoreScan should not filter files by read type when it contains no physical columns. (apache#6714) [spark] Update scalafmt version to 3.10.2 (apache#6709) [variant] Extract only required columns when reading shredded variants (apache#6720) [python] Fix read large volume of blob data (apache#6701) [flink] Support cdc source (apache#6606) [hive] fix splitting for bucket tables (apache#6594) [spark] Update spark build topology for global index (2) (apache#6703) [test][spark] Fix the flaky test setDefaultDatabase (apache#6696) [spark] Update global index build topology (apache#6700) [spark] Introduce global file index builder on spark (apache#6684) [python] Fix with_shard feature for blob data (apache#6691) [test][spark] Add alter with incompatible col type test case (apache#6689) [variant] Introduce withVariantAccess in ReadBuilder (apache#6685) [python] Fix file name prefix in postpone mode. (apache#6668)
* upstream/master: [core] format table: throw exception when get partiiton from file system (apache#6730) [core] Enrich static creation methods for GlobalIndexResult [spark] Fix duplicate column error when merging on _ROW_ID (apache#6727) [core] Remove Builder of SplitReadProvider.Context [core] Intruduce bitmap64 for GlobalIndexResult and support with ranges filter push down (apache#6725) [spark] Push down partition filter when compactUnAwareBucketTable (apache#6663) [core] Support the new split description for reading data (apache#6711) [python] support paimon as ray datasource for distributed processing (apache#6686) [docs] Add missing PaimonSparkSessionExtensions to Spark configs (apache#6729) Revert "[flink] Flink batch job support specify partition with max_pt() and max_two_pt()" without review [flink] Flink batch job support specify partition with max_pt() and max_two_pt() [spark] Update the display of size and timing metrics (apache#6722) [core] Append commit should check bucket number if latest commit user is different (apache#6723) [core] ShardScanner should not keep partition parameter (apache#6724)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new paimon-vector module that implements vector indexing capabilities using Apache Lucene 9.12.3. The implementation provides efficient approximate nearest neighbor (ANN) search using the HNSW algorithm with support for multiple similarity metrics (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT) and both float and byte vector types.
Key Changes:
- New
paimon-vectormodule with Lucene-based vector indexing for similarity search - Comprehensive test suite with 11 test cases covering various scenarios (dimensions, metrics, sharding)
- Service provider interface integration for pluggable global indexer factory
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pom.xml | Adds paimon-vector profile activated for JDK 11+ |
| paimon-vector/pom.xml | Defines new module with Lucene 9.12.3 dependencies |
| VectorGlobalIndexerFactory.java | Factory implementation for creating vector indexers via SPI |
| VectorGlobalIndexer.java | Main indexer that creates writer and reader instances |
| VectorGlobalIndexWriter.java | Implements vector index writing using Lucene HNSW with batch support |
| VectorGlobalIndexReader.java | Implements KNN search across multiple index shards with Lucene |
| VectorIndexOptions.java | Configuration options for dimension, metric, HNSW parameters |
| VectorIndexMetadata.java | Serializable metadata for vector index parameters |
| VectorIndex.java | Abstract base class for vector index entries |
| FloatVectorIndex.java | Float vector implementation with Lucene KnnFloatVectorField |
| ByteVectorIndex.java | Byte vector implementation with Lucene KnnByteVectorField |
| META-INF/.../GlobalIndexerFactory | Service provider configuration registering VectorGlobalIndexerFactory |
| VectorGlobalIndexTest.java | Comprehensive test suite covering write/read, metrics, dimensions, multi-shard scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
paimon-vector/src/test/java/org/apache/paimon/vector/VectorGlobalIndexTest.java
Outdated
Show resolved
Hide resolved
paimon-vector/src/test/java/org/apache/paimon/vector/VectorGlobalIndexTest.java
Outdated
Show resolved
Hide resolved
paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexReader.java
Show resolved
Hide resolved
paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexReader.java
Outdated
Show resolved
Hide resolved
paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexWriter.java
Show resolved
Hide resolved
paimon-vector/src/main/java/org/apache/paimon/vector/VectorGlobalIndexReader.java
Show resolved
Hide resolved
paimon-vector/src/main/java/org/apache/paimon/vector/ByteVectorIndex.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits) | ||
| // Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are different (vehicles) |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits)" but this could be misleading because the similarity actually depends on the metric being used (EUCLIDEAN in this test). The comment describes conceptual similarity but doesn't clarify that these are close in Euclidean distance.
| // Apple [0.9, 0.1] and Banana [0.8, 0.2] and Orange [0.85, 0.15] are similar (fruits) | |
| // Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are different (vehicles) | |
| // Apple [0.9, 0.1], Banana [0.8, 0.2], and Orange [0.85, 0.15] are close in Euclidean distance (fruits) | |
| // Car [0.1, 0.9], Bike [0.15, 0.85], and Truck [0.2, 0.8] are close to each other in Euclidean distance (vehicles), but far from the fruit vectors. |
| } | ||
| return byteOut.toByteArray(); | ||
| } | ||
|
|
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The serializeMetadata method only serializes metadata but there's no corresponding deserializeMetadata method. This creates an asymmetric API and may cause confusion. Consider adding a matching deserialization method for consistency and future use.
| public static VectorIndexMetadata deserializeMetadata(byte[] data) throws IOException { | |
| try (java.io.DataInputStream dataIn = new java.io.DataInputStream(new java.io.ByteArrayInputStream(data))) { | |
| int dimension = dataIn.readInt(); | |
| String similarityFunction = dataIn.readUTF(); | |
| int m = dataIn.readInt(); | |
| int efConstruction = dataIn.readInt(); | |
| return new VectorIndexMetadata(dimension, similarityFunction, m, efConstruction); | |
| } | |
| } |
| private static IndexWriterConfig getIndexWriterConfig( | ||
| int m, int efConstruction, int writeBufferSize) { | ||
| IndexWriterConfig config = new IndexWriterConfig(); | ||
| config.setRAMBufferSizeMB(writeBufferSize); // Increase buffer for better performance |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says "Increase buffer for better performance" but this is setting the buffer to a configurable value, not necessarily increasing it. The comment should reflect that this is configuring the RAM buffer size based on user settings.
| config.setRAMBufferSizeMB(writeBufferSize); // Increase buffer for better performance | |
| config.setRAMBufferSizeMB(writeBufferSize); // Configure RAM buffer size based on user settings |
| // Test 3: Query with vector similar to "Car" - should find Car first | ||
| float[] carQueryVector = new float[] {0.15f, 0.85f}; | ||
| GlobalIndexResult carSearchResult = reader.search(carQueryVector, 1); | ||
|
|
||
| List<Long> carResultIds = new ArrayList<>(); | ||
| carSearchResult.results().iterator().forEachRemaining(carResultIds::add); | ||
|
|
||
| assertThat(carResultIds).as("Query similar to Car should find Car").contains(carId); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test comment and assertion are incorrect. The query vector [0.15f, 0.85f] is exactly equal to the Bike vector [0.15f, 0.85f] (line 394), not the Car vector [0.1f, 0.9f] (line 392). The test should either: (1) change the comment and assertion to expect Bike, or (2) change the query vector to be closer to Car, such as [0.12f, 0.88f].
| private Options createDefaultOptions() { | ||
| Options options = new Options(); | ||
| options.setInteger("vector.dim", 128); | ||
| options.setString("vector.metric", "COSINE"); | ||
| options.setInteger("vector.m", 16); | ||
| options.setInteger("vector.ef-construction", 100); | ||
| return options; | ||
| } |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The createDefaultOptions() method sets the metric to "COSINE" but many tests override this to use "EUCLIDEAN". Consider renaming this method to something like createBaseOptions() or documenting that these are base settings that tests typically override, to avoid confusion about what the "default" actually means.
| # limitations under the License. | ||
| ################################################################################ | ||
|
|
||
| name: UTCase and ITCase Spark 4.x |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow name "UTCase and ITCase Spark 4.x" is incorrect for a paimon-vector workflow. This should be something like "UTCase and ITCase Paimon Vector" to accurately reflect what this workflow tests.
| name: UTCase and ITCase Spark 4.x | |
| name: UTCase and ITCase Paimon Vector |
| .stringType() | ||
| .defaultValue("EUCLIDEAN") | ||
| .withDescription( | ||
| "The similarity metric for vector search (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT)"); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default similarity metric is "EUCLIDEAN" but the documentation in the PR description mentions COSINE, DOT_PRODUCT, and EUCLIDEAN without specifying which is the default. Consider documenting why EUCLIDEAN was chosen as the default, as COSINE is often preferred for semantic search use cases mentioned in the PR purpose.
| "The similarity metric for vector search (COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT)"); | |
| "The similarity metric for vector search. Options are COSINE, DOT_PRODUCT, EUCLIDEAN, MAX_INNER_PRODUCT. " + | |
| "The default is EUCLIDEAN for compatibility with common vector index algorithms. " + | |
| "Note: COSINE is often preferred for semantic search use cases, but EUCLIDEAN is used here as the default for broader applicability."); |
| p -> { | ||
| try { | ||
| java.nio.file.Files.delete(p); | ||
| } catch (IOException e) { | ||
| // Ignore cleanup errors | ||
| } | ||
| }); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda silently ignores cleanup errors which could mask issues during testing or debugging. Consider at least logging these errors instead of completely ignoring them, or document why silent failure is acceptable here.
| .forEach( | ||
| p -> { | ||
| try { | ||
| java.nio.file.Files.delete(p); | ||
| } catch (IOException e) { | ||
| // Ignore cleanup errors | ||
| } | ||
| }); |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lambda silently ignores cleanup errors which could mask issues during testing or debugging. Consider at least logging these errors instead of completely ignoring them, or document why silent failure is acceptable here. This is the same issue as in VectorGlobalIndexWriter.
| Set<String> fieldsToLoad = Set.of(VectorIndex.ROW_ID_FIELD); | ||
| // Collect row IDs from results | ||
| for (org.apache.lucene.search.ScoreDoc scoreDoc : topDocs.scoreDocs) { | ||
| float rawScore = scoreDoc.score; |
Copilot
AI
Dec 4, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable rawScore is declared but never used. Consider removing it or using it for logging/debugging purposes if score information is relevant.
| float rawScore = scoreDoc.score; |
1179721 to
213332f
Compare
Purpose
Introduce vector index support for Paimon using Apache Lucene 9.x to enable:
Implementation
Tests
API and Format
Documentation