How in filter works?
#44935
Replies: 4 comments 21 replies
-
|
I believe the time cost is mainly due to parsing the extremely long string of the filter. You can use "filter template" to pass the ids in list format: For example: |
Beta Was this translation helpful? Give feedback.
-
|
I just tested 10M vectors(4-dim) with this script: 10M 4-dim vectors, the total size is 160MB. If I set the segment size to be 10MB, there are 30 ~ 50 segments generated. The performance summary: The key point: if we use a filter with 1 million ids to search, it must compute the distance of 1 million vectors to find out the topk items. |
Beta Was this translation helpful? Give feedback.
-
|
I checked the bitset preparation code and found an issue: The problem:This function runs for every segment's index and every batch of values. Since my filter has a lot of document_ids, for every batch & segment, we are running b index lookups (so b.log(s)) where b is batch size and s is segment size. We are also creating a bitset of size s which will take atleast O(s). My proposed solution:
|
Beta Was this translation helpful? Give feedback.
-
|
template } how does this looks like |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When I search the whole collection without any filters, it takes 200ms. I am using FLAT index. I have ~100 nodes in the k8s deployment and 10TB memory.
When I search with filters like "document_id in [<>]", it takes a lot longer, around 5s, when I pass around 10% of document_ids in the array. Total document_ids = 10M, passed values in filter = 1M.
Why is it working like this? It takes longer to filters these values as compared to taking so many dot products of huge vectors?
Beta Was this translation helpful? Give feedback.
All reactions