How `in` filter works? #44935

subhamhimself · 2025-10-17T07:07:21Z

subhamhimself
Oct 17, 2025

When I search the whole collection without any filters, it takes 200ms. I am using FLAT index. I have ~100 nodes in the k8s deployment and 10TB memory.
When I search with filters like "document_id in [<>]", it takes a lot longer, around 5s, when I pass around 10% of document_ids in the array. Total document_ids = 10M, passed values in filter = 1M.
Why is it working like this? It takes longer to filters these values as compared to taking so many dot products of huge vectors?

yhmo · 2025-10-17T08:14:50Z

yhmo
Oct 17, 2025
Collaborator

I believe the time cost is mainly due to parsing the extremely long string of the filter.

You can use "filter template" to pass the ids in list format:
https://milvus.io/docs/filtering-templating.md

For example:

expr = "document_id in {ids}"

id_list = []
for i in range(1000000):
  id_list.append(i)
filter_template = {"ids": id_list}

res = client.search(
    "hello_milvus",
    vectors,
    filter=expr,
    limit=10,
    filter_params=filter_template,
)

2 replies

subhamhimself Oct 17, 2025
Author

Hey @yhmo,
I have tried using filter templating but it wasn't helpful. It reduced 5-10% time but not any meaningful difference.

xiaofan-luan Oct 18, 2025
Maintainer

For Milvus. filter don't help on improving the performance.

If you can, you may try to put document into different partitions and only search certian partitions

yhmo · 2025-10-20T04:37:42Z

yhmo
Oct 20, 2025
Collaborator

I just tested 10M vectors(4-dim) with this script:
Search without filter expression is much faster than with filter.
Search with template filter(1M ids) is a bit faster than search with string filter (1M ids).

import random
import time

from pymilvus import (
    MilvusClient, DataType, 
)

client = MilvusClient(uri="http://localhost:19530")
print(client.get_server_version())

COLLECTION_NAME = "XXX"

dim = 4
def gen_embedding():
    return [random.random() for _ in range(dim)]


if __name__ == '__main__':
    client.drop_collection(collection_name=COLLECTION_NAME)
    schema = MilvusClient.create_schema(enable_dynamic_field=True)
    schema.add_field(field_name="pk", datatype=DataType.INT64, is_primary=True, auto_id=False)
    schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=dim)

    client.create_collection(collection_name=COLLECTION_NAME, schema=schema)

    index_params = client.prepare_index_params()
    index_params.add_index(
        field_name="vector",
        index_type="IVF_FLAT",
        metric_type="COSINE",
    )
    client.create_index(collection_name=COLLECTION_NAME, index_params=index_params)
    client.load_collection(collection_name=COLLECTION_NAME)

    batch = 10000
    for i in range(1000):
        print(i)
        data = [
            {"pk": i*batch + k, "vector": gen_embedding()} for k in range(batch)
        ]
        res = client.insert(collection_name=COLLECTION_NAME, data=data)
    client.flush(collection_name=COLLECTION_NAME)
    res = client.query(collection_name=COLLECTION_NAME, filter="", output_fields=["count(*)"], consistency_level="Strong")
    print(f'row count: {res[0]["count(*)"]}')

    def search_without_filter(vector, topk):
        start = time.time()
        results = client.search(collection_name=COLLECTION_NAME,
                                data=[vector],
                                anns_field="vector",
                                consistency_level="Bounded",
                                limit=topk)
        end = time.time()
        print(f"1. search without filter time cost: {(end - start) * 1000} ms")
        return (end - start) * 1000

    def search_with_str_filter(vector, topk):
        ids= [k + 2000000 for k in range(1000000)]
        filter = f"pk in {ids}"
        start = time.time()
        results = client.search(collection_name=COLLECTION_NAME,
                                data=[vector],
                                filter=filter,
                                anns_field="vector",
                                consistency_level="Bounded",
                                limit=topk)
        end = time.time()
        print(f"2. search with string filter time cost: {(end-start)*1000} ms")
        return (end - start) * 1000

    def search_with_template_filter(vector, topk):
        ids= [k + 2000000 for k in range(1000000)]
        start = time.time()
        results = client.search(collection_name=COLLECTION_NAME,
                                data=[vector],
                                filter="pk in {ids}",
                                filter_params={"ids": ids},
                                anns_field="vector",
                                consistency_level="Bounded",
                                limit=topk)
        end = time.time()
        print(f"3. search with template filter time cost: {(end-start)*1000} ms")
        return (end - start) * 1000

    topk = 50
    tt1 = []
    tt2 = []
    tt3 = []
    for i in range(10):
        vector = gen_embedding()
        t1 = search_without_filter(vector, topk)
        tt1.append(t1)
        t2 = search_with_str_filter(vector, topk)
        tt2.append(t2)
        t3 = search_with_template_filter(vector, topk)
        tt3.append(t3)

    print("\nSummary:")
    print(f"\t Without filter, average time cost: {sum(tt1)/len(tt1)}")
    print(f"\t String filter, average time cost: {sum(tt2) / len(tt2)}")
    print(f"\t Template filter, average time cost: {sum(tt3) / len(tt3)}")

10M 4-dim vectors, the total size is 160MB.
If I use the default segment size 1024MB, all the vectors are stored in one segment. The performance summary:

Summary:
	 Without filter, average time cost: 13.866877555847168
	 String filter, average time cost: 3138.899779319763
	 Template filter, average time cost: 2576.1354207992554

If I set the segment size to be 10MB, there are 30 ~ 50 segments generated. The performance summary:

Summary:
	 Without filter, average time cost: 44.48275566101074
	 String filter, average time cost: 6032.812809944153
	 Template filter, average time cost: 5229.302072525024

The key point: if we use a filter with 1 million ids to search, it must compute the distance of 1 million vectors to find out the topk items.
But if we don't use a filter expression, it will use the index to do a fast scan, and most of the vectors are ignored, only a few vectors are computed to find out the topk items. Maybe only thousands of vectors are computed. So the time cost of filter expression(1M ids) is much slower than pure index search.

6 replies

yhmo Oct 20, 2025
Collaborator

Let's say the calculation distance between two vectors is "one unit".

For FLAT index, all vectors are computed with the target vector, it requires 10M units of calculation.
For IVF_FLAT index, the number of units is around 10M * nprobe/nlist. If nlist=1024 and nprobe=32, it requires 300K units of calculation.
With filter expression "document_id in [1M ids]", the time cost = "parsing the expression" + "use the bloom filter to find the 1M ids in each segment" + "1M units of calculation".

100 workers, 100K vectors per worker. It could be a bottleneck that "uses the bloom filter to find the 1M ids in each segment".
I use a collection that only has one row to test, filtering(1M ids) search still costs more than 600ms.

Summary:
	 Without filter, average time cost: 20.663166046142578
	 String filter, average time cost: 820.7868576049805
	 Template filter, average time cost: 616.273021697998

subhamhimself Oct 21, 2025
Author

Makes sense.

Since you're saying time goes in "uses the bloom filter to find the 1M ids in each segment", does this mean:
Making my partitions smaller so it's only 1 segment per partition, and running n parallel queries, one for each segment/partition; will this be faster?

Is there any change I can make locally to change the bitmap construction or directly pass a bitmap to improve query speed?

yhmo Oct 22, 2025
Collaborator

It could be better with partitions.
For example, I have 10M entities with ID=0 to ID=10M, I put range[0, 999999] to partition_0, range[1000000, 1999999] to partition_1, etc.
Then I search ids=[1, 10, 20, 30 ... 999990] in partition_0, and search ids=[1000001, 1000010, 1000020, ..., 1999990] in partition_1. There are 10 search requests, each of which searches 100,000 IDs in 1 million vectors.

subhamhimself Oct 22, 2025
Author

@yhmo But milvus can do that optimization directly on the server level right?
Assuming we can get a table document_id ->partition name, we can groupby partition and run the whole thing in parallel for each partition?

yhmo Oct 22, 2025
Collaborator

Partition key can distribute data into different partitions automatically. But it doesn't help in your case.
https://milvus.io/docs/use-partition-key.md#Use-Partition-Key

Usage of partition key:

Define a field to be the partition key. Note that primary key field is not allowed to be partition key.

    schema = MilvusClient.create_schema(enable_dynamic_field=True)
    schema.add_field(field_name="pk", datatype=DataType.INT64, is_primary=True, auto_id=False)
    schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=dim)
    schema.add_field(field_name="aaa", datatype=DataType.INT64, is_partition_key=True)

Create a collection and specify the number of partitions. Milvus automatically creates partitions.

client.create_collection(collection_name=COLLECTION_NAME, schema=schema, num_partitions=10)

Insert data, milvus automatically distributes data to different partitions according to the hash value of each entity.
Search with a filter expression "aaa in [99, 100]" that contains the partition key name, milvus uses the hash values of 99 and 100 to find/search the partitions that contain the two entities.

results = client.search(collection_name=COLLECTION_NAME,
                            data=[gen_embedding(), gen_embedding()],
                            filter="aaa in [99, 100]",
                            anns_field="vector",
                            limit=10)

Your case is searching 1M primary key, primary key cannot be partition key.

subhamhimself · 2025-10-23T06:59:40Z

subhamhimself
Oct 23, 2025
Author

I checked the bitset preparation code and found an issue:

template <typename T>
const TargetBitmap
ScalarIndexSort<T>::In(const size_t n, const T* values) {
    AssertInfo(is_built_, "index has not been built");
    TargetBitmap bitset(Count());
    for (size_t i = 0; i < n; ++i) {
        auto lb =
            std::lower_bound(begin(), end(), IndexStructure<T>(*(values + i)));
        auto ub =
            std::upper_bound(begin(), end(), IndexStructure<T>(*(values + i)));
        for (; lb < ub; ++lb) {
            if (lb->a_ != *(values + i)) {
                std::cout << "error happens in ScalarIndexSort<T>::In, "
                             "experted value is: "
                          << *(values + i) << ", but real value is: " << lb->a_;
            }
            bitset[lb->idx_] = true;
        }
    }
    return bitset;
}

The problem:

This function runs for every segment's index and every batch of values. Since my filter has a lot of document_ids, for every batch & segment, we are running b index lookups (so b.log(s)) where b is batch size and s is segment size. We are also creating a bitset of size s which will take atleast O(s).
Time complexity: O(b.log(s) + s) per segment per batch

My proposed solution:

Enforce const T* values to be sorted before it is passed to this code (total time added O(n log(n) for the whole query)
- Maybe they are already sorted in the current code? I don't know
Since we know the index's domain (index.front() and index.back()), we only iterate the values array in between that domain.
- Since the values array is sorted, we can binary search it to get the start and end point.
Declare the bitset with minimum size needed - i.e. use information gained in step 2
- It will be much smaller than before when it had fixed size
- Also, merging these bitsets will be faster than before since these are smaller.
Proceed with the rest of the logic

6 replies

xiaofan-luan Oct 23, 2025
Maintainer

sounds like a plan.

sort the in list and use double pointer makes sense

zhengbuqian Oct 23, 2025
Maintainer

That's a good idea. A better way to optimize it is to find the min/max of values and use statistics like min/max/bloom filter(skip index in milvus) to filter some values.

we don't yet have those in the skip index so not an option for now.

but sorting the values is indeed a good plan.

also the bitset still needs to be of size Count() for its size is number of rows, not number of possible matching values.

subhamhimself Oct 23, 2025
Author

@zhengbuqian Yeah you're right, the bitset size part makes sense.
But, can we at least handle the "no match found" case and return an empty bitset for that special case? We can intercept the empty bitset case in the merging logic (in ProcessIndexOneChunk). This works really well for id like fields for my use case or for similar cases like #30339.

zhengbuqian Nov 18, 2025
Maintainer

@subhamhimself we did some tests and seems the 2 pointers algo it self did not bring much improvement. the main problem is a unnecessary copy causing the huge delay. #45491 should fix it. this will also be cherry-picked into the next minor release of 2.5 and 2.6..

zhengbuqian Nov 25, 2025
Maintainer

@subhamhimself the fix is included in release 2.6.6, it should help a lot

xiaofan-luan · 2025-10-30T10:28:44Z

xiaofan-luan
Oct 30, 2025
Maintainer

template
const TargetBitmap
ScalarIndexSort::In(const size_t n, const T* values) {
AssertInfo(is_built_, "index has not been built");
const auto& index = this->data_; // sorted array of IndexStructure
const size_t s = index.size();

// 1. Copy & sort values
std::vector<T> vals(values, values + n);
std::sort(vals.begin(), vals.end());

// 2. Skip values < min or > max
const T& min_v = index.front().a_;
const T& max_v = index.back().a_;

auto lower = std::lower_bound(vals.begin(), vals.end(), min_v);
auto upper = std::upper_bound(vals.begin(), vals.end(), max_v);

TargetBitmap bitset(s);

if (lower == upper) {
    return bitset;  // all false
}

// 3. Merge-style scan
size_t j = 0;  // pointer into index array
for (auto it = lower; it != upper; ++it) {
    const T& v = *it;
    // move j to first >= v
    while (j < s && index[j].a_ < v) {
        ++j;
    }
    // assign all idx where equals
    while (j < s && index[j].a_ == v) {
        bitset[index[j].idx_] = true;
        ++j;
    }
}

return bitset;

}

@zhengbuqian

how does this looks like

7 replies

xiaofan-luan Nov 3, 2025
Maintainer

which milvus version are u using?

We did some optimization on bitset on the latest version.

xiaofan-luan Nov 3, 2025
Maintainer

I think the goal is to reduce bitset.Set. this could be expensive.

xiaofan-luan Nov 3, 2025
Maintainer

which milvus version are u using?

We did some optimization on bitset o

don't think there is a strong reason for that, chunk is prepared for future skipIndex

subhamhimself Nov 5, 2025
Author

which milvus version are u using?

I built it from the main branch.

xiaofan-luan Nov 6, 2025
Maintainer

@buqian-zilliz actually did some optimization and we should see better perfomrance for IN operator.

Also please use expr template instead so it could save some parse time.

How in filter works? #44935

Uh oh!

subhamhimself Oct 17, 2025

Replies: 4 comments · 21 replies

Uh oh!

yhmo Oct 17, 2025 Collaborator

Uh oh!

subhamhimself Oct 17, 2025 Author

Uh oh!

xiaofan-luan Oct 18, 2025 Maintainer

Uh oh!

yhmo Oct 20, 2025 Collaborator

Uh oh!

yhmo Oct 20, 2025 Collaborator

Uh oh!

subhamhimself Oct 21, 2025 Author

Uh oh!

yhmo Oct 22, 2025 Collaborator

Uh oh!

subhamhimself Oct 22, 2025 Author

Uh oh!

yhmo Oct 22, 2025 Collaborator

Uh oh!

subhamhimself Oct 23, 2025 Author

The problem:

My proposed solution:

Uh oh!

xiaofan-luan Oct 23, 2025 Maintainer

Uh oh!

Uh oh!

zhengbuqian Oct 23, 2025 Maintainer

Uh oh!

subhamhimself Oct 23, 2025 Author

Uh oh!

zhengbuqian Nov 18, 2025 Maintainer

Uh oh!

zhengbuqian Nov 25, 2025 Maintainer

Uh oh!

xiaofan-luan Oct 30, 2025 Maintainer

Uh oh!

xiaofan-luan Nov 3, 2025 Maintainer

Uh oh!

xiaofan-luan Nov 3, 2025 Maintainer

Uh oh!

xiaofan-luan Nov 3, 2025 Maintainer

Uh oh!

subhamhimself Nov 5, 2025 Author

Uh oh!

xiaofan-luan Nov 6, 2025 Maintainer

How `in` filter works? #44935

subhamhimself
Oct 17, 2025

Replies: 4 comments 21 replies

yhmo
Oct 17, 2025
Collaborator

subhamhimself Oct 17, 2025
Author

xiaofan-luan Oct 18, 2025
Maintainer

yhmo
Oct 20, 2025
Collaborator

yhmo Oct 20, 2025
Collaborator

subhamhimself Oct 21, 2025
Author

yhmo Oct 22, 2025
Collaborator

subhamhimself Oct 22, 2025
Author

yhmo Oct 22, 2025
Collaborator

subhamhimself
Oct 23, 2025
Author

xiaofan-luan Oct 23, 2025
Maintainer

zhengbuqian Oct 23, 2025
Maintainer

subhamhimself Oct 23, 2025
Author

zhengbuqian Nov 18, 2025
Maintainer

zhengbuqian Nov 25, 2025
Maintainer

xiaofan-luan
Oct 30, 2025
Maintainer

xiaofan-luan Nov 3, 2025
Maintainer

xiaofan-luan Nov 3, 2025
Maintainer

xiaofan-luan Nov 3, 2025
Maintainer

subhamhimself Nov 5, 2025
Author

xiaofan-luan Nov 6, 2025
Maintainer