Add a new method `roaring_bitmap_to_bool_array`. #764

RinChanNOWWW · 2025-11-24T08:41:54Z

Currently, we can only convert the roaring bitmap to a sparse uint32 array or a dense bitset. In some case, we may need to get a dense boolean/uint8 array. For example, some database systems like ClickHouse, will use a uint8 array as a mask column to filter data.

This PR introduces roaring_bitmap_to_bool_array to convert a roaring bitmap to a bool array to meet the requirement. The implementation is quite simple, and optimizations (like utilize SIMD) could be introduced in later PRs if this PR is accepted by the community.

lemire · 2025-11-24T20:30:18Z

@RinChanNOWWW

I am not 100% convinced about your motivation. We already support conversion to a bitset:

roaring_bitmap_t *r1 = roaring_bitmap_create();
for (uint32_t i = 100; i < 100000; i+= 1 + (i%5)) {
     roaring_bitmap_add(r1, i);
}
for (uint32_t i = 100000; i < 500000; i+= 100) {
     roaring_bitmap_add(r1, i);
}
roaring_bitmap_add_range(r1, 500000, 600000);
bitset_t * bitset = bitset_create();
bool success = roaring_bitmap_to_bitset(r1, bitset);
assert(success); // could fail due to memory allocation.
assert(bitset_count(bitset) == roaring_bitmap_get_cardinality(r1));
// You can then query the bitset:
for (uint32_t i = 100; i < 100000; i+= 1 + (i%5)) {
    assert(bitset_get(bitset,i));
}
for (uint32_t i = 100000; i < 500000; i+= 100) {
    assert(bitset_get(bitset,i));
}
// you must free the memory:
bitset_free(bitset);
roaring_bitmap_free(r1);

A bitset instance is quite simple:

struct bitset_s {
    uint64_t *CROARING_CBITSET_RESTRICT array;
    /* For simplicity and performance, we prefer to have a size and a capacity
     * that is a multiple of 64 bits. Thus we only track the size and the
     * capacity in terms of 64-bit words allocated */
    size_t arraysize;
    size_t capacity;
};

typedef struct bitset_s bitset_t;

I guess we can always argue that it is good to have more ways to get the same work done, but expanding our API is not free.

If this is related to something ClickHouse needs, then sure... but do you have a related ClickHouse issue ? Or discussion thread ?

RinChanNOWWW · 2025-11-25T03:19:53Z

Hi @lemire, thanks for the response.

I also found that we can convert to bitset already. However, there are two main reasons the bitset API cannot meet my requirement:

As described above, database system like ClickHouse uses a uint8 vector to store boolean values to do filtering. If I want to convert a roaring bitmap to a uint8 vector by using bitset, I need to convert a roaring bitmap to bitset, and then convert the bitset to a uint8 vector.
There aren't a range API for bitset like roaring_bitmap_range_uint32_array. Sometimes I only need a subset of it.

There are already conversion from a roaring bitmap to a uint8 vector in ClickHouse, but the conversion is not efficient yet:

https://github.com/ClickHouse/ClickHouse/blob/e891253ffd874c91c6bb23398554f31c7a90450b/src/Storages/MergeTree/MergeTreeIndexReadResultPool.cpp#L264-L279

Current implementation is using a uint32 iterator to find all values in the roaring bitmap within a specific range, and fill the coresponding position to true in the uint8 vector. If we support roaring_bitmap_to_bool_array, we can utilize SIMD for optimzation.

Apache Doris also implement a similar way to iterate the roaring bitmap to fill a uint8 vector:

https://github.com/apache/doris/blob/a13241b76c7f09594ec4f9c1d1433af51bc80548/be/src/olap/rowset/segment_v2/segment_iterator.cpp#L2833-L2843

lemire · 2025-11-25T04:19:37Z

include/roaring/roaring.h

+ * e.g.
+ *     ans = malloc(roaring_bitmap_maximum(bitmap) * sizeof(bool));
+ *
+ * This function always returns `true`


If the function always return true, why have a return value at all?

If the function always return true, why have a return value at all?

I followed the style of existing APIs like this one:

CRoaring/include/roaring/roaring.h

Lines 558 to 573 in d5f8433

/**

* Convert the bitmap to a sorted array from `offset` by `limit`, output in

* `ans`.

*

* Caller is responsible to ensure that there is enough memory allocated, e.g.

*

* ans = malloc(roaring_bitmap_get_cardinality(limit) * sizeof(uint32_t));

*

* This function always returns `true`

*

* For more control, see `roaring_uint32_iterator_skip` and

* `roaring_uint32_iterator_read`, which can be used to e.g. tell how many

* values were actually read.

*/

bool roaring_bitmap_range_uint32_array(const roaring_bitmap_t *r, size_t offset,

size_t limit, uint32_t *ans);

Should we remove them all?

This function originally returned false on error, at least as per its specification. We updated the documentation a few months ago to reflect the fact that it cannot, in fact, return false. Given that the function has been around for a long time, it is easier to keep the bool return (doing otherwise might break existing code).

I am not sure that it is a sensible pattern that we should reproduce.

Right?

lemire · 2025-11-25T04:20:52Z

@RinChanNOWWW That's an excellent answer, but I am not sure what you have implemented would work for ClickHouse.

Look at the use case:

        roaring::api::roaring_uint32_iterator_t it;
        roaring_iterator_init(data.bitmap32, &it);
        if (!roaring_uint32_iterator_move_equalorlarger(&it, starting_row))
            return false;

        bool has_value = false;
        while (it.current_value < ending_row)
        {
            has_value = true;
            pos[it.current_value - starting_row] = 1;
            if (!roaring_uint32_iterator_advance(&it))
                break;
        }
        return has_value;

So it seems that it needs to operate over a range, doesn't it ?

Of course, you can always dump the whole thing to a temporary buffer and copy it over, but that's not efficient.

RinChanNOWWW · 2025-11-25T05:05:38Z

@lemire

So it seems that it needs to operate over a range, doesn't it ?

Yes.

Of course, you can always dump the whole thing to a temporary buffer and copy it over, but that's not efficient.

I would like to implement a range API roaring_bitmap_range_bool_array later if these APIs are welcome by the community. And in my opinion, these two APIs (roaring_bitmap_to_bool_array and roaring_bitmap_range_bool_array) should exist togehter just like the ones for uint32 array, so we can have a heuristic strategy to decide to use which one (the full one or the range one) in different scenarios.

lemire · 2025-11-25T05:23:12Z

@RinChanNOWWW I understand, but I am concerned about adding functions for which I see no obvious use. Your analysis is excellent, but it suggests that the range functions are what we want to have. If we had the ranged function, then your version would become unnecessary.

RinChanNOWWW · 2025-11-25T06:10:57Z

@lemire You are right. So let me implement the range function in this PR first.

RinChanNOWWW force-pushed the to_bool_array branch from 173ef04 to 57a91cd Compare November 24, 2025 08:45

RinChanNOWWW marked this pull request as draft November 24, 2025 11:58

Basic impl of roaring_bitmap_to_bool_array.

ff677b9

RinChanNOWWW force-pushed the to_bool_array branch from 57a91cd to ff677b9 Compare November 24, 2025 12:15

RinChanNOWWW marked this pull request as ready for review November 24, 2025 12:57

RinChanNOWWW added 2 commits November 24, 2025 21:02

Add unit tests.

441ff30

Add benchmarks.

c957dfa

RinChanNOWWW force-pushed the to_bool_array branch from b15d97d to c957dfa Compare November 24, 2025 13:02

lemire reviewed Nov 25, 2025

View reviewed changes

RinChanNOWWW marked this pull request as draft November 25, 2025 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a new method `roaring_bitmap_to_bool_array`. #764

Add a new method `roaring_bitmap_to_bool_array`. #764

RinChanNOWWW commented Nov 24, 2025 •

edited

Loading

Uh oh!

lemire commented Nov 24, 2025

Uh oh!

RinChanNOWWW commented Nov 25, 2025 •

edited

Loading

Uh oh!

lemire Nov 25, 2025

Uh oh!

RinChanNOWWW Nov 25, 2025

Uh oh!

lemire Nov 25, 2025

Uh oh!

lemire commented Nov 25, 2025

Uh oh!

RinChanNOWWW commented Nov 25, 2025

Uh oh!

lemire commented Nov 25, 2025

Uh oh!

RinChanNOWWW commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	/**
	* Convert the bitmap to a sorted array from `offset` by `limit`, output in
	* `ans`.
	*
	* Caller is responsible to ensure that there is enough memory allocated, e.g.
	*
	* ans = malloc(roaring_bitmap_get_cardinality(limit) * sizeof(uint32_t));
	*
	* This function always returns `true`
	*
	* For more control, see `roaring_uint32_iterator_skip` and
	* `roaring_uint32_iterator_read`, which can be used to e.g. tell how many
	* values were actually read.
	*/
	bool roaring_bitmap_range_uint32_array(const roaring_bitmap_t *r, size_t offset,
	size_t limit, uint32_t *ans);

Add a new method roaring_bitmap_to_bool_array. #764

Are you sure you want to change the base?

Add a new method roaring_bitmap_to_bool_array. #764

Conversation

RinChanNOWWW commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Nov 24, 2025

Uh oh!

RinChanNOWWW commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

RinChanNOWWW Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

lemire Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

lemire commented Nov 25, 2025

Uh oh!

RinChanNOWWW commented Nov 25, 2025

Uh oh!

lemire commented Nov 25, 2025

Uh oh!

RinChanNOWWW commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a new method `roaring_bitmap_to_bool_array`. #764

Add a new method `roaring_bitmap_to_bool_array`. #764

RinChanNOWWW commented Nov 24, 2025 •

edited

Loading

RinChanNOWWW commented Nov 25, 2025 •

edited

Loading