Title Chunker and Token Chunker #11410

Yeon1007 · 2025-11-20T08:04:26Z

Yeon1007
Nov 20, 2025

For the chunker selection of ingestion pipeline , if I use the title chunker first and then the token chunker, i.e., connect two different chunkers, will the token chunker only take effect if the size of a certain chunk exceeds the set chunk size, based on the hierarchy-based chunking?

Answered by dosubot[bot]

Nov 20, 2025

Yes, when you use the Title chunker first and then the Token chunker in the ingestion pipeline, the Title chunker splits the document into chunks based on heading levels (like H1, H2, H3, H4). The Token chunker then only further splits those chunks that exceed the configured token size limit—so it acts as a fallback to ensure no chunk is too large after the initial hierarchical split. This approach preserves document structure as much as possible, only breaking up oversized chunks as needed. Placing the Title chunker after the Token chunker is not supported and may cause errors, so always keep the Title chunker before the Token chunker in your pipeline (Title chunker docs, Token chunker …

View full answer

dosubot[bot] · 2025-11-20T08:08:19Z

dosubot[bot]
bot Nov 20, 2025

Yes, when you use the Title chunker first and then the Token chunker in the ingestion pipeline, the Title chunker splits the document into chunks based on heading levels (like H1, H2, H3, H4). The Token chunker then only further splits those chunks that exceed the configured token size limit—so it acts as a fallback to ensure no chunk is too large after the initial hierarchical split. This approach preserves document structure as much as possible, only breaking up oversized chunks as needed. Placing the Title chunker after the Token chunker is not supported and may cause errors, so always keep the Title chunker before the Token chunker in your pipeline (Title chunker docs, Token chunker docs).

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

12 replies

dosubot[bot] bot Nov 26, 2025

RAGFlow’s Title Chunker applies your regexes independently to each line, so it can’t assign heading levels based on context or previous headings—there’s no way to make “3.1” an H2 in one section and an H3 in another using only regex config. This is a known limitation of the current implementation: heading level assignment is static and order-based, not context-aware (see code).

To handle your scenario, you have a few practical options:

Preprocess your document before ingestion to standardize headings. For example, rewrite “3.1” under “III. INSTALLATION” as “III.3.1” or insert a marker so your regex can distinguish these cases.
Postprocess the chunk output: after chunking, merge or reclassify small chunks based on their content and position in the hierarchy.
Try alternative chunking methods like manual or delimiter-based chunking if your document structure allows it (see discussion).
If you need a fully automated solution, you’d have to customize the HierarchicalMerger code to add context-sensitive logic.

If you want help with a preprocessing or postprocessing script, feel free to share a sample of your document or chunk output!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

dosubot[bot] bot Nov 27, 2025

You've reached the anonymous usage limit of 10 interactions per month. Your usage will reset on 2025-12-01. If you'd like to use Dosu before then, please register to increase your limit to 100 interactions per month for free.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Yeon1007 Nov 27, 2025
Author

If I only use the title chunker to do chunking now, there are about 250 chunks, most of chunk size are around 800, some of which are too large and reach around 3000, and some of which are too small and are less than 100. In this situation, comparing 1. using a token chunker after the title chunker (but since I set the chunk size to 1024, many chunks will be merged back together, inevitably disrupting the chunk structure) and 2. using the build-in general chunking method, which method, based on past experience, would be more effective? Regarding the previous answer: maintainers have noted that for smaller datasets (around 1,000 chunks), larger chunks often yield better results (what does "larger chunks" mean in terms of chunk size? Is 1024 considered large?).

Also, there's an issue with the general build-in chunking method: it gets stuck at 89.44% parse progress. The log details show indexing done, but the status remains running. The container log shows WARNING 18 Database connection issue (attempt 1/5): (0, ''). How can this be resolved? @dosu

dosubot[bot] bot Nov 27, 2025

For your dataset size (250 chunks, most around 800 tokens, some up to 3000, some <100), both the title+token chunker and the built-in general chunking method will merge small chunks to reach the configured chunk size (1024), which can disrupt your original chunk structure. This merging is by design in both methods and can't be disabled without lowering the chunk size or skipping the token chunker entirely. The general chunking method does not preserve heading hierarchy—it's purely size-based—while the title+token combo at least starts from your document's structure before merging occurs see code.

For small datasets (≤1000 chunks), maintainers and user reports confirm that "larger chunks" (1024 tokens) are reasonable and often yield better retrieval results, as they provide more context per chunk maintainer confirmation, discussion. 1024 is considered large compared to the default (512), but is within best-practice range for many use cases.

If you want to preserve semantic structure, stick with just the title chunker. If you need to enforce a hard chunk size limit, use the token chunker after the title chunker, but be aware of the merging effect. The general chunking method is best if you want uniform chunk sizes and don't care about headings.

The ingestion hang at 89.44% with a database connection warning is a known backend bug. The retry logic keeps trying to reconnect, but if the DB can't be reached, the pipeline hangs. Workarounds: clear unused knowledge bases and documents, delete failed/problematic files, or remove all data in the Infinity directory and restart the cluster before re-indexing. Avoid bulk indexing large batches at once to reduce risk see issue. This will be fixed in a future release.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Yeon1007 Nov 27, 2025
Author

Regarding using the title chunker before the token chunker: If we first use the title chunker to create chunks, resulting in four chunks with sizes of 1100, 500, 300, and 1000. and I set the chunk size of the token chunk to 1024 For the first 1180 chunk, wouldn't the portion exceeding 1024 be combined with 500 and 300 chunks to form a new chunk (76 + 500 + 300)? However, this still wouldn't reach 1024, so the header of the fourth 1000 chunk would be further split, with the portion from 1000 to 876 to form a 1024-sized chunk. Am i right?@dosu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InfiniFlow

Title Chunker and Token Chunker #11410

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 12 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

InfiniFlow

Title Chunker and Token Chunker #11410

Uh oh!

Yeon1007 Nov 20, 2025

Replies: 1 comment · 12 replies

Uh oh!

dosubot[bot] bot Nov 20, 2025

Uh oh!

dosubot[bot] bot Nov 26, 2025

Uh oh!

dosubot[bot] bot Nov 27, 2025

Uh oh!

Yeon1007 Nov 27, 2025 Author

Uh oh!

dosubot[bot] bot Nov 27, 2025

Uh oh!

Yeon1007 Nov 27, 2025 Author

Yeon1007
Nov 20, 2025

Replies: 1 comment 12 replies

dosubot[bot]
bot Nov 20, 2025

Yeon1007 Nov 27, 2025
Author

Yeon1007 Nov 27, 2025
Author