-
|
I created a pipeline in the agent to try to chunk PDF files. When I selected deepdoc as the parser, the progress in the log got stuck at 10%, and the total CPU usage of Docker became 430%/0% (16 CPUs available). What could be causing this? Using DeepDoc shouldn't require any other prerequisites. And I would also like to ask about the prerequisites for using MinerU and Docling. Do I need to install them using pip or modify the Dockerfile first? Also, I'd like to ask about the UI for enabling pipeline process monitoring. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Hi @imkkk777! I'm Dosu and I’m helping the ragflow team. The DeepDoc parser often gets stuck at 10% with high CPU usage because it runs OCR and layout analysis on CPU by default, which is very resource-intensive—especially for image-heavy or complex PDFs. GPU acceleration is not used for DeepDoc parsing, even if you set GPU options; it's only used for embedding or LLM inference. If you don't need image recognition, switching to the "Naive" parser can significantly reduce resource usage and avoid these stalls. Also, make sure your server has enough RAM and that both the main server and task executor are running—parsing can hang if resources are low or services are unhealthy. Upgrading to the latest nightly build may resolve some known bugs with stuck parsing and parser selection [source]. For MinerU and Docling: both require installation before use. In Docker, set To use a different parser, select it in the dataset configuration UI before uploading your PDF. If you change the parser, you must delete and re-upload the PDF for the new parser to take effect—re-parsing does not update the parser for existing files [source]. For pipeline process monitoring, there is a built-in UI log sheet that shows pipeline progress, logs, and lets you cancel or export logs as JSON. This is available in the agent pipeline UI and provides real-time feedback on parsing and chunking steps [source]. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Hi @imkkk777! I'm Dosu and I’m helping the ragflow team.
The DeepDoc parser often gets stuck at 10% with high CPU usage because it runs OCR and layout analysis on CPU by default, which is very resource-intensive—especially for image-heavy or complex PDFs. GPU acceleration is not used for DeepDoc parsing, even if you set GPU options; it's only used for embedding or LLM inference. If you don't need image recognition, switching to the "Naive" parser can significantly reduce resource usage and avoid these stalls. Also, make sure your server has enough RAM and that both the main server and task executor are running—parsing can hang if resources are low or services are unhealthy. Upgrading to …