This project contains the tools used by the Loghi framework.
Make sure langident is installed:
git clone https://github.com/rvankoert/langident.git
cd langident
mvn clean package
mvn install:install-file -Dfile=target/langident-1.0.5-SNAPSHOT.jar -DgroupId=nl.knaw.huygens.pergamon.nlp -DartifactId=langident -Dpackaging=jar -Dversion=1.0.5Compiling loghi-tooling
cd ../loghi-tooling
mvn clean packageThese are commandline tools that help process the input and output for the Loghi framework.
| Minion | Purpose | Key Inputs / Notes |
|---|---|---|
| MinionCutFromImageBasedOnPageXMLNew | Cut text line image snippets based on PAGE XML coordinates | Image + PAGE XML folder (page/), supports styles & rescaling |
| MinionDetectLanguageOfPageXml | Predict document and region-level language | PAGE XML directory; optional training data folders |
| MinionExtractBaselines | Inject baselines/text lines into PAGE XML from segmentation PNG | PAGE XML + baseline PNG + config (P2PaLA/Laypa) |
| MinionExtractBaselinesStartEndNew | Baselines + start/end point images to improve rotated line detection | PAGE XML + baseline PNG + start/end PNGs |
| MinionExtractBaselinesStartEndNew3 | Updated rotated line baseline extraction (variant v3) | PAGE XML + baseline + start/end PNGs (new heuristics) |
| MinionConvertPageToTxt | Convert PAGE XML files to consolidated or line-based text | PAGE XML directory; flags for linebased/plaintext |
| MinionConvertToPdf | Produce a PDF from JPEG images, overlaying PAGE text at baselines | JPEG images + matching PAGE XML in page/ subdir |
| MinionFixPageXML | Normalize/fix PAGE XML; optionally remove text or words | PAGE XML directory; namespace flag (2013/2019) (bug: removetext sets words) |
| MinionGarbageCharacterCalculator | Compute percentage of disallowed characters | PAGE XML file + allowed characters file |
| MinionGeneratePageImages | Synthesize PAGE XML + rendered images from text and fonts | Text files + fonts directory |
| MinionLoghiHTRMergePageXML | Merge Loghi HTR output lines into existing PAGE XML | PAGE XML + HTR results file + config |
| MinionPyLaiaMergePageXML | Merge PyLaia HTR output lines into existing PAGE XML | PAGE XML + PyLaia results file |
| MinionRecalculateReadingOrderNew | Recompute reading order of regions in PAGE XML | PAGE XML directory (modifies in place) |
| MinionShrinkRegions | Shrink region polygons based on image content | Images + PAGE XML in page/ subfolder |
| MinionShrinkTextLines | Shrink text line polygons based on image content | Images + PAGE XML in page/ subfolder |
| MinionSplitPageXMLTextLineIntoWords | Derive word segmentation from text lines | PAGE XML directory |
| MinionConvertOCRResult | Convert OCR result files (ALTO -> PageXML2019 implemented) | Input directory of OCR XML; target format flag |
This tool will cut the textlines from an image based on the PAGE xml file. It expects that the PAGE xml file has same name as the image except for the extension.
./target/appassembler/bin/MinionCutFromImageBasedOnPageXMLNew -help./target/appassembler/bin/MinionCutFromImageBasedOnPageXMLNew -input_path /example/input_path -outputbase /example/output_path -output_type png -channels 4 -threads 5This call will take the images from /example/input_path.
It will take the page from /example/input_path/page.
The text lines will be stored as png in /example/output_path in a subfolder with the name of the original image.
The images will be stored with transparency information.
The minion will use 5 threads.
OPTIONAL PARAMETER: --use_tags for using HTML tag equivalents for the text styles taken from the pageXML:
IMPORTANT: this only works if --include_text_styles and --write_text_contents are passed as well.
- ␅ underline
- before: ␅u␅n␅d␅e␅r␅l␅i␅n␅e
- after: underline
- ␃ strikethrough
- before: ␃u␃n␃d␃e␃r␃l␃i␃n␃e
- after:
underline
- ␄ subscript;
- before: H␄2O
- after: H2 O
- ␆ superscript
- before: E=mc␆2
- after: E=mc2
This minion will try to predict the language of the PAGE file. If the PAGE file contains text regions it will also predict the language for each region.
./target/appassembler/bin/MinionDetectLanguageOfPageXml -help./target/appassembler/bin/MinionDetectLanguageOfPageXml -page /example/page -lang_train_data /example/lang_training_data/example/page contains the PAGE xml files of which the language has to be detected.
-lang_train_data is an optional argument, but it is highly recommended, to use your own training data.
The training data looks something like this:
lang_training_data
|
- Dutch
- English
Dutch and English are plain text files.
The names should comply with https://github.com/PRImA-Research-Lab/PAGE-XML/blob/4f40cd4b68d893b02a6396cf00df3e0e96db0d21/pagecontent/schema/pagecontent.xsd#L1675
The contents of the files should be a unicode text in the language of the file name.
This minion takes PAGE XML and a PNG containing baseline labels (from P2PaLA or Laypa) and extracts baseline / text line information, adding it to the PAGE XML regions. Supports optional splitting of merged baselines and configuration via P2PaLA or Laypa config files.
Typical call (P2PaLA style input where baselines are label 0 or 255 depending on config):
./target/appassembler/bin/MinionExtractBaselines -input_path_png /example/p2pala/result/png \
-input_path_page /example/p2pala/result/png \
-output_path_page /example/output/page/./target/appassembler/bin/MinionExtractBaseLines -help./target/appassembler/bin/MinionExtractBaseLines -input_path_png /example/p2pala/result/png -input_path_page /example/p2pala/result/png -output_path_page /example/output/page/ -invert_image./target/appassembler/bin/MinionExtractBaseLines -input_path_png /example/p2pala/result/png -input_path_page /example/p2pala/result/png -output_path_page /example/output/page/P2PaLA will output the page and pngs in the same folder.
So input_path_png and input_path_page will have the same value most of the time.
This minion expects:
- pageXML,
- a folder with pngs containing the baselines
- a folder with pngs containing the start points
- a folder with pngs containing the end points
- a folder to store the new PAGE xml It extracts info about the baselines from the images and add baseline/textline information to the regions in the pagexml. This version adds the ability to correctly detect rotated lines.
This minion expects pageXML, a png containing baselines and a png containing baseline start and ending as input. It extracts info about the baselines from the images and add baseline/textline information to the regions in the pagexml. This version adds the ability to correctly detect rotated lines.
./target/appassembler/bin/MinionExtractBaselinesStartEndNew3 -help./target/appassembler/bin/MinionExtractBaselinesStartEndNew3 -input_path_png /example/png_input/ -input_path_pagexml /example/page_input/ -output_path_pagexml /example/page_output/Converts PAGE XML files in a directory to plain text (.txt). Can output either:
- Concatenated text (default) preserving paragraph/line breaks
- Line-based text (one line per PAGE TextLine) using
-linebased - Plain text vs UTF8 extracted text using
-plaintext
Options:
-pagexmldir <dir>(required) Directory containing PAGE XML files.-no_overwriteDo not overwrite existing .txt files.-linebasedOutput one line per PAGE text line.-plaintextUseTextEquiv.PlainTextinstead of UTF8 text.
Example:
./target/appassembler/bin/MinionConvertPageToTxt -pagexmldir /data/pagexml -linebasedBuilds a PDF from a directory of JPEG images, overlaying PAGE XML textual content at baseline positions. It expects for each image a PAGE XML file in a page/ subdirectory with the same base filename.
Usage:
./target/appassembler/bin/MinionConvertToPdf output.pdf /path/to/imagesNotes:
- Font size auto-scales if x-height isn't available.
- Currently uses Times Roman and positions text at baseline start coordinates.
This tool reads and then writes the PAGE XML again, fixing some small problems that exist in existing PAGE XML files.
To fix a PAGE XML file, use the following command:
./target/appassembler/bin/MinionFixPageXML -input_path /path/to/input/pagexml -output_path /path/to/output/pagexml -namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15-input_path: Path to the directory containing the input PAGE XML files. -namespace: Target namespace for the PAGE XML files. Default is http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 use http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 for PAGE 2013 and compatibility with Transkribus. -removetext: Optional flag to remove text from the PAGE XML. -removewords: Optional flag to remove words from the PAGE XML.
Correct options are:
-input_path <dir>(required) Path to PAGE XML files (recursive descent into subdirectories).-removetextRemove region and line level text (sets TextEquiv to null).-removewordsRemove word elements from lines.-namespace <ns>Target namespace (default 2019:http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15). Use 2013:http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15for compatibility with Transkribus.
Example:
./target/appassembler/bin/MinionFixPageXML -input_path /data/pagexml -namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 -removewordsKnown issue: In the current code -removetext sets removeWords instead of removeText. This should be fixed in code for expected behavior.
This minion returns the characters that should not be in the text as a percentage of total amount of characters.
./target/appassembler/bin/MinionGarbageCharacterCalculator -help./target/appassembler/bin/MinionGarbageCharacterCalculator -page_file /path/to/page.xml -characters_file /path/supported_characters.txt/path/allowed_characters.txt is a plain text file, that contains the supported without any spaces in between.
It should look something like:
abcdefgABCDEFG
The minion will create PAGE xml and the image, written in a font.
./target/appassembler/bin/MinionGeneratePageImages -help./target/appassembler/bin/MinionGeneratePageImages -text_path /example/text_path/ -output_path /example/page_output/ -font_path /example/fonts text_pathcontains plain text filesfont_pathcontains font files (i.e. *.ttf)
For each text a synthetic will be created with a randomly chosen font.
This minion merges the HTR results of Loghi HTR with the existing PAGE files. The results file should look something like:
/example/image_name/page_name.xml-line_name1.png This is a text line
/example/image_name/page_name.xml-line_name2.png This is a text line too
./target/appassembler/bin/MinionLoghiHTRMergePageXML -help./target/appassembler/bin/MinionLoghiHTRMergePageXML -input_path /example/page -results_file /example/htr_results.txt -config_file /path/to/htr-config.jsonThis minion works similar to MinionLoghiHTRMergePageXML, but is made to process the HTR results of Pylaia.
./target/appassembler/bin/MinionPyLaiaMergePageXML -help./target/appassembler/bin/MinionPyLaiaMergePageXML -input_path /example/page -results_file /example/htr_results.txtThis minion will recalculate the reading order of the PAGE xml files.
It will change the page files the folder that is passed as the input_path_page-argument.
./target/appassembler/bin/MinionRecalculateReadingOrderNew -help./target/appassembler/bin/MinionRecalculateReadingOrderNew -input_path_page /example/page/This minion reevaluates the Text lines of the PAGE xml and shrinks them where needed. The results are based on the image the PAGE xml is describing.
./target/appassembler/bin/MinionShrinkRegions -help./target/appassembler/bin/MinionShrinkRegions -input /example/imagesinput should be a folder and should contain the images and a subfolder page with the PAGE xml files of the images.
This minion reevaluates the Text lines of the PAGE xml and shrinks them where needed. The results are based on the image the PAGE xml is describing.
./target/appassembler/bin/MinionShrinkTextLines -help./target/appassembler/bin/MinionShrinkTextLines -input /example/imagesinput should be a folder and should contain the images and a subfolder page with the PAGE xml files of the images.
This minion will split the text lines of a PAGE xml file into words.
./target/appassembler/bin/MinionSplitPageXMLTextLineIntoWords -help./target/appassembler/bin/MinionSplitPageXMLTextLineIntoWords -input_path /example/pageThis tool converts OCR result files between several supported formats. It currently implements conversion from ALTO XML to PageXML 2019. Other target formats are planned or partially stubbed.
Supported source format auto-detection (per file):
- PageXML 2013 (PcGts root with 2013-07-15 namespace)
- PageXML 2019 (PcGts root with 2019-07-15 namespace)
- ALTO (root element )
- hOCR (HTML containing elements with class="ocr")
Target formats (argument to -targetformat / -t):
- PageXML2013 (planned; conversion logic commented out)
- PageXML2019 (implemented for ALTO source)
- hOCR (placeholder)
- ALTO (placeholder for reverse conversion)
How it works:
- Reads every regular file in the input directory (non-recursive).
- Detects its format using lightweight string matching (namespace checks, root tag, presence of hOCR markers).
- For ALTO -> PageXML2019:
- Parses ALTO XML using AltoUtils.readAltoDocumentFromString into an AltoDocument.
- Converts AltoDocument -> DocumentPage -> PcGts (PageXML 2019) via DocumentTypeConverter.
- Serializes PcGts with Jackson XmlMapper (pretty-printed) and writes to output directory, preserving the original filename.
- Skips files with unknown or unsupported formats, printing a message to stderr.
- Ensures the output directory exists (creates it if absent).
CLI options:
- -i / --inputpath Path to directory containing OCR result files (default: /scratch/altotest/)
- -o / --outputpath Path to directory for converted files (default: /tmp/limited/)
- -t / --targetformat Desired target format (case-insensitive). Default: PageXML2013
Example: Convert a batch of ALTO XML files to PageXML 2019
./target/appassembler/bin/MinionConvertOCRResult -inputpath /example/alto -outputpath /example/pagexml -targetformat PageXML2019Notes & limitations:
- This is a work-in-progress tool. Expect incomplete functionality and breaking changes.
- Currently only ALTO -> PageXML2019 is fully implemented.
- PageXML 2013 -> PageXML 2019 conversion is indicated but commented out.
- hOCR and ALTO outputs are placeholders; invoking with those target formats will not perform conversions yet.
- Processing is non-recursive; organize files flat or invoke multiple times for subdirectories.
- Detection depends on simple string heuristics; malformed XML may be skipped.
Planned enhancements:
- Implement PageXML 2013 to 2019 conversion pathway.
- Add hOCR and ALTO generation (reverse conversions).
- Optional recursive directory traversal.
- More robust XML validation and error reporting.
These requests are on the public port of the webservice, that is by default 8080.
When running old P2PaLA (where 0 means baseline):
curl -X POST -F "mask=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.png" -F "xml=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.xml" -F "identifier=id" -F "invertImage=true" http://localhost:8080/extract-baselinesWhen running old P2PaLA with config file:
curl -X POST -F "mask=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.png" -F "xml=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.xml" -F "identifier=id" -F "invertImage=true" -F "p2palaa_config=@/path/to/p2pala_config.json" -F "config_white_list=num_workers" -F "config_white_list=line_alg" http://localhost:8080/extract-baselinesWhen running Laypa (where 255 means baseline):
curl -X POST -F "mask=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.png" -F "xml=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.xml" -F "identifier=id" http://localhost:8080/extract-baselinesWhen running Laypa with config file:
curl -X POST -F "mask=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.png" -F "xml=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.xml" -F "identifier=id" -F "laypa_config=@/path/to/laypa_config.yml" -F "config_white_list=DATASETS" -F "config_white_list=VERSION" http://localhost:8080/extract-baselinesRequest with all options
curl -X POST -F "mask=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.png" -F "xml=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.xml" -F "identifier=id" -F "margin=30" http://localhost:8080/extract-baselinesSimple request:
curl -X POST -F "image=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.png" -F "page=@/tmp/upload/id/extract_baselines.xml" -F "identifier=id" -F "output_type=png" -F "channels=4" http://localhost:8080/cut-from-image-based-on-page-xml-newoptional parameter: -F @use_tags=true for using HTML tag equivalents for the text styles taken from the pageXML:
IMPORTANT: this only works if -F @include_text_styles=true and -F @write_text_contents=true are passed as well.
- ␅ underline
- before: ␅u␅n␅d␅e␅r␅l␅i␅n␅e
- after: underline
- ␃ strikethrough
- before: ␃u␃n␃d␃e␃r␃l␃i␃n␃e
- after:
underline
- ␄ subscript;
- before: H␄2O
- after: H2 O
- ␆ superscript
- before: E=mc␆2
- after: E=mc2
Full request:
curl -X POST -F "image=@/data/scratch/p2palaintermediate/5c52d146-34b1-48e8-8805-04885d39d96a.png" \
-F "page=@/tmp/upload/id/extract_baselines.xml" \
-F "identifier=id" \
-F "output_type=png" \
-F "channels=4" \
-F "min_width=5" \
-F "min_height=5" \
-F "min_width_to_height_ratio=2" \
-F "write_text_contents=true" \
-F "rescale_height=20" \
-F "output_box_file=false" \
-F "output_txt_file=false" \
-F "recalculate_text_line_contours_from_baselines=false" \
-F "fixed_x_height=15" \
-F "min_x_height=10" \
-F "include_text_styles=true" \
http://localhost:8080/cut-from-image-based-on-page-xml-newSimple request
curl -X POST -F "identifier=id" -F "page=@/home/martijnm/workspace/images/loghi-htr/data/page/NL-0400410000_26_009015_000321.xml" -F "results=@/home/martijnm/workspace/images/loghi-htr/results.txt" -F "htr-config=@/home/martijnm/workspace/images/loghi-htr/output/config.json" http://localhost:8080/loghi-htr-merge-page-xmlFull request
curl -X POST -F "identifier=id" -F "page=@/home/martijnm/workspace/images/loghi-htr/data/page/NL-0400410000_26_009015_000321.xml" -F "results=@/home/martijnm/workspace/images/loghi-htr/results.txt" -F "htr-config=@/home/martijnm/workspace/images/loghi-htr/output/config.json" -F "comment=My comment" -F "config_white_list=seed" -F "config_white_list=batch_size" http://localhost:8080/loghi-htr-merge-page-xmlSimple request
curl -X POST -F "identifier=id" -F "page=@/home/martijnm/workspace/images/loghi-htr/data/page/NL-0400410000_26_009015_000321.xml" -F "border_margin=200" http://localhost:8080/recalculate-reading-order-newFull request
curl -X POST -F "identifier=id" -F "page=@/home/martijnm/workspace/images/loghi-htr/data/page/NL-0400410000_26_009015_000321.xml" -F "border_margin=200" -F "interline_clustering_multiplier=1.5" -F "dubious_size_width_multiplier=0.05" -F "dubious_size_width=1024" http://localhost:8080/recalculate-reading-order-newcurl -X POST -F "identifier=id" -F "xml=@/home/stefan/Documents/repos/laypa/tutorial/data/inference/page/NL-HaNA_1.01.02_3112_0395.xml" http://localhost:8080/split-page-xml-text-line-into-wordscurl -X POST -F "identifier=id" -F "page=@/home/martijnm/workspace/images/loghi-htr/data/page/NL-040041000_26_009015_000321.xml" -F "training_data=@/home/martijnm/workspace/images/loghi-tooling/minions/src/main/resources/lang-ident-training-data/Dutch" -F "training_data=@/home/martijnm/workspace/images/loghi-tooling/minions/src/main/resources/lang-ident-training-data/English" -F "training_data=@/home/martijnm/workspace/images/loghi-tooling/minions/src/main/resources/lang-ident-training-data/French" http://localhost:8080/detect-language-of-page-xmlThese requests use the admin port that is by default 8081.
These requests are GET requests and could be viewed in your browser.
http://localhost:8081/prometheus
http://localhost:8081/