Add a validation layer for YAML based configuration #1780

omri374 · 2025-11-11T10:58:42Z

This pull request introduces Pydantic-based validation for configuration files and objects throughout the Presidio Analyzer codebase, improving reliability and error reporting. It replaces legacy manual validation logic with centralized schema validation, updates the NerModelConfiguration class to use Pydantic, and refactors providers to leverage the new validation approach.

Configuration validation improvements:

Added the ConfigurationValidator class in input_validation/schemas.py to validate analyzer and NLP configurations using Pydantic models, including checks for language codes, file paths, score thresholds, and nested structures.
Updated input_validation/__init__.py to expose ConfigurationValidator and related schema models for use throughout the codebase.

Provider refactoring:

Refactored NlpEngineProvider in nlp_engine_provider.py to remove legacy validation methods and use ConfigurationValidator for all configuration checks, including file path and YAML structure validation. [1] [2]
Updated AnalyzerEngineProvider to validate analyzer configuration using ConfigurationValidator and log warnings instead of printing errors.

Model validation enhancements:

Replaced the legacy dataclass implementation of NerModelConfiguration with a Pydantic-based class, adding field-level validation and default value handling for configuration options. [1] [2]

Logging and error handling:

Improved error handling and logging for configuration parsing and validation, switching from print statements to logger warnings for better observability.

HOW TO TEST?

run

import logging
from pathlib import Path

from presidio_analyzer import AnalyzerEngineProvider

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)


def get_full_paths(analyzer_yaml, nlp_engine_yaml=None, recognizer_registry_yaml=None):
    this_path = Path(__file__).parent.absolute()

    analyzer_yaml_path = Path(this_path, analyzer_yaml) if analyzer_yaml else None
    nlp_engine_yaml_path = Path(this_path, nlp_engine_yaml) if nlp_engine_yaml else None
    recognizer_registry_yaml_path = (
        Path(this_path, recognizer_registry_yaml) if recognizer_registry_yaml else None
    )
    return analyzer_yaml_path, nlp_engine_yaml_path, recognizer_registry_yaml_path


if __name__ == "__main__":
    analyzer_yaml, reco_yaml, nlp_yaml = get_full_paths(
        "conf/default_analyzer_full.yaml"
    )
    logger.debug(analyzer_yaml)
    provider = AnalyzerEngineProvider(analyzer_yaml)
    analyzer = provider.create_engine()

    res = analyzer.analyze("John's email is 1341122341", language="en")
    print(analyzer.get_recognizers())
    logger.info(res)

play with different yaml configurations, including wrong ones, and see that the output you're getting makes sense.

# Conflicts: # presidio-analyzer/presidio_analyzer/nlp_engine/ner_model_configuration.py

Copilot

Pull Request Overview

This PR introduces Pydantic-based validation for YAML configuration files, modernizing the configuration management system in Presidio Analyzer. The changes centralize validation logic, improve error handling, and reduce technical debt by removing legacy validation methods.

Key Changes

Added new Pydantic models for validating recognizer configurations and NLP engine settings
Replaced legacy manual validation with declarative Pydantic-based validation
Refactored NerModelConfiguration from dataclass to Pydantic model
Removed redundant validation methods from NlpEngineProvider

Reviewed Changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
`pyproject.toml`	Added Pydantic 2.x dependency
`input_validation/schemas.py`	New centralized `ConfigurationValidator` class with static validation methods
`input_validation/yaml_recognizer_models.py`	New Pydantic models for recognizer registry configuration
`ner_model_configuration.py`	Converted from dataclass to Pydantic BaseModel with field validation
`nlp_engine_provider.py`	Removed legacy validation methods, integrated ConfigurationValidator
`pattern.py`	Added regex and score validation using `regex` library
`pattern_recognizer.py`	Added support for `supported_entities` to `supported_entity` conversion
`recognizers_loader_utils.py`	Enhanced to handle Pydantic-normalized fields and entity conversions
`recognizer_registry_provider.py`	Integrated ConfigurationValidator for registry validation
`analyzer_engine_provider.py`	Added configuration validation call
Test files	Comprehensive test coverage for new validation logic

presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py

presidio-analyzer/tests/test_configuration_validator.py

presidio-analyzer/presidio_analyzer/input_validation/schemas.py

presidio-analyzer/tests/test_recognizer_registry_provider.py

presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py

…gnizer_models.py Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]>

…mri/pydantic_validation # Conflicts: # presidio-analyzer/presidio_analyzer/input_validation/schemas.py

Copilot

Pull Request Overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 3 comments.

presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py

presidio-analyzer/presidio_analyzer/input_validation/schemas.py

presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py

Co-authored-by: Copilot <[email protected]>

omri374 · 2025-11-20T10:11:09Z

presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py

@@ -0,0 +1,487 @@
+"""Pydantic models for YAML recognizer configurations."""


This file contains the pydantic equivalent of the yaml structures. As these are not 1:1 with Presidio objects in some cases, there's an intermediate pydantic representation. The existing layer (AnalyzerEngineProvider, NlpEngineProvider, RecognizerRegistryProvider) still handles the actual creation of objects.

omri374 · 2025-11-20T10:11:45Z

presidio-analyzer/presidio_analyzer/nlp_engine/ner_model_configuration.py

-@dataclass
-class NerModelConfiguration:
-    """NER model configuration.
+class NerModelConfiguration(BaseModel):


Switched to BaseModel to improve validation

omri374 · 2025-11-20T10:13:38Z

presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py

-        if "supported_language" in recognizer_conf:
-            return [PatternRecognizer.from_dict(recognizer_conf)]
+        # legacy recognizer (has supported_language set to a value, not None)
+        if recognizer_conf.get("supported_language"):


The legacy yaml based recognizer had a single language it supported. The newer YAML config allows the user to create recognizers with multiple languages which would translate to multiple instances of the same recognizer (one for each language). Therefore the validation needs to support both the legacy and the new structures.

Removed comments about excluded fields in recognizer initialization.

dorlugasigal · 2025-11-24T09:28:21Z

presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py

+            # Validate using ConfigurationValidator - let Pydantic errors propagate
+            ConfigurationValidator.validate_nlp_configuration(nlp_configuration)
            self.nlp_configuration = nlp_configuration

-        if conf_file or conf_file == '':
-            self._validate_conf_file_path(conf_file)
+        if conf_file or conf_file == "":
+            if conf_file == "":
+                raise ValueError("conf_file is empty")
+            ConfigurationValidator.validate_file_path(conf_file)
            self.nlp_configuration = self._read_nlp_conf(conf_file)

        if conf_file is None and nlp_configuration is None:
            conf_file = self._get_full_conf_path()
            logger.debug(f"Reading default conf file from {conf_file}")
            self.nlp_configuration = self._read_nlp_conf(conf_file)


I think we are missing
ConfigurationValidator.validate_nlp_configuration(nlp_configuration) on the other if cases here? lines 66 and 71?

testing locally with

invalid_config = { "nlp_engine_name": "spacy", "models": "THIS_SHOULD_BE_A_LIST" # BUG: Should be a list! }

bypasses the checks

ShakutaiGit

Great PR, Few minor comments.

ShakutaiGit · 2025-11-27T07:37:32Z

presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py

                with open(self._get_full_conf_path()) as file:
                    configuration = yaml.safe_load(file)

+        # Validate configuration using Pydantic-based ConfigurationValidator


Is this comment necessary?

ShakutaiGit · 2025-11-27T07:37:54Z

presidio-analyzer/presidio_analyzer/analyzer_engine_provider.py

                    configuration = yaml.safe_load(file)

+        # Validate configuration using Pydantic-based ConfigurationValidator
+        from presidio_analyzer.input_validation import ConfigurationValidator


Should this import be on top of the file ?

ShakutaiGit · 2025-11-27T07:52:31Z

presidio-analyzer/presidio_analyzer/input_validation/schemas.py

+            return validated_config.model_dump(exclude_unset=False)
+        except ValidationError as e:
+            raise ValueError(f"Invalid recognizer registry configuration: {e}")
+        except ImportError:


Do we actually need this fallback?
i see that we already load it here
https://github.com/microsoft/presidio/pull/1780/files#diff-90164cd93bcc7aadd168773b72a4a398ab95778f3403c76471c790eab70b76b4R7

ShakutaiGit · 2025-11-27T07:54:19Z

presidio-analyzer/presidio_analyzer/input_validation/schemas.py

+            # Use model_dump() without exclude_unset to include default values
+            return validated_config.model_dump(exclude_unset=False)
+        except ValidationError as e:
+            raise ValueError(f"Invalid recognizer registry configuration: {e}")


Should we use something like that:

raise ValueError("Invalid recognizer registry configuration") from e

i think that Using from e keeps the original Pydantic error details and traceback..

ShakutaiGit · 2025-11-27T08:04:33Z

presidio-analyzer/presidio_analyzer/input_validation/schemas.py

+        :param languages: List of languages to validate.
+        """
+        for lang in languages:
+            if not re.match(r"^[a-z]{2}(-[A-Z]{2})?$", lang):


I see that this pattern is re occuring over the PR,
Can we create a shared constant for language validation and reuse it across both files?

LANGUAGE_CODE_PATTERN = re.compile(r"^[a-z]{2}(-[A-Z]{2})?$")

ShakutaiGit · 2025-11-27T08:17:54Z

presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py

+import regex as re
+from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator
+
+logger = logging.getLogger("presidio-analyzer")


unused logger ?

ShakutaiGit · 2025-11-27T08:29:38Z

presidio-analyzer/presidio_analyzer/nlp_engine/ner_model_configuration.py

-            )
-            self.low_score_entity_names = LOW_SCORE_ENTITY_NAMES
-        if self.labels_to_ignore is None:
+    labels_to_ignore: Optional[Collection[str]] = Field(


Do we actually want Optional here, meaning we want to allow passing None, or should users simply omit the field to use the default?

ShakutaiGit · 2025-11-27T08:34:34Z

presidio-analyzer/presidio_analyzer/nlp_engine/ner_model_configuration.py

-            self.labels_to_ignore = {}
+        return v

+    @field_validator("stride")


Should we add the mode="before"?

ShakutaiGit · 2025-11-27T08:36:43Z

presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py

+        if nlp_engines is None:
            nlp_engines = (SpacyNlpEngine, StanzaNlpEngine, TransformersNlpEngine)

+        # No legacy validation - just assign the engines


both comments here are needed ?

ShakutaiGit · 2025-11-27T08:49:29Z

presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py

+
+            # Transform supported_entities -> supported_entity
+            # (PatternRecognizer expects singular)
+            if "supported_entities" in conf_copy:


This code section repeated 3 times,
Can we extract this into a single helper method ?
Lines 156-159, 176-179, 265-274

omri374 added 4 commits October 19, 2025 15:03

initial code for pydantic based validation for yaml files

159315e

Validation layer for YAML based configuration - cont'd

cfc7c1b

Merge branch 'main' into omri/pydantic_validation

f7b54d4

# Conflicts: # presidio-analyzer/presidio_analyzer/nlp_engine/ner_model_configuration.py

linting

0fbd010

omri374 requested a review from Copilot November 11, 2025 14:32

Copilot started reviewing on behalf of omri374 November 11, 2025 14:33 View session

Copilot finished reviewing on behalf of omri374 November 11, 2025 14:34

Copilot AI reviewed Nov 11, 2025

View reviewed changes

omri374 and others added 10 commits November 11, 2025 22:24

Update presidio-analyzer/presidio_analyzer/input_validation/yaml_reco…

c4841b5

…gnizer_models.py Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/presidio_analyzer/input_validation/yaml_reco…

3939fc9

…gnizer_models.py Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/tests/test_configuration_validator.py

d7cb69b

Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/presidio_analyzer/input_validation/schemas.py

3b61469

Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/presidio_analyzer/input_validation/schemas.py

251cefc

Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/presidio_analyzer/input_validation/schemas.py

1420bd5

Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/presidio_analyzer/input_validation/schemas.py

c677b79

Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/tests/test_recognizer_registry_provider.py

421e47d

Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/presidio_analyzer/input_validation/schemas.py

2901b13

Co-authored-by: Copilot <[email protected]>

Update presidio-analyzer/presidio_analyzer/input_validation/schemas.py

8b84370

Co-authored-by: Copilot <[email protected]>

omri374 marked this pull request as ready for review November 17, 2025 17:29

omri374 requested a review from a team as a code owner November 17, 2025 17:29

Merge branch 'main' into omri/pydantic_validation

1cf7678

omri374 marked this pull request as draft November 18, 2025 12:24

ruff on the entire analyzer codebase

108e3d0

omri374 marked this pull request as ready for review November 19, 2025 11:02

Merge branch 'main' into omri/pydantic_validation

b492b79

omri374 requested a review from Copilot November 19, 2025 11:02

Copilot started reviewing on behalf of omri374 November 19, 2025 11:03 View session

Merge remote-tracking branch 'origin/omri/pydantic_validation' into o…

eb9f7c7

…mri/pydantic_validation # Conflicts: # presidio-analyzer/presidio_analyzer/input_validation/schemas.py

Copilot finished reviewing on behalf of omri374 November 19, 2025 11:06

Copilot AI reviewed Nov 19, 2025

View reviewed changes

omri374 and others added 4 commits November 19, 2025 13:41

Update presidio-analyzer/presidio_analyzer/input_validation/schemas.py

bfd067b

Co-authored-by: Copilot <[email protected]>

ruff and copilot review fixes

f2e7fd9

merge

18df02e

Delete presidio-analyzer/test-output.xml

41328cc

omri374 requested review from ShakutaiGit and dorlugasigal November 19, 2025 13:07

omri374 added 3 commits November 19, 2025 22:15

fixed bad test

bd2d045

ruff

11a8169

removed wrong test which assumes defaults

8054750

omri374 commented Nov 20, 2025

View reviewed changes

Clean up comments in recognizers_loader_utils.py

86baa12

Removed comments about excluded fields in recognizer initialization.

This was referenced Nov 20, 2025

Allow multiple instances of the same recognizer through YAML #1761

Draft

Improve input validation for YAML #1766

Open

omri374 linked an issue Nov 20, 2025 that may be closed by this pull request

Improve input validation for YAML #1766

Open

dorlugasigal reviewed Nov 24, 2025

View reviewed changes

ShakutaiGit reviewed Nov 27, 2025

View reviewed changes

		@@ -0,0 +1,487 @@
		"""Pydantic models for YAML recognizer configurations."""

Add a validation layer for YAML based configuration #1780

Are you sure you want to change the base?

Add a validation layer for YAML based configuration #1780

Uh oh!

Conversation

omri374 commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HOW TO TEST?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShakutaiGit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

omri374 commented Nov 11, 2025 •

edited

Loading