Skip to content

Conversation

@omri374
Copy link
Contributor

@omri374 omri374 commented Nov 11, 2025

This pull request introduces Pydantic-based validation for configuration files and objects throughout the Presidio Analyzer codebase, improving reliability and error reporting. It replaces legacy manual validation logic with centralized schema validation, updates the NerModelConfiguration class to use Pydantic, and refactors providers to leverage the new validation approach.

Configuration validation improvements:

  • Added the ConfigurationValidator class in input_validation/schemas.py to validate analyzer and NLP configurations using Pydantic models, including checks for language codes, file paths, score thresholds, and nested structures.
  • Updated input_validation/__init__.py to expose ConfigurationValidator and related schema models for use throughout the codebase.

Provider refactoring:

  • Refactored NlpEngineProvider in nlp_engine_provider.py to remove legacy validation methods and use ConfigurationValidator for all configuration checks, including file path and YAML structure validation. [1] [2]
  • Updated AnalyzerEngineProvider to validate analyzer configuration using ConfigurationValidator and log warnings instead of printing errors.

Model validation enhancements:

  • Replaced the legacy dataclass implementation of NerModelConfiguration with a Pydantic-based class, adding field-level validation and default value handling for configuration options. [1] [2]

Logging and error handling:

  • Improved error handling and logging for configuration parsing and validation, switching from print statements to logger warnings for better observability.

HOW TO TEST?

run

import logging
from pathlib import Path

from presidio_analyzer import AnalyzerEngineProvider

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)


def get_full_paths(analyzer_yaml, nlp_engine_yaml=None, recognizer_registry_yaml=None):
    this_path = Path(__file__).parent.absolute()

    analyzer_yaml_path = Path(this_path, analyzer_yaml) if analyzer_yaml else None
    nlp_engine_yaml_path = Path(this_path, nlp_engine_yaml) if nlp_engine_yaml else None
    recognizer_registry_yaml_path = (
        Path(this_path, recognizer_registry_yaml) if recognizer_registry_yaml else None
    )
    return analyzer_yaml_path, nlp_engine_yaml_path, recognizer_registry_yaml_path


if __name__ == "__main__":
    analyzer_yaml, reco_yaml, nlp_yaml = get_full_paths(
        "conf/default_analyzer_full.yaml"
    )
    logger.debug(analyzer_yaml)
    provider = AnalyzerEngineProvider(analyzer_yaml)
    analyzer = provider.create_engine()

    res = analyzer.analyze("John's email is 1341122341", language="en")
    print(analyzer.get_recognizers())
    logger.info(res)

play with different yaml configurations, including wrong ones, and see that the output you're getting makes sense.

@omri374 omri374 requested a review from Copilot November 11, 2025 14:32
Copilot finished reviewing on behalf of omri374 November 11, 2025 14:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces Pydantic-based validation for YAML configuration files, modernizing the configuration management system in Presidio Analyzer. The changes centralize validation logic, improve error handling, and reduce technical debt by removing legacy validation methods.

Key Changes

  • Added new Pydantic models for validating recognizer configurations and NLP engine settings
  • Replaced legacy manual validation with declarative Pydantic-based validation
  • Refactored NerModelConfiguration from dataclass to Pydantic model
  • Removed redundant validation methods from NlpEngineProvider

Reviewed Changes

Copilot reviewed 23 out of 24 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
pyproject.toml Added Pydantic 2.x dependency
input_validation/schemas.py New centralized ConfigurationValidator class with static validation methods
input_validation/yaml_recognizer_models.py New Pydantic models for recognizer registry configuration
ner_model_configuration.py Converted from dataclass to Pydantic BaseModel with field validation
nlp_engine_provider.py Removed legacy validation methods, integrated ConfigurationValidator
pattern.py Added regex and score validation using regex library
pattern_recognizer.py Added support for supported_entities to supported_entity conversion
recognizers_loader_utils.py Enhanced to handle Pydantic-normalized fields and entity conversions
recognizer_registry_provider.py Integrated ConfigurationValidator for registry validation
analyzer_engine_provider.py Added configuration validation call
Test files Comprehensive test coverage for new validation logic

@omri374 omri374 marked this pull request as ready for review November 17, 2025 17:29
@omri374 omri374 requested a review from a team as a code owner November 17, 2025 17:29
@omri374 omri374 marked this pull request as draft November 18, 2025 12:24
@omri374 omri374 marked this pull request as ready for review November 19, 2025 11:02
…mri/pydantic_validation

# Conflicts:
#	presidio-analyzer/presidio_analyzer/input_validation/schemas.py
Copilot finished reviewing on behalf of omri374 November 19, 2025 11:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 3 comments.

@@ -0,0 +1,487 @@
"""Pydantic models for YAML recognizer configurations."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contains the pydantic equivalent of the yaml structures. As these are not 1:1 with Presidio objects in some cases, there's an intermediate pydantic representation. The existing layer (AnalyzerEngineProvider, NlpEngineProvider, RecognizerRegistryProvider) still handles the actual creation of objects.

@dataclass
class NerModelConfiguration:
"""NER model configuration.
class NerModelConfiguration(BaseModel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to BaseModel to improve validation

if "supported_language" in recognizer_conf:
return [PatternRecognizer.from_dict(recognizer_conf)]
# legacy recognizer (has supported_language set to a value, not None)
if recognizer_conf.get("supported_language"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legacy yaml based recognizer had a single language it supported. The newer YAML config allows the user to create recognizers with multiple languages which would translate to multiple instances of the same recognizer (one for each language). Therefore the validation needs to support both the legacy and the new structures.

Removed comments about excluded fields in recognizer initialization.
Comment on lines +59 to 72
# Validate using ConfigurationValidator - let Pydantic errors propagate
ConfigurationValidator.validate_nlp_configuration(nlp_configuration)
self.nlp_configuration = nlp_configuration

if conf_file or conf_file == '':
self._validate_conf_file_path(conf_file)
if conf_file or conf_file == "":
if conf_file == "":
raise ValueError("conf_file is empty")
ConfigurationValidator.validate_file_path(conf_file)
self.nlp_configuration = self._read_nlp_conf(conf_file)

if conf_file is None and nlp_configuration is None:
conf_file = self._get_full_conf_path()
logger.debug(f"Reading default conf file from {conf_file}")
self.nlp_configuration = self._read_nlp_conf(conf_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing
ConfigurationValidator.validate_nlp_configuration(nlp_configuration) on the other if cases here? lines 66 and 71?

testing locally with

invalid_config = {
    "nlp_engine_name": "spacy",
    "models": "THIS_SHOULD_BE_A_LIST"  # BUG: Should be a list!
}

bypasses the checks

Copy link
Collaborator

@ShakutaiGit ShakutaiGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR, Few minor comments.

with open(self._get_full_conf_path()) as file:
configuration = yaml.safe_load(file)

# Validate configuration using Pydantic-based ConfigurationValidator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment necessary?

configuration = yaml.safe_load(file)

# Validate configuration using Pydantic-based ConfigurationValidator
from presidio_analyzer.input_validation import ConfigurationValidator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this import be on top of the file ?

return validated_config.model_dump(exclude_unset=False)
except ValidationError as e:
raise ValueError(f"Invalid recognizer registry configuration: {e}")
except ImportError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Use model_dump() without exclude_unset to include default values
return validated_config.model_dump(exclude_unset=False)
except ValidationError as e:
raise ValueError(f"Invalid recognizer registry configuration: {e}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use something like that:

raise ValueError("Invalid recognizer registry configuration") from e

i think that Using from e keeps the original Pydantic error details and traceback..

:param languages: List of languages to validate.
"""
for lang in languages:
if not re.match(r"^[a-z]{2}(-[A-Z]{2})?$", lang):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this pattern is re occuring over the PR,
Can we create a shared constant for language validation and reuse it across both files?

LANGUAGE_CODE_PATTERN = re.compile(r"^[a-z]{2}(-[A-Z]{2})?$")

import regex as re
from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator

logger = logging.getLogger("presidio-analyzer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused logger ?

)
self.low_score_entity_names = LOW_SCORE_ENTITY_NAMES
if self.labels_to_ignore is None:
labels_to_ignore: Optional[Collection[str]] = Field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually want Optional here, meaning we want to allow passing None, or should users simply omit the field to use the default?

self.labels_to_ignore = {}
return v

@field_validator("stride")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the mode="before"?

if nlp_engines is None:
nlp_engines = (SpacyNlpEngine, StanzaNlpEngine, TransformersNlpEngine)

# No legacy validation - just assign the engines
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both comments here are needed ?


# Transform supported_entities -> supported_entity
# (PatternRecognizer expects singular)
if "supported_entities" in conf_copy:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code section repeated 3 times,
Can we extract this into a single helper method ?
Lines 156-159, 176-179, 265-274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve input validation for YAML

4 participants