-
Notifications
You must be signed in to change notification settings - Fork 861
Add a validation layer for YAML based configuration #1780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # presidio-analyzer/presidio_analyzer/nlp_engine/ner_model_configuration.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces Pydantic-based validation for YAML configuration files, modernizing the configuration management system in Presidio Analyzer. The changes centralize validation logic, improve error handling, and reduce technical debt by removing legacy validation methods.
Key Changes
- Added new Pydantic models for validating recognizer configurations and NLP engine settings
- Replaced legacy manual validation with declarative Pydantic-based validation
- Refactored
NerModelConfigurationfrom dataclass to Pydantic model - Removed redundant validation methods from
NlpEngineProvider
Reviewed Changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Added Pydantic 2.x dependency |
input_validation/schemas.py |
New centralized ConfigurationValidator class with static validation methods |
input_validation/yaml_recognizer_models.py |
New Pydantic models for recognizer registry configuration |
ner_model_configuration.py |
Converted from dataclass to Pydantic BaseModel with field validation |
nlp_engine_provider.py |
Removed legacy validation methods, integrated ConfigurationValidator |
pattern.py |
Added regex and score validation using regex library |
pattern_recognizer.py |
Added support for supported_entities to supported_entity conversion |
recognizers_loader_utils.py |
Enhanced to handle Pydantic-normalized fields and entity conversions |
recognizer_registry_provider.py |
Integrated ConfigurationValidator for registry validation |
analyzer_engine_provider.py |
Added configuration validation call |
| Test files | Comprehensive test coverage for new validation logic |
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/schemas.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/schemas.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/schemas.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/schemas.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
…gnizer_models.py Co-authored-by: Copilot <[email protected]>
…gnizer_models.py Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
…mri/pydantic_validation # Conflicts: # presidio-analyzer/presidio_analyzer/input_validation/schemas.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 3 comments.
presidio-analyzer/presidio_analyzer/nlp_engine/nlp_engine_provider.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/schemas.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
| @@ -0,0 +1,487 @@ | |||
| """Pydantic models for YAML recognizer configurations.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file contains the pydantic equivalent of the yaml structures. As these are not 1:1 with Presidio objects in some cases, there's an intermediate pydantic representation. The existing layer (AnalyzerEngineProvider, NlpEngineProvider, RecognizerRegistryProvider) still handles the actual creation of objects.
| @dataclass | ||
| class NerModelConfiguration: | ||
| """NER model configuration. | ||
| class NerModelConfiguration(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to BaseModel to improve validation
| if "supported_language" in recognizer_conf: | ||
| return [PatternRecognizer.from_dict(recognizer_conf)] | ||
| # legacy recognizer (has supported_language set to a value, not None) | ||
| if recognizer_conf.get("supported_language"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legacy yaml based recognizer had a single language it supported. The newer YAML config allows the user to create recognizers with multiple languages which would translate to multiple instances of the same recognizer (one for each language). Therefore the validation needs to support both the legacy and the new structures.
Removed comments about excluded fields in recognizer initialization.
| # Validate using ConfigurationValidator - let Pydantic errors propagate | ||
| ConfigurationValidator.validate_nlp_configuration(nlp_configuration) | ||
| self.nlp_configuration = nlp_configuration | ||
|
|
||
| if conf_file or conf_file == '': | ||
| self._validate_conf_file_path(conf_file) | ||
| if conf_file or conf_file == "": | ||
| if conf_file == "": | ||
| raise ValueError("conf_file is empty") | ||
| ConfigurationValidator.validate_file_path(conf_file) | ||
| self.nlp_configuration = self._read_nlp_conf(conf_file) | ||
|
|
||
| if conf_file is None and nlp_configuration is None: | ||
| conf_file = self._get_full_conf_path() | ||
| logger.debug(f"Reading default conf file from {conf_file}") | ||
| self.nlp_configuration = self._read_nlp_conf(conf_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing
ConfigurationValidator.validate_nlp_configuration(nlp_configuration) on the other if cases here? lines 66 and 71?
testing locally with
invalid_config = {
"nlp_engine_name": "spacy",
"models": "THIS_SHOULD_BE_A_LIST" # BUG: Should be a list!
}
bypasses the checks
ShakutaiGit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR, Few minor comments.
| with open(self._get_full_conf_path()) as file: | ||
| configuration = yaml.safe_load(file) | ||
|
|
||
| # Validate configuration using Pydantic-based ConfigurationValidator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment necessary?
| configuration = yaml.safe_load(file) | ||
|
|
||
| # Validate configuration using Pydantic-based ConfigurationValidator | ||
| from presidio_analyzer.input_validation import ConfigurationValidator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this import be on top of the file ?
| return validated_config.model_dump(exclude_unset=False) | ||
| except ValidationError as e: | ||
| raise ValueError(f"Invalid recognizer registry configuration: {e}") | ||
| except ImportError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this fallback?
i see that we already load it here
https://github.com/microsoft/presidio/pull/1780/files#diff-90164cd93bcc7aadd168773b72a4a398ab95778f3403c76471c790eab70b76b4R7
| # Use model_dump() without exclude_unset to include default values | ||
| return validated_config.model_dump(exclude_unset=False) | ||
| except ValidationError as e: | ||
| raise ValueError(f"Invalid recognizer registry configuration: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use something like that:
raise ValueError("Invalid recognizer registry configuration") from ei think that Using from e keeps the original Pydantic error details and traceback..
| :param languages: List of languages to validate. | ||
| """ | ||
| for lang in languages: | ||
| if not re.match(r"^[a-z]{2}(-[A-Z]{2})?$", lang): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this pattern is re occuring over the PR,
Can we create a shared constant for language validation and reuse it across both files?
LANGUAGE_CODE_PATTERN = re.compile(r"^[a-z]{2}(-[A-Z]{2})?$")| import regex as re | ||
| from pydantic import BaseModel, ConfigDict, Field, field_validator, model_validator | ||
|
|
||
| logger = logging.getLogger("presidio-analyzer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused logger ?
| ) | ||
| self.low_score_entity_names = LOW_SCORE_ENTITY_NAMES | ||
| if self.labels_to_ignore is None: | ||
| labels_to_ignore: Optional[Collection[str]] = Field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually want Optional here, meaning we want to allow passing None, or should users simply omit the field to use the default?
| self.labels_to_ignore = {} | ||
| return v | ||
|
|
||
| @field_validator("stride") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the mode="before"?
| if nlp_engines is None: | ||
| nlp_engines = (SpacyNlpEngine, StanzaNlpEngine, TransformersNlpEngine) | ||
|
|
||
| # No legacy validation - just assign the engines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both comments here are needed ?
|
|
||
| # Transform supported_entities -> supported_entity | ||
| # (PatternRecognizer expects singular) | ||
| if "supported_entities" in conf_copy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code section repeated 3 times,
Can we extract this into a single helper method ?
Lines 156-159, 176-179, 265-274
This pull request introduces Pydantic-based validation for configuration files and objects throughout the Presidio Analyzer codebase, improving reliability and error reporting. It replaces legacy manual validation logic with centralized schema validation, updates the
NerModelConfigurationclass to use Pydantic, and refactors providers to leverage the new validation approach.Configuration validation improvements:
ConfigurationValidatorclass ininput_validation/schemas.pyto validate analyzer and NLP configurations using Pydantic models, including checks for language codes, file paths, score thresholds, and nested structures.input_validation/__init__.pyto exposeConfigurationValidatorand related schema models for use throughout the codebase.Provider refactoring:
NlpEngineProviderinnlp_engine_provider.pyto remove legacy validation methods and useConfigurationValidatorfor all configuration checks, including file path and YAML structure validation. [1] [2]AnalyzerEngineProviderto validate analyzer configuration usingConfigurationValidatorand log warnings instead of printing errors.Model validation enhancements:
NerModelConfigurationwith a Pydantic-based class, adding field-level validation and default value handling for configuration options. [1] [2]Logging and error handling:
HOW TO TEST?
run
play with different yaml configurations, including wrong ones, and see that the output you're getting makes sense.