Skip to content

Conversation

@kmetra1910
Copy link

@kmetra1910 kmetra1910 commented Apr 14, 2025

Drop-in replacement for GMM from gmr

Implements:

.from_samples(...)

.sample(...)

.predict(...)

.to_responsibilities(...)

.condition(...)

Carefully replicates gmr behavior, including:

Manual initialization when n_samples < 2

Safe handling of degenerate weights (e.g., nan or negative)

Adds warnings instead of exceptions for better debug experience

image

This pull request includes several changes across multiple files to improve code readability and consistency. The most important changes involve adding missing imports, reformatting code for better readability, and updating string formatting.

Code Readability Improvements:

Consistency Updates:

Minor Additions:

  • Added missing author information in test files. [1] [2] [3] [4]

Bug Fixes:

Import Updates:

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

bamt/utils/gmm_wrapper.py:160

  • The assignment of given_values using indexing [0] may unintentionally drop dimensions when conditioning on multiple variables. Consider using np.array(given_values) without the [0] indexing so that the full array is preserved.
given_values = np.array(given_values)[0]

@kmetra1910
Copy link
Author

Обновил исходя из PR review, добавил комменты, докстринги и остальное

jrzkaminski
jrzkaminski previously approved these changes Apr 18, 2025
@jrzkaminski
Copy link
Collaborator

Разве что надо убрать gmr теперь из requirements и pyproject.toml

Copy link
Collaborator

@Roman223 Roman223 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: I didn't run these changes on my own!

Low-Level Observations

  • The low-level implementation looks solid overall.
  • A few commented-out lines should be removed to clean things up.
  • Exception handling is clear and well-structured.
  • Computation is efficient and numerically stable — e.g., good use of np.linalg.solve instead of inverse matrix operations.

High-Level Issues

Redundant Validations in from_samples

There's a manual check to ensure X.shape[0] >= 2, like:

if X.shape[0] < 2:
    raise ValueError("Need at least 2 samples...")

But this check already exists within sklearn.GaussianMixture through:

Snippet from fit_predict

X = validate_data(self, X, dtype=[np.float64, np.float32], ensure_min_samples=2)

Which internally uses:

if ensure_min_samples > 0:
    n_samples = _num_samples(array)
    if n_samples < ensure_min_samples:
        raise ValueError(
            "Found array with %d sample(s) (shape=%s) while a"
            " minimum of %d is required%s."
            % (n_samples, array.shape, ensure_min_samples, context)
        )

Recommendation: These checks are redundant and can be removed or delegated to sklearn.


manual_init Duplication

Instead of rewriting the initialization logic, it's more maintainable to reuse sklearn's internal _initialize method. For reference, here's the original:

def _initialize(self, X, resp):
    """Initialization of the Gaussian mixture parameters."""
    n_samples, _ = X.shape
    weights, means, covariances = None, None, None
    if resp is not None:
        weights, means, covariances = _estimate_gaussian_parameters(
            X, resp, self.reg_covar, self.covariance_type
        )
        if self.weights_init is None:
            weights /= n_samples

    self.weights_ = weights if self.weights_init is None else self.weights_init
    self.means_ = means if self.means_init is None else self.means_init

    if self.precisions_init is None:
        self.covariances_ = covariances
        self.precisions_cholesky_ = _compute_precision_cholesky(
            covariances, self.covariance_type
        )
    else:
        self.precisions_cholesky_ = _compute_precision_cholesky_from_precisions(
            self.precisions_init, self.covariance_type
        )

Recommendation: Subclass GaussianMixture and override only what you need, e.g.:

from sklearn.mixture import GaussianMixture

class CustomGMM(GaussianMixture):
    def _initialize(self, X, resp):
        super()._initialize(X, resp)
        # Add any custom behavior here

This avoids code duplication and benefits from future improvements in scikit-learn.

Testing Strategy

  • Using gmr as a reference is acceptable for transitional verification.
  • However, the long-term goal is to remove gmr from bamt, which means the testing strategy should evolve.

Suggestions:

  • Shift from comparison-based testing to behavioral testing:
    • Check that likelihood improves after training.
    • Verify cluster consistency on toy datasets.
  • Add integration tests on synthetic datasets with known structure.

Action Items

  1. Remove commented-out lines and any redundant stability checks.
  2. Refactor by subclassing sklearn.GaussianMixture and override only required methods.
  3. Redesign tests to focus on correctness and invariants instead of output comparison with gmr.

@Roman223
Copy link
Collaborator

Roman223 commented Apr 18, 2025

@jrzkaminski убирать тогда надо аккуратно -- перенести например в группу тестирования

@kmetra1910
Copy link
Author

Removed redundant validation from from_samples() method and cleaned up outdated comments.
Confirmed that sklearn.GaussianMixture already handles insufficient data with its internal validation logic (as @Roman223 suggested).

All tests pass
Suggest moving gmr removal and behavioral test redesign into the next PR, along with a cleaner initialization approach (as suggested by Roman) using sklearn subclassing or controlled parameter setup.

This will allow us to fully drop dependency on gmr and transition to robust behavioral testing (log-likelihood checks, cluster validation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants