Skip to content

Conversation

@talagayev
Copy link
Member

@talagayev talagayev commented Nov 20, 2025

Fixes #4666.

Changes made in this Pull Request:

  • Addition of parallelization to WaterBridgeAnalysis class in analysis.hydrogenbonds.wbridge_analysis.
  • Addition of client_WaterBridgeAnalysis to conftest.py.
  • Adjustment of tests in test_wbridge.py to work with parallelization.

Here currently I adjusted analysis.hydrogenbonds.wbridge_analysis to work with parallelization and added an example of pytests that will show that it works.

The problem that I have encountered is that I had to modify universe_DA_PBC, since if I directly add the client_WaterBridgeAnalysis it will led to an infinite pytest on my local PC, mainly connected with us using here StringIO that appearantly has difficulties with parallelization.

The second thing is that currently the tests in test_wbridge.py are mainly with 1 frame in mind, so to showcase that parallelization works I could add additional tests similar like universe_DA_PBC_10frames().

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?
  • Is your name in package/AUTHORS? (If it is not, add it!)

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5151.org.readthedocs.build/en/5151/

@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.72%. Comparing base (6b51eac) to head (3b4812f).
⚠️ Report is 1 commits behind head on develop.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #5151   +/-   ##
========================================
  Coverage    92.72%   92.72%           
========================================
  Files          180      180           
  Lines        22458    22466    +8     
  Branches      3186     3187    +1     
========================================
+ Hits         20824    20832    +8     
  Misses        1177     1177           
  Partials       457      457           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@talagayev
Copy link
Member Author

@marinegor I would ping you in this PR, since it is parallelization :)

Here I would need your opinion on how to approach the pytest cases. I didn't add client_WaterBridgeAnalysis to the tests with StringIO since these would lead to the tests being infinite. The question is, is there any goaround for parallelization with StringIO or do we just modify the pytests to work?

Also as mentioned with the pytests being only for 1 frame, should I add for each test cases with multiple frames and test them as in the example that I have now or what would be the prefereable approach?

@marinegor
Copy link
Contributor

@talagayev I don't see a reason why these files take StringIO instead of just a filename, hence I'd suggest re-writing these tests to accept filenames, and put respective filenames (ADA.gro, DWA.gro, ...) in e.g. testsuite/MDAnalysisTests/data or in a new folder testsuite/MDAnalysisTests/data/waterbridge. Then everything should work as for other classes, I guess.

@talagayev
Copy link
Member Author

talagayev commented Nov 21, 2025

@talagayev I don't see a reason why these files take StringIO instead of just a filename, hence I'd suggest re-writing these tests to accept filenames, and put respective filenames (ADA.gro, DWA.gro, ...) in e.g. testsuite/MDAnalysisTests/data or in a new folder testsuite/MDAnalysisTests/data/waterbridge. Then everything should work as for other classes, I guess.

Agree, we can do them similar to the other cases with the files. I will create the test filles, put them into testsuite/MDAnalysisTests/data/waterbridge and adjust the pytests to use them. I will then do for now for the tests the case where I check for now for for each case a test with 1 frame, basically as they are now and a test for multiple frames, so similar like the added test_donor_accepter_pbc_multi, just name them more correspondingly to the tests with 1 frame.

@marinegor
Copy link
Contributor

@talagayev ok, sure! just @ me here or request a review when you're done.

@talagayev
Copy link
Member Author

@marinegor okay should be ready to be reviewed now, had to do adjust timesteps for parallelization cases, due to an error that was appearing, mainly that one that was appearing

analysis_func = <function TestWaterBridgeAnalysis.test_count_by_time_weight.<locals>.analysis at 0x000002322B345A80>, kwargs = {}, result = []

    def count_by_time(self, analysis_func=None, **kwargs):
        """Counts the number of water bridges per timestep.

        The counting behaviour can be adjusted by supplying analysis_func.
        See :ref:`wb_count_by_time` for details.

        Returns
        -------
        counts : list
             Returns a time series ``N(t)`` where ``N`` is the total
             number of observed water bridges at time ``t``.

        """
        if analysis_func is None:
            analysis_func = self._count_by_time_analysis
        if self.results.network:
            result = []
>           for time, frame in zip(self.timesteps, self.results.network):
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E           TypeError: 'NoneType' object is not iterable

..\..\..\package\MDAnalysis\analysis\hydrogenbonds\wbridge_analysis.py:1966: TypeError

So had to adjust timesteps to not be NoneType during parallelization.

As for the testfiles, I did create those that were created in the tests, put them in the data and import them similar as in the other tests.

@marinegor marinegor self-requested a review November 26, 2025 23:31
@talagayev
Copy link
Member Author

@marinegor Ah and I can update the CHANGELOG and add the Documentation if this looks good, wasn't sure if something needs to be modified so I kept it for last :)

@marinegor
Copy link
Contributor

@talagayev

I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers?

I.e. if I modify lines 1954-ish it to:

        if self.results.network:
            result = []
            print(f'{self.timesteps=}')
            assert False
            for time, frame in zip(self.timesteps, self.results.network):

and run tests, I get [0.0] before assertion error.

I guess the issue here is that attribute timesteps are implicitly expected to exist after the run is finished, which is not the case when we run it in parallel. Hence they should be moved to self.results.timesteps if we rely on them in _conclude or elsewhere.

It's weirdly the only class that actually does this:

❯ rg -t py self.timesteps
package/MDAnalysis/analysis/hydrogenbonds/wbridge_analysis.py
1027:        self.timesteps = None  # time for each frame
1314:        self.timesteps = []
1409:        self.timesteps.append(self._ts.time)
1969:        if self.timesteps is None:
1972:            timesteps = self.timesteps
2036:            if self.timesteps is None:
2039:                timesteps = self.timesteps
2140:        for t, hframe in zip(self.timesteps, timeseries):

@MDAnalysis/coredevs do you have any opinion on moving timesteps to results.timesteps instead? If I'm understanding correctly, the right way to do that is to deprecate self.timesteps in favor of self.results.timesteps, while for now copying self.results.timesteps to timesteps.

@talagayev
Copy link
Member Author

@talagayev

I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers?

I.e. if I modify lines 1954-ish it to:

        if self.results.network:
            result = []
            print(f'{self.timesteps=}')
            assert False
            for time, frame in zip(self.timesteps, self.results.network):

and run tests, I get [0.0] before assertion error.

I guess the issue here is that attribute timesteps are implicitly expected to exist after the run is finished, which is not the case when we run it in parallel. Hence they should be moved to self.results.timesteps if we rely on them in _conclude or elsewhere.

It's weirdly the only class that actually does this:

❯ rg -t py self.timesteps
package/MDAnalysis/analysis/hydrogenbonds/wbridge_analysis.py
1027:        self.timesteps = None  # time for each frame
1314:        self.timesteps = []
1409:        self.timesteps.append(self._ts.time)
1969:        if self.timesteps is None:
1972:            timesteps = self.timesteps
2036:            if self.timesteps is None:
2039:                timesteps = self.timesteps
2140:        for t, hframe in zip(self.timesteps, timeseries):

@MDAnalysis/coredevs do you have any opinion on moving timesteps to results.timesteps instead? If I'm understanding correctly, the right way to do that is to deprecate self.timesteps in favor of self.results.timesteps, while for now copying self.results.timesteps to timesteps.

True, yes was not sure about the solution, if it will work for other cases, that are not covered by the pytests 🙈 Yes should be timesteps, which is currently tricky to get during multiprocessing, which is as you mentioned that they have to exist after the run, which works well with serial, but difficult with multiprocessing.

hm, yes the moving to of timesteps to results.timesteps could be an option I think, would need to look more into it as well.

@marinegor
Copy link
Contributor

I vaguely remember having this conversation already, actually.

I think #3720 and #3735 (reading changelog) implemented something like that, so we could in principle do this right away too🫣 But I need other coderev opinions.

@talagayev
Copy link
Member Author

I vaguely remember having this conversation already, actually.

I think #3720 and #3735 (reading changelog) implemented something like that, so we could in principle do this right away too🫣 But I need other coderev opinions.

Yep, sounds good, let's see what the others say, in the meantime I can check the PR and how it was done there.

@orbeckst
Copy link
Member

timesteps is not part of the AnalysisBase API (as far as I can tell) so I don't see an issue putting it under results.

If it was documented as a top level attribute of the specific class before then you have to deprecate it and keep a pointer around until we can remove it in 3.0.

(Apologies, quick comment before I am now starting my 🦃 days.)

4ALA H 4 0.500 0.000 0.000
4ALA N 5 0.600 0.000 0.000
1.0 1.0 1.0"""
u = MDAnalysis.Universe(StringIO(grofile), format="gro")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow the whole StringIO thing - but being able to read a StringIO is a core component of MDAnalysis. Are we saying that this is no longer possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem in the current iteration is that when StringIO is combined with parallelization it led to infinite pytests, so reading StringIO + doing analysis with dask/multiprocessing is not compatible. Maybe an option would to add then an exception case and some check before the running of an analysis if the file was read from StringIO 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talagayev can you perhaps provide a minimal running example of what was happening with StringIO, so that we could try running it locally and see the infiniteness for ourselves?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I can try maybe to do a small push, put one StringIO case back so that you can then git clone it and run/test locally. Can do it in around 30 minutes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay got the test_loop to the intial state where it reads the StringIO to create the universe. It currently fails the tests here due to the time limit, so you can git clone and run it locally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for anyone wondering, here's the standalone code:

from io import StringIO

import MDAnalysis as mda
from MDAnalysis.analysis.hydrogenbonds.wbridge_analysis import (
    WaterBridgeAnalysis,
)

s = """Test gro file
5
    1ALA      O    1   0.000   0.001   0.000
    2SOL     OW    2   0.300   0.001   0.000
    2SOL    HW1    3   0.200   0.002   0.000
    2SOL    HW2    4   0.200   0.000   0.000
    4ALA      O    5   0.600   0.000   0.000
  1.0   1.0   1.0"""
grofile = StringIO(s)


wb = WaterBridgeAnalysis(
    mda.Universe(grofile, format="gro"),
    "protein and (resid 1)",
    "protein and (resid 1 or resid 4)",
)
wb.run(backend="multiprocessing")

and here's the stacktrace:

In [9]: wb.run(backend='multiprocessing')
/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/reduction.py:51: UserWarning: Reader has no dt information, set to 1.0 ps
  cls(buf, protocol).dump(obj)
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/pool.py", line 114, in worker
    task = get()
  File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/queues.py", line 387, in get
    return _ForkingPickler.loads(res)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^
  File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
    return getattr(self.stream, x)
                   ^^^^^^^^^^^
  File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
    return getattr(self.stream, x)
                   ^^^^^^^^^^^
  File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
    return getattr(self.stream, x)
                   ^^^^^^^^^^^
  [Previous line repeated 2970 more times]
RecursionError: maximum recursion depth exceeded
Exception ignored in: <function NamedStream.__del__ at 0x70e968095800>
Traceback (most recent call last):
  File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
    return getattr(self.stream, x)
  File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
    return getattr(self.stream, x)
  File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
    return getattr(self.stream, x)
  [Previous line repeated 997 more times]
RecursionError: maximum recursion depth exceeded
^CProcess ForkPoolWorker-2:

@talagayev I'd suggest to make a separate issue out of it and not include it in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not include it in this PR

If you decide to move everything out of the StringIO representation, I would ask that you open an issue to fix it once the underlying issue is fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer if we didn't need to add all these files and could continue using the in-line test snippets.

More broadly, before working around issues we should understand what's going wrong.

@marinegor
Copy link
Contributor

also @talagayev you can have a look at InterRDS_s and DeprecationWarning there to have a look at how things get moved to results. I'd do something similar here as well.

Comment on lines +1969 to +1972
if self.timesteps is None:
timesteps = range(len(self.results.network))
else:
timesteps = self.timesteps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be actual timesteps, and not integer indices. Solution would be to save them to results._timesteps and later use them where needed, while adding deprecation warning to self.timesteps alike InterRDS_s's approach.

"SURFACE_PDB", # 111 FCC lattice topology for NSGrid bug #2345
"SURFACE_TRR", # full precision coordinates for NSGrid bug #2345
"DSSP", # DSSP test suite
"WB_AD",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add desriptors for all these files, the filenames themselves won't make a ton of sense a couple months down the line.

return request.param


# MDAnalysis.analysis.hydrogenbonds.wbridge_analysis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

@orbeckst
Copy link
Member

orbeckst commented Dec 1, 2025

@talagayev why is StringIO not working here?

@marinegor
Copy link
Contributor

marinegor commented Dec 1, 2025 via email

@talagayev
Copy link
Member Author

I pasted traceback above -- seems that StringIO doesn't work well with multiprocessing.

Could it be connected somehow to the case, that at some point in the parallelization if it uses ReaderBase to read the file it would try to read the file and would not be able to access it, since it gets only a string that is not pickable and iteratable due to this part of the code in the coordinates/base.py? Since here he only exception being NamedStream files.

    if isinstance(filename, NamedStream):
        self.filename = filename
    else:
        self.filename = str(filename)

maybe we can adjust it somehow for StringIO cases. How is it currently regulated with NamedStream cases? I assume those would possibly also fall into the same line of them being difficult to parallelize?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MDAnalysis.analysis.hydrogenbonds.WaterBridgeAnalysis: Implement parallelization or mark as unparallelizable

4 participants