-
Notifications
You must be signed in to change notification settings - Fork 746
Enabling of parallelization of analysis.hydrogenbonds.WaterBridgeAnalysis
#5151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5151 +/- ##
========================================
Coverage 92.72% 92.72%
========================================
Files 180 180
Lines 22458 22466 +8
Branches 3186 3187 +1
========================================
+ Hits 20824 20832 +8
Misses 1177 1177
Partials 457 457 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@marinegor I would ping you in this PR, since it is parallelization :) Here I would need your opinion on how to approach the pytest cases. I didn't add Also as mentioned with the pytests being only for 1 frame, should I add for each test cases with multiple frames and test them as in the example that I have now or what would be the prefereable approach? |
|
@talagayev I don't see a reason why these files take |
Agree, we can do them similar to the other cases with the files. I will create the test filles, put them into |
|
@talagayev ok, sure! just |
|
@marinegor okay should be ready to be reviewed now, had to do adjust So had to adjust As for the testfiles, I did create those that were created in the tests, put them in the |
|
@marinegor Ah and I can update the |
|
I'm slightly worried about these timesteps -- shouldn't they be actual timestamps, and not just consecutive integers? I.e. if I modify lines 1954-ish it to: if self.results.network:
result = []
print(f'{self.timesteps=}')
assert False
for time, frame in zip(self.timesteps, self.results.network):and run tests, I get I guess the issue here is that attribute It's weirdly the only class that actually does this: @MDAnalysis/coredevs do you have any opinion on moving |
True, yes was not sure about the solution, if it will work for other cases, that are not covered by the pytests 🙈 Yes should be timesteps, which is currently tricky to get during multiprocessing, which is as you mentioned that they have to exist after the run, which works well with serial, but difficult with multiprocessing. hm, yes the moving to of |
Yep, sounds good, let's see what the others say, in the meantime I can check the PR and how it was done there. |
|
If it was documented as a top level attribute of the specific class before then you have to deprecate it and keep a pointer around until we can remove it in 3.0. (Apologies, quick comment before I am now starting my 🦃 days.) |
| 4ALA H 4 0.500 0.000 0.000 | ||
| 4ALA N 5 0.600 0.000 0.000 | ||
| 1.0 1.0 1.0""" | ||
| u = MDAnalysis.Universe(StringIO(grofile), format="gro") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the whole StringIO thing - but being able to read a StringIO is a core component of MDAnalysis. Are we saying that this is no longer possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem in the current iteration is that when StringIO is combined with parallelization it led to infinite pytests, so reading StringIO + doing analysis with dask/multiprocessing is not compatible. Maybe an option would to add then an exception case and some check before the running of an analysis if the file was read from StringIO 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talagayev can you perhaps provide a minimal running example of what was happening with StringIO, so that we could try running it locally and see the infiniteness for ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I can try maybe to do a small push, put one StringIO case back so that you can then git clone it and run/test locally. Can do it in around 30 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay got the test_loop to the intial state where it reads the StringIO to create the universe. It currently fails the tests here due to the time limit, so you can git clone and run it locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for anyone wondering, here's the standalone code:
from io import StringIO
import MDAnalysis as mda
from MDAnalysis.analysis.hydrogenbonds.wbridge_analysis import (
WaterBridgeAnalysis,
)
s = """Test gro file
5
1ALA O 1 0.000 0.001 0.000
2SOL OW 2 0.300 0.001 0.000
2SOL HW1 3 0.200 0.002 0.000
2SOL HW2 4 0.200 0.000 0.000
4ALA O 5 0.600 0.000 0.000
1.0 1.0 1.0"""
grofile = StringIO(s)
wb = WaterBridgeAnalysis(
mda.Universe(grofile, format="gro"),
"protein and (resid 1)",
"protein and (resid 1 or resid 4)",
)
wb.run(backend="multiprocessing")and here's the stacktrace:
In [9]: wb.run(backend='multiprocessing')
/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/reduction.py:51: UserWarning: Reader has no dt information, set to 1.0 ps
cls(buf, protocol).dump(obj)
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
self.run()
~~~~~~~~^^
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/pool.py", line 114, in worker
task = get()
File "/home/marinegor/.local/share/uv/python/cpython-3.13.7-linux-x86_64-gnu/lib/python3.13/multiprocessing/queues.py", line 387, in get
return _ForkingPickler.loads(res)
~~~~~~~~~~~~~~~~~~~~~^^^^^
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
^^^^^^^^^^^
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
^^^^^^^^^^^
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
^^^^^^^^^^^
[Previous line repeated 2970 more times]
RecursionError: maximum recursion depth exceeded
Exception ignored in: <function NamedStream.__del__ at 0x70e968095800>
Traceback (most recent call last):
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
File "/home/marinegor/github/mdanalysis/package/MDAnalysis/lib/util.py", line 736, in __getattr__
return getattr(self.stream, x)
[Previous line repeated 997 more times]
RecursionError: maximum recursion depth exceeded
^CProcess ForkPoolWorker-2:
@talagayev I'd suggest to make a separate issue out of it and not include it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not include it in this PR
If you decide to move everything out of the StringIO representation, I would ask that you open an issue to fix it once the underlying issue is fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we didn't need to add all these files and could continue using the in-line test snippets.
More broadly, before working around issues we should understand what's going wrong.
get one StringIO case back for showcase
add missing
import addition
|
also @talagayev you can have a look at |
| if self.timesteps is None: | ||
| timesteps = range(len(self.results.network)) | ||
| else: | ||
| timesteps = self.timesteps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be actual timesteps, and not integer indices. Solution would be to save them to results._timesteps and later use them where needed, while adding deprecation warning to self.timesteps alike InterRDS_s's approach.
| "SURFACE_PDB", # 111 FCC lattice topology for NSGrid bug #2345 | ||
| "SURFACE_TRR", # full precision coordinates for NSGrid bug #2345 | ||
| "DSSP", # DSSP test suite | ||
| "WB_AD", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add desriptors for all these files, the filenames themselves won't make a ton of sense a couple months down the line.
| return request.param | ||
|
|
||
|
|
||
| # MDAnalysis.analysis.hydrogenbonds.wbridge_analysis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
|
@talagayev why is StringIO not working here? |
|
I pasted traceback above -- seems that StringIO doesn't work well with multiprocessing.
|
Could it be connected somehow to the case, that at some point in the parallelization if it uses maybe we can adjust it somehow for |
Fixes #4666.
Changes made in this Pull Request:
WaterBridgeAnalysisclass inanalysis.hydrogenbonds.wbridge_analysis.client_WaterBridgeAnalysistoconftest.py.test_wbridge.pyto work with parallelization.Here currently I adjusted
analysis.hydrogenbonds.wbridge_analysisto work with parallelization and added an example of pytests that will show that it works.The problem that I have encountered is that I had to modify
universe_DA_PBC, since if I directly add theclient_WaterBridgeAnalysisit will led to an infinite pytest on my local PC, mainly connected with us using hereStringIOthat appearantly has difficulties with parallelization.The second thing is that currently the tests in
test_wbridge.pyare mainly with 1 frame in mind, so to showcase that parallelization works I could add additional tests similar likeuniverse_DA_PBC_10frames().PR Checklist
package/CHANGELOGfile updated?package/AUTHORS? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5151.org.readthedocs.build/en/5151/