[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to gracefully handle OOM #3582

Open
svandenhaute opened this issue Jul 27, 2024 · 2 comments
Open

How to gracefully handle OOM #3582

svandenhaute opened this issue Jul 27, 2024 · 2 comments

Comments

@svandenhaute
Copy link

Is it possible to deal with OOM situations gracefully?

I'm dealing with input structures which are somewhat unphysical in terms of box size and or particle positions. It's hard to filter this out beforehand. In any case, when he tries to construct the PW basis, he runs OOM because probably the number of plane waves is exceptionally high because the box is unnecessarily large.

The problem is that it triggers an OOM of the kind that it kills everything else that was running on the node. Ideally, I'd use e.g ulimit to handle this more gracefully. When I use ulimit -v and set it to 2GB (roughly the available memory per core, those errors are in fact successfully caught by the OS and it no longer makes everything crash. However, sometimes, it also triggers OOM errors in cases where he's clearly not running out of memory. Presumably, the ulimit kicks in a little to early, i.e. before spawning all the separate processes, since in those cases there hasn't been any CP2K output whatsoever.

@oschuett
Copy link
Member
oschuett commented Jul 30, 2024

Is it possible to deal with OOM situations gracefully?

It depends ;-)

Many methods have a hard requirement and there is no way to perform the calculation with less memory. The plane wave grids are a good example of that.

However, there are a few cases where we can trade memory for compute time, e.g. caching of ERIs or COSMA.
And indeed there has been a recent proposal to behave smarter in those situations: #3565.

That being said, for the time being I'm afraid you'll have to use try-and-error to find the right settings.

@svandenhaute
Copy link
Author
svandenhaute commented Jul 30, 2024

When I enable use ulimit within the container, it seems to be OK.

As a side note, I've observed that when using the CP2K docker recipes and converting them to .sif, the container performance is still not reproducible across platforms with similar-performing hardware. In particular, I've noticed that it is necessary to use mpirun -bind-to core whereas in other cases that tanks performance by x10.
How is this possible given that I'm using the --compat flag for apptainer exec? The MPI library is also located within the container...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants