-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hit assert in ABTI_mem_pool_alloc() #333
Comments
Thank you for reporting an issue! The pool structure looks broken. This should not happen if the algorithm works correctly. This pool uses a bit complicated logic (#183), but our CI (including numerous OSs, compilers, and CPU architectures) never encountered this issue so far (see https://www.argobots.org/tests/ to know the combinations). I haven't tested the combination of Azure/docker/VM, though. Regarding the line number of
I would really appreciate a reproducer to investigate this issue, even if the reproducing code is not small. |
Thanks for looking into this.
Many thanks, I'll keep you informed if there is any new findings. |
This was actually hit in our testsuite, we're trying to see what we can achieve in github-actions and this is one of the failures that we saw there, an example run is here: https://github.com/ashleypittman/daos/runs/2567827884 Generally running under github-actions hasn't been that stable for us, we've found a few issues that all seem to relate to resource starvation or timeouts which is not entirely unexpected given the constraints. We've since trimmed back the PR in question to a core set of functionality and landed it, but I can expand it again to see if I can hit upon a more reliable reproducer. Argobots is built from your v1.1 tag. I'll create another PR to reproduce the settings I was using before to see if I can trigger this again - it was regularly occurring for a couple of days for me last week. |
Thank you for your replies.
I thought Argobots might encounter a bug in 128-bit atomic CAS, which is used for this memory pool algorithm, but a widely used compiler (e.g., GCC) + x86/64 should not cause an issue. This feature is checked in https://github.com/pmodels/argobots/blob/main/src/include/asm/abtd_asm_int128_cas.h#L20-L36 Thank you. It is very helpful! We will investigate this issue, but as the program is large, please do not expect that I can find a bug very soon. Regarding resource management, Argobots 1.1 fixed error handling paths, so Argobots itself should properly return resource allocation errors (e.g., memory allocation failure in this memory pool) to the user application unless the error is catastrophic. Those paths should be well tested (#309).
Thanks! Tag v1.1 of Argobots has not been updated since March 31, so it would be helpful to know which commits directly reveal this issue (that potentially existed in Argobots). |
I could not reproduce this issue as far as I checked 4-5 times shintaro-iwasaki/daos-copy#1 I will write a heavily threaded program and check this memory pool implementation in Argobots, but at this point I would suspect either a ULT stack overflow or a bug (e.g., illegal memory access) in DAOS. |
I am hitting this assert on Summit. This is running with ASAN, which is reporting no errors ahead of the assert failure. dspaces_server: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertion `num_headers_in_cur_bucket >= 1' failed. |
@philip-davis Thank you very much. The error seems very similar to what @NiuYawei reported. I will check this issue again. |
I tested Argobots' memory-pool operations on Summit-like POWER9 machine at Argonne, but I could not reproduce this issue. What I did (collapsed)
Argobots v1.1 + POWER9 + GCC 9.3, Spack-default configuration. The test creates 10 millions ULTs (no cutoff Fibonacci(34)) and schedule them in a random-work-stealing manner. I repeated this test with various numbers of ESs 500 times in total (which took a few hours). ## Environment
$ gcc --version
gcc (Spack GCC) 9.3.0
$ cat /proc/cpuinfo
...
cpu : POWER9, altivec supported
...
## Configure Argobots (the same as the default "spack install argobots")
$ git checkout v1.1
$ sh autogen.sh
$ ./configure --prefix=$(pwd)/install --enable-perf-opt
## Build and run modified fibonacci
$ gcc fib.c -labt -L install/lib -I install/include/ -Wl,-rpath=$(pwd)/install/lib -o fib.out
$ cat test.sh
for repeat in $(seq 5); do
for es in $(seq 100); do
date
echo "./fib.out -n 35 -e $es"
./fib.out -n 35 -e $es
done
done
$ sh test.sh Code
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <unistd.h>
#include <stdarg.h>
#include <abt.h>
#define DEFAULT_NUM_XSTREAMS 4
#define DEFAULT_N 10
ABT_pool *pools;
typedef struct {
int n;
int ret;
} fibonacci_arg_t;
void fibonacci(void *arg)
{
int n = ((fibonacci_arg_t *)arg)->n;
int *p_ret = &((fibonacci_arg_t *)arg)->ret;
if (n <= 1) {
*p_ret = 1;
} else {
fibonacci_arg_t child1_arg = { n - 1, 0 };
fibonacci_arg_t child2_arg = { n - 2, 0 };
int rank;
ABT_xstream_self_rank(&rank);
ABT_pool target_pool = pools[rank];
ABT_thread child1;
/* Calculate fib(n - 1). */
ABT_thread_create(target_pool, fibonacci, &child1_arg,
ABT_THREAD_ATTR_NULL, &child1);
/* Calculate fib(n - 2). We do not create another ULT. */
fibonacci(&child2_arg);
ABT_thread_free(&child1);
*p_ret = child1_arg.ret + child2_arg.ret;
}
}
int fibonacci_seq(int n)
{
if (n <= 1) {
return 1;
} else {
int i;
int fib_i1 = 1; /* Value of fib(i - 1) */
int fib_i2 = 1; /* Value of fib(i - 2) */
for (i = 3; i <= n; i++) {
int tmp = fib_i1;
fib_i1 = fib_i1 + fib_i2;
fib_i2 = tmp;
}
return fib_i1 + fib_i2;
}
}
int main(int argc, char **argv)
{
int i, j;
/* Read arguments. */
int num_xstreams = DEFAULT_NUM_XSTREAMS;
int n = DEFAULT_N;
while (1) {
int opt = getopt(argc, argv, "he:n:");
if (opt == -1)
break;
switch (opt) {
case 'e':
num_xstreams = atoi(optarg);
break;
case 'n':
n = atoi(optarg);
break;
case 'h':
default:
printf("Usage: ./fibonacci [-e NUM_XSTREAMS] [-n N]\n");
return -1;
}
}
/* Allocate memory. */
ABT_xstream *xstreams =
(ABT_xstream *)malloc(sizeof(ABT_xstream) * num_xstreams);
pools = (ABT_pool *)malloc(sizeof(ABT_pool) * num_xstreams);
ABT_sched *scheds = (ABT_sched *)malloc(sizeof(ABT_sched) * num_xstreams);
/* Initialize Argobots. */
ABT_init(argc, argv);
/* Create pools. */
for (i = 0; i < num_xstreams; i++) {
ABT_pool_create_basic(ABT_POOL_FIFO, ABT_POOL_ACCESS_MPMC, ABT_TRUE,
&pools[i]);
}
/* Create schedulers. */
for (i = 0; i < num_xstreams; i++) {
ABT_pool *tmp = (ABT_pool *)malloc(sizeof(ABT_pool) * num_xstreams);
for (j = 0; j < num_xstreams; j++) {
tmp[j] = pools[(i + j) % num_xstreams];
}
ABT_sched_create_basic(ABT_SCHED_DEFAULT, num_xstreams, tmp,
ABT_SCHED_CONFIG_NULL, &scheds[i]);
free(tmp);
}
/* Set up a primary execution stream. */
ABT_xstream_self(&xstreams[0]);
ABT_xstream_set_main_sched(xstreams[0], scheds[0]);
/* Create secondary execution streams. */
for (i = 1; i < num_xstreams; i++) {
ABT_xstream_create(scheds[i], &xstreams[i]);
}
for (int i = 2; i <= n; i++) {
fibonacci_arg_t arg = { i, 0 };
fibonacci(&arg);
int ret = arg.ret;
int ans = fibonacci_seq(i);
/* Check the results. */
printf("Fibonacci(%d) = %d (ans: %d)\n", i, ret, ans);
}
/* Join secondary execution streams. */
for (i = 1; i < num_xstreams; i++) {
ABT_xstream_join(xstreams[i]);
ABT_xstream_free(&xstreams[i]);
}
/* Finalize Argobots. */
ABT_finalize();
/* Free allocated memory. */
free(xstreams);
free(pools);
free(scheds);
return 0;
} Although I have not confirmed the reason, I would first suggest you set ABT_THREAD_STACKSIZE=256000 ./your_app.out ExplanationIf a ULT runs out of its function stack, it can overwrite [EDIT] 4KB is wrong. 16KB is correct. I'm not sure the Margo's default stack size, but if Margo does not explicitly set it, possibly the program caused stack overflow considering the depth of function stack @philip-davis reported. By default it is To examine this, the latest Argobots ( |
Margo sets ABT_THREAD_STACKSIZE to 2097152 by default, so I doubt that's the issue, but I could be wrong. |
@mdorier Thank you. I will check the memory pool implementation again. |
Specifically speaking to my issue, I am initializing Argobots outside of
Margo, so Margo doesn't have the opportunity to change the value of
ABT_THREAD_STACKSIZE. When I increase the stack size as suggested, the
error (and a number of other hard to track down errors) disappear. This
appears to have been the problem. Thank you.
…On Fri, Jun 25, 2021 at 7:10 PM Shintaro Iwasaki ***@***.***> wrote:
@mdorier <https://github.com/mdorier> Thank you. I will check the memory
pool implementation again.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#333 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABRSFIYQB35FR6R7UDTDJVLTUUEGVANCNFSM447R47NA>
.
|
Thank you very much for the update, based on your comments I've tried updating to tip of argobots 2202510 and am building with --enable-debug=most and setting ABT_STACK_OVERFLOW_CHECK=mprotect and this has converted general instability I was seeing into a constant, reproducible segfault as a result of which I've managed to identify at least two areas of our code which require attention. |
I can confirm that once we'd tried with a build with the ABT_STACK_OVERFLOW_CHECK=memcheck feature and fixed two issues that were causing segfaults with that feature enabled we've not seen this and in fact the system has been remarkably stable since so I think we can confirm that the problems we were seeing were the result of stack overflow and I'd be happy to close this bug report now. |
I've seen a couple of CI failures with this assertion failure, but do not have a good reproducer. This is running on Azure, so under docker on a shared VM, and I expect there to be extreme CPU and memory pressure in these cases.
ERROR: daos_engine:0 daos_engine: ../src/include/abti_mem_pool.h:123: ABTI_mem_pool_alloc: Assertion
num_headers_in_cur_bucket >= 1' failed.ERROR: daos_engine:0 *** Process 43149 received signal 6 ***
Associated errno: Success (0)
/lib64/libpthread.so.0(+0x12b20)[0x7f660fdc8b20]
/lib64/libc.so.6(gsignal+0x10f)[0x7f660f1767ff]
/lib64/libc.so.6(abort+0x127)[0x7f660f160c35]
/lib64/libc.so.6(+0x21b09)[0x7f660f160b09]
/lib64/libc.so.6(+0x2fde6)[0x7f660f16ede6]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x10bf2)[0x7f660fb9ebf2]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(ABT_thread_create+0x92)[0x7f660fb9eda2]
/opt/daos/bin/daos_engine[0x44b49f]
/opt/daos/bin/daos_engine[0x44af6e]
/opt/daos/bin/daos_engine(dss_ult_create+0x45)[0x44ada5]
/opt/daos/bin/daos_engine[0x417e20]
/opt/daos/bin/daos_engine[0x417a2b]
/opt/daos/bin/daos_engine[0x4174f5]
/opt/daos/bin/daos_engine[0x417105]
/opt/daos/bin/daos_engine(drpc_progress+0x27e)[0x4165ee]
/opt/daos/bin/daos_engine[0x415622]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x17dba)[0x7f660fba5dba]
/opt/daos/bin/../prereq/release/argobots/lib/libabt.so.1(+0x17f51)[0x7f660fba5f51]
DEBUG 21:05:28.522056 procmon.go:246: Cleaning Pool f04361ee-06fe-4c34-8ecc-8f1dd3a55c49 failed:pool evict failed: rpc error: code = Unknown desc = failed to send 92B message: dRPC recv: EOF
instance 0, pid 43149, rank 0 exited with status: /opt/daos/bin/daos_engine exited: signal: aborted (core dumped)
`
The text was updated successfully, but these errors were encountered: