Skip to content

memory leak in get_num_procs() #2003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrandmeyer opened this issue Feb 6, 2019 · 3 comments
Closed

memory leak in get_num_procs() #2003

jbrandmeyer opened this issue Feb 6, 2019 · 3 comments

Comments

@jbrandmeyer
Copy link

Seen in version 0.3.5+ds-1 as packaged by the Debian project for Debian Sid, on an AMD Threadripper 2950X system.

Some of my test programs are tripping GCC's memory leak detection built into their implementation of asan. It looks like the error handling paths coming out of get_num_procs() are getting triggered, and they leak memory.

Hand-walking through get_num_procs():

  • sysconf(_SC_NPROCESSORS_CONF) returns 32
  • CPU_ALLOC_SIZE(32) returns 8. However, sizeof(cpu_set_t) is 128.
  • Calling sched_getaffinity(0, CPU_ALLOC_SIZE(32), <a previously allocated cpu_set_t *>) returns -1: Illegal argument. However, sched_getaffinity(0, sizeof(cpu_set_t), ) succeeds.

Its not clear to me just by reading the manpages whether or not it is glibc in error or openblas. It certainly seems reasonable to pass CPU_ALLOC_SIZE(nprocs) instead of sizeof(cpu_set_t) as the second parameter.

That said, once ret is nonzero, get_num_procs() returns without CPU_FREE()'ing the previously CPU_ALLOC()'ed cpusetp, and memory is leaked.

@martin-frbg
Copy link
Collaborator

You are talking about memory.c here I assume ? There is similar code in init.c that - I think - does not return (too) early.
That code was cobbled together from the manpages so it may well be wrong, on rereading it seems to me that the libc can use a fixed-size cpu_set_t that is bigger than the bare minimum number of bytes returned by CPU_ALLOC_SIZE. So perhaps that size should be the maximum of the two to encompass both systems with <1023 and >1023 cpus? (The manpage suggestion of probing different values on a big system until sched_getaffinity() succeeds looks strangely lame)

@jbrandmeyer
Copy link
Author

Yes, driver/others/memory.c, from the release-0.3.0 branch, inside some feature test macros at line 226.

I, for one do not have access to such a well-endowed shared memory machine to test empirically. I think you'll have to ask the glibc developers just exactly what the API is supposed to be here, since its custom to glibc. That said, once upon a time, when Mono faced this problem they did use sizeof(cpu_set_t).

https://github.com/mono/mono/blob/17aa73d0ae216529b59d5aa3d0ee68007bb035c4/mono/utils/mono-proclib.c#L774-L780

@jbrandmeyer
Copy link
Author

Further digging in the kernel sources:

sched_getaffinity() will return EINVAL if:

  • the len argument is not a multiple of sizeof(unsigned long)
  • the len argument isn't large enough to hold nr_cpu_ids

For AMD Threadripper 2950X, I see the following in dmesg:

[ 0.000000] smpboot: Allowing 128 CPUs, 96 hotplug CPUs
[ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:128 nr_node_ids:1

Sure enough, if I build up a cpu_set_t that is 128 bits large, everything works. Maybe the bug is that _SC_NPROCESSORS_CONF is returning 32 instead of 128.

martin-frbg added a commit that referenced this issue Feb 10, 2019
* Fix potential memory leak in cpu enumeration with glibc

An early return after a failed call to sched_getaffinity would leak the previously allocated cpu_set_t. Wrong calculation of the size argument in that call increased the likelyhood of that failure. Fixes #2003
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants