-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Further improvements to memory.c. #1625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
40ac5ea
to
0d77548
Compare
I get core dumps with this patch applied unfortunately I'm not seeing an actual performance difference |
It should be about the same as using TLS from before, but should be as fast for systems that don't support compiler-generated TLS. The delta between the two was about 15-20% before, now it should be around 1%. What are you building with (just USE_THREADS=1)? |
This patch is not quite right, on ARM64, it leads to improvements for small matrices (even without threading), but large matrix GEMM has gotten worse pretty uniformly (results on pixel2). Old: 3313e4b Since I am testing single threaded performance on ARM64, none of the other changes has any effect. These numbers are fairly repeatable too.
|
Older compilers do not like the |
Ahh, I’ll just use the same pattern I’ve seen elsewhere for the
definition. I’ll check out the test; I wonder if it’s actualy starting too
many threads. I did make the assumption that MAX_CPU_NUMBER is the max
number of threads that can be created, is that a false assumption?
At any rate, I put this up so that you all could take a look at it since it
has a fair number of changes. I’ll dig back into it tomorrow, but thanks
for testing it out!
|
Just trying to redo the failed CI jobs locally to find the causes that are hidden by the QUIET_MAKE setting. (Removing that option only makes it hit the output limit in travis...). And the threads issue is probably related to what I warned sandwichmaker about in #1616 - NUM_BUFFERS is/was allocated as NUM_THREADS*2 to leave room for level3 blas functions to spawn their own threads (fork utest spawns cblas_dgemm jobs). |
Seems my problem with the fork utest is directly related to the small NUM_THREADS chosen at build time (4 on the dual-core laptop I used yesterday, while the CI runs set NUM_THREADS=32). So perhaps it is this default that needs to be changed (although that test used to complete in seconds even with NUM_THREADS=4. Now if I just increase the limits in memory.c to get past the "too many threads" warning, I see one thread each spinning sched_yield with three others waiting on the lock in get_memory_table().) |
The issue is actually that at fork() we need to reinitialize the memory table for the child process. Well that and also increase the max thread number. I think I've also resolved the performance issue for large matrices (needed to pad the alloc_t structure for proper alignment). I'll update the PR soon. |
Ok updated. @sandwichmaker and I have both seen some small improvements now with large matrices on x86_64, and he's going to test with ARM hopefully later today. This also resolves the fork test (that's a very handy test, btw, I never would have realized we needed to reinitialize things without it). |
Number from @sandwichmaker
|
Seems get_memory_table does not have an implementation on Windows yet, and next_memory_table_pos only exists in the COMPILER_TLS case (which breaks the xcode builds). |
79ff1f8
to
e9d6864
Compare
Whoops, missed the ALLOC_MMAP ifdef, that should have been below the new code (should fix windows), and also fixed the ifdef. I should really get set up on more platforms.... |
driver/others/memory.c
Outdated
void (*func)(struct alloc_t *); | ||
/* Pad to 64-byte alignment */ | ||
#ifdef __64BIT__ | ||
char pad[48]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fragile, is there no way for the compiler to do the padding automatically, or by using a macro of some sort here which will deduce the size and do the right thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
#endif | ||
}; | ||
|
||
static const int allocation_block_size = BUFFER_SIZE + sizeof(struct alloc_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and added more documentation for the struct and why it exists, too.
driver/others/memory.c
Outdated
|
||
static const int allocation_block_size = BUFFER_SIZE + sizeof(struct alloc_t); | ||
|
||
/* TLS is supported from clang 2.8, gcc 4.1, MSVC 2005, and XCode 8 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a rather convoluted test, perhaps break this into multiple more easily checkable/modifiable define statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
along the lines of the android test you have below, which is considerably easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
|
||
/* Holds pointers to allocated memory */ | ||
#if defined(SMP) && !defined(USE_OPENMP) | ||
#define MAX_ALLOCATING_THREADS MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the logic for this number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use indentation here to make the logic here easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This is how we do it internally.
driver/others/memory.c
Outdated
static DWORD local_storage_key; | ||
#else | ||
static pthread_key_t local_storage_key; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the else and endifs are so nested, consider adding
#endif // foo to indicate what is ending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and elsewhere.
driver/others/memory.c
Outdated
|
||
} | ||
|
||
static void *alloc_windows(void *address){ | ||
void *map_address; | ||
|
||
map_address = VirtualAlloc(address, | ||
BUFFER_SIZE, | ||
allocation_block_size, | ||
MEM_RESERVE | MEM_COMMIT, | ||
PAGE_READWRITE); | ||
|
||
if (map_address == (void *)NULL) map_address = (void *)-1; | ||
|
||
if (map_address != (void *)-1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure what you mean here.
|
||
/* Memory allocation routine */ | ||
/* procpos ... indicates where it comes from */ | ||
/* 0 : Level 3 functions */ | ||
/* 1 : Level 2 functions */ | ||
/* 2 : Thread */ | ||
|
||
static void blas_memory_init(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see note about indentation and better documenting the else and endifs earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
release_info[release_pos].attr = fd; | ||
release_info[release_pos].func = alloc_hugetlbfile_free; | ||
release_pos ++; | ||
struct alloc_t *alloc_info = (struct alloc_t *)map_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this type of if statement and code pattern is repeated many times. any way this can be made simpler via a macro perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
@@ -770,9 +845,9 @@ static void *alloc_devicedirver(void *address){ | |||
|
|||
#ifdef ALLOC_SHM | |||
|
|||
static void alloc_shm_free(struct release_t *release){ | |||
static void alloc_shm_free(struct alloc_t *release){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
release -> alloc_info everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
release_info[release_pos].attr = fd; | ||
release_info[release_pos].func = alloc_devicedirver_free; | ||
release_pos ++; | ||
struct alloc_t *alloc_info = (struct alloc_t *)map_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with macro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is going to work since I have to force update the branch, but I've resolved all the comments.
driver/others/memory.c
Outdated
void (*func)(struct alloc_t *); | ||
/* Pad to 64-byte alignment */ | ||
#ifdef __64BIT__ | ||
char pad[48]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
#endif | ||
}; | ||
|
||
static const int allocation_block_size = BUFFER_SIZE + sizeof(struct alloc_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and added more documentation for the struct and why it exists, too.
driver/others/memory.c
Outdated
|
||
static const int allocation_block_size = BUFFER_SIZE + sizeof(struct alloc_t); | ||
|
||
/* TLS is supported from clang 2.8, gcc 4.1, MSVC 2005, and XCode 8 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
|
||
/* Holds pointers to allocated memory */ | ||
#if defined(SMP) && !defined(USE_OPENMP) | ||
#define MAX_ALLOCATING_THREADS MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This is how we do it internally.
driver/others/memory.c
Outdated
static DWORD local_storage_key; | ||
#else | ||
static pthread_key_t local_storage_key; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done here and elsewhere.
driver/others/memory.c
Outdated
|
||
} | ||
|
||
static void *alloc_windows(void *address){ | ||
void *map_address; | ||
|
||
map_address = VirtualAlloc(address, | ||
BUFFER_SIZE, | ||
allocation_block_size, | ||
MEM_RESERVE | MEM_COMMIT, | ||
PAGE_READWRITE); | ||
|
||
if (map_address == (void *)NULL) map_address = (void *)-1; | ||
|
||
if (map_address != (void *)-1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not sure what you mean here.
driver/others/memory.c
Outdated
release_info[release_pos].attr = fd; | ||
release_info[release_pos].func = alloc_devicedirver_free; | ||
release_pos ++; | ||
struct alloc_t *alloc_info = (struct alloc_t *)map_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
@@ -770,9 +845,9 @@ static void *alloc_devicedirver(void *address){ | |||
|
|||
#ifdef ALLOC_SHM | |||
|
|||
static void alloc_shm_free(struct release_t *release){ | |||
static void alloc_shm_free(struct alloc_t *release){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
driver/others/memory.c
Outdated
release_info[release_pos].attr = fd; | ||
release_info[release_pos].func = alloc_hugetlbfile_free; | ||
release_pos ++; | ||
struct alloc_t *alloc_info = (struct alloc_t *)map_address; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
/* Memory allocation routine */ | ||
/* procpos ... indicates where it comes from */ | ||
/* 0 : Level 3 functions */ | ||
/* 1 : Level 2 functions */ | ||
/* 2 : Thread */ | ||
|
||
static void blas_memory_init(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Hmm... looks like GitHub's review system doesn't work like gerrit or our internal tools; you can't diff with the previous push, since you have to overwrite the commit. |
I believe one can still view changes relative to your original commit (by choosing that option in the "Changes from" dropdown on the diff page), and reviewers have an additional option to "view changes since your last review" there. |
Aha. I think that will work if @sandwichmaker sends his comments as a review, then I can diff against that. Right now it doesn't have anything in "changes since last review." |
Something wrong with the Windows ::TlsGetValue/::TlsSetValue now, perhaps just missing an include ? |
- Compiler TLS is now used only used when the compiler supports it - If compiler TLS is unsupported, we use platform-specific TLS - Only one variable (an index) is now in TLS - We only access TLS once per alloc, and never when freeing - Allocation / release info is now stored within the allocation itself, by over-allocating; this saves having external structures do the bookkeeping, and reduces some of the redundant data that was being stored (such as addresses) - We never hit the alloc lock when not using SMP or when using OpenMP (that was my fault) - Now that there are fewer tracking structures I think this is a bit easier to read than before
When I refactored the HAS_COMPILER_TLS block I left out the MSVC one to use __declspec(thread). Those functions should be defined in winbase.h, which comes in from windows.h, which is included through common.h. Either way, this should resolve it, though I wonder if we can just remove the Tls functions since it's doubtful anyone is still using MSVC pre-2005. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making the changes. The code looks really nice to read.
One minor comment about MAX_ALLOCATING_THREADS, but otherwise good to go from me.
/* Holds pointers to allocated memory */ | ||
#if defined(SMP) && !defined(USE_OPENMP) | ||
/* This is the number of threads than can be spawned by the server, which is the | ||
server plus the number of threads in the thread pool */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand how the comment leads to this particular computation of number of threads.
nice cleanup @oon3m0oo. @martin-frbg dropping the ifdefs that come from MSVC pre 2005 is a good idea. Does that mean we will be able to get rid of the windows block for compilers without TLS? |
If we dropped support for pre-2005 then yes, the windows block would go away, and we'd only have the pthreads part. I believe the number is discussed in #1616, since this determines the number of threads. |
I am always a bit sentimental about removing working code for old systems, there could be someone somewhere still relying on it. Twelve years may be a bit too old to be useful though (but it matches the ancient hardware still supported, Core2 would have been brand new back then). |
ready for merging ? |
Hmm... before you merge this, I see some fortran test failures on my side with a couple of tests, let me see if that's this change. |
Ok this was only in my internal branch that has lots of other changes to help build things internally, everything passes fine on my github repo. |
The issue I see is an illegal instruction in zgemv.c when using STACK_ALLOC, where the allocation size requested is larger than MAX_STACK_ALLOC, which ends up creating an array of size 0, which is indeed illegal. This doesn't seem to be related to the memory.c changes, though. Aha, the issue only appears in unoptimized builds. I'll send another PR to fix this. |
Sounds unrelated but decidedly unhealthy, probably best to create a new issue for that |
Sent out #1631. |
over-allocating; this saves having external structures do the bookkeeping, and
reduces some of the redundant data that was being stored (such as addresses)
my fault)
read than before