-
Notifications
You must be signed in to change notification settings - Fork 15
Add new example for section 5 #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The job queue exhibits incorrect behavior and is currently under rework. To reproduce the incorrect behavior, add new jobs after finishing the current ones. A segmentation fault occurs simply because |
Solved. Operations on atomic types, including flag type, covered in C11 standard 7.17.7 and 7.17.8 are all used in the example code. Next I will start revising section 5 based on this example. The undesired result caused by not using atomic operations will be provided as well. Also clarifying more on why we need read-modify-write as a atomic step. |
|
||
Relaxed operations are also beneficial for managing flags shared between threads. | ||
For example, a thread might continuously run until it receives a signal to exit: | ||
Relaxed operations are beneficial for managing flags shared between threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention the discussion around relaxed atomics?
like this one: https://lukegeeson.com/blog/2023-10-17-A-Proposal-For-Relaxed-Atomics/
To demonstrate the application of Test and Set in rmw_example.c: As stated in the Concurrency Primer (§ 5.2) regarding Test and Set: "We could use Test and Set to build a simple spinlock." In rmw_example.c, Compare and Swap is utilized to avoid race conditions. However, the Test and Set operation using atomic_flag_test_and_set() is only employed for initialization, which underutilizes its locking properties to avoid race condition. To illustrate the distinct applications of Test and Set versus Compare and Swap: However, to discuss this in rmw_example.c, another shared resource must be introduced. Currently, the shared resource is a thrd_pool. Consider a scenario where multiple workers all need to access the same shared variable to perform operations. In this case, mutex can be utilized to provide the functionality similar to a Test and Set flag. In rmw_example.c, "Exchange, Test and Set, Fetch and ..., and Compare and Swap" performs all four operations using atomic operations from stdatomic.h. However, as mentioned in Chapter 1: "System programmers are familiar with tools such as mutexes, semaphores, and condition variables. Nevertheless, a question remains: How do these tools function, and how can we write concurrent code in their absence?" Therefore, it is not mandatory to use atomic_flag_test_and_set() for Test and Set. Mutex can also achieve locking and unlocking effects. |
Another idea for Chapter 5 is to retain the original structure, keeping the descriptions of concepts in sections 5.1, 5.2, 5.3, and 5.4. However, the original examples are not intuitive because all of these sections are actually related to RMW operations. It's important for readers to understand that they only differ in techniques, aimed at being applicable to different scenarios. At the same time, specifically explain where exactly the need for atomic operations arises. Implementing RMW does not necessarily require tools like those found in stdatomic.h. In a subsequent new subsection 5.5, further explanation will detail the atomic functions available to implement these four concepts. In rmw_example.c, demonstrate their application in specific scenarios. |
5490f3d
to
bd9ce93
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase the latest main
branch for reviewing.
A new directory /examples is added for example codes. .clang-format is copied from sysprog21/lkmpg project. This commit provides a simplified implementation of thread pool. After initializing the thread pool with thread count, jobs can be added. The job queue is a SPMC ring buffer. To keep the implementation minimal, the producer is not protecting, resulting the thread pool can not run automatically when jobs are added. Or the worker may try to get the job before it is fully enqueued. Padding is added in thread_pool_t to avoid false sharing. The number "40" is the sum of size of struct members, including alignment, before the first padding. There should be a better way to determine the value since structure packing is implementation defined. The test in main function results nondeterminate order of jobs echoing its id. A mechanism to wait all jobs to complete should be added later instead of using sleep. `thread_pool_destroyed()` is not functional yet.
Codes executed under running condition are placed in the same scope.
Both next and prev in job_t are struct job, thus residing them in the same line.
An assert is added before malloc. Also type of size to changed to size_t.
Two macros `CAST_JOB(job, type)` and `PREV_JOB(job)` are added to simplify long expressoins to improve readability.
A new struct `idle_job` is added. There are several ways to have atomic `thrd_pool->head->prev` (the original tail): (1) _Atomic used as either specifier or qualifier in C11 acts on object, not region. So we can not have only the idle `job_t` have atomic `prev`. All job_t would have atomic `prev`. However _Atomic is only allowed to act on complete type, meaning that `_Atomic(struct job *)` and `_Atomic(void *)` are not allowed in the declaration of `job_t`. `atomic_uintptr` has already shown enough casting chaos in the previous commit. (2) Embed `job_t` in a new struct `idle_job` along side `_Atomic(job_t *) prev;`. In worker function, the last job is accessed through `thrd_pool->head->prev`, the same as (1). The only difference is how the idle job is initialized and how the first job is added. Padding could also be added around `prev` to avoid false sharing. Test in main demonstrates a series of jobs are added after fininshing existing ones. `thread_pool_destroy` is implemented to cancel and free the pool. Additional sleep could be added before detroy to observ the second series of jobs. Notice that freeing memory of job in worker directly after using it may cause dangling pointer in other threads. Safe memory reclaimation should be introduced to avoid this completely. Or use memory pool for jobs.
Atomic flag is used for checking if the given thread pool has been initialized. The flag is initialized when thread pool struct declared and reseted to false when thread pool destroyed. Atomic exchange obtains previous state when destroying thread pool and gives warnning message if the state were running. Atomic fetch and AND with zero demonstrate a way to set state to idle.
Both job_count and thread_count were meant to be constant in the given test scenario. Thus they were specified as macros instead of variables.
Clarification on read-modify-write is first added at the beginning of the section. To make the discussion based on atomic load/store, more information is supplemented at the end of section 2. Example code is included by using minted package. Each subsection is revised according to the atomic library usage in the example. At the end of sectoin 5, a new subsection "further improvements" is added to discuss topics on leveraging other memory order, false sharing and safe memory reclamation. First two topics are forward referenced to according chapters. The last one is not covered in this book so it has no reference.
rmw_cample.c is committed in the last commit.
A diff file is added to patch original example to the one that can cause races. Substituting `threads.h` with `pthread.h` is also included in diff because sanitizer hasn't support C11 thread yet. How sanitizer works and how to use it are added as well. Then explainations on warning messages from Tsan are followed. The part mentioning safe memory reclaimation is moved to this subsection because warning messages from Tsan mentioned it. A missing reference to spinlock (originally as one of the rmw examples) in section 9 was added back as a new code block.
The original statement is only true on the successful operation. To other failed operations, it is the successful one that finished before failed ones. The new statement does not totally cover the charateristic of atomic operations though. It is the generated cmpxchg or LL/SC loop that make the operation keeps trying and eventually finish. But considering that the purpose of this paragraph is to plot a big picture of order and atomicity, more details on atomic operations should be covered in section "Atomic as building blocks". More refences to compiler and CPU menufacturer documents should be taken in consideration then.
The intro of section 10 originally references back to spinlock in section 5. It is now replace by using the new example. Section 10.2 onriginally references back to UI thread in section 5. It is now removed as new example presents down below to explain relaxed memory model. New example is used here as well is because it is exactly what original example was talking about.
Static linkage is added for better practice. A new inline function wait_until is added to serve the need of waiting thread pool until give state, thus removing sleep() and corresponding header. The weak version of compare and swap is used instead due to 1. There is really no other thing in the same cache line to cause spurious fail. 2. The retry cost is considered lower than nested loop
1. Use Bailey–Borwein–Plouffe formula to approximate PI - Reference : https://github.com/sysprog21/concurrent-programs/blob/master/tpool/tpool.c 2. Add Add PRECISION constant with value 100
1. Add the tpool_future variable - tpool_future to pass the result to the main thread. - The mutex lock and the condition variable to ensure concurrency. 2. The main thread sequentially accumulate results from BBP that calcuate by every worker. - Wait using `tpool_future_get()` until the condition variable is broadcast to confirm that the result has been marked as __FUTURE_FINISHED. 3. Change `thread_pool` to `tpool` to improve readability. 4. Add the Makefile.
1. Directly show the scenarios using Test and Set and its atomic operations. - Use `atomic_flag_test_and_set()` and `atomic_flag_clear()` to implement the original mutex lock and unlock mechanism. - Replace the original condition variable wait mechanism with `atomic_flag_test_and_set()` combined with a `while` loop. 2. Avoid deadlock in `tpool_future_get()`. - The main thread must first wait for the worker to complete the "BBP formula" job. - Subsequently, it should wait for the worker to unlock. - These two operations must occur in this order to avoid deadlock. Swapping them will lead to deadlock.
1. Check if `future->result` is NULL. - If `future->result` is NULL, the job is still in progress. - If `future->result` is not NULL, the job has been completed by the worker.
1. When allocating memory for future, if the allocation fails, do not simply return NULL. Instead, release the memory allocated for job beforehand to avoid memory leaks.
1. When creating the future, set the future's flag, which is akin to assigning the job. Afterward, transfer the ownership to the worker. Once the worker completes the job, clear the flag and return the ownership, which is akin to submitting a job. Then, the main thread can regain ownership. By doing this, the main thread can wait directly for the result through test and set without checking if the result is NULL. This avoids the situation where the flag could be set to true by the main thread before the worker starts the job. Additionally, the worker does not need to check with test and set before performing the job. 2. Drop the `atomic_flag_clear` in `tpool_future_wait` function and then directly free the pointer of future and its result in `tpool_future_destroy` function. 3. Rename the variable 'lock' in the future structure to 'flag'. Rename the function name `tpool_future_get` to `tpool_future_wait`. Co-authored-by: Chih-Wei Chien <[email protected]> Signed-off-by: Wei-Hsin Yeh <[email protected]>
1. When allocating memory for the product, if the allocation fails, it returns NULL. Co-authored-by: Chih-Wei Chien <[email protected]> Signed-off-by: Wei-Hsin Yeh <[email protected]>
1. Use 2 figures to connect concepts from the first 3 sections. - Figure atomic_rmw illustrates that atomic operations consist of not only a single operation but a group of operations that need to perform atomically. - Figure rmw_communicate shows how this atomic group of operations can be used on shared resource for communication. 2. Discuss how to ensure the operations of accessing the shared resource for communication between concurrent threads are correct: - Use Test and Set and Compare and Swap as examples to illustrate how this can be achieved. 3. Compare the usage scenarios of Exchange and Fetch and ... 4. Introduce the concept that we can utilize atomic operations to ensure that a group of operations can perform atomically.
Introducing thread sanitizer here may be an unexpected pop up for the readers that are new to concurrency. Here we focus on rmw atomic operation instead, thus related content and diff file are removed. The proper place for this topic could be a dedicated section for "testing, debugging and verifing concurrent programs" This aligns the decision sticking to C11 thread as well. Co-authored-by: Wei-Hsin Yeh <[email protected]>
Since spinlock is added back in section 5.2, the original content is restored. Same as the rmw example, the goal is to provide easy to understand example first and improve it later on.
References to C11 standard were added when explaining properties of atomic type and operations. More information of codegen on atomic operations is added as a footnote with a link to LLVM's document as an example.
Add the description of atomic instruction to let readers know there is a difference between using fetch and..., which is only a programming tool, and its actual execution as an atomic operation that depends on the compiler. Simplify the rmw_example code to provide more flexible examples. - Initially, all worker threads will be initialized. The main thread will ask all workers to start running. If there is no job or the job is completed, the worker will become idle. Next, the main thread will continue to add more jobs and ask the worker to start running again. Meanwhile, the main thread will also wait for the results of the work. - Use the struct `tpool_future` to record all the information required for the job. Co-authored-by: Chih-Wei Chien <[email protected]>
Thank @idoleat for contributing! |
A new directory /examples is added for example codes. .clang-format is copied from sysprog21/lkmpg project.
This draft PR provides a simplified implementation of thread pool. After initializing the thread pool with thread count, jobs can be added. The job queue is a SPMC ring buffer. To keep the implementation minimal, the producer is not protecting, resulting the thread pool can not run automatically when jobs are added. Or the worker may try to get the job before it is fully enqueued.
Padding is added in
thread_pool_t
to avoid false sharing. The number "40" is the sum of size of struct members, including alignment, before the first padding. There should be a better way to determine the value since structure packing is implementation defined.The test in main function results non-determinate order of jobs echoing its id. A mechanism to wait all jobs to complete should be added later instead of using sleep.
thread_pool_destroyed()
is not functional yet.To explain
Exchange
,Test and set
,Fetch and ...
andCompare and swap
in section 5.1~5.4 using this example, the following issues should be resolved:Exchange
is not in use.Test and set
on theinitialized
flag is useless. Testing the flag on the first thread pool initialization retrieve non-determinate value. Currently re-initialization is still possible. If preventing re-initialization is not crucial in this example, we need to find another way to use test and set.Fetch and add
to change thread pool state seems too intentional.Should we break the example into pieces to explain individually? Or list the code first?