Skip to content

Add new example for section 5 #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jul 22, 2024
Merged

Add new example for section 5 #7

merged 28 commits into from
Jul 22, 2024

Conversation

idoleat
Copy link
Collaborator

@idoleat idoleat commented Apr 17, 2024

A new directory /examples is added for example codes. .clang-format is copied from sysprog21/lkmpg project.

This draft PR provides a simplified implementation of thread pool. After initializing the thread pool with thread count, jobs can be added. The job queue is a SPMC ring buffer. To keep the implementation minimal, the producer is not protecting, resulting the thread pool can not run automatically when jobs are added. Or the worker may try to get the job before it is fully enqueued.

Padding is added in thread_pool_t to avoid false sharing. The number "40" is the sum of size of struct members, including alignment, before the first padding. There should be a better way to determine the value since structure packing is implementation defined.

The test in main function results non-determinate order of jobs echoing its id. A mechanism to wait all jobs to complete should be added later instead of using sleep. thread_pool_destroyed() is not functional yet.

To explain Exchange, Test and set, Fetch and ... and Compare and swap in section 5.1~5.4 using this example, the following issues should be resolved:

  • Exchange is not in use.
  • Test and set on the initialized flag is useless. Testing the flag on the first thread pool initialization retrieve non-determinate value. Currently re-initialization is still possible. If preventing re-initialization is not crucial in this example, we need to find another way to use test and set.
  • Using Fetch and add to change thread pool state seems too intentional.
  • Pick a better job type, such as BBP PI approximation.
  • Finish implementation mentioned in the above paragraph.

Should we break the example into pieces to explain individually? Or list the code first?

@jserv jserv changed the title Add new example for chapter 5 Add new example for section 5 Apr 17, 2024
@idoleat
Copy link
Collaborator Author

idoleat commented Apr 20, 2024

The job queue exhibits incorrect behavior and is currently under rework. To reproduce the incorrect behavior, add new jobs after finishing the current ones. A segmentation fault occurs simply because thrd_pool->tail is an independent variable, not a pointer to the real tail. It still holds the old address when new jobs are added. I overthought the situation, assuming that prev in job_t would be constantly changed, and thus made it a distinct struct member, even with padding.

@idoleat
Copy link
Collaborator Author

idoleat commented Apr 21, 2024

the following issues should be resolved:

  • Exchange is not in use.
  • Test and set on the initialized flag is useless. Testing the flag on the first thread pool initialization retrieve non-determinate value. Currently re-initialization is still possible. If preventing re-initialization is not crucial in this example, we need to find another way to use test and set.
  • Using Fetch and add to change thread pool state seems too intentional.

Solved.

Operations on atomic types, including flag type, covered in C11 standard 7.17.7 and 7.17.8 are all used in the example code. Next I will start revising section 5 based on this example. The undesired result caused by not using atomic operations will be provided as well. Also clarifying more on why we need read-modify-write as a atomic step.

@idoleat idoleat marked this pull request as ready for review May 3, 2024 10:56

Relaxed operations are also beneficial for managing flags shared between threads.
For example, a thread might continuously run until it receives a signal to exit:
Relaxed operations are beneficial for managing flags shared between threads.
Copy link
Collaborator Author

@idoleat idoleat May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention the discussion around relaxed atomics?
like this one: https://lukegeeson.com/blog/2023-10-17-A-Proposal-For-Relaxed-Atomics/

@weihsinyeh
Copy link
Collaborator

To demonstrate the application of Test and Set in rmw_example.c:

As stated in the Concurrency Primer (§ 5.2) regarding Test and Set: "We could use Test and Set to build a simple spinlock."

In rmw_example.c, Compare and Swap is utilized to avoid race conditions. However, the Test and Set operation using atomic_flag_test_and_set() is only employed for initialization, which underutilizes its locking properties to avoid race condition.

To illustrate the distinct applications of Test and Set versus Compare and Swap:
Test and Set : If the Test operation fails, it will spin.
Compare and Swap : When a Compare operation fails, it will continue to execute.

However, to discuss this in rmw_example.c, another shared resource must be introduced. Currently, the shared resource is a thrd_pool.

Consider a scenario where multiple workers all need to access the same shared variable to perform operations. In this case, mutex can be utilized to provide the functionality similar to a Test and Set flag.

In rmw_example.c, "Exchange, Test and Set, Fetch and ..., and Compare and Swap" performs all four operations using atomic operations from stdatomic.h. However, as mentioned in Chapter 1: "System programmers are familiar with tools such as mutexes, semaphores, and condition variables. Nevertheless, a question remains: How do these tools function, and how can we write concurrent code in their absence?"

Therefore, it is not mandatory to use atomic_flag_test_and_set() for Test and Set. Mutex can also achieve locking and unlocking effects.

@weihsinyeh
Copy link
Collaborator

Another idea for Chapter 5 is to retain the original structure, keeping the descriptions of concepts in sections 5.1, 5.2, 5.3, and 5.4. However, the original examples are not intuitive because all of these sections are actually related to RMW operations. It's important for readers to understand that they only differ in techniques, aimed at being applicable to different scenarios. At the same time, specifically explain where exactly the need for atomic operations arises.

Implementing RMW does not necessarily require tools like those found in stdatomic.h.

In a subsequent new subsection 5.5, further explanation will detail the atomic functions available to implement these four concepts. In rmw_example.c, demonstrate their application in specific scenarios.

@weihsinyeh weihsinyeh force-pushed the ch5-examples branch 2 times, most recently from 5490f3d to bd9ce93 Compare June 23, 2024 16:31
Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase the latest main branch for reviewing.

idoleat added 7 commits June 25, 2024 14:12
A new directory /examples is added for example codes. .clang-format is
copied from sysprog21/lkmpg project.

This commit provides a simplified implementation of thread pool. After
initializing the thread pool with thread count, jobs can be added. The
job queue is a SPMC ring buffer. To keep the implementation minimal,
the producer is not protecting, resulting the thread pool can not run
automatically when jobs are added. Or the worker may try to get the job
before it is fully enqueued.

Padding is added in thread_pool_t to avoid false sharing. The number
"40" is the sum of size of struct members, including alignment, before
the first padding. There should be a better way to determine the value
since structure packing is implementation defined.

The test in main function results nondeterminate order of jobs echoing
its id. A mechanism to wait all jobs to complete should be added later
instead of using sleep. `thread_pool_destroyed()` is not functional yet.
Codes executed under running condition are placed in the same scope.
Both next and prev in job_t are struct job, thus residing them in the
same line.
An assert is added before malloc. Also type of size to changed to
size_t.
Two macros `CAST_JOB(job, type)` and `PREV_JOB(job)` are added to
simplify long expressoins to improve readability.
A new struct `idle_job` is added. There are several ways to have atomic
`thrd_pool->head->prev` (the original tail):

(1) _Atomic used as either specifier or qualifier in C11 acts on object, not
region. So we can not have only the idle `job_t` have atomic `prev`. All
job_t would have atomic `prev`. However _Atomic is only allowed to act
on complete type, meaning that `_Atomic(struct job *)` and `_Atomic(void
*)` are not allowed in the declaration of `job_t`. `atomic_uintptr` has already
shown enough casting chaos in the previous commit.

(2) Embed `job_t` in a new struct `idle_job` along side `_Atomic(job_t *) prev;`.
In worker function, the last job is accessed through
`thrd_pool->head->prev`, the same as (1). The only difference is how the
idle job is initialized and how the first job is added. Padding could
also be added around `prev` to avoid false sharing.

Test in main demonstrates a series of jobs are added after fininshing
existing ones. `thread_pool_destroy` is implemented to cancel and free the
pool. Additional sleep could be added before detroy to observ the second
series of jobs.

Notice that freeing memory of job in worker directly after using it may
cause dangling pointer in other threads. Safe memory reclaimation should
be introduced to avoid this completely. Or use memory pool for jobs.
idoleat and others added 16 commits June 25, 2024 14:12
Atomic flag is used for checking if the given thread pool has been
initialized. The flag is initialized when thread pool struct declared and
reseted to false when thread pool destroyed.

Atomic exchange obtains previous state when destroying thread pool and gives
warnning message if the state were running.

Atomic fetch and AND with zero demonstrate a way to set state to idle.
Both job_count and thread_count were meant to be constant in the given
test scenario. Thus they were specified as macros instead of variables.
Clarification on read-modify-write is first added at the beginning of
the section. To make the discussion based on atomic load/store, more
information is supplemented at the end of section 2.

Example code is included by using minted package. Each subsection is
revised according to the atomic library usage in the example.

At the end of sectoin 5, a new subsection "further improvements" is
added to discuss topics on leveraging other memory order, false sharing
and safe memory reclamation. First two topics are forward referenced to
according chapters. The last one is not covered in this book so it has no
reference.
rmw_cample.c is committed in the last commit.
A diff file is added to patch original example to the one that can cause
races. Substituting `threads.h` with `pthread.h` is also included in
diff because sanitizer hasn't support C11 thread yet. How sanitizer
works and how to use it are added as well. Then explainations on
warning messages from Tsan are followed.

The part mentioning safe memory reclaimation is moved to this subsection
because warning messages from Tsan mentioned it.

A missing reference to spinlock (originally as one of the rmw examples)
in section 9 was added back as a new code block.
The original statement is only true on the successful operation. To 
other failed operations, it is the successful one that finished before 
failed ones.

The new statement does not totally cover the charateristic of atomic 
operations though. It is the generated cmpxchg or LL/SC loop that make 
the operation keeps trying and eventually finish. But considering that 
the purpose of this paragraph is to plot a big picture of order and 
atomicity, more details on atomic operations should be covered in 
section "Atomic as building blocks". More refences to compiler and CPU 
menufacturer documents should be taken in consideration then.
The intro of section 10 originally references back to spinlock in 
section 5. It is now replace by using the new example.

Section 10.2 onriginally  references back to UI thread in section 5. It 
is now removed as new example presents down below to explain relaxed 
memory model. New example is used here as well is because it is exactly 
what original example was talking about.
Static linkage is added for better practice. A new inline function 
wait_until is added to serve the need of waiting thread pool until give 
state, thus removing sleep() and corresponding header.

The weak version of compare and swap is used instead due to
1. There is really no other thing in the same cache line to cause 
   spurious fail.
2. The retry cost is considered lower than nested loop
1. Use Bailey–Borwein–Plouffe formula to approximate PI
-  Reference : https://github.com/sysprog21/concurrent-programs/blob/master/tpool/tpool.c

2. Add Add PRECISION constant with value 100
1. Add the tpool_future variable
- tpool_future to pass the result to the main thread.
- The mutex lock and the condition variable to ensure concurrency.

2. The main thread sequentially accumulate results from BBP that
calcuate by every worker.
- Wait using `tpool_future_get()` until the condition variable is
broadcast to confirm that the result has been marked as
__FUTURE_FINISHED.

3. Change `thread_pool` to `tpool` to improve readability.

4. Add the Makefile.
1. Directly show the scenarios using Test and Set and
its atomic operations.
- Use `atomic_flag_test_and_set()` and `atomic_flag_clear()` to
implement the original mutex lock and unlock mechanism.
- Replace the original condition variable wait mechanism with
`atomic_flag_test_and_set()` combined with a `while` loop.

2. Avoid deadlock in `tpool_future_get()`.
- The main thread must first wait for the worker to complete the "BBP
formula" job.
- Subsequently, it should wait for the worker to unlock.
- These two operations must occur in this order to avoid deadlock.
Swapping them will lead to deadlock.
1. Check if `future->result` is NULL.
- If `future->result` is NULL, the job is still in progress.
- If `future->result` is not NULL, the job has been completed by the
worker.
1. When allocating memory for future, if the allocation fails, do not
simply return NULL. Instead, release the memory allocated for job
beforehand to avoid memory leaks.
1. When creating the future, set the future's flag, which is akin to
assigning the job. Afterward, transfer the ownership to the worker.
Once the worker completes the job, clear the flag and return the
ownership, which is akin to submitting a job. Then, the main thread
can regain ownership. By doing this, the main thread can wait directly
for the result through test and set without checking if the result is
NULL. This avoids the situation where the flag could be set to true by
the main thread before the worker starts the job. Additionally, the
worker does not need to check with test and set before performing the
job.

2. Drop the `atomic_flag_clear` in `tpool_future_wait` function and then
directly free the pointer of future and its result in
`tpool_future_destroy` function.

3. Rename the variable 'lock' in the future structure to 'flag'. Rename
the function name `tpool_future_get` to `tpool_future_wait`.

Co-authored-by: Chih-Wei Chien <[email protected]>
Signed-off-by: Wei-Hsin Yeh <[email protected]>
1. When allocating memory for the product, if the allocation fails,
it returns NULL.

Co-authored-by: Chih-Wei Chien <[email protected]>
Signed-off-by: Wei-Hsin Yeh <[email protected]>
1. Use 2 figures to connect concepts from the first 3 sections.
- Figure atomic_rmw illustrates that atomic operations consist of not
only a single operation but a group of operations that need to perform
atomically.

- Figure rmw_communicate shows how this atomic group of operations can
be used on shared resource for communication.

2. Discuss how to ensure the operations of accessing the shared resource
for communication between concurrent threads are correct:
- Use Test and Set and Compare and Swap as examples to illustrate how
this can be achieved.

3. Compare the usage scenarios of Exchange and Fetch and ...

4. Introduce the concept that we can utilize atomic operations to
ensure that a group of operations can perform atomically.
idoleat and others added 2 commits June 26, 2024 15:56
Introducing thread sanitizer here may be an unexpected pop up for the
readers that are new to concurrency. Here we focus on rmw atomic
operation instead, thus related content and diff file are removed.
The proper place for this topic could be a dedicated section for
"testing, debugging and verifing concurrent programs"

This aligns the decision sticking to C11 thread as well.


Co-authored-by: Wei-Hsin Yeh <[email protected]>
Since spinlock is added back in section 5.2, the original content is 
restored. Same as the rmw example, the goal is to provide easy to 
understand example first and improve it later on.
References to C11 standard were added when explaining properties of
atomic type and operations. More information of codegen on atomic
operations is added as a footnote with a link to LLVM's document as an
example.
Add the description of atomic instruction to let readers know there is a
difference between using fetch and..., which is only a programming tool,
and its actual execution as an atomic operation that depends on the
compiler.

Simplify the rmw_example code to provide more flexible examples.

 - Initially, all worker threads will be initialized. The main thread
will ask all workers to start running. If there is no job or the job is
completed, the worker will become idle. Next, the main thread will
continue to add more jobs and ask the worker to start running again.
Meanwhile, the main thread will also wait for the results of the work.

 - Use the struct `tpool_future` to record all the information required
for the job.

Co-authored-by: Chih-Wei Chien <[email protected]>
@idoleat idoleat requested a review from jserv July 22, 2024 03:47
@jserv jserv merged commit 8d9a4b9 into sysprog21:main Jul 22, 2024
@jserv
Copy link
Contributor

jserv commented Jul 22, 2024

Thank @idoleat for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants