Skip to content

Commit ba88488

Browse files
committed
Add a new section Shared Resources.
1. Align with Chapter 5's perspective on Shared Resources. - Expand from the programmer's view in Chapter 5, focused on a cache-line perspective in this section. 2. In this section, understand that shared resources at the cache-line level involve issues related to data transfer across chips. - Move "Cache effects and false sharing" to Section 6: Shared Resources. 3. Address false sharing issues within shared resources. 4. Proceed to bottleneck issues in shared resources due to communication. - Discuss how cache coherence limits the scalability of spin locks, connecting to the problems caused by blocking in the next chapter, and preparing for the discussion of lock-free mechanisms later on.
1 parent 8a7c0a2 commit ba88488

File tree

3 files changed

+100
-40
lines changed

3 files changed

+100
-40
lines changed

concurrency-primer.tex

Lines changed: 100 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -499,6 +499,66 @@ \subsection{Compare and swap}
499499
}
500500
\end{cppcode}
501501

502+
\section{Shared Resources}
503+
\label{false-sharing}
504+
From \secref{rmw}, we have understood that there are two types of shared resources that need to be considered.
505+
The first type is shared resources that concurrent threads will access in order to collaborate to achieve a goal.
506+
The second type is shared resources that serve as a communication channel for concurrent threads,
507+
ensuring correct access to shared resources.
508+
However, all of these considerations stem from a programming perspective,
509+
where we only distinguish between shared resources and private resources.
510+
511+
Given all the complexities to consider, modern hardware adds another layer to the puzzle,
512+
as depicted in \fig{ideal-machine}.
513+
Remember, memory moves between the main \textsc{RAM} and the \textsc{CPU} in segments known as cache lines.
514+
These lines also represent the smallest unit of data transferred between cores and their caches.
515+
When one core writes a value and another reads it,
516+
the entire cache line containing that value must be transferred from the first core's cache(s) to the second,
517+
ensuring a coherent ``view'' of memory across cores. This dynamic can significantly affect performance.
518+
519+
This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same shared resource,
520+
which is the cache line, as shown in \fig{fig:false-sharing}.
521+
When designing concurrent data structures or algorithms,
522+
this \introduce{false sharing} must be taken into account.
523+
One way to avoid it is to pad atomic variables with a cache line of private data,
524+
but this is obviously a large space-time trade-off.
525+
526+
\includegraphics[keepaspectratio, width=0.6\linewidth]{images/false-sharing}
527+
\captionof{figure}{Processor 1 and Processor 2 operate independently on variables A and B.
528+
Simultaneously, they read the cache line containing these two variables.
529+
In the next time step, each processor modifies A and B in their private L1 cache separately.
530+
Subsequently, both processors write their modified cache line to the shared L2 cache.
531+
At this moment, the expansion of the scope of shared resources to encompass cache lines highlights the importance of considering cache coherence issues.}
532+
\label{fig:false-sharing}
533+
534+
Not only shared resources,
535+
but we also need to consider shared resources that serve as a communication channel, e.g. spinlock (see \secref{spinlock}).
536+
Processors communicate through cache lines when using locks.
537+
When a processor broadcasts the release of a lock,
538+
multiple processors on different cores attempt to acquire the lock simultaneously.
539+
To ensure a consistent state of the lock across all L1 cache lines,
540+
which is a part of cache coherence,
541+
the cache line containing the lock will be continually transferred among the caches of those cores.
542+
Unless the critical sections are considerably lengthy,
543+
the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
544+
This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
545+
as discussed in Paul~E.\ McKenney's
546+
\href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
547+
for a deeper exploration.}
548+
despite the algorithm's non-blocking nature.
549+
550+
With these high communication costs, there may be only one processor that succeeds in acquiring it again in the case of mutex lock or spinlock, as shown in \fig{fig:spinlock}.
551+
Then the other processors that have not successfully acquired the lock will continue to wait,
552+
resulting in little practical benefit (only one processor gains the lock) and significant communication overhead.
553+
This disparity severely limits the scalability of the spin lock.
554+
555+
\includegraphics[keepaspectratio, width=0.9\linewidth]{images/spinlock}
556+
\captionof{figure}{Three processors use lock to communicate to insure the access operations to the shared L2 cache will be correct.
557+
Processors 2 and 3 are trying to acquire a lock that is held by Processor 1.
558+
Therefore, when processor 1 unlocks,
559+
the state of lock needs to be updated on other processors' L1 cache.}
560+
\label{fig:spinlock}
561+
502562
\section{Atomic operations as building blocks}
503563

504564
Atomic loads, stores, and \textsc{RMW} operations are the building blocks for every single concurrency tool.
@@ -802,7 +862,7 @@ \subsection{Acquire and release}
802862
On \textsc{Arm} and other weakly-ordered architectures, this enables us to eliminate one of the memory barriers in each operation,
803863
such that
804864

805-
\begin{cppcode}
865+
\begin{cppcode}
806866
int acquireFoo()
807867
{
808868
return foo.load(memory_order_acquire);
@@ -1049,45 +1109,45 @@ \section{Hardware convergence}
10491109
\textsc{Arm}v8 processors offer dedicated load-acquire and store-release instructions: \keyword{lda} and \keyword{stl}.
10501110
Hopefully, future \textsc{CPU} architectures will follow suit.
10511111

1052-
\section{Cache effects and false sharing}
1053-
\label{false-sharing}
1054-
1055-
Given all the complexities to consider, modern hardware adds another layer to the puzzle.
1056-
Remember, memory moves between main \textsc{RAM} and the \textsc{CPU} in segments known as cache lines.
1057-
These lines also represent the smallest unit of data transferred between cores and their caches.
1058-
When one core writes a value and another reads it,
1059-
the entire cache line containing that value must be transferred from the first core's cache(s) to the second,
1060-
ensuring a coherent ``view'' of memory across cores.
1061-
1062-
This dynamic can significantly affect performance.
1063-
Take a readers-writer lock, for example,
1064-
which prevents data races by allowing either a single writer or multiple readers access to shared data but not simultaneously.
1065-
At its most basic, this concept can be summarized as follows:
1066-
\begin{cppcode}
1067-
struct RWLock {
1068-
int readers;
1069-
bool hasWriter; // Zero or one writers
1070-
};
1071-
\end{cppcode}
1072-
Writers must wait until the \cc|readers| count drops to zero,
1073-
while readers can acquire the lock through an atomic \textsc{RMW} operation if \cc|hasWriter| is \cpp|false|.
1074-
1075-
At first glance, this approach might seem significantly more efficient than exclusive locking mechanisms (e.g., mutexes or spinlocks) in scenarios where shared data is read more frequently than written.
1076-
However, this perspective overlooks the impact of cache coherence.
1077-
If multiple readers on different cores attempt to acquire the lock simultaneously,
1078-
the cache line containing the lock will constantly be transferred among the caches of those cores.
1079-
Unless the critical sections are considerably lengthy,
1080-
the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
1081-
This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
1082-
as discussed in Paul~E.\ McKenney's
1083-
\href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
1084-
for a deeper exploration.}
1085-
despite the algorithm's non-blocking nature.
1086-
1087-
This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same cache line.
1088-
When designing concurrent data structures or algorithms,
1089-
this \introduce{false sharing} must be taken into account.
1090-
One way to avoid it is to pad atomic variables with a cache line of unshared data, but this is obviously a large space-time tradeoff.
1112+
% \section{Cache effects and false sharing}
1113+
% \label{false-sharing}
1114+
1115+
% Given all the complexities to consider, modern hardware adds another layer to the puzzle.
1116+
% Remember, memory moves between main \textsc{RAM} and the \textsc{CPU} in segments known as cache lines.
1117+
% These lines also represent the smallest unit of data transferred between cores and their caches.
1118+
% When one core writes a value and another reads it,
1119+
% the entire cache line containing that value must be transferred from the first core's cache(s) to the second,
1120+
% ensuring a coherent ``view'' of memory across cores.
1121+
1122+
% This dynamic can significantly affect performance.
1123+
% Take a readers-writer lock, for example,
1124+
% which prevents data races by allowing either a single writer or multiple readers access to shared data but not simultaneously.
1125+
% At its most basic, this concept can be summarized as follows:
1126+
% \begin{cppcode}
1127+
% struct RWLock {
1128+
% int readers;
1129+
% bool hasWriter; // Zero or one writers
1130+
% };
1131+
% \end{cppcode}
1132+
% Writers must wait until the \cc|readers| count drops to zero,
1133+
% while readers can acquire the lock through an atomic \textsc{RMW} operation if \cc|hasWriter| is \cpp|false|.
1134+
1135+
% At first glance, this approach might seem significantly more efficient than exclusive locking mechanisms (e.g., mutexes or spinlocks) in scenarios where shared data is read more frequently than written.
1136+
% However, this perspective overlooks the impact of cache coherence.
1137+
% If multiple readers on different cores attempt to acquire the lock simultaneously,
1138+
% the cache line containing the lock will constantly be transferred among the caches of those cores.
1139+
% Unless the critical sections are considerably lengthy,
1140+
% the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
1141+
% This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
1142+
% as discussed in Paul~E.\ McKenney's
1143+
% \href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
1144+
% for a deeper exploration.}
1145+
% despite the algorithm's non-blocking nature.
1146+
1147+
% This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same cache line.
1148+
% When designing concurrent data structures or algorithms,
1149+
% this \introduce{false sharing} must be taken into account.
1150+
% One way to avoid it is to pad atomic variables with a cache line of unshared data, but this is obviously a large space-time tradeoff.
10911151

10921152
\section{If concurrency is the question, \texttt{volatile} is not the answer.}
10931153
% Todo: Add ongoing work from JF's CppCon 2019 talk?

images/false-sharing.pdf

9.06 KB
Binary file not shown.

images/spinlock.pdf

9.68 KB
Binary file not shown.

0 commit comments

Comments
 (0)