You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. Align with Chapter 5's perspective on Shared Resources.
- Expand from the programmer's view in Chapter 5, focused on a
cache-line perspective in this section.
2. In this section, understand that shared resources at the cache-line
level involve issues related to data transfer across chips.
- Move "Cache effects and false sharing" to Section 6: Shared Resources.
3. Address false sharing issues within shared resources.
4. Proceed to bottleneck issues in shared resources due to
communication.
- Discuss how cache coherence limits the scalability of spin locks,
connecting to the problems caused by blocking in the next chapter, and
preparing for the discussion of lock-free mechanisms later on.
\captionof{figure}{Processor 1 and Processor 2 operate independently on variables A and B.
528
+
Simultaneously, they read the cache line containing these two variables.
529
+
In the next time step, each processor modifies A and B in their private L1 cache separately.
530
+
Subsequently, both processors write their modified cache line to the shared L2 cache.
531
+
At this moment, the expansion of the scope of shared resources to encompass cache lines highlights the importance of considering cache coherence issues.}
532
+
\label{fig:false-sharing}
533
+
534
+
Not only shared resources,
535
+
but we also need to consider shared resources that serve as a communication channel, e.g. spinlock (see \secref{spinlock}).
536
+
Processors communicate through cache lines when using locks.
537
+
When a processor broadcasts the release of a lock,
538
+
multiple processors on different cores attempt to acquire the lock simultaneously.
539
+
To ensure a consistent state of the lock across all L1 cache lines,
540
+
which is a part of cache coherence,
541
+
the cache line containing the lock will be continually transferred among the caches of those cores.
542
+
Unless the critical sections are considerably lengthy,
543
+
the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
544
+
This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
545
+
as discussed in Paul~E.\ McKenney's
546
+
\href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
547
+
for a deeper exploration.}
548
+
despite the algorithm's non-blocking nature.
549
+
550
+
With these high communication costs, there may be only one processor that succeeds in acquiring it again in the case of mutex lock or spinlock, as shown in \fig{fig:spinlock}.
551
+
Then the other processors that have not successfully acquired the lock will continue to wait,
552
+
resulting in little practical benefit (only one processor gains the lock) and significant communication overhead.
553
+
This disparity severely limits the scalability of the spin lock.
\textsc{Arm}v8 processors offer dedicated load-acquire and store-release instructions: \keyword{lda} and \keyword{stl}.
1050
1110
Hopefully, future \textsc{CPU} architectures will follow suit.
1051
1111
1052
-
\section{Cache effects and false sharing}
1053
-
\label{false-sharing}
1054
-
1055
-
Given all the complexities to consider, modern hardware adds another layer to the puzzle.
1056
-
Remember, memory moves between main \textsc{RAM} and the \textsc{CPU} in segments known as cache lines.
1057
-
These lines also represent the smallest unit of data transferred between cores and their caches.
1058
-
When one core writes a value and another reads it,
1059
-
the entire cache line containing that value must be transferred from the first core's cache(s) to the second,
1060
-
ensuring a coherent ``view'' of memory across cores.
1061
-
1062
-
This dynamic can significantly affect performance.
1063
-
Take a readers-writer lock, for example,
1064
-
which prevents data races by allowing either a single writer or multiple readers access to shared data but not simultaneously.
1065
-
At its most basic, this concept can be summarized as follows:
1066
-
\begin{cppcode}
1067
-
struct RWLock {
1068
-
int readers;
1069
-
bool hasWriter; // Zero or one writers
1070
-
};
1071
-
\end{cppcode}
1072
-
Writers must wait until the \cc|readers| count drops to zero,
1073
-
while readers can acquire the lock through an atomic \textsc{RMW} operation if \cc|hasWriter| is \cpp|false|.
1074
-
1075
-
At first glance, this approach might seem significantly more efficient than exclusive locking mechanisms (e.g., mutexes or spinlocks) in scenarios where shared data is read more frequently than written.
1076
-
However, this perspective overlooks the impact of cache coherence.
1077
-
If multiple readers on different cores attempt to acquire the lock simultaneously,
1078
-
the cache line containing the lock will constantly be transferred among the caches of those cores.
1079
-
Unless the critical sections are considerably lengthy,
1080
-
the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
1081
-
This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
1082
-
as discussed in Paul~E.\ McKenney's
1083
-
\href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
1084
-
for a deeper exploration.}
1085
-
despite the algorithm's non-blocking nature.
1086
-
1087
-
This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same cache line.
1088
-
When designing concurrent data structures or algorithms,
1089
-
this \introduce{false sharing} must be taken into account.
1090
-
One way to avoid it is to pad atomic variables with a cache line of unshared data, but this is obviously a large space-time tradeoff.
1112
+
%\section{Cache effects and false sharing}
1113
+
%\label{false-sharing}
1114
+
1115
+
%Given all the complexities to consider, modern hardware adds another layer to the puzzle.
1116
+
%Remember, memory moves between main \textsc{RAM} and the \textsc{CPU} in segments known as cache lines.
1117
+
%These lines also represent the smallest unit of data transferred between cores and their caches.
1118
+
%When one core writes a value and another reads it,
1119
+
%the entire cache line containing that value must be transferred from the first core's cache(s) to the second,
1120
+
%ensuring a coherent ``view'' of memory across cores.
1121
+
1122
+
%This dynamic can significantly affect performance.
1123
+
%Take a readers-writer lock, for example,
1124
+
%which prevents data races by allowing either a single writer or multiple readers access to shared data but not simultaneously.
1125
+
%At its most basic, this concept can be summarized as follows:
1126
+
%\begin{cppcode}
1127
+
%struct RWLock {
1128
+
% int readers;
1129
+
% bool hasWriter; // Zero or one writers
1130
+
%};
1131
+
%\end{cppcode}
1132
+
%Writers must wait until the \cc|readers| count drops to zero,
1133
+
%while readers can acquire the lock through an atomic \textsc{RMW} operation if \cc|hasWriter| is \cpp|false|.
1134
+
1135
+
%At first glance, this approach might seem significantly more efficient than exclusive locking mechanisms (e.g., mutexes or spinlocks) in scenarios where shared data is read more frequently than written.
1136
+
%However, this perspective overlooks the impact of cache coherence.
1137
+
%If multiple readers on different cores attempt to acquire the lock simultaneously,
1138
+
%the cache line containing the lock will constantly be transferred among the caches of those cores.
1139
+
%Unless the critical sections are considerably lengthy,
1140
+
%the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
1141
+
%This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
1142
+
%as discussed in Paul~E.\ McKenney's
1143
+
%\href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
1144
+
%for a deeper exploration.}
1145
+
%despite the algorithm's non-blocking nature.
1146
+
1147
+
%This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same cache line.
1148
+
%When designing concurrent data structures or algorithms,
1149
+
%this \introduce{false sharing} must be taken into account.
1150
+
%One way to avoid it is to pad atomic variables with a cache line of unshared data, but this is obviously a large space-time tradeoff.
1091
1151
1092
1152
\section{If concurrency is the question, \texttt{volatile} is not the answer.}
1093
1153
% Todo: Add ongoing work from JF's CppCon 2019 talk?
0 commit comments