Skip to content

Commit d4a9a04

Browse files
author
3rdCore
committed
added explanation about DQN
1 parent aa1a61e commit d4a9a04

File tree

3 files changed

+60
-4
lines changed

3 files changed

+60
-4
lines changed

SeaPearl.html

+60-4
Original file line numberDiff line numberDiff line change
@@ -140,13 +140,13 @@
140140
<br>
141141
The RL algorithm used is <a href="https://www.researchgate.net/figure/The-Q-learning-algorithm-taken-from-Sutton-Barto-1998_fig1_337089781" target="_blank">Deep Q-Learning </a> (DQN). The associated <b class="term">reward</b> is based on the objective function/quality of the solution returned.
142142

143-
<h3>Generic Graph representation</h3>
143+
<h3>Generic Graph representation :</h3>
144144
Every CP problems is defined by a set of variable, values that can be taken by these variables and a set of Constraints on these variables. The idea is to encode each of these entities as node in a graph and connect these nodes according to whether :
145145
<ul>
146146
<li>A <b class="term">Value</b> is part of a <b class="term">Variable</b>'s domain of definition</li>
147147
<li>A <b class="term">Variable</b> is involved in a <b class="term">Constraint</b></li>
148148
</ul>
149-
Here is a little example just to get a sense of things :
149+
This graph is naturally dynamically updated throughout the resolution of instances during the training process. Here is a little example just to get a sense of things :
150150
</tr>
151151
<br>
152152
<br>
@@ -157,10 +157,66 @@ <h3>Generic Graph representation</h3>
157157
</tr>
158158
<br>
159159
<tr>
160-
The advantage of this method is that it allows the entire information to be encoded in a structure (a graph) that can be processed by a Neural Networks. Each node comes with a feature vector allowing to identify -among others- the type of constraints of a Constraint node or the value of a Value node.
160+
The advantage of this method is that it allows the entire information to be encoded in a structure (a graph) that can be processed by a Neural Network. Each node comes with node embeddings allowing to identify -among others- the type of constraints of a Constraint node or the value of a Value node.
161+
162+
<h3>Neural Pipeline for Variable/Value assignation :</h3>
163+
164+
Now that we defined our input, recall that our goal is to infer a variable/value assignation. Let's consider for the moment that the variable selection heuristic is deterministic, so that the input is the graph representation <b class="term">and</b> a variable on which to branch. This is where we are :
165+
</tr>
161166
<br>
162167
<br>
163-
Now that we defined our input, recall that our goal is to infer a variable/value assignation. Let's consider for the moment that the variable selection heuristic is deterministic, so that the input is the graph representation <b class="term">and</b> a variable on which to branch. This is where we are :
168+
<tr>
169+
<div style="text-align: center;">
170+
<img src="images/scheme.drawio.png" alt="nthu" width="650" height="240">
171+
</div>
172+
</tr>
173+
<br>
174+
<tr>
175+
Given a state, the RL agent is parameterized by a DQN-network that outputs the Q-values associated with every possible value selection in the domain of the selected variable $v$. The tripartite state is fed into a neural approximator model, the <b class="term">learner</b> $\hat{Q}$, which consists of two parts : an <b class="term">encoder</b> for learning contextual embeddings of the nodes of the graph representation, and a <b class="term">decoder</b> which, given these node embeddings, estimates the optimal policy to design a powerful heuristic.
176+
177+
<h4>Graph neural network encoder :</h4>
178+
179+
Graph convolutional networks (GCN) constitute a very convenient solution to learn contextual node embeddings, and have been largely used for this purpose in reinforcement learning for discrete optimization. Due to the heterogeneous nature of our representation we opted for a heterogeneous GCN composed of several Graph convolutional layers.
180+
<br>
181+
<br>
182+
Considering a variable $v_i$, a constraint $c_j$ and a value $u_k$, they are respectively defined by raw feature $V_i, C_j, U_k$, with respective dimension $d_v, d_c, d_u$ . First, a type-wise linear combination of raw features compute the input features of dimension $d$ for the GNN such that : $\mathbf{h}_{v_{i}}^{0} = V_{i}w_v$, $\mathbf{h}_{c_{j}}^{0} = C_{j}w_v$ and $\mathbf{h}_{u_{k}}^{0} = U_{k}w_v$.
183+
Then, we perform recursively $N$ operations of graph convolution on the nodes of the graph representation. At step $t$, a convolution can be formulated as :
184+
<br>
185+
<br>
186+
\begin{align*}
187+
\mathbf{h}_{v_{i}}^{t+1}&=\phi_{v}\left(\mathbf{V}_{i}: \mathbf{h}_{v_{i}}^{t}: \bigoplus_{c_{j} \in \mathcal{N}_{c}\left(v_{i}\right)} \mathbf{h}_{c_{j}}^{t}: \bigoplus_{u_{k} \in \mathcal{N}_{u}\left(v_{i}\right)} \mathbf{h}_{u_{k}}^{t}\right)\\
188+
\mathbf{h}_{c_{j}}^{t+1}&=\phi_{c}\left(\mathbf{C}_{j}: \mathbf{h}_{c_{j}}^{t}: \bigoplus_{v_{i} \in \mathcal{N}_{v}\left(v_{i}\right)} \mathbf{h}_{v_{i}}^{t}\right)\\
189+
\mathbf{h}_{u_{k}}^{t+1}&=\phi_{u}\left(\mathbf{U}_{k}: \mathbf{h}_{u_{k}}^{t}: \bigoplus_{v_{i} \in \mathcal{N}_{v}\left(v_{i}\right)} \mathbf{h}_{v_{i}}^{t}\right)
190+
\end{align*}
191+
192+
Where $\phi_{v}, \phi_{c}, \phi_{u}$ are one-layer perceptron, composed by an affine transformation followed by an activation function, $:$ is the concatenation operation, $\mathcal{N}_{v},\mathcal{N}_{c},\mathcal{N}_{u}$ represents the type-specific neighborhood, and $\bigoplus$ is the feature-wise aggregation function.
193+
<br>
194+
<br>
195+
<h4>Downstream neural network decoder :</h4>
196+
197+
Once the contextual node embeddings are computed, a decoder should be used to convert them into an actionable policy.
198+
<br>
199+
<br>
200+
Our architecture consists in first transforming the embeddings of the variable $v$ currently branched on and all the possible values by feeding them into two different multi-layer perceptrons. The transformed embedding of each value $u \in \mathcal{D}_v$ concatenated with the transformed embedding of the branching variable $v$ passes trough a last multi-layer perceptron that outputs the approximated Q-value for each pair ($v$,$u$).
201+
<br>
202+
<br>
203+
\begin{equation}
204+
\hat{Q}(S_t,a) = \hat{Q}\left(\{s_t,v\},u) = \phi_q(\phi_v( \mathbf{h}_{v}^{N}):\phi_u( \mathbf{h}_{u}^{N})\right)
205+
\end{equation}
206+
where $\phi_{q}, \phi_{u}, \phi_{v}$ are multi-layer perceptron.
207+
<br>
208+
<br>
209+
Once the Q-values are approximated, the <b class="term">explorer</b> can exploit the learned values and voraciously choose the best action or decide otherwise (for example, a random action with probability $\epsilon$). This tradeoff between exploitation and exploration is necessary in early learning when the estimate of Q-values is very poor and many states have never been visited before.
210+
</tr>
211+
<br>
212+
<br>
213+
<tr>
214+
<div style="text-align: center;">
215+
<img src="images/SeaPearl_recap.png" alt="nthu" width="650" height="330">
216+
</div>
217+
</tr>
218+
<br>
219+
<tr>
164220
<br>
165221
<br>
166222
My website is still under construction. I will add the end of the explanations soon :) !

images/SeaPearl_recap.png

335 KB
Loading

images/scheme.drawio.png

41.8 KB
Loading

0 commit comments

Comments
 (0)