You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: SeaPearl.html
+60-4
Original file line number
Diff line number
Diff line change
@@ -140,13 +140,13 @@
140
140
<br>
141
141
The RL algorithm used is <ahref="https://www.researchgate.net/figure/The-Q-learning-algorithm-taken-from-Sutton-Barto-1998_fig1_337089781" target="_blank">Deep Q-Learning </a> (DQN). The associated <bclass="term">reward</b> is based on the objective function/quality of the solution returned.
142
142
143
-
<h3>Generic Graph representation</h3>
143
+
<h3>Generic Graph representation :</h3>
144
144
Every CP problems is defined by a set of variable, values that can be taken by these variables and a set of Constraints on these variables. The idea is to encode each of these entities as node in a graph and connect these nodes according to whether :
145
145
<ul>
146
146
<li>A <bclass="term">Value</b> is part of a <bclass="term">Variable</b>'s domain of definition</li>
147
147
<li>A <bclass="term">Variable</b> is involved in a <bclass="term">Constraint</b></li>
148
148
</ul>
149
-
Here is a little example just to get a sense of things :
149
+
This graph is naturally dynamically updated throughout the resolution of instances during the training process. Here is a little example just to get a sense of things :
The advantage of this method is that it allows the entire information to be encoded in a structure (a graph) that can be processed by a Neural Networks. Each node comes with a feature vector allowing to identify -among others- the type of constraints of a Constraint node or the value of a Value node.
160
+
The advantage of this method is that it allows the entire information to be encoded in a structure (a graph) that can be processed by a Neural Network. Each node comes with node embeddings allowing to identify -among others- the type of constraints of a Constraint node or the value of a Value node.
161
+
162
+
<h3>Neural Pipeline for Variable/Value assignation :</h3>
163
+
164
+
Now that we defined our input, recall that our goal is to infer a variable/value assignation. Let's consider for the moment that the variable selection heuristic is deterministic, so that the input is the graph representation <bclass="term">and</b> a variable on which to branch. This is where we are :
165
+
</tr>
161
166
<br>
162
167
<br>
163
-
Now that we defined our input, recall that our goal is to infer a variable/value assignation. Let's consider for the moment that the variable selection heuristic is deterministic, so that the input is the graph representation <bclass="term">and</b> a variable on which to branch. This is where we are :
Given a state, the RL agent is parameterized by a DQN-network that outputs the Q-values associated with every possible value selection in the domain of the selected variable $v$. The tripartite state is fed into a neural approximator model, the <bclass="term">learner</b> $\hat{Q}$, which consists of two parts : an <bclass="term">encoder</b> for learning contextual embeddings of the nodes of the graph representation, and a <bclass="term">decoder</b> which, given these node embeddings, estimates the optimal policy to design a powerful heuristic.
176
+
177
+
<h4>Graph neural network encoder :</h4>
178
+
179
+
Graph convolutional networks (GCN) constitute a very convenient solution to learn contextual node embeddings, and have been largely used for this purpose in reinforcement learning for discrete optimization. Due to the heterogeneous nature of our representation we opted for a heterogeneous GCN composed of several Graph convolutional layers.
180
+
<br>
181
+
<br>
182
+
Considering a variable $v_i$, a constraint $c_j$ and a value $u_k$, they are respectively defined by raw feature $V_i, C_j, U_k$, with respective dimension $d_v, d_c, d_u$ . First, a type-wise linear combination of raw features compute the input features of dimension $d$ for the GNN such that : $\mathbf{h}_{v_{i}}^{0} = V_{i}w_v$, $\mathbf{h}_{c_{j}}^{0} = C_{j}w_v$ and $\mathbf{h}_{u_{k}}^{0} = U_{k}w_v$.
183
+
Then, we perform recursively $N$ operations of graph convolution on the nodes of the graph representation. At step $t$, a convolution can be formulated as :
Where $\phi_{v}, \phi_{c}, \phi_{u}$ are one-layer perceptron, composed by an affine transformation followed by an activation function, $:$ is the concatenation operation, $\mathcal{N}_{v},\mathcal{N}_{c},\mathcal{N}_{u}$ represents the type-specific neighborhood, and $\bigoplus$ is the feature-wise aggregation function.
193
+
<br>
194
+
<br>
195
+
<h4>Downstream neural network decoder :</h4>
196
+
197
+
Once the contextual node embeddings are computed, a decoder should be used to convert them into an actionable policy.
198
+
<br>
199
+
<br>
200
+
Our architecture consists in first transforming the embeddings of the variable $v$ currently branched on and all the possible values by feeding them into two different multi-layer perceptrons. The transformed embedding of each value $u \in \mathcal{D}_v$ concatenated with the transformed embedding of the branching variable $v$ passes trough a last multi-layer perceptron that outputs the approximated Q-value for each pair ($v$,$u$).
where $\phi_{q}, \phi_{u}, \phi_{v}$ are multi-layer perceptron.
207
+
<br>
208
+
<br>
209
+
Once the Q-values are approximated, the <bclass="term">explorer</b> can exploit the learned values and voraciously choose the best action or decide otherwise (for example, a random action with probability $\epsilon$). This tradeoff between exploitation and exploration is necessary in early learning when the estimate of Q-values is very poor and many states have never been visited before.
0 commit comments