With everything that has been presented in Part 1, I hope that you will have understood how a neural network works in general, with the two main steps of forward and back propagation. And that’s not bad enough, it’s a lot of concepts to mature! In this part, without going into too much detail either, I will introduce you to some notions to go a little further on the use of neural networks.
Initializing weights and biases
We have hardly talked about it so far, but to start the forward propagation step, all the weights and biases of the model must have a value at the beginning (before being modified by the gradient descent), otherwise there is little chance of producing a prediction… So how do we initialize our model? Do we set all weights and biases to 0? This is not a very good idea… If all weights are initialized to 0, the result of the derivative of the cost function with respect to each weight of the model will be the same for all weights. So it will be like having a single weight in the entire neural network, and so we don’t do much better than a linear model. This is called the symmetry problem because all the hidden layers will be the same. We will generally prefer to initialize weights and biases with a Gaussian distribution centered on 0 and with a standard deviation of 1 for example. Some authors also recommend a Gaussian distribution centered on 0 with a standard deviation of \frac{1}{\sqrt(n_{in})} with n_{in} the number of weights associated with each neuron, in order to avoid the saturation of the neurons (and therefore that they learn less quickly).
The cross-entropy function
As we have seen, to evaluate the quality of a neural network model and to perform the backpropagation step, we need a cost function, simply because we need to make the model understand when it predicts well and when it predicts poorly. The cost function that was presented at the beginning of the article is the quadratic function, which is a classic enough function to penalize predictions more and more strongly as they move further and further away from true or expected predictions. Nevertheless, this function may be subject to the problem of slow neural learning (the “learning slowdown”, which I describe below) and another often preferred function, that of cross-entropy:
- \frac{1}{n} \sum_{x} [y \times log(a) + (1-y) \times log(1-a)]
As we have repeated several times, the objective of the back propagation step is to calculate the derivative of the cost function with respect to the weights of the model (always the famous \frac{\partial C_{x}{\partial w_{1}}). With the quadratic function that we had calculated above (I let you refer to the equations):
\frac {\partial C_{x}}{\partial w_{1}} = \frac {\partial z(x)}{\partial w_{1}} \times \frac {\partial C_{x}}{\partial z(x)} = a_{1}(x) \times 2 \times (\phi(z(x))-y(x)) \times \phi'(z(x))
The problem of learning slowdown comes from the last term of the equation: \phi'(z(x)). In the case of the sigmoid activation function, the concern is that we saw it was flat at the tails of its definition interval, which means that its derivative can be very low if the activation value of a neuron is close to 1 or 0 (where the sigmoid curve is flat). If the derivative is very low, so will the gradient and therefore the model will learn very slowly (and we have already talked about the risks of very slow learning and getting stuck in a local minimum of the cost function). But with the cross-entropy function, there is no such problem, because believe me or not, when you calculate the derivative of the cost function with respect to the weights of the model, there is no term \phi'(z(x)), so we are saved! It’s not completely random either, since the cross entropy function has been set up for that purpose… I’ll let you go see the derivatives if you’re interested.
Note: There is indeed a negative sign in front of the cross entropy function, it is not a mistake! The cost will always be positive, don’t worry!
In general, there will be a tendency to use the cross entropy function for classification problems and the quadratic function for regression problems. In our very simple example of the beginning of an article, I preferred to use the quadratic cost function for the back propagation step, which is easier to understand (the quadratic function is still a little less scary than the cross entropy function).
Regularization and Drop Out
One of the classic problems that can be faced when trying to model a phenomenon is overlearning or over-fitting. As its name suggests, overfitting occurs when the model learns too much about the data provided to it and is not able to generalize its predictions to new input data. This is why it is necessary to get used to separating your dataset into a training set on which the model learns and a validation set on which the model is validated to see if the model is able to generalize to a dataset different from the training set. To be totally transparent, you would even need to have a third dataset that is completely different from your initial dataset (acquired under other conditions, for example) to be sure of the generalizability of the initial model. On the other hand, validating your model on the data set that was also used for the model to be learned is meaningless… To be avoided! What can be done in the case of neural networks to limit this problem of overfitting? Several solutions are available to us! I will only present three of them here:
First solution: Add elements to its initial dataset. Well, easier said than done… If it’s to go back for a new sampling campaign, we’re not out of the woods yet. It’s still a good option! When you are not able to add real samples to your dataset, there is still the possibility of adding artificial samples. For example, in the case of weed images, we can imagine several pre-processing to artificially increase the data set: rotation, distortion, adding noise to the image….
Second solution: Update the weights and biases of the network by not using all the neurons in the model, this is the “Dropout” method. For each sample that is passed through the model to estimate the weight and bias variations necessary to improve the model, neurons will be randomly and temporarily removed from the network (they are not completely removed, you can consider them as ghost neurons). Once the sample is passed, the deleted neurons reappear and a new random draw of neurons is made to choose which ones will be deleted for the next sample. Thanks to this Dropout approach, the model is more flexible in the sense that it does not learn all the time with the same neurons, which makes it easier to generalize predictions to new data.
Third solution: Regularize the cost function used so that the weights used in the model are not too large. Several regularization methods exist but we will focus on the one corresponding to the so-called L2 or “weight decay” regularization. The adjusted cost function is as follows:
C = \frac {1}{n} \sum C_{x} + \frac {\lambda}{2n} \sum w^2
Here you will find the cost formula that was used previously C = \frac {1}{n} \sum C_{x}[/latex] (regardless of which one, whether it is the quadratic, cross entropy or other function) to which the sum of the squares of the network weights weighted by the ratio \frac {\lambda}{2n} was added. The \lambda factor is called the regulation parameter (\lambda is positive). The term n is always the number of training samples in our model.
If \lambda is set to a very small value, the regularized cost function simply becomes the initial cost function again since the term of regularization no longer matters. On the other hand, if \lambda is set at a high value, then the regularization term becomes more important! I would remind you that our main objective is to produce the best possible estimates via our model, so what interests us is to have the lowest possible cost value. If \lambda is large, then the sum of the weights of the model will have to be small, otherwise there is little chance of having a low cost value. Regularization therefore favours low weights in neural network models. When the model is composed of low weights, it means that the model will not be too strongly impacted by sudden changes in the input data and therefore it will not be too specific to these input data, nor to the noise they may contain. It is therefore a way of being relatively flexible and of being able to generalize predictions to new data. If the weights of the model are too large, we can imagine that a new data a little different from the others can completely upset the model, which is not necessarily the most appropriate.
The bias-variance dilemma
In the previous section, we focused on the problem of overfitting, that is, not being able to generalize our predictions to new input data. We can also talk about the opposite problem, i.e. underfitting, where, on the contrary, the model has not learned enough about the data and is therefore not able to make good predictions. If our predictions are wrong, then how do we know if we are facing a problem of under- or over-fitting? This is discussed here with the dilemma or compromise of bias/variance. Model bias can be seen as the inability of a model to learn the right decision rules (so it is more related to underfitting). The variance of a model can be seen as the sensitivity of the model to small fluctuations in the training sample (so it is more related to overfitting). And we talk about a bias/variance dilemma because we cannot reduce these two components at the same time… If we reduce the bias of the model, we overlearn on the training data and therefore we end up with a high variance and vice versa. So the whole game is to reach a compromise… In the following table, I give you some examples to judge, depending on the prediction error you get with your training set or validation set, the state of the bias and variance of your model (the bias/variance compromise is obviously not specific to neural networks)
Prediction error (%) | ||||
Training set | 2 | 12 | 12 | 1 |
Validation set | 15 | 13 | 25 | 2 |
Interpretation | High variance | High bias | High bias and variance | Low bias and variance |
A little feedback on the hyper-parameters, with some new ones in the list
You may have started to feel it, but there are quite a few hyperparameter settings in neural networks: for example, we have already talked about learning rate (\eta or \alpha depending on the notations), dropout (with the percentage of neural suppression), and regulation with \lambda. We distinguish these general settings (hyperparameters therefore) from the model parameters that will be calculated internally by the model (weights, bias…). Let’s see the few other hyperparameters with which we may have to interfere:
- Batch size: This is the number of samples that will be propagated through the neural network. Instead of using for example only once its 1000 samples and updating only once the parameters of the model (bias, weight…), you can propagate its samples in 10 groups of 100 samples and update more regularly the parameters of the model. The main advantage is that this procedure allows less memory to be used and also allows the neural model to learn faster. However, be careful not to use a group size that is too small, which would greatly reduce the gradient descent (since the parameters would be constantly re-evaluated on a very small number of samples and therefore not really representative…)
- Epoch: This is the number of times the complete dataset will be propagated in the neural network. An epoch is the passing of the complete dataset once in the network. By passing the complete dataset several times with different batches, we get closer to the optimum of our model
- Iterations: This is the number of Batches that are necessary to reach an epoch. To repeat the previous example, if I divide my dataset of 1000 samples into batches of 100 samples, it will take 10 iterations to get to an epoch.
- Momentum: This hyperparameter, which is generally noted \mu (or \beta depending on the notations), adds a notion of speed to the gradient descent we saw earlier. The momentum varies between 0 and 1. So far, during gradient descent, we have looked for the most interesting direction to go towards the optimum of the cost function. But we essentially talked about direction (i.e. are we going in the right direction or not) but we didn’t ask ourselves how fast we were going in that direction. In the initial version of the gradient descent, I explained how to update the weights of the neural network:
w^{l} = w^{l} - \eta \times \frac {\partial C}{\partial w^{l}}
With the notion of momentum, we add a short-term memory to the gradient descent (\v is here a velocity vector)
v^{l} = \mu \times v^{l} - \eta \times \frac {\partial C}{\partial w^{l}}
And then we update the weights of the model
w^{l} = w^{l} + v^{l}
By choosing a momentum of 0, we find the classic gradient descent that we used. By choosing a momentum of 1, we see that we add a gradient term to each iteration which will accelerate the gradient descent more and more. It’s interesting at first, but the problem is that we kind of need this descent to slow down as the model gets closer to its optimum. This is why values between 0 and 1 are more generally chosen because it allows to add a notion of friction to the descent or a deceleration (if it’s clearer like that).
Supervised, unsupervised and reinforcement learning
In this article, we have talked about learning several times! Our objective was indeed for our neural network model to learn. And in the examples that were put in place, the models had to learn to predict wheat yield on a plot from a number of input parameters. So far, we have worked exclusively on supervised learning. We had a set of input variables (rainfall, temperature, soil type) and we knew what the result of the model should be: low yield or high yield. The model still had to configure all these weights and biases to arrive at the best prediction rate. Learning is supervised because the user knows exactly what he wants to achieve and therefore supervises the algorithm to achieve his goals. Supervised learning remains by far the most widely used form of learning with neural networks, particularly since it allows for quite impressive results in terms of prediction quality.
Lets talk about unsupervised learning. In this case, we start from the same base as in supervised learning, i.e. we have a data set with a number of input parameters, but on the other hand, the output of the model is totally unknown. It must be understood that the user has no (or few) preconceptions about what the model should return. The user actually asks the model to extract particular trends, patterns, groups, patterns, or data arrangements from the dataset provided as an input to the model. If we take the example of wheat yield prediction, instead of telling the model: “these input parameters led to low yield and these led to high yield, find me how I can predict the yield of new input data” (supervised learning), instead we would tell him “I have all these input data, find me a way to group them into different subgroups” (unsupervised learning). And at the end of this unsupervised work, one can imagine hoping to obtain two subgroups that will be characterized by low and high yield. Unsupervised learning remains the most complex form of learning (the model manages on its own and is not guided) but you can imagine that it is the form of learning that everyone would like to be able to achieve.
The third learning case that I briefly present here is learning by reinforcement. This is the form of learning that has been brought to light quite recently with the success of DeepMind’s AlphaGo computer program, which succeeded in beating the world champion in the game of Go, a particularly complex board strategy game. In this form of learning, the user teaches the model to learn. The algorithm operates in an environment where all the actions that the algorithm can perform are known, each associated with a reward or penalty. The objective is to successfully minimize a cost function (for example, in the case of AlphaGo, one could imagine that it is to lose as few games as possible). The algorithm therefore sequences a set of known actions (placing a counter here, placing a counter there, etc.) and each of the actions is rewarded or not depending on the result of the game. It must be understood that the algorithm does not necessarily choose to chain direct rewards since a large penalty can be hidden behind a reward (in the example of chess, we can say that it is good to eat a pawn but if it is to get your queen taken right behind, it is still not very interesting…). It’s a bit like a human way of learning. For example, when we learn to ride a bike, we can consider that our cost function is not to fall and to be able to go fast enough. We also know all the actions we can do (pedaling, bending, accelerating, slowing down…). As we try, we get a better understanding of how to ride a bike and what not to do (for example, if we bend too much, we fall). Through this form of learning, the AlphaGo program has been able to play thousands of games against itself and find out for itself how to minimize its cost function. Some experts were surprised by the moves played by the computer program, which later proved to be particularly effective (because the algorithm knew that these moves would lead further in the game to a positive reward). I insist on the fact that, in this form of learning, the algorithm must know all the actions it is likely to be able to perform (the rules of the game in a way). If you ask him to play chess and you don’t teach him that the queen can eat diagonally, it won’t be able to learn it on his own… But if it knows well all the rules of the game, he may well be able to react to new situations (perhaps in a very surprising way). This type of learning is also beginning to be used in real-time strategy video games (e. g. StarCraft) to try to beat players from all over the world, where the complexity is absolutely staggering (each player has their own strategy and can perform a number of impressive different actions).