Here are my responses to the questions in the exercises -
Using dropout1=0.2 and dropout2=0.5, I compared the loss and accuracy for the default stochastic gradient optimizer and Adam optimizer for 100 epochs-
SGD optimizer (learning rate 0.5)
Adam optimizer (learning rate = 1e-3)
For comparison, without any dropout, and SGD optimizer, I get-
As the number of epochs increases, the gap between train and test accuracy increases indicating that the network is over-training. This is true irrespective of the optimizer used. There are bumps in the SGD curves because the learning rate of 0.5 is high and can be possibly lowered. From these curves alone, it is not clear if the dropout1=0.2 and dropout2=0.5 is doing much since the generalization gap seems to be the same.
Keeping dropout2=0.2 and num_epochs=100 and adam optimizer with learning rate 1e-3, I swept dropout1 from from 0 to 0.9 and this is what I got-
Keeping dropout1=0.5 and num_epochs=100 and adam optimizer with learning rate 1e-3, I swept dropout2 from 0 to 0.99 and this this what I got-
Controlling dropout1 has more impact on regularization than dropout2. But as observed by others, this regularization does not have a huge impact on generalization for this shallow network but it definitely helps with reducing over-training because as we increase dropout (both layer 1 and layer 2), the gap between train and test accuracy reduces. Although I don't show it, the curves don't change if the dropout is applied before or after regularization.
Q: What is the variance of the activations in each hidden layer when dropout is and is not applied?
With dropout2 = 0.5, I swept dropout1 from 0-1. This the result plot-
With dropout1 = 0.6, I swept dropout2 from 0-1. This is the resulting plot-
Without any dropout, in either the first or second layers, the values are- tr_loss, te_loss = 0.01, 0.80 tr_acc, te_acc = 0.99, 0.89 lin1_var, lin2_var, lin3_var = 0.03, 0.02, 0.07
dropout1 has a significant influence in controlling for variance in all three layers compared to dropout2.
Dropout is a way to incentivize the model into finding good features. The learnings of the model are spread out across all the weights and using only some features during testing would be suboptimal. Also, the other reasons give in previous discussions regarding results during evaluation not being deterministic as well as are also valid.
What happens when we apply dropout to individual weights of the weight matrix rather than the activations?
Here, I applied dropout1, dropout2 = 0.2, 0.5 to activations in hidden1 and hidden2 respectively-
Here, I applied dropout1, dropout2 to weights between hidden1-hidden2 and hidden2-out respectively.
Here, I applied dropout1, dropout2, dropout3 = 0.2, 0.2, 0.5 to input-hidden1, hidden1-hidden2 and hidden2-output respectively.
Here, I apply dropout1, dropout2 = 0.2, 0.5 to input-hidden1 and hidden1-hidden2 respectively (no dropout for hidden2-output)
It appears that applying dropout1, dropout2 to input-hidden1 and hidden1-hidden2 weights is much more effective in reducing over-training than applying it to activations, atleast for this shallow network.
One way to think about dropout is as follows- Each layer in a network is learning to create new features using "inputs" from previous layers. Dropout in higher layers forces the lower layers to generate features that are informative for a number of different features in the higher layers and not generate "inputs" (to higher layers) that are too specialized.
Dropout is effective in controlling for variance indirectly compared to the l2-norm which does it more explicitly. So dropout is effective in controlling model capacity.