Finding the right subnetwork in a randomly initialized neural network

Turns out randomly optimized weights has something in them...

Dec 11, 2019

What is hidden inside the weights of a randomly initialized neural network?

If you want to dive headfirst to the code, you can directly go here

Earlier this week, a researcher i followed on twitter retweeted this thread written by another researcher.

The paper mentioned can be accessed through This link. After reading the abstract and eventually the whole paper, i found the idea to be pretty cool and i wanted to give it a go myself.

TL;DR of the paper : Chonky bois (wide, randomly initialized neural nets) contains wisdom (a subnetwork that performs well on a given task)

The main idea of the paper is that, hidden inside a randomly initialized neural network is a subnetwork that can reach good performance on a specified task, similar if not better in performance with a trained model. But the problem is, because all the possible combinations, the number of subnetworks within a singular network will be huge. This makes it more and more unlikely to find the right, or let’s say, the good enough subnetwork to achieve our goal, sadly. But, the possibility is still not zero as you can see in a bit.

The idea of optimizing a neural network without changing the weights has been explored in many papers, the most recent i read (and also mentioned in the paper explained in this post) is the Weight Agnostic Neural Networks (WANN) which performs architecture search on a fixed value for all weights to reach good performance on certain tasks.

So, how do we find our needle in the haystack? Let’s say we already have our network. For simplicity’s sake, I’m gonna use a Fully-Connected Neural Network as an example. How do we decide which subnet is the best for our use case?

The approach mentioned in the paper is to assign for each individual weights of the network their own scores. During feedforward, for each layers, only the top K-% weights with the highest scores are used. The rest are zeroed out. In this paper, only the weights are considered, and all kinds of biases (including the ones in BatchNorm) are not used.

This equation kinda describes what happens during feedforward

With L_i being the output vector of the ith layer, W being the weights, and M being the mask for the weights. The weights, scores and mask all have the same dimension. The mask contains only 1 and 0. 1 where the score (that corresponds to a weight value in the same position) is among the top-K% highest, 0 otherwise.

So far so good. But the scores are random, too! How do we make sure that eventually, the right scores will be assigned to the relevant weights?

Well, of course, gradients.

Normally, we use the gradients of the loss function with respect to the weights to perturb the weights in the correct direction to slowly but surely adjust the weights so our loss decreases over time.

To update the weights, we find the gradients of the loss function with respect to the scores to update the score in the same manner we update the weights on normal weight training. But there’s a catch..

Recall that the Mask M depends on the scores being the top-K% or not, and it only has 2 possible values, 1 and 0!

As the result, the gradient is 0, on both sides. This is why in the paper, the authors estimates the gradient using a straight-trough estimator. Which means the gradients will pass straight through the mask and directly into the scores. In the future, better scoring segments could be explored, but looking at the results on the paper, they are good enough.

Now we have it. We know how to run the forward pass and backward pass, and we now have the gradients w.r.t the scores. Just grab an optimizer, chuck in the params and the gradients, and call it a day.

TL;DR so far

Weights are NEVER updated.
Bias aren’t used.
Each weights are assigned their own scores.
During forward pass, only weights whose score is among the top-K% is considered. The rest are zeroed out.
To update the scores so that the right weight will be selected, we calculate the gradient of the loss function w.r.t the scores, and update them all the same.

Important parameters to note, from the paper

The value of K is important, and for simplicity is the same for every layers. generally K between {30,70} is good. Too much, the prediction will be random. Too little, not enough power.
Wider, the better. This gives the model more parameters and possible subnetworks to choose from. But it does not come from just having more parameters. Observations made in the paper shows that even at the same parameter count (by adjusting the K), the performance is still great! Even closer to the performance of a model trained normally.
Weight initialization matters. The Kaiming Uniform and its scaled variant gives the best result as observed on the paper.

Code Implementation

At the time of implementation, I haven’t learned pytorch so everything you see here is implemented using CuPy.

You can check the notebook, i think it should be pretty much self documented. I used a normal fully connected neural network. I think this can be extended to CNNs, with the scores applied the same way. Can probably also experiment with having scores dedicated for each filters instead of individual weights in the CNN filters tensor.

On my experiment, i used the same number of hidden layers for both variants. The only different is the width. While the normal NN has sizes of [784,100,100,10], for the Score NN i used [784,250,250,10] with K being 50%. I also used the same loss function, the same initializer and the same optimizer for both variants. Even batch size and no. of epochs are also the same. Difference only in alpha (learning rate) because seems like the scores can be trained using higher learning rate (implications of faster convergence?).

At the end of the experiment, the normal NN achieves 97.5% on training data and 95.5% on test data, while the score NN achieves 93.3% accuracy ontraining data and 92.26% on test data. (Maybe the lower disparity between the train and test scores of score_nn signifies something?) Which shows that it is possible to utilize the random weights of a NN by finding the right subnetwork utilizing them.

I have no idea if this has any real-world implication but i just thought it was a pretty cool idea.