Training Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

  1. Perceptron
  2. Energy Based Neural Networks
  3. Training Networks
  4. Deep Learning


A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985.

Learning representations by back-propagating errors. By Rumelhart, Hinton, and Williams in 1986.

Other Useful References


How do you actually train neural networks?

Hopefully the past few posts have piqued your interest in neural networks. Maybe you even want to unleash a neural network on some data. How do you actually train the neural network?

I’m actually going to keep this brief for two reasons. First, detailed derivations can already be found elsewhere (for Boltzmann see Appendix of the original paper as well as MacKay, for backpropagation see Nielsen). Second, I firmly believe that algorithms are best learned by actually stepping through the updates, so any explanation I attempt will not be sufficient for you to truly learn the algorithm. I will provide some general context as well as some questions you should be able to answer, but please go do it yourself!

There are three general classes of machine learning based on the information received:

  • Unsupervised – data only. Boltzmann machine.
  • Supervised – data with labels. MLP with backpropagation.
  • Reinforcement – data, actions, and scores associated with each action. Deserves its own detailed post, but check out papers by DeepMind for cool applications.

The Boltzmann machine learning rule is an example of maximum likelihood. In practice, the original learning rule is too computationally expensive, so a modified algorithm called contrastive divergence (or variants such as persistent contrastive divergence) is utilized instead. See the Hinton guide to RBMs for more details.

Backpropagation is a computationally-efficient writing of the chain rule from calculus, so besides the above paper which popularized it, there is actually a long history of this algorithm being discovered and rediscovered.

Fundamental Questions

  • What is maximum likelihood?
  • Why can one interpret the learning terms in the BM algorithm as “waking” and “sleeping”?
  • Why are BM hidden layers so important?
  • Why are restricted Boltzmann machines, RBMs, much easier to train?
  • Why is backpropagation more computationally efficient than the finite difference method?
  • Derive the 4 backpropagation equations!

Advanced Questions

  • Follow Hinton’s RBM guide and implement your own Boltzmann machine
  • Use Nielsen’s code to train your own MLP

Energy Based Neural Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

  1. Perceptron
  2. Energy Based Neural Networks
  3. Training Networks
  4. Deep Learning


Neural networks and physical systems with emergent collective computational abilities. By Hopfield in 1982.

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985. (Note: Only section 1 and 2 are covered here, rest of paper covered in next topic).

Other Useful References

  • Hopfield Network – Wikipedia and Scholarpedia
  • Boltzmann Machine – Wikipedia and Scholarpedia
  • MacKay, Ch 31 (optional intro to Ising model), Ch 42 (Hopfield), and Ch 43 (Boltzmann).
  • Amit – This provides a detailed, physics-based, analysis of the Hopfield model


Why should you care?

The Hopfield paper provides an explicit, decently biologically plausible, mechanism by which a system of (artificial) neurons can store memories. The central goal of the paper is to demonstrate a method for content-addressible memory. Standard computer memory is location-addressible (ie your computer looks to a specific place on your disk). The idea of content-addressible memory is that a partial (perhaps faulty) presentation of the memory should be sufficient to obtain the full memory. I love this quote:

An example asks you to recall ‘An American politician who was very intelligent and whose politician father did not like broccoli’. Many people think of president [George W.] Bush –even though one of the cues contains an error.

MacKay Ch 38, pg 469.

Hopfield Neural Network

So how does Hopfield actually store memories? The idea is that a system can be constructed such that the stable states of the system are the desired memories. I’m going to change notation from the original Hopfield paper so that it matches standard physics notation.

In the actual paper, Hopfield uses threshold neurons but all arguments can be easily extended to a tanh neuron. I will define a neuron’s state as S and neuron i (out of N total) is firing if S_i=1 and not firing if S_i = -1.

How does the system actually store memories? I will call the memories (p of them) that one wants to store as \xi^\mu where \mu=1\ldots p. The idea is to define the interactions between neurons as:

J_{ij} = \frac{1}{N}\sum_{\mu=1}^p \xi_i^\mu \xi_j^\mu

Why are memories actual stable? We can write the activation function of a neuron (also known as the local field in physics terms) as

a_i = \sum_{j\neq i} J_{ij} S_j

Now let’s check if a memory is stable (say memory 1) by examining the activation function of the first neuron.

a_1 = \sum_{j=2}^N J_{1j} \xi_j^1= \frac{1}{N} \sum_{j=2}^N \sum_{\mu=1}^p \xi_1^\mu \xi_j^\mu \xi_j^1

a_1 = \frac{1}{N} \sum_{j=2}^N \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1

a_1 = \frac{N-1}{N} \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1

The first term of the activation function is exactly what we need for the first neuron to be stable. The second term is noise, but how big is it? We need a couple more facts to figure it out. First, we can safely assume N\gg1. Second, we will assume that memories are not biased (equal numbers of neurons are on and off). Third, we will assume that we are storing random memories. Then using the central limit theorem, we get that on average, a_1 = \xi_1. (Advanced question: what is the standard deviation of the noise term? What does that imply about stability of memories?)

So this establishes that memories are stable. But how should we actually think about these memories? By examining the energy of the system, one can show that these memories are global minima and have basins of attraction. The energy is defined as

H = -\frac{1}{2} \sum_{ij} S_i J_{ij} S_j

where we have defined the self-interactions to be zero (J_{ii}=0). Using similar arguments as above, one can show that on average each memory has an energy of -\frac{N}{2} and that a flipping a single spin leads to higher energy. (Fundamental question: prove these statements!)

Therefore, using the prescription outlined in the Hopfield paper, one can take a set of memories, \xi, and create a dynamical system with these memories as global minima.


Boltzmann Machine

Hopfield networks are great if you already know the states of the desired memories. But what if you are only given data? How would you actually train a neural network to store the data?

The next journal club will get to actual training, but it is convenient to introduce at this time a Boltzmann Machine (BM). This is an extension of Hopfield networks that can actually learn to store data. In the most general Boltzmann machine, neurons are divided into visible (actually interact with the data) and hidden (only see data through interactions with visible neurons). This leads to an energy function of:

H = -\frac{1}{2} \sum_{ij}v_i J_{ij} v_j- \sum_{ij}v_i w_{ij} h_j-\frac{1}{2} \sum_{ij}h_i K_{ij} h_j

where v are visible neurons and h are hidden neurons (if present, not a requirement). There are three different types of interactions, those amongst visible neurons only (J), those amongst hidden neurons only (K), and those between visible and hidden neurons (w).

As will be explained in the next journal club, the full Boltzmann machine takes a long time to train. So instead, it is common to use a Restricted Boltzmann Machine (RBM) which has no self interactions amongst layers:

H = - \sum_{ij}v_i w_{ij} h_j


Fundamental Questions

  • Why is content-addressible memory considered associative, software and hardware fault tolerant, and distributed? Why is this closer to biology than location-addressible memory?
  • Why is the Hopfield storage prescription Hebbian?
  • Do the calculation to show that the memories are global minima.
  • Hopfield says “In many physical systems, the nature of the emergent collective properties are insensitive to the details inserted in the model.” What are some assumptions that Hopfield relaxes in the simulations?
  • Next time we will see that RBMs are easier to train to BM. Can you see why?

Advanced Questions

  • For the activation function argument, what is the standard deviation of the noise term? What does that imply about stability of memories?
  • What happens to the capacity if the memories are not equally \pm 1 and/or correlated with each other?