February | 2016 | N 2 Infinity and Beyond

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

Papers

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985.

Learning representations by back-propagating errors. By Rumelhart, Hinton, and Williams in 1986.

Other Useful References

Boltzmann Machine (BM) – Wikipedia and Scholarpedia
MacKay Ch 43 (Boltzmann).
Hinton guide to RBMs
Backpropagation – Wikipedia
Multilayer perceptron (MLP) – Wikipedia
Michael Nielsen Chapter 2

How do you actually train neural networks?

Hopefully the past few posts have piqued your interest in neural networks. Maybe you even want to unleash a neural network on some data. How do you actually train the neural network?

I’m actually going to keep this brief for two reasons. First, detailed derivations can already be found elsewhere (for Boltzmann see Appendix of the original paper as well as MacKay, for backpropagation see Nielsen). Second, I firmly believe that algorithms are best learned by actually stepping through the updates, so any explanation I attempt will not be sufficient for you to truly learn the algorithm. I will provide some general context as well as some questions you should be able to answer, but please go do it yourself!

There are three general classes of machine learning based on the information received:

Unsupervised – data only. Boltzmann machine.
Supervised – data with labels. MLP with backpropagation.
Reinforcement – data, actions, and scores associated with each action. Deserves its own detailed post, but check out papers by DeepMind for cool applications.

The Boltzmann machine learning rule is an example of maximum likelihood. In practice, the original learning rule is too computationally expensive, so a modified algorithm called contrastive divergence (or variants such as persistent contrastive divergence) is utilized instead. See the Hinton guide to RBMs for more details.

Backpropagation is a computationally-efficient writing of the chain rule from calculus, so besides the above paper which popularized it, there is actually a long history of this algorithm being discovered and rediscovered.

Fundamental Questions

What is maximum likelihood?
Why can one interpret the learning terms in the BM algorithm as “waking” and “sleeping”?
Why are BM hidden layers so important?
Why are restricted Boltzmann machines, RBMs, much easier to train?
Why is backpropagation more computationally efficient than the finite difference method?
Derive the 4 backpropagation equations!

Advanced Questions

Follow Hinton’s RBM guide and implement your own Boltzmann machine
Use Nielsen’s code to train your own MLP

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

Perceptron
Energy Based Neural Networks
Training Networks
Deep Learning

Papers

Neural networks and physical systems with emergent collective computational abilities. By Hopfield in 1982.

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985. (Note: Only section 1 and 2 are covered here, rest of paper covered in next topic).

Other Useful References

Hopfield Network – Wikipedia and Scholarpedia
Boltzmann Machine – Wikipedia and Scholarpedia
MacKay, Ch 31 (optional intro to Ising model), Ch 42 (Hopfield), and Ch 43 (Boltzmann).
Amit – This provides a detailed, physics-based, analysis of the Hopfield model

Why should you care?

The Hopfield paper provides an explicit, decently biologically plausible, mechanism by which a system of (artificial) neurons can store memories. The central goal of the paper is to demonstrate a method for content-addressible memory. Standard computer memory is location-addressible (ie your computer looks to a specific place on your disk). The idea of content-addressible memory is that a partial (perhaps faulty) presentation of the memory should be sufficient to obtain the full memory. I love this quote:

An example asks you to recall ‘An American politician who was very intelligent and whose politician father did not like broccoli’. Many people think of president [George W.] Bush –even though one of the cues contains an error.

MacKay Ch 38, pg 469.

Hopfield Neural Network

So how does Hopfield actually store memories? The idea is that a system can be constructed such that the stable states of the system are the desired memories. I’m going to change notation from the original Hopfield paper so that it matches standard physics notation.

In the actual paper, Hopfield uses threshold neurons but all arguments can be easily extended to a $tanh$ neuron. I will define a neuron’s state as $S$ and neuron $i$ (out of $N$ total) is firing if $S_i=1$ and not firing if $S_i = -1$ .

How does the system actually store memories? I will call the memories ( $p$ of them) that one wants to store as $\xi^\mu$ where $\mu=1\ldots p$ . The idea is to define the interactions between neurons as:

$J_{ij} = \frac{1}{N}\sum_{\mu=1}^p \xi_i^\mu \xi_j^\mu$

Why are memories actual stable? We can write the activation function of a neuron (also known as the local field in physics terms) as

$a_i = \sum_{j\neq i} J_{ij} S_j$

Now let’s check if a memory is stable (say memory $1$ ) by examining the activation function of the first neuron.

$a_1 = \sum_{j=2}^N J_{1j} \xi_j^1= \frac{1}{N} \sum_{j=2}^N \sum_{\mu=1}^p \xi_1^\mu \xi_j^\mu \xi_j^1$

$a_1 = \frac{1}{N} \sum_{j=2}^N \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1$

$a_1 = \frac{N-1}{N} \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1$

The first term of the activation function is exactly what we need for the first neuron to be stable. The second term is noise, but how big is it? We need a couple more facts to figure it out. First, we can safely assume $N\gg1$ . Second, we will assume that memories are not biased (equal numbers of neurons are on and off). Third, we will assume that we are storing random memories. Then using the central limit theorem, we get that on average, $a_1 = \xi_1$ . (Advanced question: what is the standard deviation of the noise term? What does that imply about stability of memories?)

So this establishes that memories are stable. But how should we actually think about these memories? By examining the energy of the system, one can show that these memories are global minima and have basins of attraction. The energy is defined as

$H = -\frac{1}{2} \sum_{ij} S_i J_{ij} S_j$

where we have defined the self-interactions to be zero ( $J_{ii}=0$ ). Using similar arguments as above, one can show that on average each memory has an energy of $-\frac{N}{2}$ and that a flipping a single spin leads to higher energy. (Fundamental question: prove these statements!)

Therefore, using the prescription outlined in the Hopfield paper, one can take a set of memories, $\xi$ , and create a dynamical system with these memories as global minima.

Boltzmann Machine

Hopfield networks are great if you already know the states of the desired memories. But what if you are only given data? How would you actually train a neural network to store the data?

The next journal club will get to actual training, but it is convenient to introduce at this time a Boltzmann Machine (BM). This is an extension of Hopfield networks that can actually learn to store data. In the most general Boltzmann machine, neurons are divided into visible (actually interact with the data) and hidden (only see data through interactions with visible neurons). This leads to an energy function of:

$H = -\frac{1}{2} \sum_{ij}v_i J_{ij} v_j- \sum_{ij}v_i w_{ij} h_j-\frac{1}{2} \sum_{ij}h_i K_{ij} h_j$

where $v$ are visible neurons and $h$ are hidden neurons (if present, not a requirement). There are three different types of interactions, those amongst visible neurons only ( $J$ ), those amongst hidden neurons only ( $K$ ), and those between visible and hidden neurons ( $w$ ).

As will be explained in the next journal club, the full Boltzmann machine takes a long time to train. So instead, it is common to use a Restricted Boltzmann Machine (RBM) which has no self interactions amongst layers:

$H = - \sum_{ij}v_i w_{ij} h_j$

Fundamental Questions

Why is content-addressible memory considered associative, software and hardware fault tolerant, and distributed? Why is this closer to biology than location-addressible memory?
Why is the Hopfield storage prescription Hebbian?
Do the calculation to show that the memories are global minima.
Hopfield says “In many physical systems, the nature of the emergent collective properties are insensitive to the details inserted in the model.” What are some assumptions that Hopfield relaxes in the simulations?
Next time we will see that RBMs are easier to train to BM. Can you see why?

Advanced Questions

For the activation function argument, what is the standard deviation of the noise term? What does that imply about stability of memories?
What happens to the capacity if the memories are not equally $\pm 1$ and/or correlated with each other?

N 2 Infinity and Beyond

N 2 Infinity and Beyond

A physics PhD's adventures in machine learning, biophysics, computational neuroscience, and beyond

Month: February 2016

Training Networks

Unit: Deep Learning

Papers

Other Useful References

How do you actually train neural networks?

Fundamental Questions

Advanced Questions

Energy Based Neural Networks

Unit: Deep Learning

Papers

Other Useful References

Why should you care?

Hopfield Neural Network

Boltzmann Machine

Fundamental Questions

Advanced Questions

N 2 Infinity and Beyond

Unit: Deep Learning

Papers

Other Useful References

How do you actually train neural networks?

Fundamental Questions

Advanced Questions

Share this:

Unit: Deep Learning

Papers

Other Useful References

Why should you care?

Hopfield Neural Network

Boltzmann Machine

Fundamental Questions

Advanced Questions

Share this: