Energy Based Neural Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

  1. Perceptron
  2. Energy Based Neural Networks
  3. Training Networks
  4. Deep Learning

Papers

Neural networks and physical systems with emergent collective computational abilities. By Hopfield in 1982.

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985. (Note: Only section 1 and 2 are covered here, rest of paper covered in next topic).

Other Useful References

  • Hopfield Network – Wikipedia and Scholarpedia
  • Boltzmann Machine – Wikipedia and Scholarpedia
  • MacKay, Ch 31 (optional intro to Ising model), Ch 42 (Hopfield), and Ch 43 (Boltzmann).
  • Amit – This provides a detailed, physics-based, analysis of the Hopfield model

 

Why should you care?

The Hopfield paper provides an explicit, decently biologically plausible, mechanism by which a system of (artificial) neurons can store memories. The central goal of the paper is to demonstrate a method for content-addressible memory. Standard computer memory is location-addressible (ie your computer looks to a specific place on your disk). The idea of content-addressible memory is that a partial (perhaps faulty) presentation of the memory should be sufficient to obtain the full memory. I love this quote:

An example asks you to recall ‘An American politician who was very intelligent and whose politician father did not like broccoli’. Many people think of president [George W.] Bush –even though one of the cues contains an error.

MacKay Ch 38, pg 469.

Hopfield Neural Network

So how does Hopfield actually store memories? The idea is that a system can be constructed such that the stable states of the system are the desired memories. I’m going to change notation from the original Hopfield paper so that it matches standard physics notation.

In the actual paper, Hopfield uses threshold neurons but all arguments can be easily extended to a tanh neuron. I will define a neuron’s state as S and neuron i (out of N total) is firing if S_i=1 and not firing if S_i = -1.

How does the system actually store memories? I will call the memories (p of them) that one wants to store as \xi^\mu where \mu=1\ldots p. The idea is to define the interactions between neurons as:

J_{ij} = \frac{1}{N}\sum_{\mu=1}^p \xi_i^\mu \xi_j^\mu

Why are memories actual stable? We can write the activation function of a neuron (also known as the local field in physics terms) as

a_i = \sum_{j\neq i} J_{ij} S_j

Now let’s check if a memory is stable (say memory 1) by examining the activation function of the first neuron.

a_1 = \sum_{j=2}^N J_{1j} \xi_j^1= \frac{1}{N} \sum_{j=2}^N \sum_{\mu=1}^p \xi_1^\mu \xi_j^\mu \xi_j^1

a_1 = \frac{1}{N} \sum_{j=2}^N \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1

a_1 = \frac{N-1}{N} \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1

The first term of the activation function is exactly what we need for the first neuron to be stable. The second term is noise, but how big is it? We need a couple more facts to figure it out. First, we can safely assume N\gg1. Second, we will assume that memories are not biased (equal numbers of neurons are on and off). Third, we will assume that we are storing random memories. Then using the central limit theorem, we get that on average, a_1 = \xi_1. (Advanced question: what is the standard deviation of the noise term? What does that imply about stability of memories?)

So this establishes that memories are stable. But how should we actually think about these memories? By examining the energy of the system, one can show that these memories are global minima and have basins of attraction. The energy is defined as

H = -\frac{1}{2} \sum_{ij} S_i J_{ij} S_j

where we have defined the self-interactions to be zero (J_{ii}=0). Using similar arguments as above, one can show that on average each memory has an energy of -\frac{N}{2} and that a flipping a single spin leads to higher energy. (Fundamental question: prove these statements!)

Therefore, using the prescription outlined in the Hopfield paper, one can take a set of memories, \xi, and create a dynamical system with these memories as global minima.

 

Boltzmann Machine

Hopfield networks are great if you already know the states of the desired memories. But what if you are only given data? How would you actually train a neural network to store the data?

The next journal club will get to actual training, but it is convenient to introduce at this time a Boltzmann Machine (BM). This is an extension of Hopfield networks that can actually learn to store data. In the most general Boltzmann machine, neurons are divided into visible (actually interact with the data) and hidden (only see data through interactions with visible neurons). This leads to an energy function of:

H = -\frac{1}{2} \sum_{ij}v_i J_{ij} v_j- \sum_{ij}v_i w_{ij} h_j-\frac{1}{2} \sum_{ij}h_i K_{ij} h_j

where v are visible neurons and h are hidden neurons (if present, not a requirement). There are three different types of interactions, those amongst visible neurons only (J), those amongst hidden neurons only (K), and those between visible and hidden neurons (w).

As will be explained in the next journal club, the full Boltzmann machine takes a long time to train. So instead, it is common to use a Restricted Boltzmann Machine (RBM) which has no self interactions amongst layers:

H = - \sum_{ij}v_i w_{ij} h_j

 

Fundamental Questions

  • Why is content-addressible memory considered associative, software and hardware fault tolerant, and distributed? Why is this closer to biology than location-addressible memory?
  • Why is the Hopfield storage prescription Hebbian?
  • Do the calculation to show that the memories are global minima.
  • Hopfield says “In many physical systems, the nature of the emergent collective properties are insensitive to the details inserted in the model.” What are some assumptions that Hopfield relaxes in the simulations?
  • Next time we will see that RBMs are easier to train to BM. Can you see why?

Advanced Questions

  • For the activation function argument, what is the standard deviation of the noise term? What does that imply about stability of memories?
  • What happens to the capacity if the memories are not equally \pm 1 and/or correlated with each other?

Perceptron

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

  1. Perceptron
  2. Energy Based Neural Networks
  3. Training Networks
  4. Deep Learning

Paper

The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain by Rosenblatt in 1958

Other Useful References

Motivation for Perceptron

I’ll let Rosenblatt introduce the important questions leading to the perceptron himself by quoting his first paragraph:

If we are eventually to understand the capability of higher organisms for perceptual recognition, generalization, recall, and thinking, we must first have answers to three fundamental questions:

  1. How is information about the physical world sensed, or detected, by biological system?
  2. In what form is information stored, or remembered?
  3. How does information in storage, or in memory, influence recognition and behavior?

The perceptron is a first attempt to answer second and third questions. In the years leading up to the perceptron, there were two dominate themes of theoretical research on the brain. One focused on the general computational properties of the brain (McCollough and Pitts 1943) and showed that simple binary neurons could form a computer (ie they can compute any possible function). Another theme focused on abstracting away the details of experiments to get at general principles that relate to computation in the brain (Hebb 1949 and his synapse learning rules).

The perceptron opened up a third avenue of theoretical research. The central goal is to devise neuron-inspired algorithms that learn from real data and can be used to make a decision.

What is a Perceptron?

Basics

I find the math in the original perceptron paper pretty confusing. This is partly due to a generational difference in terminology, and partly due to poor explanations in the paper. This is definitely a paper that benefited from the passage of time and future synthesis into a more concise topic. Therefore, I recommend focusing attention on the introduction and conclusion, while below I’ll introduce the modern notation of the perceptron (see MacKay Ch 39/40 for similar details).

Perceptron

The perceptron consists of a set of inputs, x, that are fed into the perceptron, with each input receiving its own weight, w. The activity of the percepton is given by a = wx

Note that the perceptron can have a bias that is independent of inputs. However, we don’t need to write this out separately and can instead include an input that is always set to 1, independently of the data.

This activity is then evaluated by the activation function, f(a), to determine the output, y. There are lots of different possible activation rules with some popular ones including

  • Linear:

y(a) = a

  • Rectified Linear:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= a\quad\text{if}\quad a>0 \end{aligned}

  • Sigmoid:

y(a) = \frac{1}{1+\exp{(-a)}}

  • Threshold:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= 1\quad\text{if}\quad a>0 \end{aligned}

 

The end result is that we can take the output of a perceptron and use this output to make a decision. The sigmoid and threshold activation functions return an answer between 0 and 1 and hence have a natural interpretation as a probability. From now on, we will work with sigmoid activation functions.

Training

Now that the basics of a perceptron have been introduced, how do we actually train it? In other words, if I gave you a set of data, X, where each entry x_n is N dimensional, how would I evaluate perceptron’s handling of the data? For now, we will focus on using the perceptron as a binary classifier (only need to decide between two groups: 0 and 1). Since we are using sigmoid activation functions, we can interpret the output as the probability that the data is in group 1.

The standard way to train a binary classier is to have a training set which consists of pairs of data, x_n, and correct labels, t_n. Then training proceeds by seeing if the output label of a perceptron matches the correct label. If everything is correct, perfect! If not, we need to do something with that error.

For a sigmoid activation, the commonly used error function is the cross-entropy:

\mathcal{E} = - \sum_n \left[ t_n \ln y_n + \left(1-t_n \right) \ln \left(1-y_n \right) \right]

The output y is a function of the weights w. We can then take the derivate of the error with respect to the weights, which in the case of the sigmoid activation and cross-entropy error is simply \delta \mathcal{E}_n = -\left(t_n - y_n\right) x_n.

The simplest possible update algorithm is to perform gradient descent on the weights and define \Delta w_n =\delta \mathcal{E}_n. This is a greedy algorithm (always improves current error, longterm consequences be damned!). Gradient descent comes in several closely related varieties: online, batch, and mini-batch. Let’s start with the mini-batch. First the data is divided up into small random sets (say 10 data points each). Then we loop through the mini-batches, and for each one we calculate the output and error and update the weights. Online learning is when the mini-batches each contain exactly 1 data point, while batch learning is when the mini-batch is the whole dataset. The current standard is to use a mini-batch of between 10 to 100 which is a compromise between speed (batches are faster) and accuracy (online finds better solutions).

Putting it all together, the training algorithm is as follows

  1. Calculate the activation function and output with respect to a mini-batch of data
  2. Calculate the errors of the output
  3. Update the weights

 

And now you’ve got all the basics of a perceptron down! On to the more difficult questions…

Fundamental Questions

  • What are similarities and differences between a perceptron and a neuron? Do different activation functions lead to distinct interpretations?
  • What is connectivisim? How does this relate to the perceptron? How does this contrast with computers?
  • What class of learning algorithm is the perceptron? Possible answers: unsupervised, supervised, or reinforcement learning
  • What type of functions can a perceptron compute? Compare the standard OR gate vs the exclusive OR (XOR) gate for a perceptron with 2 weights.
  • Does the perceptron return a unique answer? Does the perceptron return the “best” answer (you need to define “best”)? Check out Support Vector Machines for one answer to the “best”.
  • Under what conditions can the perceptron generalize to data it has never seen before? Look into Rosenblatt’s “differentiated environment”.

Additional Questions

  • There are other possibilities for the error functions. Why is the cross-entropy a wise choice for the sigmoid activation?
  • The weight updates can be multiplied by a “learning rate” that controls the size of updates, while I implicitly assumed a learning rate of 1. How would you actually determine a sensible learning rate?
  • The standard learning algorithm puts no constraints on the possible weights. What are some possible problems with unconstrained weights? Can you think of a possible solution? How does this change the generalization properties of a perceptron?
  • Threshold activation functions produce simpler output (only two possible values) than sigmoid activation functions. Despite this simpler output, threshold activation functions are more difficult to train. Can you figure out why?
  • What is the information storage capacity of a perceptron? The exact answer is difficult, but you can get the right order of magnitude in the limit of large number of data points and large number of weights.

 

JC: Computational Neuroscience

This is part of the “journal club for credit” series. Below are the included units and details for each week.

Unit: Deep Learning

  1. Perceptron
  2. Energy Based Neural Networks
  3. Training Networks
  4. Deep Learning

Unit: Diffusion

Organized by Ben Regner

  1. Standard Diffusion
  2. Anomalous Diffusion
  3. Life at Low Reynold’s Number

Journal Club For Credit

My favorite course of all time is one that I had the chance to TA. It was based on a Princeton course originally organized by Ned Wingreen and David Botstein (see this paper for their teaching philosophy), and brought to BU by my advisor Pankaj Mehta.

The class was intended for upper level undergrads and graduate students from a variety of backgrounds including biology, physics, engineering, etc. In order to establish a common vocabulary and shared knowledge base, each week we read and discussed foundational papers in quantitative biology. The papers were a mix of theoretical papers and experimental papers that contributed key concepts (we did not read overly mathematical theory paper or experimentally detailed protocol papers). By the end of the course, everyone had a shared set of fundamental concepts that both theorists and experimentalists could understand.

Since I’m relatively new to computational neuroscience, I’m trying to startup a journal club. However, computational neuroscience is a grab-bag of topics that only have the brain as a unifier. Additionally, journal clubs usually cover the latest breaking research, which in computational neuroscience would lead to papers from week to week that may have minimal in common.

So inspired by the Wingreen and Botstein course, we will be using an approach that I’m calling “journal club for credit”. We are going to try and blend the best ideas from a course based on fundamental papers with a journal club that covers the latest research. We are organizing around units that will last 2-4 weeks. Each unit will be a self-contained introduction to a topic. The first weeks cover the essential papers needed to understand the background, while the final week will discuss current research.

Since I suggested this organization, I’m starting us off with a unit on Deep Learning. My intent is to blog about each unit and topic. In order to encourage others to actually read the paper, my blog posts will be deliberately vague. My plan is to provide the needed background to get you interested (the WHY you should care), start you towards understanding (define WHAT the topic is), but avoiding explaining the topic in enough detail that you do not feel compelled to read the papers (I want you to discover the detailed HOW and WHY of the topic on your own). I will outline a set of fundamental questions that everyone should understand as well as additional questions that go further into supplementary points or advanced topics.

I’m not exactly sure how many units will get covered (highly dependent on the rest of the journal club!), but my dream is that by the end of my postdoc, we have covered enough topics in computational neuroscience that we have a “course” in the similar philosophy to Wingreen and Botstein.

For the details of the papers, check out Journal Club for Computational Neuroscience