# Deep Learning Seminar Course

This semester Terry Sejnowski is teaching a graduate seminar course that is focused on Deep Learning. The course meets weekly for two hours to discuss papers. Here I’ll just outline the course and in later posts I’ll add some thoughts on each specific week.

# Life at Low Reynolds Number

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Diffusion

Organized by Ben Regner

1. Standard Diffusion
2. Anomalous Diffusion
3. Life at Low Reynold’s Number

## Papers

Life at Low Reynold’s Number. By Purcell in 1977.

## Introduction

This is one of my favorite papers. The presentation style is extremely fun and readable without sacrificing any scientific integrity. I think it serves as a great introduction to fluid mechanics at low Reynold’s number. I don’t have too many comments since I think the paper explains it the best, but I will provide a few supplementary details for a more in depth exploration of the ideas from the paper.

And just to get you excited about fluid dynamics, I present an example of laminar flow:

## Basics of Fluid Mechanics

The fundamental equation of fluid mechanics is Navier-Stokes. The relevant version for this paper is the incompressible flow equations with pressure but no other external fields:

$\frac{\partial \vec{u}}{\partial t}+ \vec{u}\cdot\nabla\vec{u} +\frac{1}{\rho}\nabla p -\nu\nabla^2\vec{u}=0$

where $\vec{u}$ is the velocity vector, $\vec{x}$ is position, $\rho$ is density, $p$ is pressure, and $\nu$ is the kinematic viscosity. This equation can be made non-dimensional by the introduction of a characteristic velocity $U$, length $L$, and introducing the dynamic viscosity $\eta=\nu/\rho$. This gives the following dimensionless variables:

$u^* = \frac{u}{U}$

$x^* = \frac{x}{L}$

$p^* = \frac{pL}{\eta U}$

$t^* = \frac{L}{U}$

Substituting in these characteristic length scales and doing some algebra, one arrives at the simplified equations:

$R\frac{\partial \vec{u^*}}{\partial t^*}+ R\vec{u^*}\cdot\nabla^*\vec{u^*} +\nabla^* p^*-(\nabla^*)^2\vec{u^*}=0$

with only one dimensionless constant, the Reynold’s number, defined as:

$R = \frac{UL\rho}{\eta} = \frac{UL}{\nu}$

As explained in the paper, Reynold’s number is one of the essential constants describing a flow. High Reynold’s number leads to turbulent (chaotic) flow, while low Reynold’s number leads to laminar (smooth) flow. For extemely small Reynold’s number, Navier-Stokes simplifies to:

$\nabla^* p^* = (\nabla^*)^2\vec{u^*}$

which is also just called Stoke’s equation.

At the end of the paper, Purcell describes another dimensionless number which he calls $S$ and in a footnote identifies as the Sherwood number. However, Ben Regner pointed out, that Purcell’s $S$ would actually be called the Peclet number today.

## Basics of Ecoli Chemotaxis

Chemotaxis and cellular sensing really deserves its own series of papers. But in the meantime, I recommend the following resources

## Video Proof of Purcell’s Scallop Theorem

Reversible kicking does fine in water (high Reynold’s number)…

… but the same motion has issues in corn syrup (low Reynold’s number).

Here is a solution similar to what Ecoli and other bacteria employ.

## Fundamental Questions

• Purcell does an amazing job, so I have nothing to add.

• What are some other strategies that are employed in biology to get around the issue of mobility at low Reynold’s number? Hint: I already linked to a video of one strategy. There are at least two other strategies, but to find these you will need to think about the assumptions leading to the basic Navier-Stokes equations.

# Anomalous Diffusion

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Diffusion

Organized by Ben Regner

1. Standard Diffusion
2. Anomalous Diffusion
3. Life at Low Reynold’s Number

## What is anomalous diffusion?

If one measures the mean square displacement vs time, it can be parameterized as

$< x^2> = t^\alpha$

where $\alpha=1$ is Brownian (standard diffusion), $0<\alpha<1$ is subdiffusive, $1<\alpha<2$ is superdiffusive, and ballistic is $\alpha=2$. So the technical definition of anomalous diffusion is $0<\alpha<1$ or $1<\alpha<2$.

## How to describe anomalous diffusion?

Currently, there is no “best” or “simple” description of anomalous diffusion in the general case. However, continuous-time random walks (CTRW) are one paradigm that I find helpful as a conceptual and simulation framework.

In the simplest discrete random walk (DRW), at every time step, a particle makes a jump of fixed size, the only question is the direction. The next generalization has the particle make a jump at every time step, but now it draws the jump size from a distribution.

The idea of a CTRW is that there is now a distribution both of the waiting time between jumps, and the jump size. If the waiting time follows the exponential distribution and the jump size follows the normal distribution, one ends up with the Wiener process aka standard diffusion and Brownian motion.

## What causes anomalous diffusion?

Just as a reminder, there are three conditions that need to be satisfied for Brownian motion (standard diffusion):
1. Increments are independent
2. Increments are wide sense stationary. 1st moment and autocovariance don’t depend on time (this is weaker condition then complete stationarity)
3. Zero mean

The third condition is often ignored by examining the motion relative to the mean displacement (ie the actual displacement is not Brownian, but fluctuations in the displacement could be Brownian). So really, the first two are the more important conditions. Therefore, anomalous diffusion arises due to non-independent increments and/or correlations in time of the mean and/or standard deviation.

The CTRW allows one to think more precisely about different mechanisms that can give rise to anomalous diffusion. There is not one single way to get sub or super-diffusion in CTRW, since there are two, potentially dependent, distributions (waiting time and jump size). However, there are a few common situations that seem to arise often in biology and elsewhere (see Random walk models in biology, Box 2 for original idea). Subdiffusion in biology is often caused by longer waiting time distributions (compared to exponential), or molecular crowding, while superdiffusion in occurs when jump sizes are drawn from a Levy flight or other alpha stable distributions.

## Examples

For further exploration of anomalous diffusion in biology, I recommend these papers

• This is an interesting paper that introduces a renormalization group approach to classifying diffusion processes

# Standard Diffusion

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Diffusion

Organized by Ben Regner

1. Standard Diffusion
2. Anomalous Diffusion
3. Life at Low Reynold’s Number

## Papers

Brownian Motion. By Einstein in 1905.

Brownian Motion. By Langevin in 1908.

An Introduction to Fractional Diffusion. By Henry, Langlands, and Straka in 2010.

## What is diffusion?

Diffusion is the general process by which small particles move from regions of high concentration to low concentration. Check out the link to the Wikipedia articles above for some cool videos and animations. Diffusion is extremely ubiquitous and plays an essential role in biology. For example, oxygen diffuses from your lungs to unoxygenated blood, which then delivers it to the rest of your body where it diffuses out of your blood and into your cells. Additionally, signals between neurons are transmitted by several different diffusing molecules.

Mathematically, standard diffusion is described by two fundamental equations.

Fick’s First Law: Particles move from high-to-low concentration.

$j=-D\frac{\partial n}{\partial x}$

where $n$ is the number of particles, $x$ is the location of the particles, $D$ is the diffusion constant, and  $j$ is the flux of particles.

Fick’s Second Law: Conservation of particles combined with Fick’s First Law leads to the diffusion equation.

If particles cannot be created or destroyed, they follow a conservation law:

$\frac{\partial n}{\partial t} = -\frac{\partial j}{\partial x}$

Combining the conservation law with Fick’s First Law gives us the diffusion equation:

$\frac{\partial n}{\partial t} = D \frac{\partial^2 n}{\partial x^2}$

## Brownian Motion

In 1827 Robert Brown looked at pollen in water under a microscope, see Wikipedia page for simulations of the observations. Much to his surprise, the pollen acts as if it alive! Brown verified that pollen is not alive and any small, inorganic particle followed similar motion. In 1905, during Einstein’s miracle year, he wrote a paper on an atomistic description that describes Brownian Motion. In 1908 Langevin used a different approach (that is “infinitely simpler” in his words) to describe Brownian motion. The general explanations are outlined below.

1. Einstein’s Derivation

Einstein’s goal was a probability based description of Brownian motion that connects to Fick’s law. Einstein makes several assumptions about the particles, including

In the end, Einstein finds a solution that is Gaussian, implying that the mean square displacement is linear in time for Brownian motion:

$< x^2> = t$

More generally, the mean square displacement could depend on some power of time, usually parameterized as

$< x^2> = t^\alpha$

where $\alpha=1$ is Brownian, $0<\alpha<1$ is subdiffusive, $1<\alpha<2$ is superdiffusive, and ballistic is $\alpha=2$. Note, one can get up to $\alpha=3$ in certain turbulent regimes.

2. Langevin’s Derivation
The Langevin approach is to start with a particle based description. The first assumption is the equipartition theorem to determine the kinetic energy (KE)
$KE = \frac{k_B T}{2} = m (\frac{d^2 x}{dt^2})^2$

Then, one looks at the actual forces on the particle:

KE = Stoke’s + stochastic variable
$m (\frac{d^2 x}{dt^2})^2 = -6 \pi \eta r \frac{dx}{dt} + X$
where $X$ is a stochastic variable. It is assumed to be zero mean, unit variance, and no time correlations, aka white noise.

After multiplying both sides of the equation by x, doing some algebra, and then taking the average solution, one arrives at the same results as Einstein (after ignore a short time transient).

3. Random Walk Derivation.

There is a third way to derive Brownian motion that is layed out in the book chapter above. The idea is to look at a single particle and do a microscopic random walk. One can set up a recursive definition that defines a binomial probability solution. After a large number of steps, the central limit theorem applies and we end up with a Gaussian solution.

How do we get Brownian motion?

In general, there are three conditions that need to be satisfied for Brownian motion:
1. Increments are independent
2. Increments are wide sense stationary. 1st moment and autocovariance don’t depend on time (this is weaker condition then complete stationarity)
3. Zero mean

The third condition is often ignored by examining the motion relative to the mean displacement (ie the actual displacement is not Brownian, but fluctuations in the displacement could be Brownian). So really, the first two are the more important conditions.

## Fundamental Questions

• Einstein made three major assumptions in his derivation. 2/3 are often violated by biology, which assumption is relatively safe?
• What biological processes do you think are actually diffusive vs sub/super-diffusive? Think about the 3 conditions for Brownian motion listed above. Note, this is a preview for the next post.

# Deep Learning

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Papers

Deep Learning. By LeCun, Bengio, and Hinton in 2015.

Deep learning in neural networks: An overview. By Schmidhuber in 2015.

## What is deep learning?

In previous weeks we have introduced perceptrons, multilayer perceptrons (MLP), Hopfield neural networks, and Boltzmann machines.

But what is deep learning? I think it is really two things:

1. Successful training of multilayered neural networks perform better (higher classification accuracy, etc) and involve more layers than previous implementations
2. Just a rebranding of neural networks

Here is my summary of the history of deep learning, see both reviews above for extended details. In 2006, Deep Belief Networks (DBN) were introduced in two papers (Reducing the Dimensionality of Data with Neural Networks and A Fast Learning Algorithm for Deep Belief Nets). The idea of a DBN is to train a series of restricted Boltzmann machines (RBM). The first RBM is trained and given the original data, produces an output of hidden layer activations. The second RBM uses the first RBM’s hidden layer activations as inputs, and trains on that “data”. This is continued to the desired depth. At this point, the DBN can be used for unsupervised learning, or one can use it as pretraining for a MLP which will utilize backpropagation for supervised learning.

So on the one hand, there were technical breakthroughs that enabled neural networks to utilize more layers than previous iterations and achieve state of the art performance. However, the actual component (RBMs and MLPs), have been around since the 1980s, so it would also be fair to deem deep learning as a rebranding of neural networks.

Therefore, neural network winter (mid 90s to mid 00s) officially ended in 2006. I propose calling 2006-2012 neural network spring. While interest in neural networks increased and new advances were made, the general machine learning community was not obsessed with deep learning. That changed in 2012 when the neural network summer began. This paper presented at NIPS revolutionized the computer vision community by cutting the error rate on Imagenet in half! The Imagenet challenge was viewed as a serious benchmark that all computer vision systems should address. By blowing previous results out of the water, the revolution was completed.

So for now, enjoy neural network summer, but always remember, winter is coming.

## Fundamental Questions

• When do extra layers help in a neural network? When do they hurt?
• Why was pretraining originally needed, but is no longer used in practice? Check out these papers for details: Glorot and Saxe.
• Learn about convolutional and recurrent neural networks. These are extremely popular right now!

• Do research on unsupervised learning! It is definitely less popular today, but all the big-shots think is the longterm future of neural networks.

# Training Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Deep Learning

1. Perceptron
2. Energy Based Neural Networks
3. Training Networks
4. Deep Learning

## Papers

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985.

Learning representations by back-propagating errors. By Rumelhart, Hinton, and Williams in 1986.

## How do you actually train neural networks?

Hopefully the past few posts have piqued your interest in neural networks. Maybe you even want to unleash a neural network on some data. How do you actually train the neural network?

I’m actually going to keep this brief for two reasons. First, detailed derivations can already be found elsewhere (for Boltzmann see Appendix of the original paper as well as MacKay, for backpropagation see Nielsen). Second, I firmly believe that algorithms are best learned by actually stepping through the updates, so any explanation I attempt will not be sufficient for you to truly learn the algorithm. I will provide some general context as well as some questions you should be able to answer, but please go do it yourself!

There are three general classes of machine learning based on the information received:

• Unsupervised – data only. Boltzmann machine.
• Supervised – data with labels. MLP with backpropagation.
• Reinforcement – data, actions, and scores associated with each action. Deserves its own detailed post, but check out papers by DeepMind for cool applications.

The Boltzmann machine learning rule is an example of maximum likelihood. In practice, the original learning rule is too computationally expensive, so a modified algorithm called contrastive divergence (or variants such as persistent contrastive divergence) is utilized instead. See the Hinton guide to RBMs for more details.

Backpropagation is a computationally-efficient writing of the chain rule from calculus, so besides the above paper which popularized it, there is actually a long history of this algorithm being discovered and rediscovered.

## Fundamental Questions

• What is maximum likelihood?
• Why can one interpret the learning terms in the BM algorithm as “waking” and “sleeping”?
• Why are BM hidden layers so important?
• Why are restricted Boltzmann machines, RBMs, much easier to train?
• Why is backpropagation more computationally efficient than the finite difference method?
• Derive the 4 backpropagation equations!

• Use Nielsen’s code to train your own MLP

# Energy Based Neural Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Deep Learning

1. Perceptron
2. Energy Based Neural Networks
3. Training Networks
4. Deep Learning

## Papers

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985. (Note: Only section 1 and 2 are covered here, rest of paper covered in next topic).

## Other Useful References

• Hopfield Network – Wikipedia and Scholarpedia
• Boltzmann Machine – Wikipedia and Scholarpedia
• MacKay, Ch 31 (optional intro to Ising model), Ch 42 (Hopfield), and Ch 43 (Boltzmann).
• Amit – This provides a detailed, physics-based, analysis of the Hopfield model

## Why should you care?

The Hopfield paper provides an explicit, decently biologically plausible, mechanism by which a system of (artificial) neurons can store memories. The central goal of the paper is to demonstrate a method for content-addressible memory. Standard computer memory is location-addressible (ie your computer looks to a specific place on your disk). The idea of content-addressible memory is that a partial (perhaps faulty) presentation of the memory should be sufficient to obtain the full memory. I love this quote:

An example asks you to recall ‘An American politician who was very intelligent and whose politician father did not like broccoli’. Many people think of president [George W.] Bush –even though one of the cues contains an error.

MacKay Ch 38, pg 469.

## Hopfield Neural Network

So how does Hopfield actually store memories? The idea is that a system can be constructed such that the stable states of the system are the desired memories. I’m going to change notation from the original Hopfield paper so that it matches standard physics notation.

In the actual paper, Hopfield uses threshold neurons but all arguments can be easily extended to a $tanh$ neuron. I will define a neuron’s state as $S$ and neuron $i$ (out of $N$ total) is firing if $S_i=1$ and not firing if $S_i = -1$.

How does the system actually store memories? I will call the memories ($p$ of them) that one wants to store as $\xi^\mu$ where $\mu=1\ldots p$. The idea is to define the interactions between neurons as:

$J_{ij} = \frac{1}{N}\sum_{\mu=1}^p \xi_i^\mu \xi_j^\mu$

Why are memories actual stable? We can write the activation function of a neuron (also known as the local field in physics terms) as

$a_i = \sum_{j\neq i} J_{ij} S_j$

Now let’s check if a memory is stable (say memory $1$) by examining the activation function of the first neuron.

$a_1 = \sum_{j=2}^N J_{1j} \xi_j^1= \frac{1}{N} \sum_{j=2}^N \sum_{\mu=1}^p \xi_1^\mu \xi_j^\mu \xi_j^1$

$a_1 = \frac{1}{N} \sum_{j=2}^N \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1$

$a_1 = \frac{N-1}{N} \xi_1^1 +\frac{1}{N}\sum_{j=2}^N \sum_{\mu=2}^p \xi_1^\mu \xi_j^\mu \xi_j^1$

The first term of the activation function is exactly what we need for the first neuron to be stable. The second term is noise, but how big is it? We need a couple more facts to figure it out. First, we can safely assume $N\gg1$. Second, we will assume that memories are not biased (equal numbers of neurons are on and off). Third, we will assume that we are storing random memories. Then using the central limit theorem, we get that on average, $a_1 = \xi_1$. (Advanced question: what is the standard deviation of the noise term? What does that imply about stability of memories?)

So this establishes that memories are stable. But how should we actually think about these memories? By examining the energy of the system, one can show that these memories are global minima and have basins of attraction. The energy is defined as

$H = -\frac{1}{2} \sum_{ij} S_i J_{ij} S_j$

where we have defined the self-interactions to be zero ($J_{ii}=0$). Using similar arguments as above, one can show that on average each memory has an energy of $-\frac{N}{2}$ and that a flipping a single spin leads to higher energy. (Fundamental question: prove these statements!)

Therefore, using the prescription outlined in the Hopfield paper, one can take a set of memories, $\xi$, and create a dynamical system with these memories as global minima.

## Boltzmann Machine

Hopfield networks are great if you already know the states of the desired memories. But what if you are only given data? How would you actually train a neural network to store the data?

The next journal club will get to actual training, but it is convenient to introduce at this time a Boltzmann Machine (BM). This is an extension of Hopfield networks that can actually learn to store data. In the most general Boltzmann machine, neurons are divided into visible (actually interact with the data) and hidden (only see data through interactions with visible neurons). This leads to an energy function of:

$H = -\frac{1}{2} \sum_{ij}v_i J_{ij} v_j- \sum_{ij}v_i w_{ij} h_j-\frac{1}{2} \sum_{ij}h_i K_{ij} h_j$

where $v$ are visible neurons and $h$ are hidden neurons (if present, not a requirement). There are three different types of interactions, those amongst visible neurons only ($J$), those amongst hidden neurons only ($K$), and those between visible and hidden neurons ($w$).

As will be explained in the next journal club, the full Boltzmann machine takes a long time to train. So instead, it is common to use a Restricted Boltzmann Machine (RBM) which has no self interactions amongst layers:

$H = - \sum_{ij}v_i w_{ij} h_j$

## Fundamental Questions

• Why is content-addressible memory considered associative, software and hardware fault tolerant, and distributed? Why is this closer to biology than location-addressible memory?
• Why is the Hopfield storage prescription Hebbian?
• Do the calculation to show that the memories are global minima.
• Hopfield says “In many physical systems, the nature of the emergent collective properties are insensitive to the details inserted in the model.” What are some assumptions that Hopfield relaxes in the simulations?
• Next time we will see that RBMs are easier to train to BM. Can you see why?

• For the activation function argument, what is the standard deviation of the noise term? What does that imply about stability of memories?
• What happens to the capacity if the memories are not equally $\pm 1$ and/or correlated with each other?

# Perceptron

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Paper

The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain by Rosenblatt in 1958

## Motivation for Perceptron

I’ll let Rosenblatt introduce the important questions leading to the perceptron himself by quoting his first paragraph:

If we are eventually to understand the capability of higher organisms for perceptual recognition, generalization, recall, and thinking, we must first have answers to three fundamental questions:

1. How is information about the physical world sensed, or detected, by biological system?
2. In what form is information stored, or remembered?
3. How does information in storage, or in memory, influence recognition and behavior?

The perceptron is a first attempt to answer second and third questions. In the years leading up to the perceptron, there were two dominate themes of theoretical research on the brain. One focused on the general computational properties of the brain (McCollough and Pitts 1943) and showed that simple binary neurons could form a computer (ie they can compute any possible function). Another theme focused on abstracting away the details of experiments to get at general principles that relate to computation in the brain (Hebb 1949 and his synapse learning rules).

The perceptron opened up a third avenue of theoretical research. The central goal is to devise neuron-inspired algorithms that learn from real data and can be used to make a decision.

## What is a Perceptron?

### Basics

I find the math in the original perceptron paper pretty confusing. This is partly due to a generational difference in terminology, and partly due to poor explanations in the paper. This is definitely a paper that benefited from the passage of time and future synthesis into a more concise topic. Therefore, I recommend focusing attention on the introduction and conclusion, while below I’ll introduce the modern notation of the perceptron (see MacKay Ch 39/40 for similar details).

The perceptron consists of a set of inputs, $x$, that are fed into the perceptron, with each input receiving its own weight, $w$. The activity of the percepton is given by $a = wx$

Note that the perceptron can have a bias that is independent of inputs. However, we don’t need to write this out separately and can instead include an input that is always set to 1, independently of the data.

This activity is then evaluated by the activation function, $f(a)$, to determine the output, $y$. There are lots of different possible activation rules with some popular ones including

• Linear:

$y(a) = a$

• Rectified Linear:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= a\quad\text{if}\quad a>0 \end{aligned}

• Sigmoid:

$y(a) = \frac{1}{1+\exp{(-a)}}$

• Threshold:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= 1\quad\text{if}\quad a>0 \end{aligned}

The end result is that we can take the output of a perceptron and use this output to make a decision. The sigmoid and threshold activation functions return an answer between 0 and 1 and hence have a natural interpretation as a probability. From now on, we will work with sigmoid activation functions.

### Training

Now that the basics of a perceptron have been introduced, how do we actually train it? In other words, if I gave you a set of data, $X$, where each entry $x_n$ is $N$ dimensional, how would I evaluate perceptron’s handling of the data? For now, we will focus on using the perceptron as a binary classifier (only need to decide between two groups: 0 and 1). Since we are using sigmoid activation functions, we can interpret the output as the probability that the data is in group 1.

The standard way to train a binary classier is to have a training set which consists of pairs of data, $x_n$, and correct labels, $t_n$. Then training proceeds by seeing if the output label of a perceptron matches the correct label. If everything is correct, perfect! If not, we need to do something with that error.

For a sigmoid activation, the commonly used error function is the cross-entropy:

$\mathcal{E} = - \sum_n \left[ t_n \ln y_n + \left(1-t_n \right) \ln \left(1-y_n \right) \right]$

The output $y$ is a function of the weights $w$. We can then take the derivate of the error with respect to the weights, which in the case of the sigmoid activation and cross-entropy error is simply $\delta \mathcal{E}_n = -\left(t_n - y_n\right) x_n$.

The simplest possible update algorithm is to perform gradient descent on the weights and define $\Delta w_n =\delta \mathcal{E}_n$. This is a greedy algorithm (always improves current error, longterm consequences be damned!). Gradient descent comes in several closely related varieties: online, batch, and mini-batch. Let’s start with the mini-batch. First the data is divided up into small random sets (say 10 data points each). Then we loop through the mini-batches, and for each one we calculate the output and error and update the weights. Online learning is when the mini-batches each contain exactly 1 data point, while batch learning is when the mini-batch is the whole dataset. The current standard is to use a mini-batch of between 10 to 100 which is a compromise between speed (batches are faster) and accuracy (online finds better solutions).

Putting it all together, the training algorithm is as follows

1. Calculate the activation function and output with respect to a mini-batch of data
2. Calculate the errors of the output
3. Update the weights

And now you’ve got all the basics of a perceptron down! On to the more difficult questions…

## Fundamental Questions

• What are similarities and differences between a perceptron and a neuron? Do different activation functions lead to distinct interpretations?
• What is connectivisim? How does this relate to the perceptron? How does this contrast with computers?
• What class of learning algorithm is the perceptron? Possible answers: unsupervised, supervised, or reinforcement learning
• What type of functions can a perceptron compute? Compare the standard OR gate vs the exclusive OR (XOR) gate for a perceptron with 2 weights.
• Does the perceptron return a unique answer? Does the perceptron return the “best” answer (you need to define “best”)? Check out Support Vector Machines for one answer to the “best”.
• Under what conditions can the perceptron generalize to data it has never seen before? Look into Rosenblatt’s “differentiated environment”.

• There are other possibilities for the error functions. Why is the cross-entropy a wise choice for the sigmoid activation?
• The weight updates can be multiplied by a “learning rate” that controls the size of updates, while I implicitly assumed a learning rate of 1. How would you actually determine a sensible learning rate?
• The standard learning algorithm puts no constraints on the possible weights. What are some possible problems with unconstrained weights? Can you think of a possible solution? How does this change the generalization properties of a perceptron?
• Threshold activation functions produce simpler output (only two possible values) than sigmoid activation functions. Despite this simpler output, threshold activation functions are more difficult to train. Can you figure out why?
• What is the information storage capacity of a perceptron? The exact answer is difficult, but you can get the right order of magnitude in the limit of large number of data points and large number of weights.

# JC: Computational Neuroscience

This is part of the “journal club for credit” series. Below are the included units and details for each week.

## Unit: Diffusion

Organized by Ben Regner

# Journal Club For Credit

My favorite course of all time is one that I had the chance to TA. It was based on a Princeton course originally organized by Ned Wingreen and David Botstein (see this paper for their teaching philosophy), and brought to BU by my advisor Pankaj Mehta.

The class was intended for upper level undergrads and graduate students from a variety of backgrounds including biology, physics, engineering, etc. In order to establish a common vocabulary and shared knowledge base, each week we read and discussed foundational papers in quantitative biology. The papers were a mix of theoretical papers and experimental papers that contributed key concepts (we did not read overly mathematical theory paper or experimentally detailed protocol papers). By the end of the course, everyone had a shared set of fundamental concepts that both theorists and experimentalists could understand.

Since I’m relatively new to computational neuroscience, I’m trying to startup a journal club. However, computational neuroscience is a grab-bag of topics that only have the brain as a unifier. Additionally, journal clubs usually cover the latest breaking research, which in computational neuroscience would lead to papers from week to week that may have minimal in common.

So inspired by the Wingreen and Botstein course, we will be using an approach that I’m calling “journal club for credit”. We are going to try and blend the best ideas from a course based on fundamental papers with a journal club that covers the latest research. We are organizing around units that will last 2-4 weeks. Each unit will be a self-contained introduction to a topic. The first weeks cover the essential papers needed to understand the background, while the final week will discuss current research.

Since I suggested this organization, I’m starting us off with a unit on Deep Learning. My intent is to blog about each unit and topic. In order to encourage others to actually read the paper, my blog posts will be deliberately vague. My plan is to provide the needed background to get you interested (the WHY you should care), start you towards understanding (define WHAT the topic is), but avoiding explaining the topic in enough detail that you do not feel compelled to read the papers (I want you to discover the detailed HOW and WHY of the topic on your own). I will outline a set of fundamental questions that everyone should understand as well as additional questions that go further into supplementary points or advanced topics.

I’m not exactly sure how many units will get covered (highly dependent on the rest of the journal club!), but my dream is that by the end of my postdoc, we have covered enough topics in computational neuroscience that we have a “course” in the similar philosophy to Wingreen and Botstein.

For the details of the papers, check out Journal Club for Computational Neuroscience