# Perceptron

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Paper

The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain by Rosenblatt in 1958

## Motivation for Perceptron

I’ll let Rosenblatt introduce the important questions leading to the perceptron himself by quoting his first paragraph:

If we are eventually to understand the capability of higher organisms for perceptual recognition, generalization, recall, and thinking, we must first have answers to three fundamental questions:

1. How is information about the physical world sensed, or detected, by biological system?
2. In what form is information stored, or remembered?
3. How does information in storage, or in memory, influence recognition and behavior?

The perceptron is a first attempt to answer second and third questions. In the years leading up to the perceptron, there were two dominate themes of theoretical research on the brain. One focused on the general computational properties of the brain (McCollough and Pitts 1943) and showed that simple binary neurons could form a computer (ie they can compute any possible function). Another theme focused on abstracting away the details of experiments to get at general principles that relate to computation in the brain (Hebb 1949 and his synapse learning rules).

The perceptron opened up a third avenue of theoretical research. The central goal is to devise neuron-inspired algorithms that learn from real data and can be used to make a decision.

## What is a Perceptron?

### Basics

I find the math in the original perceptron paper pretty confusing. This is partly due to a generational difference in terminology, and partly due to poor explanations in the paper. This is definitely a paper that benefited from the passage of time and future synthesis into a more concise topic. Therefore, I recommend focusing attention on the introduction and conclusion, while below I’ll introduce the modern notation of the perceptron (see MacKay Ch 39/40 for similar details).

The perceptron consists of a set of inputs, $x$, that are fed into the perceptron, with each input receiving its own weight, $w$. The activity of the percepton is given by $a = wx$

Note that the perceptron can have a bias that is independent of inputs. However, we don’t need to write this out separately and can instead include an input that is always set to 1, independently of the data.

This activity is then evaluated by the activation function, $f(a)$, to determine the output, $y$. There are lots of different possible activation rules with some popular ones including

• Linear:

$y(a) = a$

• Rectified Linear:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= a\quad\text{if}\quad a>0 \end{aligned}

• Sigmoid:

$y(a) = \frac{1}{1+\exp{(-a)}}$

• Threshold:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= 1\quad\text{if}\quad a>0 \end{aligned}

The end result is that we can take the output of a perceptron and use this output to make a decision. The sigmoid and threshold activation functions return an answer between 0 and 1 and hence have a natural interpretation as a probability. From now on, we will work with sigmoid activation functions.

### Training

Now that the basics of a perceptron have been introduced, how do we actually train it? In other words, if I gave you a set of data, $X$, where each entry $x_n$ is $N$ dimensional, how would I evaluate perceptron’s handling of the data? For now, we will focus on using the perceptron as a binary classifier (only need to decide between two groups: 0 and 1). Since we are using sigmoid activation functions, we can interpret the output as the probability that the data is in group 1.

The standard way to train a binary classier is to have a training set which consists of pairs of data, $x_n$, and correct labels, $t_n$. Then training proceeds by seeing if the output label of a perceptron matches the correct label. If everything is correct, perfect! If not, we need to do something with that error.

For a sigmoid activation, the commonly used error function is the cross-entropy:

$\mathcal{E} = - \sum_n \left[ t_n \ln y_n + \left(1-t_n \right) \ln \left(1-y_n \right) \right]$

The output $y$ is a function of the weights $w$. We can then take the derivate of the error with respect to the weights, which in the case of the sigmoid activation and cross-entropy error is simply $\delta \mathcal{E}_n = -\left(t_n - y_n\right) x_n$.

The simplest possible update algorithm is to perform gradient descent on the weights and define $\Delta w_n =\delta \mathcal{E}_n$. This is a greedy algorithm (always improves current error, longterm consequences be damned!). Gradient descent comes in several closely related varieties: online, batch, and mini-batch. Let’s start with the mini-batch. First the data is divided up into small random sets (say 10 data points each). Then we loop through the mini-batches, and for each one we calculate the output and error and update the weights. Online learning is when the mini-batches each contain exactly 1 data point, while batch learning is when the mini-batch is the whole dataset. The current standard is to use a mini-batch of between 10 to 100 which is a compromise between speed (batches are faster) and accuracy (online finds better solutions).

Putting it all together, the training algorithm is as follows

1. Calculate the activation function and output with respect to a mini-batch of data
2. Calculate the errors of the output
3. Update the weights

And now you’ve got all the basics of a perceptron down! On to the more difficult questions…

## Fundamental Questions

• What are similarities and differences between a perceptron and a neuron? Do different activation functions lead to distinct interpretations?
• What is connectivisim? How does this relate to the perceptron? How does this contrast with computers?
• What class of learning algorithm is the perceptron? Possible answers: unsupervised, supervised, or reinforcement learning
• What type of functions can a perceptron compute? Compare the standard OR gate vs the exclusive OR (XOR) gate for a perceptron with 2 weights.
• Does the perceptron return a unique answer? Does the perceptron return the “best” answer (you need to define “best”)? Check out Support Vector Machines for one answer to the “best”.
• Under what conditions can the perceptron generalize to data it has never seen before? Look into Rosenblatt’s “differentiated environment”.

• There are other possibilities for the error functions. Why is the cross-entropy a wise choice for the sigmoid activation?
• The weight updates can be multiplied by a “learning rate” that controls the size of updates, while I implicitly assumed a learning rate of 1. How would you actually determine a sensible learning rate?
• The standard learning algorithm puts no constraints on the possible weights. What are some possible problems with unconstrained weights? Can you think of a possible solution? How does this change the generalization properties of a perceptron?
• Threshold activation functions produce simpler output (only two possible values) than sigmoid activation functions. Despite this simpler output, threshold activation functions are more difficult to train. Can you figure out why?
• What is the information storage capacity of a perceptron? The exact answer is difficult, but you can get the right order of magnitude in the limit of large number of data points and large number of weights.

# JC: Computational Neuroscience

This is part of the “journal club for credit” series. Below are the included units and details for each week.

## Unit: Diffusion

Organized by Ben Regner

# Journal Club For Credit

My favorite course of all time is one that I had the chance to TA. It was based on a Princeton course originally organized by Ned Wingreen and David Botstein (see this paper for their teaching philosophy), and brought to BU by my advisor Pankaj Mehta.

The class was intended for upper level undergrads and graduate students from a variety of backgrounds including biology, physics, engineering, etc. In order to establish a common vocabulary and shared knowledge base, each week we read and discussed foundational papers in quantitative biology. The papers were a mix of theoretical papers and experimental papers that contributed key concepts (we did not read overly mathematical theory paper or experimentally detailed protocol papers). By the end of the course, everyone had a shared set of fundamental concepts that both theorists and experimentalists could understand.

Since I’m relatively new to computational neuroscience, I’m trying to startup a journal club. However, computational neuroscience is a grab-bag of topics that only have the brain as a unifier. Additionally, journal clubs usually cover the latest breaking research, which in computational neuroscience would lead to papers from week to week that may have minimal in common.

So inspired by the Wingreen and Botstein course, we will be using an approach that I’m calling “journal club for credit”. We are going to try and blend the best ideas from a course based on fundamental papers with a journal club that covers the latest research. We are organizing around units that will last 2-4 weeks. Each unit will be a self-contained introduction to a topic. The first weeks cover the essential papers needed to understand the background, while the final week will discuss current research.

Since I suggested this organization, I’m starting us off with a unit on Deep Learning. My intent is to blog about each unit and topic. In order to encourage others to actually read the paper, my blog posts will be deliberately vague. My plan is to provide the needed background to get you interested (the WHY you should care), start you towards understanding (define WHAT the topic is), but avoiding explaining the topic in enough detail that you do not feel compelled to read the papers (I want you to discover the detailed HOW and WHY of the topic on your own). I will outline a set of fundamental questions that everyone should understand as well as additional questions that go further into supplementary points or advanced topics.

I’m not exactly sure how many units will get covered (highly dependent on the rest of the journal club!), but my dream is that by the end of my postdoc, we have covered enough topics in computational neuroscience that we have a “course” in the similar philosophy to Wingreen and Botstein.

For the details of the papers, check out Journal Club for Computational Neuroscience

# Initial Conditions

I (ie Alex Lang) am a physics PhD currently doing my postdoctorate research at the Salk Institute in San Diego. I work on a variety of research topics such as physics, computational neuroscience, machine learning and theoretical biophysics. As an outsider, that probably looks like a jumble of topics, but I swear, there is a theme! In my research, I apply techniques (both conceptual and mathematical) from statistical physics to a variety of problems. Statistical physics is the domain of physics that applies to large systems (number of “particles” $N$ when $N \gg 1$). In many ways large systems are simpler than small systems, so taking the extremely large system size limit ($N \to \infty$) often brings useful insights into a problem. So the blog name is inspired from statistical physics, my broad interests, and of course, Buzz Lightyear.

The blog will focus on research topics of interest to me and hopefully others. I will also blog about research in general, what academia is like, and other science-like things (including science-fiction!).

I will focus on occasional, but detailed posts. My personal goal for 2016 is 25 posts of substance, so one every two weeks. I’m hoping the journal club we are starting up at the Salk will provide plenty of material, more details on that soon.