Deep Learning: 0-60 in a few hours?

Here, I will try to outline the fastest possible path to go from zero understanding of deep learning to an understanding of the basic ideas. In a follow up post, I’ll outline some deep learning packages where you could actually implement these ideas.

I think by far the best introduction to deep learning is Michael Nielsen’s ebook. Before you get started with it, I think the minimum required mathematics includes an understanding of the following:

  • Vector and Matrix multiplication – especially when written in summation notation
  • Exponents and Logarithms
  • Derivatives and Partial Derivatives
  • Probability, mainly Bayes Theorem (not actually needed for Michael Nielsen’s book, but it is essential for later topics)

I really think that if you understand those mathematical topics, you can start reading the ebook.

Here is my proposed learning strategy. Iterate between reading the ebook (Chapters 1-5 only) and playing with this cool interactive neural network every time a new idea is mentioned. For a first pass, just read the ebook and don’t do the exercises or worry about actual code implementation. Additionally, chapter 6 introduces convolutional neural networks which are a more advanced topic that can be saved for later.

Once you have some intuition about neural networks, I recommend reading this review by several of the big names in deep learning. This will give you a flavor of the current status of the field.

Now you are ready to start coding!

PS. If you want to get into more advanced deep learning topics, check out my previous Deep Learning Unit. And to really get up to speed on research, there is a deep learning book that should be published soon.


Training Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

  1. Perceptron
  2. Energy Based Neural Networks
  3. Training Networks
  4. Deep Learning


A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985.

Learning representations by back-propagating errors. By Rumelhart, Hinton, and Williams in 1986.

Other Useful References


How do you actually train neural networks?

Hopefully the past few posts have piqued your interest in neural networks. Maybe you even want to unleash a neural network on some data. How do you actually train the neural network?

I’m actually going to keep this brief for two reasons. First, detailed derivations can already be found elsewhere (for Boltzmann see Appendix of the original paper as well as MacKay, for backpropagation see Nielsen). Second, I firmly believe that algorithms are best learned by actually stepping through the updates, so any explanation I attempt will not be sufficient for you to truly learn the algorithm. I will provide some general context as well as some questions you should be able to answer, but please go do it yourself!

There are three general classes of machine learning based on the information received:

  • Unsupervised – data only. Boltzmann machine.
  • Supervised – data with labels. MLP with backpropagation.
  • Reinforcement – data, actions, and scores associated with each action. Deserves its own detailed post, but check out papers by DeepMind for cool applications.

The Boltzmann machine learning rule is an example of maximum likelihood. In practice, the original learning rule is too computationally expensive, so a modified algorithm called contrastive divergence (or variants such as persistent contrastive divergence) is utilized instead. See the Hinton guide to RBMs for more details.

Backpropagation is a computationally-efficient writing of the chain rule from calculus, so besides the above paper which popularized it, there is actually a long history of this algorithm being discovered and rediscovered.

Fundamental Questions

  • What is maximum likelihood?
  • Why can one interpret the learning terms in the BM algorithm as “waking” and “sleeping”?
  • Why are BM hidden layers so important?
  • Why are restricted Boltzmann machines, RBMs, much easier to train?
  • Why is backpropagation more computationally efficient than the finite difference method?
  • Derive the 4 backpropagation equations!

Advanced Questions

  • Follow Hinton’s RBM guide and implement your own Boltzmann machine
  • Use Nielsen’s code to train your own MLP


This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

  1. Perceptron
  2. Energy Based Neural Networks
  3. Training Networks
  4. Deep Learning


The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain by Rosenblatt in 1958

Other Useful References

Motivation for Perceptron

I’ll let Rosenblatt introduce the important questions leading to the perceptron himself by quoting his first paragraph:

If we are eventually to understand the capability of higher organisms for perceptual recognition, generalization, recall, and thinking, we must first have answers to three fundamental questions:

  1. How is information about the physical world sensed, or detected, by biological system?
  2. In what form is information stored, or remembered?
  3. How does information in storage, or in memory, influence recognition and behavior?

The perceptron is a first attempt to answer second and third questions. In the years leading up to the perceptron, there were two dominate themes of theoretical research on the brain. One focused on the general computational properties of the brain (McCollough and Pitts 1943) and showed that simple binary neurons could form a computer (ie they can compute any possible function). Another theme focused on abstracting away the details of experiments to get at general principles that relate to computation in the brain (Hebb 1949 and his synapse learning rules).

The perceptron opened up a third avenue of theoretical research. The central goal is to devise neuron-inspired algorithms that learn from real data and can be used to make a decision.

What is a Perceptron?


I find the math in the original perceptron paper pretty confusing. This is partly due to a generational difference in terminology, and partly due to poor explanations in the paper. This is definitely a paper that benefited from the passage of time and future synthesis into a more concise topic. Therefore, I recommend focusing attention on the introduction and conclusion, while below I’ll introduce the modern notation of the perceptron (see MacKay Ch 39/40 for similar details).


The perceptron consists of a set of inputs, x, that are fed into the perceptron, with each input receiving its own weight, w. The activity of the percepton is given by a = wx

Note that the perceptron can have a bias that is independent of inputs. However, we don’t need to write this out separately and can instead include an input that is always set to 1, independently of the data.

This activity is then evaluated by the activation function, f(a), to determine the output, y. There are lots of different possible activation rules with some popular ones including

  • Linear:

y(a) = a

  • Rectified Linear:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= a\quad\text{if}\quad a>0 \end{aligned}

  • Sigmoid:

y(a) = \frac{1}{1+\exp{(-a)}}

  • Threshold:

\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= 1\quad\text{if}\quad a>0 \end{aligned}


The end result is that we can take the output of a perceptron and use this output to make a decision. The sigmoid and threshold activation functions return an answer between 0 and 1 and hence have a natural interpretation as a probability. From now on, we will work with sigmoid activation functions.


Now that the basics of a perceptron have been introduced, how do we actually train it? In other words, if I gave you a set of data, X, where each entry x_n is N dimensional, how would I evaluate perceptron’s handling of the data? For now, we will focus on using the perceptron as a binary classifier (only need to decide between two groups: 0 and 1). Since we are using sigmoid activation functions, we can interpret the output as the probability that the data is in group 1.

The standard way to train a binary classier is to have a training set which consists of pairs of data, x_n, and correct labels, t_n. Then training proceeds by seeing if the output label of a perceptron matches the correct label. If everything is correct, perfect! If not, we need to do something with that error.

For a sigmoid activation, the commonly used error function is the cross-entropy:

\mathcal{E} = - \sum_n \left[ t_n \ln y_n + \left(1-t_n \right) \ln \left(1-y_n \right) \right]

The output y is a function of the weights w. We can then take the derivate of the error with respect to the weights, which in the case of the sigmoid activation and cross-entropy error is simply \delta \mathcal{E}_n = -\left(t_n - y_n\right) x_n.

The simplest possible update algorithm is to perform gradient descent on the weights and define \Delta w_n =\delta \mathcal{E}_n. This is a greedy algorithm (always improves current error, longterm consequences be damned!). Gradient descent comes in several closely related varieties: online, batch, and mini-batch. Let’s start with the mini-batch. First the data is divided up into small random sets (say 10 data points each). Then we loop through the mini-batches, and for each one we calculate the output and error and update the weights. Online learning is when the mini-batches each contain exactly 1 data point, while batch learning is when the mini-batch is the whole dataset. The current standard is to use a mini-batch of between 10 to 100 which is a compromise between speed (batches are faster) and accuracy (online finds better solutions).

Putting it all together, the training algorithm is as follows

  1. Calculate the activation function and output with respect to a mini-batch of data
  2. Calculate the errors of the output
  3. Update the weights


And now you’ve got all the basics of a perceptron down! On to the more difficult questions…

Fundamental Questions

  • What are similarities and differences between a perceptron and a neuron? Do different activation functions lead to distinct interpretations?
  • What is connectivisim? How does this relate to the perceptron? How does this contrast with computers?
  • What class of learning algorithm is the perceptron? Possible answers: unsupervised, supervised, or reinforcement learning
  • What type of functions can a perceptron compute? Compare the standard OR gate vs the exclusive OR (XOR) gate for a perceptron with 2 weights.
  • Does the perceptron return a unique answer? Does the perceptron return the “best” answer (you need to define “best”)? Check out Support Vector Machines for one answer to the “best”.
  • Under what conditions can the perceptron generalize to data it has never seen before? Look into Rosenblatt’s “differentiated environment”.

Additional Questions

  • There are other possibilities for the error functions. Why is the cross-entropy a wise choice for the sigmoid activation?
  • The weight updates can be multiplied by a “learning rate” that controls the size of updates, while I implicitly assumed a learning rate of 1. How would you actually determine a sensible learning rate?
  • The standard learning algorithm puts no constraints on the possible weights. What are some possible problems with unconstrained weights? Can you think of a possible solution? How does this change the generalization properties of a perceptron?
  • Threshold activation functions produce simpler output (only two possible values) than sigmoid activation functions. Despite this simpler output, threshold activation functions are more difficult to train. Can you figure out why?
  • What is the information storage capacity of a perceptron? The exact answer is difficult, but you can get the right order of magnitude in the limit of large number of data points and large number of weights.