deep learning

Posted on July 13, 2017July 13, 2017 General science, Research

Deep Learning Tips

I thought I would write up some general tips and tricks that I have learned by experimenting with neural networks. My focus is on tips that apply to any problem and any neural network architecture, and in fact, some of these tips apply more generally to any machine learning algorithm. So what I have learned over the years?

Data Splits

Before doing anything else, you need to split the dataset into training and testing. But how much data should go into each split? This depends on your number of samples and the number of classes. For example, MNIST has only 10 digits with little variation in each digit, so the standard split is around 80% train and 20% test. ImageNet has over a million samples of 1000 diverse classes, so they use around 50% train and 50% test. So if you have an easy problem and/or a small dataset, I would suggest 80% train and 20% test. If you have a very tough problem and/or a large dataset, I would suggest 50% train and 50% test.

The test data should now be put in a lock box and only used on your final model.

Next you also should set aside some of the training data for validation which is used to determine generalization results when tuning hyperparameters. I would suggest around 20% of the training data to be used as a validation.

Finally, I do a little bit of cheating and I data snoop. I usually take a very tiny amount of the data, maybe 1-5% and play around with it. I will inspect the data to make sure that it looks good, and use the small number of samples to debug my initial code and very roughly tune the hyperparameters. This saves you the headache of doing a long training session only to find out that you had a bug in your code or grossly misunderstood where to start your hyperparameter search.

Data Preprocessing

As a general rule, the data should be standardized by preprocessing. I’ll discuss some specific standardizations below, but a general issue is whether to standardize by the whole dataset, per sample, or per feature. I tend to default to per sample, but I don’t have a good scientific reason why that is the best. If you standardize by the whole dataset or per feature, you need to make sure you only use the training data to set the scales. If you standardize per feature, make sure that all of your features have significant variation before doing so (see MNIST for an example where per feature standardization can lead to weird results since many features have a standard deviation of zero).

Mean

All numerical data should be mean centered, no questions asked. If you classes can be robustly classified just by the mean difference, then you don’t need a neural network. You have a very simple problem and should just use a simple threshold discriminator.

Scaling

I highly recommend scaling the data so that it is all order 1. This can speed up training because most initialization schemes of weights assume that the data is mean centered and has values around the size of 1. But there are two possible ways to scale your data: standard deviation or by the range. If you data looks normally distributed, then standard deviation makes sense. Otherwise I just divide by the maximum of the absolute value.

Correlations

In theory, it can also be helpful to remove correlations between features by using PCA or ZCA whitening. However, in practice you may run into numerical stability issues since you will need to invert a matrix. So this is worth considering, but takes some more careful application.

Data Augmentation

More training data is always better, but obtaining that data can be expensive. So I always try hard to find a way to do data augmentation. However, the correct data augmentation is usually problem specific, so I won’t go into details here.

Early Stopping

The no free lunch theorem of machine learning states that there is no general learning algorithm that will solve all problems. However, Geoff Hinton has pointed out that early stopping is as close to a free lunch as we can get. Early stopping is the easiest way for any machine learning algorithm to avoid overfitting, and you can read more about the technical justifications for it at Distill’s momentum article.

Optimizer

SGD vs Adam

In practice, all optimizers for neural networks involve some form of stochastic gradient descent (SGD). The only questions is whether you need to manually tune the learning rate and other parameters, or whether you use an adaptive version of SGD that automatically adjusts the learning rates. I think the best adaptive method is Adam (and Nadam when possible, see later subsection on momentum). So for me the choice is simple: either plain SGD or Adam/Nadam. For a more complete comparison of SGD variants, I highly recommend this blog post.

Learning Rate

If you are using Adam, you will rarely need to tune the learning rate. But for SGD, the learning rate is by far the most important parameter to tune. A nice tip from Yoshua Bengio is this: the optimal learning rate is often an order of magnitude lower than the smallest learning rate that blows up the loss. So this means, start with a high learning rate and work your way down a half order of magnitude at a time (for example: 1, 0.3, 0.1, …). Then start your fine grained learning rate search about an order of magnitude below the last time the loss blew up.

Another useful tweak on the learning rate is to have it decay over the course of training. I find that this slightly improves the final performance, but more importantly leads to consistent training results. There are a variety of ways to implement the decay, but I’m not sure they make that much of a difference. My standard implementation is

$l_{batch} = \frac{l_{start}}{1+decay*(N_{batches})}$

where $N_{batches}$ is the number of minibatches seen so far during training. I then set decay so that the final learning rate at the end of all the epochs is 1/10th the starting learning rate.

Momentum

Momentum is very useful for neural networks, but in practice I spend minimal time tuning the momentum rate because I have a few default settings that I strongly recommend.

First, I really only consider three possible momentum values: 0.5, 0.9, and 0.99. Since the maximum effect of momentum is $\frac{1}{1-momentum}$ , my default values are roughly spaced by an order of magnitude. I always start with 0.9 and go from there.

Also, I always choose Nesterov momentum whenever possible. Most packages, like Keras, have Nesterov as an option for SGD, and Keras also has Nadam, which is Adam with Nesterov momentum. For more details on Nesterov, see here. The short explanation is that it leads to the same maximum effect of $\frac{1}{1-momentum}$ , but it does so in a more gradual manner. In practice, this means that while standard momentum gets very unstable above 0.9, Nesterov momentum can be safely set to 0.99.

Another useful tip is to set the momentum to a smaller value (say half your standard value) for the final few epochs (maybe the last 5-10% of epochs). The intuition for why this is helpful is that hopefully by the end of training, the neural network is close to good weights, but it might be rocking back and forth around the optimal weights. Since the neural network weight space is highly non-convex, by tuning down the momentum, you force the neural network to settle down into these non-convex “valleys” that may contain the best weights.

The final tip, originally suggested here, is to exponentially ramp up and down the momentum anytime you want to change the momentum rate during training. This gives the weights updates time to adjust to the new momentum rates. I personally have found this gives a very slight improvement in performance, but more importantly it leads to consistent training results.

Summary of my momentum tips:

Peak momentum values of: 0.5, 0.9, or 0.99
Always choose Nesterov momentum if possible
Start momentum initially at half the desired peak value and exponentially ramp up
Towards the end of training, exponentially ramp down momentum to half the desired peak value.
Train for 5-10% of epochs at the desired smaller momentum.

Initialization

All weights should be initialized to an orthogonal matrix. This is extremely important for recurrent neural networks (as explained here), but I have also found it to be useful for all neural networks.

Activation Function

The standard is that all hidden layers are ReLUs unless you need the hidden layers to be a valid probability, in which case you should use a sigmoid.

Loss

Choosing the right loss function is very problem dependent, so I will leave that for another day. However, whatever loss function you do choose, make sure the output layer activation function is complimentary to that loss, see Michael Nielsen’s book for details on why sigmoid outputs and crossentropy losses are complimentary.

Regularization

Weights

Weight regularization is almost always a requirement to prevent overfitting and to get good generalization. The two main choices are L1 or L2 regularization. L1 will ensure that small weights are set to zero, and hence will lead to a sparser set of weights. L2 prevents weights from becoming too large, but does not sparsify the weights. Personally, rather than choosing between the two, I tend to default to both. I set L1 to be very small so that I at least get slightly sparser weights, but then I mainly focus on tuning L2 to control overfitting.

Activity

Dropout and batch normalization are not regularizers in the traditional sense, but in practice they help reduce overfitting by controlling the activation outputs. Additionally, it is extremely difficult to train very deep neural networks without using either dropout or batchnorm. Dropout was the standard for several years, but now it is usually replaced by batchnorm.

Parameter Tuning

Neural networks have a lot of interdependent hyperparameters to tune, so picking which ones to tune first is kind of a chicken and the egg problem. Personally, I start off with an adaptive optimizer (like Adam or Nadam) and then tune the architecture. Next I will roughly tune the regularization. Once that leads to acceptable results, I will switch the optimizer to SGD and only focus on tuning the learning rate. If SGD seems promising, I will then tune other parameters like decay and momentum. Hopefully by this point, you are achieving pretty good results. I will then use this neural network as the starting point for a systematic hyperparameter search to truly find the best results.

Final Tips

Don’t take my word for anything, try it out yourself! I strongly recommend experimenting with every option you can find in Keras and see for yourself what actually will work. I also suggest getting opinions from as many people as possible (see Yoshua Bengio’s tips). I think that about 90% of the advice will overlap, but everyone has their own bias. So hopefully be reading enough independent sources, you can average out all our mistakes. Good luck!

Posted on February 23, 2017 Experts, Research

Deep Learning Seminar Course

This semester Terry Sejnowski is teaching a graduate seminar course that is focused on Deep Learning. The course meets weekly for two hours to discuss papers. Here I’ll just outline the course and in later posts I’ll add some thoughts on each specific week.

Week 1: Perceptrons

Week 2: Hopfield Nets and Boltzmann Machines

Week 3: Backprop

Week 4: Independent Component Analysis (ICA)

Week 5: Convolutional Neural Networks (CNN)

Week 6: Recurrent Neural Networks (RNN)

Week 7: Reinforcement Learning

Week 8: Information and Control Theory

Posted on August 20, 2016 General science, Research, Teaching

Best Machine Learning Resources

Machine learning is a rapidly evolving field that is generating an intense interest from a wide audience. So how can you get started?

For now, I’m going to assume that you already have the basic programming (ie general introduction to programming and experience with matrices) and mathematical skills (calculus and some probability and linear algebra).

These are the best current books on machine learning:

Murphy. This is a comprehensive introduction to the whole field.
Learning From Data. This is a brief introduction to a subset of topics.
Deep Learning. Also check out my previous post.

These are some out of date books that still contain some useful sections (for example, Murphy several times refers you to Bishop or MacKay for more details).

Bishop. Predecessor to Murphy.
MacKay. Free pdf!
Hastie, Tibshirani, and Friedman. Free pdf!

Here is a list of other potential resources:

Posted on June 15, 2016 Experts, Research

Deep Learning in Python

So maybe after reading some of my past posts, you are fired up to start programming a deep neural network in Python. How should you get started?

If you want to be able to run anything but the simplest neural networks on easy problems, you will find that since pure Python is an interpreted language, it is too slow. Does that mean we have to give up and write our own C++ code? Luckily GPUs and other programmers come to your rescue by offering between 5-100X speedup (I would estimate my average speedup at 10X, but it varies for specific tasks).

There are two main Python packages, Theano and TensorFlow, that are designed to let you write Python code that can either run on a CPU or a GPU. In essence, they are each their own mini-language with the following changes from standard Python:

Tensors (generalizations of matrices) are the primary variable type and treated as abstract mathematical objects (don’t need to specify actual values immediately).
Computational graphs are utilized to organize operations on the tensors.
When one wants to actually evaluate the graph on some data, it is stored in a shared variable that when possible gets sent to the GPU. This data is then processed by the graph (in place of the original tensor placeholders).
Automatic differentiation (ie it understands derivatives symbolically).
Built in numerical optimizations.

So to get started you will want to install either Theano (pip install theano), TensorFlow (details here), or both. I personally have only used Theano, but if Google keeps up the developmental progress of TensorFlow, I may end up switching to it.

At the end of the day, that means that if one wants to actually implement neural networks in Theano or TensorFlow, you essentially will learn another language. However, people have built various libraries that are abstractions on top of these mini-languages. Lasagne is one example that basically organizes Theano code so that you have to interact less with Theano, but you will still need to understand Theano. I initially started with Theano and Lasagne, but I am now a convert to Keras.

Instead, I advocate for Keras (pip install keras) for two major reasons:

High level abstraction. You can write standard Python code and get a deep neural network up and running very quickly.
Back-end agnostic. Keras can run on either Theano or TensorFlow.

So it seems like a slam dunk right? Unfortunately life is never that simple, instead there are two catches:

Mediocre documentation (using Numpy as a gold standard, or even comparing to Lasagne). You can get the standard things up and running based on theirs docs. But if you want to do anything advanced, you will find yourself looking into their source code on GitHub, which has some hidden, but useful, comments.
Back-end agnostic. This means if you do want to introduce a modification to the back-end, and you want it to always work in Keras, you need to implement it in both Theano and TensorFlow. In practice this isn’t too bad since Keras has done a good job of implementing low-end operations.

Fortunately, the pros definitely outweigh the cons for Keras and I highly endorse it. Here are a few tips I have learned from my experience with Keras:

Become familiar with the Keras documentation.
I recommend only using the functional API which allows you to implement more complicated networks. The sequential API allows you to write simple models in fewer lines of code, but you lose flexibility (for example, you can’t access intermediate layers) and the code won’t generalize to complex models. So just embrace the functional API.
Explore the examples (here and here).
Check out the Keras GitHub.
Names for layers are optional keywords, but definitely use them! It will significantly help you when you are debugging.

Now start coding your own deep neural networks!

1 Comment Posted on May 22, 2016 General science, Research

Deep Learning: 0-60 in a few hours?

Here, I will try to outline the fastest possible path to go from zero understanding of deep learning to an understanding of the basic ideas. In a follow up post, I’ll outline some deep learning packages where you could actually implement these ideas.

I think by far the best introduction to deep learning is Michael Nielsen’s ebook. Before you get started with it, I think the minimum required mathematics includes an understanding of the following:

Vector and Matrix multiplication – especially when written in summation notation
Exponents and Logarithms
Derivatives and Partial Derivatives
Probability, mainly Bayes Theorem (not actually needed for Michael Nielsen’s book, but it is essential for later topics)

I really think that if you understand those mathematical topics, you can start reading the ebook.

Here is my proposed learning strategy. Iterate between reading the ebook (Chapters 1-5 only) and playing with this cool interactive neural network every time a new idea is mentioned. For a first pass, just read the ebook and don’t do the exercises or worry about actual code implementation. Additionally, chapter 6 introduces convolutional neural networks which are a more advanced topic that can be saved for later.

Once you have some intuition about neural networks, I recommend reading this review by several of the big names in deep learning. This will give you a flavor of the current status of the field.

Now you are ready to start coding!

PS. If you want to get into more advanced deep learning topics, check out my previous Deep Learning Unit. And to really get up to speed on research, there is a deep learning book that should be published soon.

4 Comments Posted on March 6, 2016 General science, Research, Teaching

Deep Learning

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

Papers

Deep Learning. By LeCun, Bengio, and Hinton in 2015.

Deep learning in neural networks: An overview. By Schmidhuber in 2015.

Other Useful References

Deep Learning – Wikipedia
Deep Learning Book (Under Development)
Michael Nielsen EBook

What is deep learning?

In previous weeks we have introduced perceptrons, multilayer perceptrons (MLP), Hopfield neural networks, and Boltzmann machines.

But what is deep learning? I think it is really two things:

Successful training of multilayered neural networks perform better (higher classification accuracy, etc) and involve more layers than previous implementations
Just a rebranding of neural networks

Here is my summary of the history of deep learning, see both reviews above for extended details. In 2006, Deep Belief Networks (DBN) were introduced in two papers (Reducing the Dimensionality of Data with Neural Networks and A Fast Learning Algorithm for Deep Belief Nets). The idea of a DBN is to train a series of restricted Boltzmann machines (RBM). The first RBM is trained and given the original data, produces an output of hidden layer activations. The second RBM uses the first RBM’s hidden layer activations as inputs, and trains on that “data”. This is continued to the desired depth. At this point, the DBN can be used for unsupervised learning, or one can use it as pretraining for a MLP which will utilize backpropagation for supervised learning.

So on the one hand, there were technical breakthroughs that enabled neural networks to utilize more layers than previous iterations and achieve state of the art performance. However, the actual component (RBMs and MLPs), have been around since the 1980s, so it would also be fair to deem deep learning as a rebranding of neural networks.

Therefore, neural network winter (mid 90s to mid 00s) officially ended in 2006. I propose calling 2006-2012 neural network spring. While interest in neural networks increased and new advances were made, the general machine learning community was not obsessed with deep learning. That changed in 2012 when the neural network summer began. This paper presented at NIPS revolutionized the computer vision community by cutting the error rate on Imagenet in half! The Imagenet challenge was viewed as a serious benchmark that all computer vision systems should address. By blowing previous results out of the water, the revolution was completed.

So for now, enjoy neural network summer, but always remember, winter is coming.

Fundamental Questions

When do extra layers help in a neural network? When do they hurt?
Why was pretraining originally needed, but is no longer used in practice? Check out these papers for details: Glorot and Saxe.
Learn about convolutional and recurrent neural networks. These are extremely popular right now!

Advanced Questions

Do research on unsupervised learning! It is definitely less popular today, but all the big-shots think is the longterm future of neural networks.

4 Comments Posted on January 21, 2016March 6, 2016 General science, Research, Teaching

Perceptron

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

Paper

The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain by Rosenblatt in 1958

Other Useful References

Wikipedia
MacKay, Ch 39 and 40.
Nielsen, mainly Ch 1.
Goodfellow, Bengio, and Courville. Kind of covered in Ch 6.

Motivation for Perceptron

I’ll let Rosenblatt introduce the important questions leading to the perceptron himself by quoting his first paragraph:

If we are eventually to understand the capability of higher organisms for perceptual recognition, generalization, recall, and thinking, we must first have answers to three fundamental questions:

How is information about the physical world sensed, or detected, by biological system?

In what form is information stored, or remembered?

How does information in storage, or in memory, influence recognition and behavior?

The perceptron is a first attempt to answer second and third questions. In the years leading up to the perceptron, there were two dominate themes of theoretical research on the brain. One focused on the general computational properties of the brain (McCollough and Pitts 1943) and showed that simple binary neurons could form a computer (ie they can compute any possible function). Another theme focused on abstracting away the details of experiments to get at general principles that relate to computation in the brain (Hebb 1949 and his synapse learning rules).

The perceptron opened up a third avenue of theoretical research. The central goal is to devise neuron-inspired algorithms that learn from real data and can be used to make a decision.

What is a Perceptron?

Basics

I find the math in the original perceptron paper pretty confusing. This is partly due to a generational difference in terminology, and partly due to poor explanations in the paper. This is definitely a paper that benefited from the passage of time and future synthesis into a more concise topic. Therefore, I recommend focusing attention on the introduction and conclusion, while below I’ll introduce the modern notation of the perceptron (see MacKay Ch 39/40 for similar details).

Perceptron

The perceptron consists of a set of inputs, $x$ , that are fed into the perceptron, with each input receiving its own weight, $w$ . The activity of the percepton is given by $a = wx$

Note that the perceptron can have a bias that is independent of inputs. However, we don’t need to write this out separately and can instead include an input that is always set to 1, independently of the data.

This activity is then evaluated by the activation function, $f(a)$ , to determine the output, $y$ . There are lots of different possible activation rules with some popular ones including

Linear:

$y(a) = a$

Rectified Linear:

$\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= a\quad\text{if}\quad a>0 \end{aligned}$

Sigmoid:

$y(a) = \frac{1}{1+\exp{(-a)}}$

Threshold:

$\begin{aligned} y &= 0 \quad\text{if}\quad a \leq 0 \\ y&= 1\quad\text{if}\quad a>0 \end{aligned}$

The end result is that we can take the output of a perceptron and use this output to make a decision. The sigmoid and threshold activation functions return an answer between 0 and 1 and hence have a natural interpretation as a probability. From now on, we will work with sigmoid activation functions.

Training

Now that the basics of a perceptron have been introduced, how do we actually train it? In other words, if I gave you a set of data, $X$ , where each entry $x_n$ is $N$ dimensional, how would I evaluate perceptron’s handling of the data? For now, we will focus on using the perceptron as a binary classifier (only need to decide between two groups: 0 and 1). Since we are using sigmoid activation functions, we can interpret the output as the probability that the data is in group 1.

The standard way to train a binary classier is to have a training set which consists of pairs of data, $x_n$ , and correct labels, $t_n$ . Then training proceeds by seeing if the output label of a perceptron matches the correct label. If everything is correct, perfect! If not, we need to do something with that error.

For a sigmoid activation, the commonly used error function is the cross-entropy:

$\mathcal{E} = - \sum_n \left[ t_n \ln y_n + \left(1-t_n \right) \ln \left(1-y_n \right) \right]$

The output $y$ is a function of the weights $w$ . We can then take the derivate of the error with respect to the weights, which in the case of the sigmoid activation and cross-entropy error is simply $\delta \mathcal{E}_n = -\left(t_n - y_n\right) x_n$ .

The simplest possible update algorithm is to perform gradient descent on the weights and define $\Delta w_n =\delta \mathcal{E}_n$ . This is a greedy algorithm (always improves current error, longterm consequences be damned!). Gradient descent comes in several closely related varieties: online, batch, and mini-batch. Let’s start with the mini-batch. First the data is divided up into small random sets (say 10 data points each). Then we loop through the mini-batches, and for each one we calculate the output and error and update the weights. Online learning is when the mini-batches each contain exactly 1 data point, while batch learning is when the mini-batch is the whole dataset. The current standard is to use a mini-batch of between 10 to 100 which is a compromise between speed (batches are faster) and accuracy (online finds better solutions).

Putting it all together, the training algorithm is as follows

Calculate the activation function and output with respect to a mini-batch of data
Calculate the errors of the output
Update the weights

And now you’ve got all the basics of a perceptron down! On to the more difficult questions…

Fundamental Questions

What are similarities and differences between a perceptron and a neuron? Do different activation functions lead to distinct interpretations?
What is connectivisim? How does this relate to the perceptron? How does this contrast with computers?
What class of learning algorithm is the perceptron? Possible answers: unsupervised, supervised, or reinforcement learning
What type of functions can a perceptron compute? Compare the standard OR gate vs the exclusive OR (XOR) gate for a perceptron with 2 weights.
Does the perceptron return a unique answer? Does the perceptron return the “best” answer (you need to define “best”)? Check out Support Vector Machines for one answer to the “best”.
Under what conditions can the perceptron generalize to data it has never seen before? Look into Rosenblatt’s “differentiated environment”.

Additional Questions

There are other possibilities for the error functions. Why is the cross-entropy a wise choice for the sigmoid activation?
The weight updates can be multiplied by a “learning rate” that controls the size of updates, while I implicitly assumed a learning rate of 1. How would you actually determine a sensible learning rate?
The standard learning algorithm puts no constraints on the possible weights. What are some possible problems with unconstrained weights? Can you think of a possible solution? How does this change the generalization properties of a perceptron?
Threshold activation functions produce simpler output (only two possible values) than sigmoid activation functions. Despite this simpler output, threshold activation functions are more difficult to train. Can you figure out why?
What is the information storage capacity of a perceptron? The exact answer is difficult, but you can get the right order of magnitude in the limit of large number of data points and large number of weights.

Data Splits

Data Preprocessing

Mean

Scaling

Correlations

Data Augmentation

Early Stopping

Optimizer

SGD vs Adam

Learning Rate

Momentum

Initialization

Activation Function

Loss

Regularization

Weights

Activity

Parameter Tuning

Final Tips

Share this:

Week 1: Perceptrons

Week 2: Hopfield Nets and Boltzmann Machines

Week 3: Backprop

Week 4: Independent Component Analysis (ICA)

Week 5: Convolutional Neural Networks (CNN)

Week 6: Recurrent Neural Networks (RNN)

Week 7: Reinforcement Learning

Week 8: Information and Control Theory

Share this:

Share this:

Share this:

Share this:

Unit: Deep Learning

Papers

Other Useful References

What is deep learning?

Fundamental Questions

Advanced Questions

Share this:

Unit: Deep Learning

Paper

Other Useful References

Motivation for Perceptron

What is a Perceptron?

Basics

Training

Fundamental Questions

Additional Questions

Share this: