Deep Learning Tips

I thought I would write up some general tips and tricks that I have learned by experimenting with neural networks. My focus is on tips that apply to any problem and any neural network architecture, and in fact, some of these tips apply more generally to any machine learning algorithm. So what I have learned over the years?

Data Splits

Before doing anything else, you need to split the dataset into training and testing. But how much data should go into each split? This depends on your number of samples and the number of classes. For example, MNIST has only 10 digits with little variation in each digit, so the standard split is around 80% train and 20% test. ImageNet has over a million samples of 1000 diverse classes, so they use around 50% train and 50% test. So if you have an easy problem and/or a small dataset, I would suggest 80% train and 20% test. If you have a very tough problem and/or a large dataset, I would suggest 50% train and 50% test.

The test data should now be put in a lock box and only used on your final model.

Next you also should set aside some of the training data for validation which is used to determine generalization results when tuning hyperparameters. I would suggest around 20% of the training data to be used as a validation.

Finally, I do a little bit of cheating and I data snoop. I usually take a very tiny amount of the data, maybe 1-5% and play around with it. I will inspect the data to make sure that it looks good, and use the small number of samples to debug my initial code and very roughly tune the hyperparameters. This saves you the headache of doing a long training session only to find out that you had a bug in your code or grossly misunderstood where to start your hyperparameter search.

Data Preprocessing

As a general rule, the data should be standardized by preprocessing. I’ll discuss some specific standardizations below, but a general issue is whether to standardize by the whole dataset, per sample, or per feature. I tend to default to per sample, but I don’t have a good scientific reason why that is the best. If you standardize by the whole dataset or per feature, you need to make sure you only use the training data to set the scales. If you standardize per feature, make sure that all of your features have significant variation before doing so (see MNIST for an example where per feature standardization can lead to weird results since many features have a standard deviation of zero).


All numerical data should be mean centered, no questions asked. If you classes can be robustly classified just by the mean difference, then you don’t need a neural network. You have a very simple problem and should just use a simple threshold discriminator.


I highly recommend scaling the data so that it is all order 1. This can speed up training because most initialization schemes of weights assume that the data is mean centered and has values around the size of 1. But there are two possible ways to scale your data: standard deviation or by the range. If you data looks normally distributed, then standard deviation makes sense. Otherwise I just divide by the maximum of the absolute value.


In theory, it can also be helpful to remove correlations between features by using PCA or ZCA whitening. However, in practice you may run into numerical stability issues since you will need to invert a matrix. So this is worth considering, but takes some more careful application.

Data Augmentation

More training data is always better, but obtaining that data can be expensive. So I always try hard to find a way to do data augmentation. However, the correct data augmentation is usually problem specific, so I won’t go into details here.

Early Stopping

The no free lunch theorem of machine learning states that there is no general learning algorithm that will solve all problems. However, Geoff Hinton has pointed out that early stopping is as close to a free lunch as we can get. Early stopping is the easiest way for any machine learning algorithm to avoid overfitting, and you can read more about the technical justifications for it at Distill’s momentum article.


SGD vs Adam

In practice, all optimizers for neural networks involve some form of stochastic gradient descent (SGD). The only questions is whether you need to manually tune the learning rate and other parameters, or whether you use an adaptive version of SGD that automatically adjusts the learning rates. I think the best adaptive method is Adam (and Nadam when possible, see later subsection on momentum). So for me the choice is simple: either plain SGD or Adam/Nadam. For a more complete comparison of SGD variants, I highly recommend this blog post.

Learning Rate

If  you are using Adam, you will rarely need to tune the learning rate. But for SGD, the learning rate is by far the most important parameter to tune. A nice tip from Yoshua Bengio is this: the optimal learning rate is often an order of magnitude lower than the smallest learning rate that blows up the loss. So this means, start with a high learning rate and work your way down a half order of magnitude at a time (for example: 1, 0.3, 0.1, …). Then start your fine grained learning rate search about an order of magnitude below the last time the loss blew up.

Another useful tweak on the learning rate is to have it decay over the course of training. I find that this slightly improves the final performance, but more importantly leads to consistent training results. There are a variety of ways to implement the decay, but I’m not sure they make that much of a difference. My standard implementation is

l_{batch} = \frac{l_{start}}{1+decay*(N_{batches})}

where N_{batches} is the number of minibatches seen so far during training. I then set decay so that the final learning rate at the end of all the epochs is 1/10th the starting learning rate.


Momentum is very useful for neural networks, but in practice I spend minimal time tuning the momentum rate because I have a few default settings that I strongly recommend.

First, I really only consider three possible momentum values: 0.5, 0.9, and 0.99. Since the maximum effect of momentum is \frac{1}{1-momentum}, my default values are roughly spaced by an order of magnitude. I always start with 0.9 and go from there.

Also, I always choose Nesterov momentum whenever possible. Most packages, like Keras, have Nesterov as an option for SGD, and Keras also has Nadam, which is Adam with Nesterov momentum. For more details on Nesterov, see here. The short explanation is that it leads to the same maximum effect of \frac{1}{1-momentum}, but it does so in a more gradual manner. In practice, this means that while standard momentum gets very unstable above 0.9, Nesterov momentum can be safely set to 0.99.

Another useful tip is to set the momentum to a smaller value (say half your standard value) for the final few epochs (maybe the last 5-10% of epochs). The intuition for why this is helpful is that hopefully by the end of training, the neural network is close to good weights, but it might be rocking back and forth around the optimal weights. Since the neural network weight space is highly non-convex, by tuning down the momentum, you force the neural network to settle down into these non-convex “valleys” that may contain the best weights.

The final tip, originally suggested here, is to exponentially ramp up and down the momentum anytime you want to change the momentum rate during training. This gives the weights updates time to adjust to the new momentum rates. I personally have found this gives a very slight improvement in performance, but more importantly it leads to consistent training results.

Summary of my momentum tips:

  • Peak momentum values of: 0.5, 0.9, or 0.99
  • Always choose Nesterov momentum if possible
  • Start momentum initially at half the desired peak value and exponentially ramp up
  • Towards the end of training, exponentially ramp down momentum to half the desired peak value.
  • Train for 5-10% of epochs at the desired smaller momentum.


All weights should be initialized to an orthogonal matrix. This is extremely important for recurrent neural networks (as explained here), but I have also found it to be useful for all neural networks.

Activation Function

The standard is that all hidden layers are ReLUs unless you need the hidden layers to be a valid probability, in which case you should use a sigmoid.


Choosing the right loss function is very problem dependent, so I will leave that for another day. However, whatever loss function you do choose, make sure the output layer activation function is complimentary to that loss, see Michael Nielsen’s book for details on why sigmoid outputs and crossentropy losses are complimentary.



Weight regularization is almost always a requirement to prevent overfitting and to get good generalization. The two main choices are L1 or L2 regularization. L1 will ensure that small weights are set to zero, and hence will lead to a sparser set of weights. L2 prevents weights from becoming too large, but does not sparsify the weights. Personally, rather than choosing between the two, I tend to default to both. I set L1 to be very small so that I at least get slightly sparser weights, but then I mainly focus on tuning L2 to control overfitting.


Dropout and batch normalization are not regularizers in the traditional sense, but in practice they help reduce overfitting by controlling the activation outputs. Additionally, it is extremely difficult to train very deep neural networks without using either dropout or batchnorm. Dropout was the standard for several years, but now it is usually replaced by batchnorm.

Parameter Tuning

Neural networks have a lot of interdependent hyperparameters to tune, so picking which ones to tune first is kind of a chicken and the egg problem. Personally, I start off with an adaptive optimizer (like Adam or Nadam) and then tune the architecture. Next I will roughly tune the regularization. Once that leads to acceptable results, I will switch the optimizer to SGD and only focus on tuning the learning rate. If SGD seems promising, I will then tune other parameters like decay and momentum. Hopefully by this point, you are achieving pretty good results. I will then use this neural network as the starting point for a systematic hyperparameter search to truly find the best results.

Final Tips

Don’t take my word for anything, try it out yourself! I strongly recommend experimenting with every option you can find in Keras and see for yourself what actually will work. I also suggest getting opinions from as many people as possible (see Yoshua Bengio’s tips). I think that about 90% of the advice will overlap, but everyone has their own bias. So hopefully be reading enough independent sources, you can average out all our mistakes. Good luck!

Research Experience for Undergrads (REU)

This National Science Foundation program is designed to give undergraduates, especially those from smaller schools, a chance to gain real research experience for a summer. Personally I participated in one official REU and one program modeling on REUs. I learned a lot (and they were tons of fun!). The best part is not the specific topic you research, but the opportunity to learn how to be a researcher.
Most of the applications are due in February. Check out the the official NSF REU website for the latest details.
When you are ready to apply, go here to search for programs of REUs in various subjects. Also, search the internet for other research opportunities; Harvard has a nice list of research programs for undergrads. For more detailed tips on applications, I recommend this site
If you want to get an idea of what an REU is like, here are some interviews of past Math REU participants. And also keep in mind these research tips for undergrads if you do get an REU.

QFT Resources

Quantum Field Theory is a notoriously difficult subject to learn, but I found the following resources to be extremely helpful when I took the course a few years ago. I just learned about a few resources that I wish I had then, so here are my current tips for learning QFT. 
Tony Zee’s book QFT in a Nutshell provides a great intuition into what QFT is all about. If you actually want to do calculations, then Peskin and Schroeder’s book is a nice compliment. These two books were the heart of my studies into QFT.
David Tong’s Notes:
Great set of lecture notes that provides a different perspective.
Sidney Coleman’s Lectures:
Apparently, all modern QFT books are based on Coleman (since all the authors learned QFT from him or his students), and you can still see the original videos.  For years there was a set of hand-written notes that served as a transcript of the video but this was recently LaTeXed and shared on the ArXiv.

Deep Learning Seminar Course

This semester Terry Sejnowski is teaching a graduate seminar course that is focused on Deep Learning. The course meets weekly for two hours to discuss papers. Here I’ll just outline the course and in later posts I’ll add some thoughts on each specific week.

Week 1: Perceptrons

Week 2: Hopfield Nets and Boltzmann Machines

Week 3: Backprop

Week 4: Independent Component Analysis (ICA)

Week 5: Convolutional Neural Networks (CNN)

Week 6: Recurrent Neural Networks (RNN)

Week 7: Reinforcement Learning

Week 8: Information and Control Theory

NSF GRFP 2016-2017

For a couple of years now, I have had a website with my thoughts on the National Science Foundation Graduate Research Fellowship (NSF GRFP) and examples of successful essays. The popularity of the site in the past few years has grown well beyond what I expected, so this year I’m going to use this blog to try out a few new things.


Questions from You

I end up getting lots of emails asking for advice. While sometimes the advice really does merit an individualized result, many of the questions are applicable to everyone. So in the interest of efficiently answering questions, here is my plan this year.

  1. Before asking me, make sure you’ve read my advice, checked out the NSF GRFP FAQ, skimmed GradCafe, read my FAQ (next section), and checked out the comments for this blog post.
  2. I will not answer any questions about eligibility due to gaps in graduate school because I am honestly clueless on it.
  3. If you feel comfortable asking the question publicly, post it by commenting below.
  4. If you want to ask me privately, send me an email (my full name at, include NSF GRFP Question in subject line). I will try and answer you and also work with you on a public question/answer that I can include here.



Here are some past questions I have been asked and/or questions I anticipate being asked this year.

  • My research is closely related to medicine. Am I still eligible?
    • I think the best test for this is to ask your advisor if they would apply to NSF or NIH for grants on this topic. If NSF you are definitely good, but if NIH, you will need to reframe the research to fit into NSF.
  • I am a first year graduate student. Should I apply this year or wait until my second year? (New issue this year since incoming graduate students can only apply once).
    • This is the toughest question for me since no one has had to make this choice yet. However, here is how I would personally decide. The important thing to remember is that undergrads, first year grads, and second year grads are each separately graded relative to their respective years. So you really need to decide how you currently rank relative to your peers versus how you will rank next year. If you did a bunch of undergrad research, have papers, etc, definitely apply as a first year. If you didn’t, it might payoff to wait, but only if your program lets you get right into research. If you will just be taking classes, I’m less confident your relative standing will improve. Good luck to everyone with this tough choice!


Requests for Essay Reading

Unfortunately, I now get more requests to read essays than I can reasonably accomplish. But I am still willing to read over a few and here is how I will decide on the essays to read.

  1. If you are in San Diego, and you think I am a better fit for you than the other local people on the experienced resource list,  send me an email with the subject NSF GRFP Experienced Resource List.
  2. If you are not in San Diego, first check out the experienced resource list and also ask around your school for other resources.
  3. If you can’t find anyone to read your essays, fill out this form. I will semi-randomly select essays to read.

What do I mean by semi-randomly? Well, in the interest of supporting the NSF GRFP’s goal of increasing the diversity of graduate school, I will give priority to undergrads who are without a local person on the experienced resource list and/or are from underrepresented groups. The NSF GRFP specifically “encourages women, members of underrepresented minority groups, persons with disabilities, and veterans to apply”, and I am willing to extremely loosely define minority group by race, ethnicity, sexual orientation, family socio-economic status, geography, colleges that traditionally send few students to graduate school, etc. The form is fill in the blank, so feel free to justify your inclusion in any other underrepresented group that I did not explicitly list.

I’ll then take the prioritized list and make some random selection. The number of people I select this way will depend on the number of local people I end up advising, but I will definitely read at least 2 non-local applications.


Here is a my time-line for essay reading:

  • Sept 16th – Random drawing number 1
  • Sept 30th Extended to Oct 5th – Random drawing number 2 (I’ll include everyone again, so early birds get double the chances of being selected)
  • Oct 21st – Last day I will help people (sorry I’m traveling near the deadline)



Best Machine Learning Resources

Machine learning is a rapidly evolving field that is generating an intense interest from a wide audience. So how can you get started?

For now, I’m going to assume that you already have the basic programming (ie general introduction to programming and experience with matrices) and mathematical skills (calculus and some probability and linear algebra).

These are the best current books on machine learning:

These are some out of date books that still contain some useful sections (for example, Murphy several times refers you to Bishop or MacKay for more details).

Here is a list of other potential resources:


I3: International Institute for Intelligence

While I was previously discussing my opinion of Open AI, I mentioned that I would do something different if I was in charge. Here is my dream.


What OpenAI is Missing

Helping everyday people throughout the whole world.

OpenAI’s stated goal is:

OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.

In the short term, we’re building on recent advances in AI research and working towards the next set of breakthroughs.

However, based on their actions so far, this interview with Ilya Sutskever, and popular press articles, the main focus of OpenAI appears to be advanced research in an artificial intelligence by stressing open source, as well as thinking longterm about the impacts of letting advanced artificial intelligence systems control large aspects of our life. While I strongly support these goals, in reality, these will not benefit all of humanity. Instead, it only benefits those with either the necessary training (which is a minimum of a bachelors, but usually means a masters or PhD) or money (to hire top people, buy the required computing resources, etc) to take advantage of the advanced research. So this leaves out the developing world as well as the poor in developed countries, ie contrary to their stated goal, OpenAI is missing the vast majority of humanity.

While one can argue that by making OpenAI’s research open source, eventually it will trickle down and help a wider swath of humanity. However, the current trend suggests that large corporations are best poised to benefit the most from the next revolution (I mean, who is more likely to invent a self driving car, Google, or someone in a developing country?). Additionally, these innovations focus on first world problems (since these are the highest paying customers). And finally, each round of innovation ends up creating fewer and fewer jobs (so the number of unemployed in developed countries may expand). I firmly believe that unless there is a global educational effort (and probably an implementation of basic income), the benefits of AI will be directed towards a tiny sliver of the world’s population.


My Proposal: I3

Here I lay out my proposal for a new institute that would actually expand the benefits of recent and future advances in machine learning / artificial intelligence to a wider swath of humanity. I don’t claim that it would truly benefit all of humanity (again, see basic income), but it is a way for research advances to reach a larger proportion of it.

I propose a new education and research institute focused on artificial intelligence, machine learning, and computational neuroscience which I’ll call the International Institute for Intelligence. I like alliterations, and since I think it should focus on three types of intelligence, I especially like the idea of calling it I3 or I-Cubed for short.

Why these three research areas? Well, machine learning is currently revolutionizing how companies use data and is facilitating new technological advances everyday. Designing artificial intelligence systems on top of these machine learning algorithms seems like a realistic possibility in the near future. The less conventional choice is computational neuroscience. I think it is important to include for two reasons. First, the brain is the best example we have of an intelligent system, so until we actually design an artificial intelligence, it seems best to understand and mimic the best example (this is the philosophy of Deep Mind according to Demis Hassabis). Second, the US Brain Initiative  and similar international efforts are injecting significant resources into neuroscience, with the hopes of sparking a revolution similar in spirit and magnitude to the widespread effect the Human Genome Project had on biotechnology and genomics. So I figure we might as well prepare everyone for this future.

So what would be the actual purpose of I3? Sticking with the theme of threes, I propose three initiatives that I will list in my order of importance as well as some bonus points.


1. International PhD Education

The central goal is to similar program to ICTP (International Centre for Theoretical Physics) but with a different research emphasis. So what is ICTP? It was founded by Nobel Prize Winner Abdus Salam and it has several programs to promote research in developing countries, including:

  • Predoctoral program – students get a 1 year course to prep them for PhDs
  • Visiting PhD program – students in a developing nation PhD program get to spend a couple of months each year for 3 years at ICTP to participate in their research
  • Conferences
  • Regional offices (currently Sao Paolo, Brazil, but more in the planning)

So the idea is to implement a similar program but with the research emphasis now focused on machine learning, artificial intelligence, and computational neuroscience. While I think the main thing is to get the predoctoral program and visiting PhD program started, eventually it would be great to have 5 regional offices spread throughout the developing world. For example, I think one is needed in South America (Lima, Peru?), one in Africa (Nairobi, Kenya?), and 2 in Asia (India, and China, but not in a traditional technological center). And assuming I3 is based in the US (see my case for San Diego below), it would be great to have an affiliate office in Europe, maybe in Trieste next to ICTP.

One additional initiative that I think could be useful would be paying people to not leave their country and instead help them establish a research center at their local universities. This could also wait until later because it might be easiest to convince some of the future alumni of the predoctoral or visiting PhD programs to return/stay in their home country.

A second additional initiative would be to encourage professors from developed and developing countries to take their sabbatical at I3. This would provide a fresh stream of mentors and set up potential future collaborations. This is a blend of two programs at KITP (this and that).


2. US Primary School Education

The science pipeline analogy is overused, but I don’t have a better one yet. So currently, the researchers in I3 focused areas are predominately male, white or Asian, and middle to upper class. So not a very representative sample of the US (or world) population. Therefore, the best longterm solution is to get a more diverse set of students interested in the research at a young age.

Technically this should have a higher priority over the next initiative (US College Education), but since there are other non-profits interested in this (for example, CodeNow), maybe I3 does not need to be a leader in this and instead can play a supporting role.


3. US College Education

And again back to science pipeline analogy, if we are to have a more diverse set of researchers, we need to encourage a diverse set of undergrads to pursue relevant majors and continue on into graduate programs. This won’t be solved by any single program, but here are some potential ideas.

  • US underrepresented students could apply for the same 1 year program that is offered to international students.
  • Assist universities in establishing bridge programs that partner research universities with colleges that have significant minority populations. A great example of this is the Vanderbilt-Fisk Physics program.
  • US colleges would also benefit from the proposed sabbatical program offered to international researchers. I also like the KITP idea of extending it to undergraduate only institutes (especially those with large minority populations) as a way to get more undergrads interested in research.
  • Establish a complete set of free college curriculum for machine learning, artificial intelligence, and computational neuroscience. While there are many useful MOOCs on these topics, I still don’t think they beat an actual course.


Bonus #1 : Research

ICTP has proven that it is possible to further global educational goals and still succeed at research. I would argue that the people working at I3 should mainly be evaluated for tenure based on their mentorship and teaching of students. Research of course will play a role (otherwise it would be poor mentorship of future researchers), but I think there shouldn’t be huge pressure to bring in grants, high-profile publications, etc. But even without that emphasis, there is no way that a group of smart people with motivated students will not lead to great research.


Bonus #2: International Primary and College Education

This is longer term, but if there are successful programs in improving the US primary and college education, international regional offices, and PhD alumni who are in their home countries, it seems like there should be possible to leverage those connections into a global initiative to improve primary and college education.


Final Thoughts

So Elon Musk, Peter Thiel, and friends, if you have another billion you want to donate (or Open AI funds to redirect), here is my proposal. In reality, implementing all of my ideas would probably cost several billions, but once you got the center founded, I think that it would be easy to get tech companies, the US government, and even UNESCO to help provide funding.

My final point is that I think San Diego would be a perfect location. I know I’m biased since I live here now, but there a many legitimate reasons San Diego is great for this institute.

  1. UCSD already partners with outside research institutes (Salk, Scripps, etc)
  2. UCSD (and Salk, etc) are leaders in all of these research areas
  3. It is extremely easy to convince people to take a sabbatical in San Diego

While there are many other great potential locations, I strongly suggest that I3 is not in the Bay Area, Seattle, Boston, or New York City. These cities already have plenty of tech jobs, please spread the wealth to other parts of the US.

Anyways, I’ll keep dreaming that someday I’ll get to work at a place like the one I just described.