January | 2021 | N 2 Infinity and Beyond

The training time of a neural network is a topic that is under discussed in machine learning academic literature, but is super important in practice. By making training time faster, one reaps the benefits of (A) reducing their compute budget and (B) being able to test hypothesis faster.

For the purposes of this post, I’m assuming the neural network of interest has already achieved the inference runtime and performance requirements, and now the focus is purely on optimizing the training time. When I am speeding up training time, I usually find myself alternating between solutions that are either primarily software or hardware related.

Software Solutions

1. Optimizer and Schedules

The software solution that I tune the most is definitely the optimizer and its associated hyper parameters. I find that after almost any other change it is worth double checking that your optimizer and its hyperparameters are still optimal for your new setup.

When first prototyping a network, I prefer to use SGD with a constant learning rate and momentum. This reduces the number of optimizer hyperparameters to just setting the learning rate.

Once a network prototype is achieved, a reliable trick to boost performance is to introduce learning rate step drops when the learning appears to have stalled out. However, determining the exact iteration at which to perform the learning rate step can be nuanced.

I personally have had great success with the One Cycle learning rate policy. However, it opens up a lot of new knobs to potentially tune (min/max lr, exact lr curve, etc). I strongly recommend using AdamW + Fast.ai One Cycle schedule (which follows a cosine annealing) as a starting policy and tuning from there.

2. Initialization

The proper initialization of a neural network can have a large impact on the training speed. There are three major options:
1. Random initialization. A standard choice is Kaiming initialization with Focal Loss bias.
2. Pretraining w/ other datasets. This includes data from other tasks (ie use ImageNet for backbone pretraining when doing object detection) or from same task but different source (ie always doing object detection, but generalizing from COCO to nuScenes).
3. Pretraining w/ same dataset. If one has determined that some subset of a dataset is less valuable for training, it could still serve a useful role in pretraining. See next section on Curriculum for ideas on how to determine the usefulness of data.

3. Curriculum

Given a fixed dataset, there is still an open question on the best way to present the data to the neural network. While the default to define an epoch as one random pass through the dataset works great as a starting point, there are lots of potential improvements in the selection and presentation of a curriculum for the neural network.

Some ideas in this area include:
1. Sampling of data. Not all data samples are equally valuable, for example repeat factor sampling can be used to prioritize rare annotations.
2. Multitask prioritization. When teaching a neural network multiple tasks, one strategy is to teach those tasks from easiest to hardest.
3. Progressive resizing. When training image based tasks with a fully convolution neural network, one approach is to start with small resolution images and progressively increase the size.

I have found that the effort to reward ratio favors spending a significant amount of time on tuning the curriculum.

4. Architecture

Since I am assuming that one has already achieved the required inference runtime, the neural network architecture may not need to be optimized further. However, if you are using a stochastic step during training like drop out, the removal of this step can significantly speed up your training time. Otherwise, tuning the architecture for training speed is usually not worth a large time investment.

Hardware Specific Solutions

1. Better Hardware

If you just sit still, your training time likely will go down from the continued advances in CPUs, GPUS, RAM, etc. A great way to set yourself up for these changes is by not owning your own hardware but instead training in the cloud.

2. Data Serving

The goal is to keep your GPUs fed, so one should push your non-GPU code to be slightly faster than your GPU step so that your GPUs are always busy. Be prepared to often revisit this step as you make advances in optimizing other parts of your training time. There is no way around it, solving this step is often difficult since there is no one size fits all solution. Instead, you will constantly need to find a nuanced solution that depends on your specific hardware and dataset.

3. GPU Utilization

Modern neural network architectures favor 2D CNNs which are highly optimized for GPUs. And if you can show how to simplify a complicated architecture to a 2D CNN, you might get a paper out of it ;).

So personally I have found the effort to be put into GPU utilization to be highly bimodal. If you have an architecture that is not GPU efficient, go all in on making it more efficient, the rewards are huge! But if you have already optimized for inference runtime, you likely already have a modern architecture that efficiently utilizes GPUs. Once you have the right architecture, then you often only need to make minor tweaks to the batch size to fully utilize the GPU.

4. Mixed Precision Training

With the correct hardware, training with mixed precision (float32 and float16) is an easy way to cut training time in half, and modern libraries make it as simple as a single setting. The only catch is that you may make your training more unstable. But if you can tame the extra training noise, this is a no brainer.

5. Multiple GPUs

Eventually, once you have pulled off all the other speedups, you will be left with one real option: scale up your compute! If you can really push the number of GPUs you can leverage, you can pull off crazy headlines like:

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (cool in 2017)
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes (cool in 2018)

Distributed computing has been and will continue to be a hot topic for years to come.

Conclusion

I hope this helps shed light on some of the potential ways to speed up your neural network training. Let me know what type of training speedups you can achieve with this advice!

How can a student without any experience outside the classroom pique the interest of a hiring manager?

Have a project that:
1. Is eye catching when boiled down to a few bullets points on a resume
2. Leads to an engaging 20 minute conversation

As a hiring manager looking for machine learning researchers, I’ve reviewed 1000’s of resumes and conducted 100’s of interviews, and the toughest resumes for me to evaluate remain new grads without internships or publications.

Why? Let me compare how my initial conversations go with different types of candidates.

Experienced candidate with a prior job: “Let’s chat about when you deployed X network in production to achieve result Y”.
New grad with an internship: “Give me more details about what you did for company C during the summer.”
New grad with a paper: “I was reading your paper, and I had a question about ablation study A”.
Other new grads: “Let me look over your resume…, um yeah, I guess I’ll try and pay attention as you tell me about yet another class project that involved applying a pretrained ResNet model to MNIST.”

At this point, there are enough students doing ML that it is no longer sufficient to stand out by just having ML classes on your resume. But if you can get an internship or publication, that continues to stand out. So are you screwed without an internship or pub? Not at all! But you do need to do some work to spice up your class or capstone projects.

What can you do to make a project that stands out? Below are a few ideas biased towards my work on neural networks for real time embedded systems.

Open source code. An estimated 10% of published papers have open sourced code. So take a cool new paper and code it up! Here is a very impressive repo that is well beyond a class project, but for a simpler project code up one single paper.
Faster. Most academic papers and leader boards focus on performance, often to the detriment of runtime. I’m always impressed when someone can take a paper and speed it up with minimal drop in performance. Some ways to speed up networks include changing the architecture, pruning, and quantization.
Smaller. Networks are massively over parametrized and much larger than they need to be. Grab a network, squeeze it down, and show you can fit it into an edge device. Check out SqueezeNet and the Lottery Ticket Hypothesis for interesting papers in this area.
Cheaper. Training state of the art neural networks is extremely time consuming and costly. Demonstrate how to train a network with a limited GPU hour budget and still get reasonable performance. Check out some ideas from Fast.ai for fast training and this article for training on a single GPU.
Multitask. Academic networks are usually single task, but real time networks are usually Frankensteins with a shared feature map supporting multiple tasks. I recommend this review paper as well as this more recent paper to get started.

Hope that helps! I look forward to chatting with you about these cool projects!

N 2 Infinity and Beyond

N 2 Infinity and Beyond

A physics PhD's adventures in machine learning, biophysics, computational neuroscience, and beyond

Month: January 2021

Training Time Speed Ups

Software Solutions

1. Optimizer and Schedules

2. Initialization

3. Curriculum

4. Architecture

Hardware Specific Solutions

1. Better Hardware

2. Data Serving

3. GPU Utilization

4. Mixed Precision Training

5. Multiple GPUs

Conclusion

How students can impress hiring managers

N 2 Infinity and Beyond

Software Solutions

1. Optimizer and Schedules

2. Initialization

3. Curriculum

4. Architecture

Hardware Specific Solutions

1. Better Hardware

2. Data Serving

3. GPU Utilization

4. Mixed Precision Training

5. Multiple GPUs

Conclusion

Share this:

Share this: