Training Time Speed Ups

The training time of a neural network is a topic that is under discussed in machine learning academic literature, but is super important in practice. By making training time faster, one reaps the benefits of (A) reducing their compute budget and (B) being able to test hypothesis faster.

For the purposes of this post, I’m assuming the neural network of interest has already achieved the inference runtime and performance requirements, and now the focus is purely on optimizing the training time. When I am speeding up training time, I usually find myself alternating between solutions that are either primarily software or hardware related.

Software Solutions

1. Optimizer and Schedules

The software solution that I tune the most is definitely the optimizer and its associated hyper parameters. I find that after almost any other change it is worth double checking that your optimizer and its hyperparameters are still optimal for your new setup.

When first prototyping a network, I prefer to use SGD with a constant learning rate and momentum. This reduces the number of optimizer hyperparameters to just setting the learning rate.

Once a network prototype is achieved, a reliable trick to boost performance is to introduce learning rate step drops when the learning appears to have stalled out. However, determining the exact iteration at which to perform the learning rate step can be nuanced.

I personally have had great success with the One Cycle learning rate policy. However, it opens up a lot of new knobs to potentially tune (min/max lr, exact lr curve, etc). I strongly recommend using AdamW + Fast.ai One Cycle schedule (which follows a cosine annealing) as a starting policy and tuning from there.

2. Initialization

The proper initialization of a neural network can have a large impact on the training speed. There are three major options:
1. Random initialization. A standard choice is Kaiming initialization with Focal Loss bias.
2. Pretraining w/ other datasets. This includes data from other tasks (ie use ImageNet for backbone pretraining when doing object detection) or from same task but different source (ie always doing object detection, but generalizing from COCO to nuScenes).
3. Pretraining w/ same dataset. If one has determined that some subset of a dataset is less valuable for training, it could still serve a useful role in pretraining. See next section on Curriculum for ideas on how to determine the usefulness of data.

3. Curriculum

Given a fixed dataset, there is still an open question on the best way to present the data to the neural network. While the default to define an epoch as one random pass through the dataset works great as a starting point, there are lots of potential improvements in the selection and presentation of a curriculum for the neural network.

Some ideas in this area include:
1. Sampling of data. Not all data samples are equally valuable, for example repeat factor sampling can be used to prioritize rare annotations.
2. Multitask prioritization. When teaching a neural network multiple tasks, one strategy is to teach those tasks from easiest to hardest.
3. Progressive resizing. When training image based tasks with a fully convolution neural network, one approach is to start with small resolution images and progressively increase the size.

I have found that the effort to reward ratio favors spending a significant amount of time on tuning the curriculum.

4. Architecture

Since I am assuming that one has already achieved the required inference runtime, the neural network architecture may not need to be optimized further. However, if you are using a stochastic step during training like drop out, the removal of this step can significantly speed up your training time. Otherwise, tuning the architecture for training speed is usually not worth a large time investment.

Hardware Specific Solutions

1. Better Hardware

If you just sit still, your training time likely will go down from the continued advances in CPUs, GPUS, RAM, etc. A great way to set yourself up for these changes is by not owning your own hardware but instead training in the cloud.

2. Data Serving

The goal is to keep your GPUs fed, so one should push your non-GPU code to be slightly faster than your GPU step so that your GPUs are always busy. Be prepared to often revisit this step as you make advances in optimizing other parts of your training time. There is no way around it, solving this step is often difficult since there is no one size fits all solution. Instead, you will constantly need to find a nuanced solution that depends on your specific hardware and dataset.

3. GPU Utilization

Modern neural network architectures favor 2D CNNs which are highly optimized for GPUs. And if you can show how to simplify a complicated architecture to a 2D CNN, you might get a paper out of it ;).

So personally I have found the effort to be put into GPU utilization to be highly bimodal. If you have an architecture that is not GPU efficient, go all in on making it more efficient, the rewards are huge! But if you have already optimized for inference runtime, you likely already have a modern architecture that efficiently utilizes GPUs. Once you have the right architecture, then you often only need to make minor tweaks to the batch size to fully utilize the GPU.

4. Mixed Precision Training

With the correct hardware, training with mixed precision (float32 and float16) is an easy way to cut training time in half, and modern libraries make it as simple as a single setting. The only catch is that you may make your training more unstable. But if you can tame the extra training noise, this is a no brainer.

5. Multiple GPUs

Eventually, once you have pulled off all the other speedups, you will be left with one real option: scale up your compute! If you can really push the number of GPUs you can leverage, you can pull off crazy headlines like:

Distributed computing has been and will continue to be a hot topic for years to come.

Conclusion

I hope this helps shed light on some of the potential ways to speed up your neural network training. Let me know what type of training speedups you can achieve with this advice!

How students can impress hiring managers

How can a student without any experience outside the classroom pique the interest of a hiring manager?

Have a project that:
1. Is eye catching when boiled down to a few bullets points on a resume
2. Leads to an engaging 20 minute conversation

As a hiring manager looking for machine learning researchers, I’ve reviewed 1000’s of resumes and conducted 100’s of interviews, and the toughest resumes for me to evaluate remain new grads without internships or publications.

Why? Let me compare how my initial conversations go with different types of candidates.

  • Experienced candidate with a prior job: “Let’s chat about when you deployed X network in production to achieve result Y”.
  • New grad with an internship: “Give me more details about what you did for company C during the summer.”
  • New grad with a paper: “I was reading your paper, and I had a question about ablation study A”.
  • Other new grads: “Let me look over your resume…, um yeah, I guess I’ll try and pay attention as you tell me about yet another class project that involved applying a pretrained ResNet model to MNIST.”

At this point, there are enough students doing ML that it is no longer sufficient to stand out by just having ML classes on your resume. But if you can get an internship or publication, that continues to stand out. So are you screwed without an internship or pub? Not at all! But you do need to do some work to spice up your class or capstone projects.

What can you do to make a project that stands out? Below are a few ideas biased towards my work on neural networks for real time embedded systems.

  1. Open source code. An estimated 10% of published papers have open sourced code. So take a cool new paper and code it up! Here is a very impressive repo that is well beyond a class project, but for a simpler project code up one single paper.
  2. Faster. Most academic papers and leader boards focus on performance, often to the detriment of runtime. I’m always impressed when someone can take a paper and speed it up with minimal drop in performance. Some ways to speed up networks include changing the architecture, pruning, and quantization.
  3. Smaller. Networks are massively over parametrized and much larger than they need to be. Grab a network, squeeze it down, and show you can fit it into an edge device. Check out SqueezeNet and the Lottery Ticket Hypothesis for interesting papers in this area.
  4. Cheaper. Training state of the art neural networks is extremely time consuming and costly. Demonstrate how to train a network with a limited GPU hour budget and still get reasonable performance. Check out some ideas from Fast.ai for fast training and this article for training on a single GPU.
  5. Multitask. Academic networks are usually single task, but real time networks are usually Frankensteins with a shared feature map supporting multiple tasks. I recommend this review paper as well as this more recent paper to get started.

Hope that helps! I look forward to chatting with you about these cool projects!

Tips for a New Engineering Manager

Welcome to the new world of engineering management! You’ve entered a brave new world that will be significantly different than your life as an individual contributor. Here are some tips I’ve learned over time.

You have one goal: maximize your team’s productive work.

That’s it. Do that and you will succeed. Do anything else and you fail. Life is simple.

Let’s break that down into a few key ideas.

Maximize Your Team Member’s Potential

First, let’s assess the type of engineers you may have on your team.

  • Solid = Can solve well defined problems.
  • Bad = Can’t solve well defined problems.
  • Junior = Bad engineers but have the excuse of being very early in their career or being put into a new situation.
  • Leaders = Can determine the problems that need to be solved and help others get the necessary work done.

How should you manage the different types of engineers?

Engineering Leaders. If you are a new manager, you are likely the only Leader in your team. If you are lucky enough to have team members that are Leaders, your life will be easy. Just keep giving them tough problems and stay out of their way.

Solid Engineers. These engineers will hopefully make up most of your team. They can tackle well defined problems, so to maximize their current utility, work on giving them the minimum definition they need and then get out of their way. To maximize their long term success, you want to try and turn Solid Engineers into Leaders. One way to do this is to keep giving them tougher problems with less definition and let them grow into Leaders.

Bad Engineers. Hopefully you don’t inherit any Bad Engineers. If you do, you need to quickly assess whether it is worth the management effort to get them to Solid Engineers, or get them off your team (either by firing or a transfer to another team).

Junior Engineers. As a new manager, you are likely to have several Junior Engineers. I’ve found that everyone is happiest if both the Junior Engineer and yourself acknowledge that Junior Engineers do not produce useful work today. Instead, both the Junior Engineer and yourself should focus your efforts on turning them into a Solid Engineer as soon as possible. As a manager, you can accelerate their growth by (1) giving them training tasks, i.e. tasks that may not be the most relevant to your team’s productivity today, but help them grow as an individual and (2) giving them extra 1 on 1 attention. While this extra effort will consume significant time now, you can have a Solid Engineer within a few months. Plus, this extra effort is nothing compared to the effort you would later need to spend to move someone from a Bad Engineer to Solid Engineer.

Maximize Your Team Output

Now that we have covered the basics of maximizing individual team member’s performance, let’s move on to some tips for maximizing your whole team’s output.

Your Output

First, you need to plan that your personal coding output will basically be zero. You may occasionally do some coding, but you need to remember that you are judged on your team’s output now, not just your own. So any coding project you do needs to be weighed against the opportunity cost of you spending that same amount of time improving your team members. I’ve found that this likely means you will only end up coding (1) early prototypes, (2) refactorings, or (3) nice to have projects. Instead, you need to switch to thinking about your team’s output.

Meetings

As a new manager, your life can easily be consumed by meetings. You will now be in charge of 1:1s and team meetings, and also get invited to lots of cross team meetings. My biggest regret as an early manager is saying yes to every meeting. I personally am fine with delegating work, but my weakness was that I still wanted to know everything that was going on. Before you know it, you will spend all day running from meeting to meeting. This is bad for three reasons: (1) you don’t have time to think, (2) you can’t adapt to emergencies and (3) your team members will become too shy to schedule 1:1s with you since they assume you are doing something important.  

So fight the good fight and minimize the number of meetings. I advocate the following meeting philosophy:

  • All meetings should have a predefined agenda, even 1:1s.
  • Have the minimum number of people in the meeting (ie prefer 1:1s or small groups)
  • Minimize the number of meetings that are just status updates. Most of these can likely be replaced by email, Slack, or a document.
  • Group meetings into blocks. I’ve had success with the following type of meeting schedule:
    • Monday AM: Sprint starts / big team meetings.
    • Monday PM: 1:1s.
    • Tuesday: No recurring meetings.
    • Wednesday: Cross team meetings.
    • Thursday: No recurring meetings.
    • Friday AM: Sprint check ins / sprint end meetings.
    • Friday PM: 1:1s.

Documents

A well written document is one of your two new best friends. The power of a well written document is that it (1) forces you to clarify your thoughts, (2) provides a clear documentation trail and (3) prevents you from repeating yourself. There is nothing worse than having a meeting with a lot of people that leads to some very important decisions, but then no one can seem to remember what those decisions are. Write, write, and write some more. 

Do Productive Work

At this point your team should be doing lots of work. But how do you guarantee that this work is productive? Let me introduce you to your second new best friend: Objectives and Key Results

The main principles of OKRs are:

  • Objectives should be clearly defined aspirational goals
  • Key results are specific quantitative measures
  • Each objective should be supported by 3-5 key results
  • Ideally key results can be measured on a 0-100% scale
  • To make key results push the boundaries of your team, it is recommended that
    • A key result is considered achieved if >=70% is achieved
    • Key results are not tied to pay

Many companies use a mix of yearly and quarterly OKRs. I personally find the quarter OKRs most useful as a management tool. The yearly OKRs are more useful for upper management to set clear company wide goals, but I’ve found that setting yearly team goals are not very useful unless they directly tie into a company OKR. Instead, I would focus on writing solid quarterly team OKRs.

For OKRs to truly work, you need to get buy in from the whole team. In order to get buy in, you should:

  1. Do the necessary prep work to deeply understand the company priorities and OKRs.
  2. Involve your team members in developing the team OKRs.
  3. Set up a central dashboard to track your progress on OKRs.
  4. Make that dashboard a central part of your meetings. Kick off team meetings by looking at the dashboard and calling out progress.
  5. Have a retrospective at the end of the recording period to understand why your team succeeded or failed to achieve OKRs.
  6. Rinse and repeat. Your first several quarters of OKRs likely won’t go well. But please don’t give up until you have tried for at least 4 quarters in a row.

Final Thoughts

Good luck, hope these tips help!