# Training Time Speed Ups

The training time of a neural network is a topic that is under discussed in machine learning academic literature, but is super important in practice. By making training time faster, one reaps the benefits of (A) reducing their compute budget and (B) being able to test hypothesis faster.

For the purposes of this post, I’m assuming the neural network of interest has already achieved the inference runtime and performance requirements, and now the focus is purely on optimizing the training time. When I am speeding up training time, I usually find myself alternating between solutions that are either primarily software or hardware related.

## Software Solutions

### 1. Optimizer and Schedules

The software solution that I tune the most is definitely the optimizer and its associated hyper parameters. I find that after almost any other change it is worth double checking that your optimizer and its hyperparameters are still optimal for your new setup.

When first prototyping a network, I prefer to use SGD with a constant learning rate and momentum. This reduces the number of optimizer hyperparameters to just setting the learning rate.

Once a network prototype is achieved, a reliable trick to boost performance is to introduce learning rate step drops when the learning appears to have stalled out. However, determining the exact iteration at which to perform the learning rate step can be nuanced.

I personally have had great success with the One Cycle learning rate policy. However, it opens up a lot of new knobs to potentially tune (min/max lr, exact lr curve, etc). I strongly recommend using AdamW + Fast.ai One Cycle schedule (which follows a cosine annealing) as a starting policy and tuning from there.

### 2. Initialization

The proper initialization of a neural network can have a large impact on the training speed. There are three major options:
1. Random initialization. A standard choice is Kaiming initialization with Focal Loss bias.
2. Pretraining w/ other datasets. This includes data from other tasks (ie use ImageNet for backbone pretraining when doing object detection) or from same task but different source (ie always doing object detection, but generalizing from COCO to nuScenes).
3. Pretraining w/ same dataset. If one has determined that some subset of a dataset is less valuable for training, it could still serve a useful role in pretraining. See next section on Curriculum for ideas on how to determine the usefulness of data.

### 3. Curriculum

Given a fixed dataset, there is still an open question on the best way to present the data to the neural network. While the default to define an epoch as one random pass through the dataset works great as a starting point, there are lots of potential improvements in the selection and presentation of a curriculum for the neural network.

Some ideas in this area include:
1. Sampling of data. Not all data samples are equally valuable, for example repeat factor sampling can be used to prioritize rare annotations.
2. Multitask prioritization. When teaching a neural network multiple tasks, one strategy is to teach those tasks from easiest to hardest.
3. Progressive resizing. When training image based tasks with a fully convolution neural network, one approach is to start with small resolution images and progressively increase the size.

I have found that the effort to reward ratio favors spending a significant amount of time on tuning the curriculum.

### 4. Architecture

Since I am assuming that one has already achieved the required inference runtime, the neural network architecture may not need to be optimized further. However, if you are using a stochastic step during training like drop out, the removal of this step can significantly speed up your training time. Otherwise, tuning the architecture for training speed is usually not worth a large time investment.

## Hardware Specific Solutions

### 1. Better Hardware

If you just sit still, your training time likely will go down from the continued advances in CPUs, GPUS, RAM, etc. A great way to set yourself up for these changes is by not owning your own hardware but instead training in the cloud.

### 2. Data Serving

The goal is to keep your GPUs fed, so one should push your non-GPU code to be slightly faster than your GPU step so that your GPUs are always busy. Be prepared to often revisit this step as you make advances in optimizing other parts of your training time. There is no way around it, solving this step is often difficult since there is no one size fits all solution. Instead, you will constantly need to find a nuanced solution that depends on your specific hardware and dataset.

### 3. GPU Utilization

Modern neural network architectures favor 2D CNNs which are highly optimized for GPUs. And if you can show how to simplify a complicated architecture to a 2D CNN, you might get a paper out of it ;).

So personally I have found the effort to be put into GPU utilization to be highly bimodal. If you have an architecture that is not GPU efficient, go all in on making it more efficient, the rewards are huge! But if you have already optimized for inference runtime, you likely already have a modern architecture that efficiently utilizes GPUs. Once you have the right architecture, then you often only need to make minor tweaks to the batch size to fully utilize the GPU.

### 4. Mixed Precision Training

With the correct hardware, training with mixed precision (float32 and float16) is an easy way to cut training time in half, and modern libraries make it as simple as a single setting. The only catch is that you may make your training more unstable. But if you can tame the extra training noise, this is a no brainer.

### 5. Multiple GPUs

Eventually, once you have pulled off all the other speedups, you will be left with one real option: scale up your compute! If you can really push the number of GPUs you can leverage, you can pull off crazy headlines like:

Distributed computing has been and will continue to be a hot topic for years to come.

## Conclusion

I hope this helps shed light on some of the potential ways to speed up your neural network training. Let me know what type of training speedups you can achieve with this advice!

# How students can impress hiring managers

How can a student without any experience outside the classroom pique the interest of a hiring manager?

Have a project that:
1. Is eye catching when boiled down to a few bullets points on a resume
2. Leads to an engaging 20 minute conversation

As a hiring manager looking for machine learning researchers, I’ve reviewed 1000’s of resumes and conducted 100’s of interviews, and the toughest resumes for me to evaluate remain new grads without internships or publications.

Why? Let me compare how my initial conversations go with different types of candidates.

• Experienced candidate with a prior job: “Let’s chat about when you deployed X network in production to achieve result Y”.
• New grad with an internship: “Give me more details about what you did for company C during the summer.”
• Other new grads: “Let me look over your resume…, um yeah, I guess I’ll try and pay attention as you tell me about yet another class project that involved applying a pretrained ResNet model to MNIST.”

At this point, there are enough students doing ML that it is no longer sufficient to stand out by just having ML classes on your resume. But if you can get an internship or publication, that continues to stand out. So are you screwed without an internship or pub? Not at all! But you do need to do some work to spice up your class or capstone projects.

What can you do to make a project that stands out? Below are a few ideas biased towards my work on neural networks for real time embedded systems.

1. Open source code. An estimated 10% of published papers have open sourced code. So take a cool new paper and code it up! Here is a very impressive repo that is well beyond a class project, but for a simpler project code up one single paper.
2. Faster. Most academic papers and leader boards focus on performance, often to the detriment of runtime. I’m always impressed when someone can take a paper and speed it up with minimal drop in performance. Some ways to speed up networks include changing the architecture, pruning, and quantization.
3. Smaller. Networks are massively over parametrized and much larger than they need to be. Grab a network, squeeze it down, and show you can fit it into an edge device. Check out SqueezeNet and the Lottery Ticket Hypothesis for interesting papers in this area.
4. Cheaper. Training state of the art neural networks is extremely time consuming and costly. Demonstrate how to train a network with a limited GPU hour budget and still get reasonable performance. Check out some ideas from Fast.ai for fast training and this article for training on a single GPU.
5. Multitask. Academic networks are usually single task, but real time networks are usually Frankensteins with a shared feature map supporting multiple tasks. I recommend this review paper as well as this more recent paper to get started.

Hope that helps! I look forward to chatting with you about these cool projects!

# Writing Goals for 2021

After a long hiatus (having a young kid is time consuming…), my goal is to write more in 2021. Here is my New Year’s resolution for 2021.

Objective: Be a positive contributor to the machine learning and tech Internet community.

• Key Result 1: Write 10 blog posts that have >1000 views each.
• Key Result 2: Write 1 blog post that has >10,000 views.
• Key Result 3: Have >1000 Twitter followers.
• Key Result 4: Write >25 tweets that have >100 engagements each.

Some footnotes:

• I wrote 26 posts in 2016, 5 in 2017, 1 in 2018, 1 in 2019, 1 in 2020.
• My PointPillars post had ~900 views in 2019.
• My most popular post of all time is about the NSF GRFP and it has ~9000 views in 4 years.
• I have long been a reader of tweets, but only recently actually started using the app and tweeting. My default is to write long blog posts, but I’d like to see how much I can shift my writing mentality to contribute to the more agile Twitter discussions.
• I currently have ~35 followers and assume I can get to ~100 just by people I know in real life. So I see 1000 followers as proof of interacting with a larger community.
• For reference, my current most engaged tweet is a lame tweet about Die Hard that generated 27 engagements.

Look forward to writing more in 2021! Happy to hear some feedback on what topics I should cover.

# PointPillars: Fast Encoders for Object Detection from Point Clouds

I’m excited to finally be able to share some of the stuff I have been working on since joining nuTonomy: an Aptiv company. We recently released our paper on PointPillars (with code), a cutting edge method for object detection using point clouds.

First, what problem are we trying to solve? Our company goal is to create the software stack to run a self driving taxi. Specifically, the machine learning team’s charter is to tackle the problems that are too tough to model explicitly. Therefore, our method of choice is deep learning, and we usually work closely with the raw sensor data. Our cars have a 360 degree coverage through multiple lidars, cameras, and radars (check out nuScenes for our actual data!), but of these lidar is the most important sensor. Lidar is a laser ranging sensor that provides sparse, yet accurate, points in the 3D world. These point clouds are the key inputs for 3D object detection since they allow precise localization in the real world.

The ideal deep learning model would incorporate all sensor modalities (lidar, cameras, and radar), but a first step is to separately model each sensor. Images are relatively easy since a multitude of methods exist in the literature, so our research has focused on how to do lidar and radar. I’m going to keep focusing on lidar since it is the main sensor, but everything that follows about PointPillars could equally well be used on radar after a few minor changes as I’ll explain later.

So, what was the state of the art for lidar only object detection when we started our research? There were two main schools of thought that are best represented by PIXOR and VoxelNet. The fundamental difference is how to represent the sparse lidar point cloud. One school of thought (PIXOR, MV3D, …) is to create a set of fixed, hand crafted features. The other school (PointNet, Frustum PointNet, VoxelNet, SECOND) believes in end to end learning and just lets the network learn directly from the point cloud. From a performance and engineering perspective, end to end learning is always better because (1) the network should always be able to match (and usually far exceed) fixed encodings and (2) we let the network do the hard work of finding the encoder, rather than having to devote engineer’s time to discover the right encoding. So we should all do end to end learning!

But there is always a catch. The issue with VoxelNet is that it is too slow to run in realtime. The central problem is that they chose to do end to end learning on voxels. This forces them to use 3D convolutions which are extremely slow. In contrast, PIXOR can just use 2D convolutions which are well optimized for GPU computing. If only there was a way to blend the performance of end to end learning with the speed of fixed encoders.

It turns out, we found a method to do so: PointPillars. The fundamental realization (courtesy of Oscar Beijbom) was that pillars are the best representation. A pillar is a vertical column that can extend infinitely up and down. By learning end to end on pillars, we achieved state of the art detection performance on the KITTI leaderboard at blazing fast speeds (60 to >100 Hz), for a 2-4 fold improvement in runtime.

A few more details on why we are so fast. First, by using pillars, we eliminate 3D convolutions since we immediately learn a 2D representation. Second, we sped up the network by eliminating parameters in the encoder and network. Third, while our initial model for training is in PyTorch, we convert that model to a NVIDIA TensorRT planfile which allows additional optimizations for GPUs.

So where do we go from here? The next step is to work on sensor fusion. First, we need a radar network, which at first glance looks like it might require more work. But it turns out, radar is also a sparse point cloud of range returns. While lidar points return the x, y, z position and reflectance of an object, radar returns the radial range, angular velocity, and a host of other features. So we can just plug in radar point clouds to PointPillar and go! Since the radar returns have worse spatial localization than lidar, it turns out the radar only network doesn’t give great performance. Finally, now that we have separate networks for lidar, images, and radar, it is time to fuse them together! We are actively working on this now and hopefully I can share some of our tricks soon.

# NSF GRFP 2018-2019

Thanks to everyone who has been sending me new essays to host. Its crazy that my advice page went from only my essays and thoughts to 93 different examples!

I have to give a disclaimer that I haven’t been following changes to the NSF GRFP as closely this past year, but I think that my general advice still holds. If something seems outdated or wrong, please let me know. I always highly recommend getting multiple opinions and I’m glad that many people who have shared their essays are also writing about their experiences. Also check out the Grad Cafe for useful discussions.

Good luck everyone!

# Deep Learning Tips

I thought I would write up some general tips and tricks that I have learned by experimenting with neural networks. My focus is on tips that apply to any problem and any neural network architecture, and in fact, some of these tips apply more generally to any machine learning algorithm. So what I have learned over the years?

# Data Splits

Before doing anything else, you need to split the dataset into training and testing. But how much data should go into each split? This depends on your number of samples and the number of classes. For example, MNIST has only 10 digits with little variation in each digit, so the standard split is around 80% train and 20% test. ImageNet has over a million samples of 1000 diverse classes, so they use around 50% train and 50% test. So if you have an easy problem and/or a small dataset, I would suggest 80% train and 20% test. If you have a very tough problem and/or a large dataset, I would suggest 50% train and 50% test.

The test data should now be put in a lock box and only used on your final model.

Next you also should set aside some of the training data for validation which is used to determine generalization results when tuning hyperparameters. I would suggest around 20% of the training data to be used as a validation.

Finally, I do a little bit of cheating and I data snoop. I usually take a very tiny amount of the data, maybe 1-5% and play around with it. I will inspect the data to make sure that it looks good, and use the small number of samples to debug my initial code and very roughly tune the hyperparameters. This saves you the headache of doing a long training session only to find out that you had a bug in your code or grossly misunderstood where to start your hyperparameter search.

# Data Preprocessing

As a general rule, the data should be standardized by preprocessing. I’ll discuss some specific standardizations below, but a general issue is whether to standardize by the whole dataset, per sample, or per feature. I tend to default to per sample, but I don’t have a good scientific reason why that is the best. If you standardize by the whole dataset or per feature, you need to make sure you only use the training data to set the scales. If you standardize per feature, make sure that all of your features have significant variation before doing so (see MNIST for an example where per feature standardization can lead to weird results since many features have a standard deviation of zero).

## Mean

All numerical data should be mean centered, no questions asked. If you classes can be robustly classified just by the mean difference, then you don’t need a neural network. You have a very simple problem and should just use a simple threshold discriminator.

## Scaling

I highly recommend scaling the data so that it is all order 1. This can speed up training because most initialization schemes of weights assume that the data is mean centered and has values around the size of 1. But there are two possible ways to scale your data: standard deviation or by the range. If you data looks normally distributed, then standard deviation makes sense. Otherwise I just divide by the maximum of the absolute value.

## Correlations

In theory, it can also be helpful to remove correlations between features by using PCA or ZCA whitening. However, in practice you may run into numerical stability issues since you will need to invert a matrix. So this is worth considering, but takes some more careful application.

# Data Augmentation

More training data is always better, but obtaining that data can be expensive. So I always try hard to find a way to do data augmentation. However, the correct data augmentation is usually problem specific, so I won’t go into details here.

# Early Stopping

The no free lunch theorem of machine learning states that there is no general learning algorithm that will solve all problems. However, Geoff Hinton has pointed out that early stopping is as close to a free lunch as we can get. Early stopping is the easiest way for any machine learning algorithm to avoid overfitting, and you can read more about the technical justifications for it at Distill’s momentum article.

# Optimizer

In practice, all optimizers for neural networks involve some form of stochastic gradient descent (SGD). The only questions is whether you need to manually tune the learning rate and other parameters, or whether you use an adaptive version of SGD that automatically adjusts the learning rates. I think the best adaptive method is Adam (and Nadam when possible, see later subsection on momentum). So for me the choice is simple: either plain SGD or Adam/Nadam. For a more complete comparison of SGD variants, I highly recommend this blog post.

### Learning Rate

If  you are using Adam, you will rarely need to tune the learning rate. But for SGD, the learning rate is by far the most important parameter to tune. A nice tip from Yoshua Bengio is this: the optimal learning rate is often an order of magnitude lower than the smallest learning rate that blows up the loss. So this means, start with a high learning rate and work your way down a half order of magnitude at a time (for example: 1, 0.3, 0.1, …). Then start your fine grained learning rate search about an order of magnitude below the last time the loss blew up.

Another useful tweak on the learning rate is to have it decay over the course of training. I find that this slightly improves the final performance, but more importantly leads to consistent training results. There are a variety of ways to implement the decay, but I’m not sure they make that much of a difference. My standard implementation is

$l_{batch} = \frac{l_{start}}{1+decay*(N_{batches})}$

where $N_{batches}$ is the number of minibatches seen so far during training. I then set decay so that the final learning rate at the end of all the epochs is 1/10th the starting learning rate.

### Momentum

Momentum is very useful for neural networks, but in practice I spend minimal time tuning the momentum rate because I have a few default settings that I strongly recommend.

First, I really only consider three possible momentum values: 0.5, 0.9, and 0.99. Since the maximum effect of momentum is $\frac{1}{1-momentum}$, my default values are roughly spaced by an order of magnitude. I always start with 0.9 and go from there.

Also, I always choose Nesterov momentum whenever possible. Most packages, like Keras, have Nesterov as an option for SGD, and Keras also has Nadam, which is Adam with Nesterov momentum. For more details on Nesterov, see here. The short explanation is that it leads to the same maximum effect of $\frac{1}{1-momentum}$, but it does so in a more gradual manner. In practice, this means that while standard momentum gets very unstable above 0.9, Nesterov momentum can be safely set to 0.99.

Another useful tip is to set the momentum to a smaller value (say half your standard value) for the final few epochs (maybe the last 5-10% of epochs). The intuition for why this is helpful is that hopefully by the end of training, the neural network is close to good weights, but it might be rocking back and forth around the optimal weights. Since the neural network weight space is highly non-convex, by tuning down the momentum, you force the neural network to settle down into these non-convex “valleys” that may contain the best weights.

The final tip, originally suggested here, is to exponentially ramp up and down the momentum anytime you want to change the momentum rate during training. This gives the weights updates time to adjust to the new momentum rates. I personally have found this gives a very slight improvement in performance, but more importantly it leads to consistent training results.

Summary of my momentum tips:

• Peak momentum values of: 0.5, 0.9, or 0.99
• Always choose Nesterov momentum if possible
• Start momentum initially at half the desired peak value and exponentially ramp up
• Towards the end of training, exponentially ramp down momentum to half the desired peak value.
• Train for 5-10% of epochs at the desired smaller momentum.

# Initialization

All weights should be initialized to an orthogonal matrix. This is extremely important for recurrent neural networks (as explained here), but I have also found it to be useful for all neural networks.

# Activation Function

The standard is that all hidden layers are ReLUs unless you need the hidden layers to be a valid probability, in which case you should use a sigmoid.

# Loss

Choosing the right loss function is very problem dependent, so I will leave that for another day. However, whatever loss function you do choose, make sure the output layer activation function is complimentary to that loss, see Michael Nielsen’s book for details on why sigmoid outputs and crossentropy losses are complimentary.

# Regularization

## Weights

Weight regularization is almost always a requirement to prevent overfitting and to get good generalization. The two main choices are L1 or L2 regularization. L1 will ensure that small weights are set to zero, and hence will lead to a sparser set of weights. L2 prevents weights from becoming too large, but does not sparsify the weights. Personally, rather than choosing between the two, I tend to default to both. I set L1 to be very small so that I at least get slightly sparser weights, but then I mainly focus on tuning L2 to control overfitting.

## Activity

Dropout and batch normalization are not regularizers in the traditional sense, but in practice they help reduce overfitting by controlling the activation outputs. Additionally, it is extremely difficult to train very deep neural networks without using either dropout or batchnorm. Dropout was the standard for several years, but now it is usually replaced by batchnorm.

# Parameter Tuning

Neural networks have a lot of interdependent hyperparameters to tune, so picking which ones to tune first is kind of a chicken and the egg problem. Personally, I start off with an adaptive optimizer (like Adam or Nadam) and then tune the architecture. Next I will roughly tune the regularization. Once that leads to acceptable results, I will switch the optimizer to SGD and only focus on tuning the learning rate. If SGD seems promising, I will then tune other parameters like decay and momentum. Hopefully by this point, you are achieving pretty good results. I will then use this neural network as the starting point for a systematic hyperparameter search to truly find the best results.

# Final Tips

Don’t take my word for anything, try it out yourself! I strongly recommend experimenting with every option you can find in Keras and see for yourself what actually will work. I also suggest getting opinions from as many people as possible (see Yoshua Bengio’s tips). I think that about 90% of the advice will overlap, but everyone has their own bias. So hopefully be reading enough independent sources, you can average out all our mistakes. Good luck!

# Temporal Difference Learning

How can humans or machines interact with an environment and learn a strategy for selecting actions that are beneficial to their goals? Answers to this question fall under the artificial intelligence category of reinforcement learning. Here I am going to provide an introduction to temporal difference (TD) learning, which is the algorithm at the heart of reinforcement learning.

I will be presenting TD learning from a computational neuroscience background. My post has been heavily influenced by Dayan and Abbott Ch 9, but I have added some additional points. The ultimate reference for reinforcement learning is the book by Sutton and Barto, and their chapter 6 dives into TD learning.

## Conditioning

To start, let’s review conditioning. The most famous example of conditional is Pavlov’s dogs. The dogs naturally learned to salivate upon the delivery of food, but Pavlov realized that he could condition dogs to associate the ringing of a bell with the delivery of food. Eventually, the ringing of the bell on its own was enough to cause dogs to salivate.

The specific example of Pavlov’s dogs is an example of classical conditioning. In classical conditioning, no action needs to be taken. However, animals can also learn to associate actions with rewards and this is called operant conditioning.

Before I introduce some specific conditioning paradigms, here are the important definitions:

• $s$ = stimulus
• $r$ = reward
• $x$ = no reward
• $v$ = value, or expected reward (generally a function of $r$, $x$)
• $u$ = binary, indicator variable, of stimulus (1 if stimulus present, 0 otherwise)

Here are the conditioning paradigms I want to discuss:

• Pavlovian
• Extinction
• Blocking
• Inhibitory
• Secondary

For each of these paradigms, I will introduce the necessary training stages and the final result. The statement, $a \rightarrow b$, means that $a$ becomes associated ($\rightarrow$) with $b$.

#### Pavlovian

Training: $s \rightarrow r$. The stimulus is trained with a reward.

Results: $s \rightarrow v[r]$. The stimulus is associated with the expectation of a reward.

#### Extinction

Training 1: $s \rightarrow r$. The stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s \rightarrow x$. The stimulus is trained with a no reward.

Results: $s \rightarrow v[x]$. The stimulus is associated with the expectation of no reward. Extinction of the previous Pavlovian training.

#### Blocking

Training 1: $s_1 \rightarrow r$. The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s_1 + s_2 \rightarrow r$. The first stimulus and a second stimulus is trained with a reward.

Results: $s_1 \rightarrow v[r]$, and $s_2 \rightarrow v[x]$. The first stimulus completely explains the reward and hence “blocks” the second stimulus from being associated with the reward.

#### Inhibitory

Training: $s_1+s_2 \rightarrow x$, and $s_1 \rightarrow r$. The combination of two stimuli leads to no reward, but the first stimuli is trained with a reward.

Results: $s_1 \rightarrow v[r]$, and $s_2 \rightarrow -v[r]$. The first stimuli is associated with the expectation of the reward while the second stimuli is associated with the negative of the reward.

#### Secondary

Training 1: $s_1 \rightarrow r$. The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s_2 \rightarrow s_1$. The second stimulus is trained with the first stimulus.

Results: $s_2 \rightarrow v[r]$. Eventually the second stimulus is associated with the reward despite never being directly associated with the reward.

## Rescorla-Wagner Rule

How do we turn the various conditioning paradigms into a mathematical framework of learning? The Rescorla Wagner rule (RW) is a very simple model that can explain many, but not all, of the above paradigms.

The RW rule is a linear prediction model that requires these three equations:

1. $v=w \cdot u$
2. $\delta = r-v$
3. $w_{new} = w_{old}+\epsilon \delta u$

and introduces the following new terms:

• $w$ = weights associated with stimuli state
• $\epsilon$ = learning rate, with $0 \le \epsilon \le 1$

What do each of these equations actually mean?

1. The expected reward, $v$, is a linear dot product of a vector of weights, $w$, associated with each stimuli, $u$.
2. But there may be a mismatch, or error, between the true actual reward, $r$, and the expected reward, $v$.
3. Therefore we should update our weights of each stimuli. We do this by adding a term that is proportional to a learning rate $\epsilon$, the error $\delta$, and the stimuli $u$.

During a Pavlovian pairing of stimuli with reward, the RW rule predicts an exponential approach of the weight to $w = \langle ru\rangle$ over the course of several trials for most values of $\epsilon$ (if $\epsilon=1$ it would instantly update to the final value. Why is this usually bad?). Then if the reward stops being paired with the stimuli, the weight will exponential decay over the course of the next trials.

The RW rule will also continue to work when the reward/stimulus pairing is stochastic instead of deterministic and the will will still approach the final value of $w = \langle ru\rangle$.

How does blocking fit into this framework? Well the RW rule says that after the first stage of training, the weights are $w_1 = r$ and $w_2 = 0$ (since we have not presented stimulus two). When we start the second stage of training and try and associate stimulus two with the reward, we find that we cannot learn that association. The reason is that there is no error (hence $\delta = 0$) and therefore $w_2 = 0$ forever. If instead we had only imperfectly learned the weight of the first stimulus, then there is still some error and hence some learning is possible.

One thing that the RW rule incorrectly predicts is secondary conditioning. In this case, during the learning of the first stimulus, $s_1$, the learned weight becomes $w_1 >0$. The RW rule predicts that the second stimulus, $s_2$, will become $w_2 <0$. This is because this paradigm is exactly the same as inhibitory conditioning, according to the RW rule. Therefore, a more complicated rule is required to successfully have secondary conditioning

One final note. The RW rule can provide an even better match to biology by assuming a non-linear relationship between $v$ and the animal behavior. This function is often something that exponentially saturates at the maximal reward (ie an animal is much more motivated to go from 10% to 20% of the max reward rather than from 80% to 90% of the max reward). While this provides a better fit to many biological experiments, it still cannot explain the secondary conditioning paradigm.

## Temporal Difference Learning

To properly model secondary conditioning, we need to explicitly add in time to our equations. For ease, one can assume that time, $t$, is discrete and that a trial lasts for total time $T$ and therefore $0 \le t \le T$.

The straightforward (but wrong) extension of the RW rule to time is:

1. $v[t]=w[t-1] \cdot u[t]$
2. $\delta[t] = r[t]-v[t]$
3. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where we will say that it takes one time unit to update the weights.

Why is this naive RW with time wrong? Well, psychology and biology experiments show that animals expected rewards does NOT reflect the past history of rewards nor just reflect the next time step, but instead reflects the expected rewards during the WHOLE REMAINDER of the trial. Therefore a better match to biology is:

1. $v[t]=w[t-1] \cdot u[t]$
2. $R[t]= \langle \sum_{\tau=0}^{T-t} r[t+\tau] \rangle$
3. $\delta[t] = R[t]-v[t]$
4. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where $R[t]$ is the full reward expected over the remainder of the trial while $r[t]$ remains the reward at a single time step. This is closer to biology, but we are still missing a key component. Not all future rewards are treated equally. Instead, rewards that happen sooner are valued higher than rewards in the distant future (this is called discounting). So the best match to biology is the following:

1. $v[t]=w[t-1] \cdot u[t]$
2. $R[t]= \langle \sum_{\tau=0}^{T-t} \gamma^\tau r[t+\tau] \rangle$
3. $\delta[t] = R[t]-v[t]$
4. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where $0 \le \gamma \le 1$ is the discounting factor for future rewards. A small discounting factor implies we prefer rewards now while a large discounting factor means we are patient for our rewards.

We have managed to write down a set of equations that accurately summarize biological reinforcement. But how can we actually learn with this system? As currently written, we would need to know the average reward over the remainder of the whole trial. Temporal difference learning makes the following assumptions in order to solve for the expected future rewards:

1. Future rewards are Markovian
2. Current observed estimate of reward is close enough to the typical trial

A Markov process is memoryless in that the next future step only depends on the current state of the system and has no other history dependence. By assuming rewards follow this structure, we can make the following approximation:

• $R[t]= \langle r[t+1] \rangle + \gamma \langle \sum_{\tau=1}^{T-t} \gamma^{\tau-1} r[t+\tau]$
• $R[t]= \langle r[t+1] \rangle + \gamma R[t+1]$

The second approximation is called bootstrapping. We will use the currently observed values rather than the full estimate for future rewards. So finally we end up at the temporal difference learning equations:

1. $v[t]=w[t-1] \cdot u[t]$
2. $R[t] = r[t+1] + \gamma v[t+1]$
3. $\delta[t] =r[t+1] + \gamma v[t+1]-v[t]$
4. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

Dayan and Abbott, Figure 9.2. This illustrates TD learning in action.

I have included an image from Dayan and Abbott about how TD learning evolves over consecutive trials, please read their Chapter 9 for full details.

Finally, I should mention that in practice, people often use the TD-Lambda algorithm. This version introduces a new parameter, lambda, which controls how far back in time one can make adjustments. Lambda 0 implies one time step only, while lambda 1 implies all past time steps. This allows TD learning to excel even if the full system is not Markovian.

## Dopamine and Biology’s TD system

So does biology actually implement TD learning? Animals definitely utilize reinforcement learning and there is strong evidence that temporal difference learning plays an essential role. The leading contender for the reward signal is dopamine. This is a widely used neurotransmitter that evolved in early animals and remains widely conserved. There are a relatively small number of dopamine neurons (in the basal ganglia and VTA in humans) that project widely throughout the brain. These dopamine neurons can produce an intense sensation of pleasure (and in fact the “high” of drugs often comes about either through stimulating dopamine production or preventing its reuptake).

There are two great computational neuroscience papers that highlight the important connection between TD learning and dopamine that analyze two different biological systems:

Both of these papers deserved to be read in detail, but I’ll give a brief summary of the bee foraging paper here. Experiments were done that tracked bees in an controlled environment consisting of “yellow flowers” and “blue flowers” (which were basically just different colored cups). These flowers had the same amount of nectar on average, but were either consistent or highly variable. The bees quickly learned to only target the consistent flowers. These experimental results were very well modeled by assuming the bee was performing TD learning with a relatively small discount factor (driving it to value recent rewards).

## TD Learning and Games

Playing games is the perfect test bed for TD learning. A game has a final objective (win), but throughout play it can be difficult to determine your probability of winning. TD learning provides a systematic framework to associate the value of a given game state with the eventual probability of learning. Below I highlight the games that have most significantly showcased the usefulness of reinforcement learning.

#### Backgammon

Backgammon is a two person game of perfect information (neither player has hidden knowledge) with an element of chance (rolling dice to determine one’s possible moves). Gerald Tesauro’s TD-Gammon was the first program to showcase the value of TD learning, so I will go through it in more detail.

Before getting into specifics, I need to point out that there are actually two (often competing) branches in artificial intelligence:

Symbolic logic tends to be a set of formal rules that a system needs to follow. These rules need to be designed by humans. The connectionist approach uses artificial neural networks and other approaches like TD learning that attempt to mimic biological neural networks. The idea is that humans set up the overall architecture and model of the neural network, but the specific connections between “neurons” is determined by the learning algorithm as it is fed real data examples.

Tesauro actually created two versions of a backgammon program. The first was called Neurogammon. It was trained using supervised learning where it was given expert games as well as games Tesauro played against himself and told to learn to mimic the human moves. Neurogammon was able to play at an intermediate human level.

Tesauro’s next version of a backgammon program was TD-Gammon since it used the TD learning rule. Instead of trying to mimic the human moves, TD-Gammon used to the TD learning rule to assign a score to each move throughout a game. The additional innovation is that the TD-Gammon program was trained by playing games against itself. This initial version of TD-Gammon soon matched Neurogammon (ie intermediate human level). TD-Gammon was able to beat experts by both using a supervised phase on expert games as well as a reinforcement phase.

Despite being able to beat experts, TD-Gammon still had a weakness in the endgame. Since it only looked two-moves ahead, it could miss key moves that would have been found by a more thorough analytical approach. This is where symbolic logic excels and hence TD-Gammon was a great demonstration of the complimentary strength and weaknesses of symbolic vs connectionist logic.

#### Go

Go is a two person game of perfect information with no element of chance. Despite this perfect knowledge, the game is complex enough that there are around $10^170$ possible games (for reference, there are only about $10^80$ atoms in the whole universe). So despite the perfect information, there are just too many possible games to determine the optimal move.

Recently AlphaGo made a huge splash by beating one of the world’s top players of Go. Most Go players, and even many artificial intelligence researchers, thoughts an expert level Go program was years away. So the win was just as surprising as when DeepBlue beat Kasparov in chess. AlphaGo is a large program with many different parts, but at the heart of it is a reinforcement learning module that utilizes TD learning (see here or here for details).

#### Poker

The final frontier in gaming is poker, specifically multi-person No-Limit Texas Hold’em. The reason this is the toughest game left is that it is a multi-player game with imperfect information and an element of chance.

Last winter the computer systems won against professionals for the first time in a series of heads up matches (computer vs only one human). Further improvements are needed to actually beat the best professionals at a multi-person table, but these results seem encouraging for future successes. The interesting thing to me is that both AI system seems to have used only a limited amount of reinforcement learning. I think that fully embracing reinforcement and TD learning should be the top priority for these research teams and might provide the necessary leap in ability. And they should hurry since others might beat them to it!

# Research Experience for Undergrads (REU)

This National Science Foundation program is designed to give undergraduates, especially those from smaller schools, a chance to gain real research experience for a summer. Personally I participated in one official REU and one program modeling on REUs. I learned a lot (and they were tons of fun!). The best part is not the specific topic you research, but the opportunity to learn how to be a researcher.
Most of the applications are due in February. Check out the the official NSF REU website for the latest details.

When you are ready to apply, go here to search for programs of REUs in various subjects. Also, search the internet for other research opportunities; Harvard has a nice list of research programs for undergrads. For more detailed tips on applications, I recommend this site

If you want to get an idea of what an REU is like, here are some interviews of past Math REU participants. And also keep in mind these research tips for undergrads if you do get an REU.

# QFT Resources

Quantum Field Theory is a notoriously difficult subject to learn, but I found the following resources to be extremely helpful when I took the course a few years ago. I just learned about a few resources that I wish I had then, so here are my current tips for learning QFT.

Books:
Tony Zee’s book QFT in a Nutshell provides a great intuition into what QFT is all about. If you actually want to do calculations, then Peskin and Schroeder’s book is a nice compliment. These two books were the heart of my studies into QFT.

David Tong’s Notes:
Great set of lecture notes that provides a different perspective.

Sidney Coleman’s Lectures:
Apparently, all modern QFT books are based on Coleman (since all the authors learned QFT from him or his students), and you can still see the original videos.  For years there was a set of hand-written notes that served as a transcript of the video but this was recently LaTeXed and shared on the ArXiv.