Teaching

Posted on January 25, 2021 General science, Teaching

Training Time Speed Ups

The training time of a neural network is a topic that is under discussed in machine learning academic literature, but is super important in practice. By making training time faster, one reaps the benefits of (A) reducing their compute budget and (B) being able to test hypothesis faster.

For the purposes of this post, I’m assuming the neural network of interest has already achieved the inference runtime and performance requirements, and now the focus is purely on optimizing the training time. When I am speeding up training time, I usually find myself alternating between solutions that are either primarily software or hardware related.

Software Solutions

1. Optimizer and Schedules

The software solution that I tune the most is definitely the optimizer and its associated hyper parameters. I find that after almost any other change it is worth double checking that your optimizer and its hyperparameters are still optimal for your new setup.

When first prototyping a network, I prefer to use SGD with a constant learning rate and momentum. This reduces the number of optimizer hyperparameters to just setting the learning rate.

Once a network prototype is achieved, a reliable trick to boost performance is to introduce learning rate step drops when the learning appears to have stalled out. However, determining the exact iteration at which to perform the learning rate step can be nuanced.

I personally have had great success with the One Cycle learning rate policy. However, it opens up a lot of new knobs to potentially tune (min/max lr, exact lr curve, etc). I strongly recommend using AdamW + Fast.ai One Cycle schedule (which follows a cosine annealing) as a starting policy and tuning from there.

2. Initialization

The proper initialization of a neural network can have a large impact on the training speed. There are three major options:
1. Random initialization. A standard choice is Kaiming initialization with Focal Loss bias.
2. Pretraining w/ other datasets. This includes data from other tasks (ie use ImageNet for backbone pretraining when doing object detection) or from same task but different source (ie always doing object detection, but generalizing from COCO to nuScenes).
3. Pretraining w/ same dataset. If one has determined that some subset of a dataset is less valuable for training, it could still serve a useful role in pretraining. See next section on Curriculum for ideas on how to determine the usefulness of data.

3. Curriculum

Given a fixed dataset, there is still an open question on the best way to present the data to the neural network. While the default to define an epoch as one random pass through the dataset works great as a starting point, there are lots of potential improvements in the selection and presentation of a curriculum for the neural network.

Some ideas in this area include:
1. Sampling of data. Not all data samples are equally valuable, for example repeat factor sampling can be used to prioritize rare annotations.
2. Multitask prioritization. When teaching a neural network multiple tasks, one strategy is to teach those tasks from easiest to hardest.
3. Progressive resizing. When training image based tasks with a fully convolution neural network, one approach is to start with small resolution images and progressively increase the size.

I have found that the effort to reward ratio favors spending a significant amount of time on tuning the curriculum.

4. Architecture

Since I am assuming that one has already achieved the required inference runtime, the neural network architecture may not need to be optimized further. However, if you are using a stochastic step during training like drop out, the removal of this step can significantly speed up your training time. Otherwise, tuning the architecture for training speed is usually not worth a large time investment.

Hardware Specific Solutions

1. Better Hardware

If you just sit still, your training time likely will go down from the continued advances in CPUs, GPUS, RAM, etc. A great way to set yourself up for these changes is by not owning your own hardware but instead training in the cloud.

2. Data Serving

The goal is to keep your GPUs fed, so one should push your non-GPU code to be slightly faster than your GPU step so that your GPUs are always busy. Be prepared to often revisit this step as you make advances in optimizing other parts of your training time. There is no way around it, solving this step is often difficult since there is no one size fits all solution. Instead, you will constantly need to find a nuanced solution that depends on your specific hardware and dataset.

3. GPU Utilization

Modern neural network architectures favor 2D CNNs which are highly optimized for GPUs. And if you can show how to simplify a complicated architecture to a 2D CNN, you might get a paper out of it ;).

So personally I have found the effort to be put into GPU utilization to be highly bimodal. If you have an architecture that is not GPU efficient, go all in on making it more efficient, the rewards are huge! But if you have already optimized for inference runtime, you likely already have a modern architecture that efficiently utilizes GPUs. Once you have the right architecture, then you often only need to make minor tweaks to the batch size to fully utilize the GPU.

4. Mixed Precision Training

With the correct hardware, training with mixed precision (float32 and float16) is an easy way to cut training time in half, and modern libraries make it as simple as a single setting. The only catch is that you may make your training more unstable. But if you can tame the extra training noise, this is a no brainer.

5. Multiple GPUs

Eventually, once you have pulled off all the other speedups, you will be left with one real option: scale up your compute! If you can really push the number of GPUs you can leverage, you can pull off crazy headlines like:

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (cool in 2017)
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes (cool in 2018)

Distributed computing has been and will continue to be a hot topic for years to come.

Conclusion

I hope this helps shed light on some of the potential ways to speed up your neural network training. Let me know what type of training speedups you can achieve with this advice!

1 Comment Posted on October 8, 2018October 8, 2018 Everybody, General science, Teaching

NSF GRFP 2018-2019

Thanks to everyone who has been sending me new essays to host. Its crazy that my advice page went from only my essays and thoughts to 93 different examples!

I have to give a disclaimer that I haven’t been following changes to the NSF GRFP as closely this past year, but I think that my general advice still holds. If something seems outdated or wrong, please let me know. I always highly recommend getting multiple opinions and I’m glad that many people who have shared their essays are also writing about their experiences. Also check out the Grad Cafe for useful discussions.

Good luck everyone!

Posted on May 16, 2017May 16, 2017 General science, Teaching

Temporal Difference Learning

How can humans or machines interact with an environment and learn a strategy for selecting actions that are beneficial to their goals? Answers to this question fall under the artificial intelligence category of reinforcement learning. Here I am going to provide an introduction to temporal difference (TD) learning, which is the algorithm at the heart of reinforcement learning.

I will be presenting TD learning from a computational neuroscience background. My post has been heavily influenced by Dayan and Abbott Ch 9, but I have added some additional points. The ultimate reference for reinforcement learning is the book by Sutton and Barto, and their chapter 6 dives into TD learning.

Conditioning

To start, let’s review conditioning. The most famous example of conditional is Pavlov’s dogs. The dogs naturally learned to salivate upon the delivery of food, but Pavlov realized that he could condition dogs to associate the ringing of a bell with the delivery of food. Eventually, the ringing of the bell on its own was enough to cause dogs to salivate.

The specific example of Pavlov’s dogs is an example of classical conditioning. In classical conditioning, no action needs to be taken. However, animals can also learn to associate actions with rewards and this is called operant conditioning.

Before I introduce some specific conditioning paradigms, here are the important definitions:

$s$ = stimulus
$r$ = reward
$x$ = no reward
$v$ = value, or expected reward (generally a function of $r$ , $x$ )
$u$ = binary, indicator variable, of stimulus (1 if stimulus present, 0 otherwise)

Here are the conditioning paradigms I want to discuss:

Pavlovian
Extinction
Blocking
Inhibitory
Secondary

For each of these paradigms, I will introduce the necessary training stages and the final result. The statement, $a \rightarrow b$ , means that $a$ becomes associated ( $\rightarrow$ ) with $b$ .

Pavlovian

Training: $s \rightarrow r$ . The stimulus is trained with a reward.

Results: $s \rightarrow v[r]$ . The stimulus is associated with the expectation of a reward.

Extinction

Training 1: $s \rightarrow r$ . The stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s \rightarrow x$ . The stimulus is trained with a no reward.

Results: $s \rightarrow v[x]$ . The stimulus is associated with the expectation of no reward. Extinction of the previous Pavlovian training.

Blocking

Training 1: $s_1 \rightarrow r$ . The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s_1 + s_2 \rightarrow r$ . The first stimulus and a second stimulus is trained with a reward.

Results: $s_1 \rightarrow v[r]$ , and $s_2 \rightarrow v[x]$ . The first stimulus completely explains the reward and hence “blocks” the second stimulus from being associated with the reward.

Inhibitory

Training: $s_1+s_2 \rightarrow x$ , and $s_1 \rightarrow r$ . The combination of two stimuli leads to no reward, but the first stimuli is trained with a reward.

Results: $s_1 \rightarrow v[r]$ , and $s_2 \rightarrow -v[r]$ . The first stimuli is associated with the expectation of the reward while the second stimuli is associated with the negative of the reward.

Secondary

Training 1: $s_1 \rightarrow r$ . The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s_2 \rightarrow s_1$ . The second stimulus is trained with the first stimulus.

Results: $s_2 \rightarrow v[r]$ . Eventually the second stimulus is associated with the reward despite never being directly associated with the reward.

Rescorla-Wagner Rule

How do we turn the various conditioning paradigms into a mathematical framework of learning? The Rescorla Wagner rule (RW) is a very simple model that can explain many, but not all, of the above paradigms.

The RW rule is a linear prediction model that requires these three equations:

$v=w \cdot u$
$\delta = r-v$
$w_{new} = w_{old}+\epsilon \delta u$

and introduces the following new terms:

$w$ = weights associated with stimuli state
$\epsilon$ = learning rate, with $0 \le \epsilon \le 1$

What do each of these equations actually mean?

The expected reward, $v$ , is a linear dot product of a vector of weights, $w$ , associated with each stimuli, $u$ .
But there may be a mismatch, or error, between the true actual reward, $r$ , and the expected reward, $v$ .
Therefore we should update our weights of each stimuli. We do this by adding a term that is proportional to a learning rate $\epsilon$ , the error $\delta$ , and the stimuli $u$ .

During a Pavlovian pairing of stimuli with reward, the RW rule predicts an exponential approach of the weight to $w = \langle ru\rangle$ over the course of several trials for most values of $\epsilon$ (if $\epsilon=1$ it would instantly update to the final value. Why is this usually bad?). Then if the reward stops being paired with the stimuli, the weight will exponential decay over the course of the next trials.

The RW rule will also continue to work when the reward/stimulus pairing is stochastic instead of deterministic and the will will still approach the final value of $w = \langle ru\rangle$ .

How does blocking fit into this framework? Well the RW rule says that after the first stage of training, the weights are $w_1 = r$ and $w_2 = 0$ (since we have not presented stimulus two). When we start the second stage of training and try and associate stimulus two with the reward, we find that we cannot learn that association. The reason is that there is no error (hence $\delta = 0$ ) and therefore $w_2 = 0$ forever. If instead we had only imperfectly learned the weight of the first stimulus, then there is still some error and hence some learning is possible.

One thing that the RW rule incorrectly predicts is secondary conditioning. In this case, during the learning of the first stimulus, $s_1$ , the learned weight becomes $w_1 >0$ . The RW rule predicts that the second stimulus, $s_2$ , will become $w_2 <0$ . This is because this paradigm is exactly the same as inhibitory conditioning, according to the RW rule. Therefore, a more complicated rule is required to successfully have secondary conditioning

One final note. The RW rule can provide an even better match to biology by assuming a non-linear relationship between $v$ and the animal behavior. This function is often something that exponentially saturates at the maximal reward (ie an animal is much more motivated to go from 10% to 20% of the max reward rather than from 80% to 90% of the max reward). While this provides a better fit to many biological experiments, it still cannot explain the secondary conditioning paradigm.

Temporal Difference Learning

To properly model secondary conditioning, we need to explicitly add in time to our equations. For ease, one can assume that time, $t$ , is discrete and that a trial lasts for total time $T$ and therefore $0 \le t \le T$ .

The straightforward (but wrong) extension of the RW rule to time is:

$v[t]=w[t-1] \cdot u[t]$
$\delta[t] = r[t]-v[t]$
$w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where we will say that it takes one time unit to update the weights.

Why is this naive RW with time wrong? Well, psychology and biology experiments show that animals expected rewards does NOT reflect the past history of rewards nor just reflect the next time step, but instead reflects the expected rewards during the WHOLE REMAINDER of the trial. Therefore a better match to biology is:

$v[t]=w[t-1] \cdot u[t]$
$R[t]= \langle \sum_{\tau=0}^{T-t} r[t+\tau] \rangle$
$\delta[t] = R[t]-v[t]$
$w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where $R[t]$ is the full reward expected over the remainder of the trial while $r[t]$ remains the reward at a single time step. This is closer to biology, but we are still missing a key component. Not all future rewards are treated equally. Instead, rewards that happen sooner are valued higher than rewards in the distant future (this is called discounting). So the best match to biology is the following:

$v[t]=w[t-1] \cdot u[t]$
$R[t]= \langle \sum_{\tau=0}^{T-t} \gamma^\tau r[t+\tau] \rangle$
$\delta[t] = R[t]-v[t]$
$w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where $0 \le \gamma \le 1$ is the discounting factor for future rewards. A small discounting factor implies we prefer rewards now while a large discounting factor means we are patient for our rewards.

We have managed to write down a set of equations that accurately summarize biological reinforcement. But how can we actually learn with this system? As currently written, we would need to know the average reward over the remainder of the whole trial. Temporal difference learning makes the following assumptions in order to solve for the expected future rewards:

Future rewards are Markovian
Current observed estimate of reward is close enough to the typical trial

A Markov process is memoryless in that the next future step only depends on the current state of the system and has no other history dependence. By assuming rewards follow this structure, we can make the following approximation:

$R[t]= \langle r[t+1] \rangle + \gamma \langle \sum_{\tau=1}^{T-t} \gamma^{\tau-1} r[t+\tau]$
$R[t]= \langle r[t+1] \rangle + \gamma R[t+1]$

The second approximation is called bootstrapping. We will use the currently observed values rather than the full estimate for future rewards. So finally we end up at the temporal difference learning equations:

$v[t]=w[t-1] \cdot u[t]$
$R[t] = r[t+1] + \gamma v[t+1]$
$\delta[t] =r[t+1] + \gamma v[t+1]-v[t]$
$w[t] = w[t-1]+\epsilon \delta[t] u[t]$

Screen Shot 2017-05-15 at 5.06.51 PM.png

Dayan and Abbott, Figure 9.2. This illustrates TD learning in action.

I have included an image from Dayan and Abbott about how TD learning evolves over consecutive trials, please read their Chapter 9 for full details.

Finally, I should mention that in practice, people often use the TD-Lambda algorithm. This version introduces a new parameter, lambda, which controls how far back in time one can make adjustments. Lambda 0 implies one time step only, while lambda 1 implies all past time steps. This allows TD learning to excel even if the full system is not Markovian.

Dopamine and Biology’s TD system

So does biology actually implement TD learning? Animals definitely utilize reinforcement learning and there is strong evidence that temporal difference learning plays an essential role. The leading contender for the reward signal is dopamine. This is a widely used neurotransmitter that evolved in early animals and remains widely conserved. There are a relatively small number of dopamine neurons (in the basal ganglia and VTA in humans) that project widely throughout the brain. These dopamine neurons can produce an intense sensation of pleasure (and in fact the “high” of drugs often comes about either through stimulating dopamine production or preventing its reuptake).

There are two great computational neuroscience papers that highlight the important connection between TD learning and dopamine that analyze two different biological systems:

Both of these papers deserved to be read in detail, but I’ll give a brief summary of the bee foraging paper here. Experiments were done that tracked bees in an controlled environment consisting of “yellow flowers” and “blue flowers” (which were basically just different colored cups). These flowers had the same amount of nectar on average, but were either consistent or highly variable. The bees quickly learned to only target the consistent flowers. These experimental results were very well modeled by assuming the bee was performing TD learning with a relatively small discount factor (driving it to value recent rewards).

TD Learning and Games

Playing games is the perfect test bed for TD learning. A game has a final objective (win), but throughout play it can be difficult to determine your probability of winning. TD learning provides a systematic framework to associate the value of a given game state with the eventual probability of learning. Below I highlight the games that have most significantly showcased the usefulness of reinforcement learning.

Backgammon

Backgammon is a two person game of perfect information (neither player has hidden knowledge) with an element of chance (rolling dice to determine one’s possible moves). Gerald Tesauro’s TD-Gammon was the first program to showcase the value of TD learning, so I will go through it in more detail.

Before getting into specifics, I need to point out that there are actually two (often competing) branches in artificial intelligence:

Symbolic logic tends to be a set of formal rules that a system needs to follow. These rules need to be designed by humans. The connectionist approach uses artificial neural networks and other approaches like TD learning that attempt to mimic biological neural networks. The idea is that humans set up the overall architecture and model of the neural network, but the specific connections between “neurons” is determined by the learning algorithm as it is fed real data examples.

Tesauro actually created two versions of a backgammon program. The first was called Neurogammon. It was trained using supervised learning where it was given expert games as well as games Tesauro played against himself and told to learn to mimic the human moves. Neurogammon was able to play at an intermediate human level.

Tesauro’s next version of a backgammon program was TD-Gammon since it used the TD learning rule. Instead of trying to mimic the human moves, TD-Gammon used to the TD learning rule to assign a score to each move throughout a game. The additional innovation is that the TD-Gammon program was trained by playing games against itself. This initial version of TD-Gammon soon matched Neurogammon (ie intermediate human level). TD-Gammon was able to beat experts by both using a supervised phase on expert games as well as a reinforcement phase.

Despite being able to beat experts, TD-Gammon still had a weakness in the endgame. Since it only looked two-moves ahead, it could miss key moves that would have been found by a more thorough analytical approach. This is where symbolic logic excels and hence TD-Gammon was a great demonstration of the complimentary strength and weaknesses of symbolic vs connectionist logic.

Go

Go is a two person game of perfect information with no element of chance. Despite this perfect knowledge, the game is complex enough that there are around $10^170$ possible games (for reference, there are only about $10^80$ atoms in the whole universe). So despite the perfect information, there are just too many possible games to determine the optimal move.

Recently AlphaGo made a huge splash by beating one of the world’s top players of Go. Most Go players, and even many artificial intelligence researchers, thoughts an expert level Go program was years away. So the win was just as surprising as when DeepBlue beat Kasparov in chess. AlphaGo is a large program with many different parts, but at the heart of it is a reinforcement learning module that utilizes TD learning (see here or here for details).

Poker

The final frontier in gaming is poker, specifically multi-person No-Limit Texas Hold’em. The reason this is the toughest game left is that it is a multi-player game with imperfect information and an element of chance.

Last winter the computer systems won against professionals for the first time in a series of heads up matches (computer vs only one human). Further improvements are needed to actually beat the best professionals at a multi-person table, but these results seem encouraging for future successes. The interesting thing to me is that both AI system seems to have used only a limited amount of reinforcement learning. I think that fully embracing reinforcement and TD learning should be the top priority for these research teams and might provide the necessary leap in ability. And they should hurry since others might beat them to it!

4 Comments Posted on August 22, 2016October 7, 2018 General science, Research, Teaching

NSF GRFP 2016-2017

For a couple of years now, I have had a website with my thoughts on the National Science Foundation Graduate Research Fellowship (NSF GRFP) and examples of successful essays. The popularity of the site in the past few years has grown well beyond what I expected, so this year I’m going to use this blog to try out a few new things.

Questions from You

I end up getting lots of emails asking for advice. While sometimes the advice really does merit an individualized result, many of the questions are applicable to everyone. So in the interest of efficiently answering questions, here is my plan this year.

Before asking me, make sure you’ve read my advice, checked out the NSF GRFP FAQ, skimmed GradCafe, read my FAQ (next section), and checked out the comments for this blog post.
I will not answer any questions about eligibility due to gaps in graduate school because I am honestly clueless on it.
If you feel comfortable asking the question publicly, post it by commenting below.
If you want to ask me privately, send me an email (my full name at gmail.com, include NSF GRFP Question in subject line). I will try and answer you and also work with you on a public question/answer that I can include here.

FAQ

Here are some past questions I have been asked and/or questions I anticipate being asked this year.

My research is closely related to medicine. Am I still eligible?
- I think the best test for this is to ask your advisor if they would apply to NSF or NIH for grants on this topic. If NSF you are definitely good, but if NIH, you will need to reframe the research to fit into NSF.
I am a first year graduate student. Should I apply this year or wait until my second year? (New issue this year since incoming graduate students can only apply once).
- This is the toughest question for me since no one has had to make this choice yet. However, here is how I would personally decide. The important thing to remember is that undergrads and graduate students are each separately graded. So you really need to decide how you currently rank relative to your peers versus how you will rank next year. If you did a bunch of undergrad research, have papers, etc, definitely apply as a first year. If you didn’t, it might payoff to wait, but only if your program lets you get right into research. If you will just be taking classes, I’m less confident your relative standing will improve. Good luck to everyone with this tough choice!

Requests for Essay Reading

Unfortunately, I now get more requests to read essays than I can reasonably accomplish. But I am still willing to read over a few and here is how I will decide on the essays to read.

If you are in San Diego, and you think I am a better fit for you than the other local people on the experienced resource list, send me an email with the subject NSF GRFP Experienced Resource List.
If you are not in San Diego, first check out the experienced resource list and also ask around your school for other resources.
If you can’t find anyone to read your essays, fill out this form. I will semi-randomly select essays to read.

What do I mean by semi-randomly? Well, in the interest of supporting the NSF GRFP’s goal of increasing the diversity of graduate school, I will give priority to undergrads who are without a local person on the experienced resource list and/or are from underrepresented groups. The NSF GRFP specifically “encourages women, members of underrepresented minority groups, persons with disabilities, and veterans to apply”, and I am willing to extremely loosely define minority group by race, ethnicity, sexual orientation, family socio-economic status, geography, colleges that traditionally send few students to graduate school, etc. The form is fill in the blank, so feel free to justify your inclusion in any other underrepresented group that I did not explicitly list.

I’ll then take the prioritized list and make some random selection. The number of people I select this way will depend on the number of local people I end up advising, but I will definitely read at least 2 non-local applications.

Here is a my time-line for essay reading:

Sept 16th – Random drawing number 1
~~Sept 30th~~ Extended to Oct 5th – Random drawing number 2 (I’ll include everyone again, so early birds get double the chances of being selected)
Oct 21st – Last day I will help people (sorry I’m traveling near the deadline)

Posted on August 20, 2016 General science, Research, Teaching

Best Machine Learning Resources

Machine learning is a rapidly evolving field that is generating an intense interest from a wide audience. So how can you get started?

For now, I’m going to assume that you already have the basic programming (ie general introduction to programming and experience with matrices) and mathematical skills (calculus and some probability and linear algebra).

These are the best current books on machine learning:

Murphy. This is a comprehensive introduction to the whole field.
Learning From Data. This is a brief introduction to a subset of topics.
Deep Learning. Also check out my previous post.

These are some out of date books that still contain some useful sections (for example, Murphy several times refers you to Bishop or MacKay for more details).

Bishop. Predecessor to Murphy.
MacKay. Free pdf!
Hastie, Tibshirani, and Friedman. Free pdf!

Here is a list of other potential resources:

3 Comments Posted on March 24, 2016March 24, 2016 General science, Research, Teaching

Life at Low Reynolds Number

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Diffusion

Organized by Ben Regner

Standard Diffusion
Anomalous Diffusion
Life at Low Reynold’s Number

Papers

Life at Low Reynold’s Number. By Purcell in 1977.

Other Useful References

Wired Article on Fluid Dynamics

Introduction

This is one of my favorite papers. The presentation style is extremely fun and readable without sacrificing any scientific integrity. I think it serves as a great introduction to fluid mechanics at low Reynold’s number. I don’t have too many comments since I think the paper explains it the best, but I will provide a few supplementary details for a more in depth exploration of the ideas from the paper.

And just to get you excited about fluid dynamics, I present an example of laminar flow:

Basics of Fluid Mechanics

The fundamental equation of fluid mechanics is Navier-Stokes. The relevant version for this paper is the incompressible flow equations with pressure but no other external fields:

$\frac{\partial \vec{u}}{\partial t}+ \vec{u}\cdot\nabla\vec{u} +\frac{1}{\rho}\nabla p -\nu\nabla^2\vec{u}=0$

where $\vec{u}$ is the velocity vector, $\vec{x}$ is position, $\rho$ is density, $p$ is pressure, and $\nu$ is the kinematic viscosity. This equation can be made non-dimensional by the introduction of a characteristic velocity $U$ , length $L$ , and introducing the dynamic viscosity $\eta=\nu/\rho$ . This gives the following dimensionless variables:

$u^* = \frac{u}{U}$

$x^* = \frac{x}{L}$

$p^* = \frac{pL}{\eta U}$

$t^* = \frac{L}{U}$

Substituting in these characteristic length scales and doing some algebra, one arrives at the simplified equations:

$R\frac{\partial \vec{u^*}}{\partial t^*}+ R\vec{u^*}\cdot\nabla^*\vec{u^*} +\nabla^* p^*-(\nabla^*)^2\vec{u^*}=0$

with only one dimensionless constant, the Reynold’s number, defined as:

$R = \frac{UL\rho}{\eta} = \frac{UL}{\nu}$

As explained in the paper, Reynold’s number is one of the essential constants describing a flow. High Reynold’s number leads to turbulent (chaotic) flow, while low Reynold’s number leads to laminar (smooth) flow. For extemely small Reynold’s number, Navier-Stokes simplifies to:

$\nabla^* p^* = (\nabla^*)^2\vec{u^*}$

which is also just called Stoke’s equation.

At the end of the paper, Purcell describes another dimensionless number which he calls $S$ and in a footnote identifies as the Sherwood number. However, Ben Regner pointed out, that Purcell’s $S$ would actually be called the Peclet number today.

Basics of Ecoli Chemotaxis

Chemotaxis and cellular sensing really deserves its own series of papers. But in the meantime, I recommend the following resources

Chemotaxis on Wikipedia
Howard Berg’s videos on individual Ecoli
Howard Berg’s videos on swarms of Ecoli
Berg and Purcell, Physics of chemoreception, 1977.

Video Proof of Purcell’s Scallop Theorem

Reversible kicking does fine in water (high Reynold’s number)…

… but the same motion has issues in corn syrup (low Reynold’s number).

Here is a solution similar to what Ecoli and other bacteria employ.

Fundamental Questions

Purcell does an amazing job, so I have nothing to add.

Advanced Questions

What are some other strategies that are employed in biology to get around the issue of mobility at low Reynold’s number? Hint: I already linked to a video of one strategy. There are at least two other strategies, but to find these you will need to think about the assumptions leading to the basic Navier-Stokes equations.

3 Comments Posted on March 24, 2016March 24, 2016 General science, Research, Teaching

Anomalous Diffusion

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Diffusion

Organized by Ben Regner

Papers

Random walk models in biology. By Codling, Plank, and Benhamou in 2008.

Other Useful References

Anomalous Diffusion

What is anomalous diffusion?

If one measures the mean square displacement vs time, it can be parameterized as

$< x^2> = t^\alpha$

where $\alpha=1$ is Brownian (standard diffusion), $0<\alpha<1$ is subdiffusive, $1<\alpha<2$ is superdiffusive, and ballistic is $\alpha=2$ . So the technical definition of anomalous diffusion is $0<\alpha<1$ or $1<\alpha<2$ .

How to describe anomalous diffusion?

Currently, there is no “best” or “simple” description of anomalous diffusion in the general case. However, continuous-time random walks (CTRW) are one paradigm that I find helpful as a conceptual and simulation framework.

In the simplest discrete random walk (DRW), at every time step, a particle makes a jump of fixed size, the only question is the direction. The next generalization has the particle make a jump at every time step, but now it draws the jump size from a distribution.

The idea of a CTRW is that there is now a distribution both of the waiting time between jumps, and the jump size. If the waiting time follows the exponential distribution and the jump size follows the normal distribution, one ends up with the Wiener process aka standard diffusion and Brownian motion.

What causes anomalous diffusion?

Just as a reminder, there are three conditions that need to be satisfied for Brownian motion (standard diffusion):
1. Increments are independent
2. Increments are wide sense stationary. 1st moment and autocovariance don’t depend on time (this is weaker condition then complete stationarity)
3. Zero mean

The third condition is often ignored by examining the motion relative to the mean displacement (ie the actual displacement is not Brownian, but fluctuations in the displacement could be Brownian). So really, the first two are the more important conditions. Therefore, anomalous diffusion arises due to non-independent increments and/or correlations in time of the mean and/or standard deviation.

The CTRW allows one to think more precisely about different mechanisms that can give rise to anomalous diffusion. There is not one single way to get sub or super-diffusion in CTRW, since there are two, potentially dependent, distributions (waiting time and jump size). However, there are a few common situations that seem to arise often in biology and elsewhere (see Random walk models in biology, Box 2 for original idea). Subdiffusion in biology is often caused by longer waiting time distributions (compared to exponential), or molecular crowding, while superdiffusion in occurs when jump sizes are drawn from a Levy flight or other alpha stable distributions.

Examples

For further exploration of anomalous diffusion in biology, I recommend these papers

Advanced Questions

This is an interesting paper that introduces a renormalization group approach to classifying diffusion processes

3 Comments Posted on March 12, 2016March 24, 2016 General science, Research, Teaching

Standard Diffusion

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Diffusion

Organized by Ben Regner

Papers

Brownian Motion. By Einstein in 1905.

Brownian Motion. By Langevin in 1908.

An Introduction to Fractional Diffusion. By Henry, Langlands, and Straka in 2010.

Other Useful References

What is diffusion?

Diffusion is the general process by which small particles move from regions of high concentration to low concentration. Check out the link to the Wikipedia articles above for some cool videos and animations. Diffusion is extremely ubiquitous and plays an essential role in biology. For example, oxygen diffuses from your lungs to unoxygenated blood, which then delivers it to the rest of your body where it diffuses out of your blood and into your cells. Additionally, signals between neurons are transmitted by several different diffusing molecules.

Mathematically, standard diffusion is described by two fundamental equations.

Fick’s First Law: Particles move from high-to-low concentration.

$j=-D\frac{\partial n}{\partial x}$

where $n$ is the number of particles, $x$ is the location of the particles, $D$ is the diffusion constant, and $j$ is the flux of particles.

Fick’s Second Law: Conservation of particles combined with Fick’s First Law leads to the diffusion equation.

If particles cannot be created or destroyed, they follow a conservation law:

$\frac{\partial n}{\partial t} = -\frac{\partial j}{\partial x}$

Combining the conservation law with Fick’s First Law gives us the diffusion equation:

$\frac{\partial n}{\partial t} = D \frac{\partial^2 n}{\partial x^2}$

Brownian Motion

In 1827 Robert Brown looked at pollen in water under a microscope, see Wikipedia page for simulations of the observations. Much to his surprise, the pollen acts as if it alive! Brown verified that pollen is not alive and any small, inorganic particle followed similar motion. In 1905, during Einstein’s miracle year, he wrote a paper on an atomistic description that describes Brownian Motion. In 1908 Langevin used a different approach (that is “infinitely simpler” in his words) to describe Brownian motion. The general explanations are outlined below.

1. Einstein’s Derivation

Einstein’s goal was a probability based description of Brownian motion that connects to Fick’s law. Einstein makes several assumptions about the particles, including

Changes in boundaries do not effect the particles
Particles are well-separated
Stoke’s relation, which implies low Reynold’s number

In the end, Einstein finds a solution that is Gaussian, implying that the mean square displacement is linear in time for Brownian motion:

$< x^2> = t$

More generally, the mean square displacement could depend on some power of time, usually parameterized as

$< x^2> = t^\alpha$

where $\alpha=1$ is Brownian, $0<\alpha<1$ is subdiffusive, $1<\alpha<2$ is superdiffusive, and ballistic is $\alpha=2$ . Note, one can get up to $\alpha=3$ in certain turbulent regimes.

2. Langevin’s Derivation
The Langevin approach is to start with a particle based description. The first assumption is the equipartition theorem to determine the kinetic energy (KE)
$KE = \frac{k_B T}{2} = m (\frac{d^2 x}{dt^2})^2$

Then, one looks at the actual forces on the particle:

KE = Stoke’s + stochastic variable
$m (\frac{d^2 x}{dt^2})^2 = -6 \pi \eta r \frac{dx}{dt} + X$
where $X$ is a stochastic variable. It is assumed to be zero mean, unit variance, and no time correlations, aka white noise.

After multiplying both sides of the equation by x, doing some algebra, and then taking the average solution, one arrives at the same results as Einstein (after ignore a short time transient).

3. Random Walk Derivation.

There is a third way to derive Brownian motion that is layed out in the book chapter above. The idea is to look at a single particle and do a microscopic random walk. One can set up a recursive definition that defines a binomial probability solution. After a large number of steps, the central limit theorem applies and we end up with a Gaussian solution.

How do we get Brownian motion?

In general, there are three conditions that need to be satisfied for Brownian motion:
1. Increments are independent
2. Increments are wide sense stationary. 1st moment and autocovariance don’t depend on time (this is weaker condition then complete stationarity)
3. Zero mean

Fundamental Questions

Einstein made three major assumptions in his derivation. 2/3 are often violated by biology, which assumption is relatively safe?
What biological processes do you think are actually diffusive vs sub/super-diffusive? Think about the 3 conditions for Brownian motion listed above. Note, this is a preview for the next post.

Advanced Questions

Learn about stochastic differential equations and their application to diffusion.

4 Comments Posted on March 6, 2016 General science, Research, Teaching

Deep Learning

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

Papers

Deep Learning. By LeCun, Bengio, and Hinton in 2015.

Deep learning in neural networks: An overview. By Schmidhuber in 2015.

Other Useful References

Deep Learning – Wikipedia
Deep Learning Book (Under Development)
Michael Nielsen EBook

What is deep learning?

In previous weeks we have introduced perceptrons, multilayer perceptrons (MLP), Hopfield neural networks, and Boltzmann machines.

But what is deep learning? I think it is really two things:

Successful training of multilayered neural networks perform better (higher classification accuracy, etc) and involve more layers than previous implementations
Just a rebranding of neural networks

Here is my summary of the history of deep learning, see both reviews above for extended details. In 2006, Deep Belief Networks (DBN) were introduced in two papers (Reducing the Dimensionality of Data with Neural Networks and A Fast Learning Algorithm for Deep Belief Nets). The idea of a DBN is to train a series of restricted Boltzmann machines (RBM). The first RBM is trained and given the original data, produces an output of hidden layer activations. The second RBM uses the first RBM’s hidden layer activations as inputs, and trains on that “data”. This is continued to the desired depth. At this point, the DBN can be used for unsupervised learning, or one can use it as pretraining for a MLP which will utilize backpropagation for supervised learning.

So on the one hand, there were technical breakthroughs that enabled neural networks to utilize more layers than previous iterations and achieve state of the art performance. However, the actual component (RBMs and MLPs), have been around since the 1980s, so it would also be fair to deem deep learning as a rebranding of neural networks.

Therefore, neural network winter (mid 90s to mid 00s) officially ended in 2006. I propose calling 2006-2012 neural network spring. While interest in neural networks increased and new advances were made, the general machine learning community was not obsessed with deep learning. That changed in 2012 when the neural network summer began. This paper presented at NIPS revolutionized the computer vision community by cutting the error rate on Imagenet in half! The Imagenet challenge was viewed as a serious benchmark that all computer vision systems should address. By blowing previous results out of the water, the revolution was completed.

So for now, enjoy neural network summer, but always remember, winter is coming.

Fundamental Questions

When do extra layers help in a neural network? When do they hurt?
Why was pretraining originally needed, but is no longer used in practice? Check out these papers for details: Glorot and Saxe.
Learn about convolutional and recurrent neural networks. These are extremely popular right now!

Advanced Questions

Do research on unsupervised learning! It is definitely less popular today, but all the big-shots think is the longterm future of neural networks.

4 Comments Posted on February 8, 2016March 6, 2016 General science, Research, Teaching

Training Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Deep Learning

Papers

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985.

Learning representations by back-propagating errors. By Rumelhart, Hinton, and Williams in 1986.

Other Useful References

Boltzmann Machine (BM) – Wikipedia and Scholarpedia
MacKay Ch 43 (Boltzmann).
Hinton guide to RBMs
Backpropagation – Wikipedia
Multilayer perceptron (MLP) – Wikipedia
Michael Nielsen Chapter 2

How do you actually train neural networks?

Hopefully the past few posts have piqued your interest in neural networks. Maybe you even want to unleash a neural network on some data. How do you actually train the neural network?

I’m actually going to keep this brief for two reasons. First, detailed derivations can already be found elsewhere (for Boltzmann see Appendix of the original paper as well as MacKay, for backpropagation see Nielsen). Second, I firmly believe that algorithms are best learned by actually stepping through the updates, so any explanation I attempt will not be sufficient for you to truly learn the algorithm. I will provide some general context as well as some questions you should be able to answer, but please go do it yourself!

There are three general classes of machine learning based on the information received:

Unsupervised – data only. Boltzmann machine.
Supervised – data with labels. MLP with backpropagation.
Reinforcement – data, actions, and scores associated with each action. Deserves its own detailed post, but check out papers by DeepMind for cool applications.

The Boltzmann machine learning rule is an example of maximum likelihood. In practice, the original learning rule is too computationally expensive, so a modified algorithm called contrastive divergence (or variants such as persistent contrastive divergence) is utilized instead. See the Hinton guide to RBMs for more details.

Backpropagation is a computationally-efficient writing of the chain rule from calculus, so besides the above paper which popularized it, there is actually a long history of this algorithm being discovered and rediscovered.

Fundamental Questions

What is maximum likelihood?
Why can one interpret the learning terms in the BM algorithm as “waking” and “sleeping”?
Why are BM hidden layers so important?
Why are restricted Boltzmann machines, RBMs, much easier to train?
Why is backpropagation more computationally efficient than the finite difference method?
Derive the 4 backpropagation equations!

Advanced Questions

Follow Hinton’s RBM guide and implement your own Boltzmann machine
Use Nielsen’s code to train your own MLP

Software Solutions

1. Optimizer and Schedules

2. Initialization

3. Curriculum

4. Architecture

Hardware Specific Solutions

1. Better Hardware

2. Data Serving

3. GPU Utilization

4. Mixed Precision Training

5. Multiple GPUs

Conclusion

Share this:

Share this:

Conditioning

Pavlovian

Extinction

Blocking

Inhibitory

Secondary

Rescorla-Wagner Rule

Temporal Difference Learning

Dopamine and Biology’s TD system

TD Learning and Games

Backgammon

Go

Poker

Share this:

Questions from You

FAQ

Requests for Essay Reading

Share this:

Share this:

Unit: Diffusion

Papers

Other Useful References

Introduction

Basics of Fluid Mechanics

Basics of Ecoli Chemotaxis

Video Proof of Purcell’s Scallop Theorem

Fundamental Questions

Advanced Questions

Share this:

Unit: Diffusion

Papers

Other Useful References

What is anomalous diffusion?

How to describe anomalous diffusion?

What causes anomalous diffusion?

Examples

Advanced Questions

Share this:

Unit: Diffusion

Papers

Other Useful References

What is diffusion?

Brownian Motion

Fundamental Questions

Advanced Questions

Share this:

Unit: Deep Learning

Papers

Other Useful References

What is deep learning?

Fundamental Questions

Advanced Questions

Share this:

Unit: Deep Learning

Papers

Other Useful References

How do you actually train neural networks?

Fundamental Questions

Advanced Questions

Share this: