# Deep Learning Seminar Course

This semester Terry Sejnowski is teaching a graduate seminar course that is focused on Deep Learning. The course meets weekly for two hours to discuss papers. Here I’ll just outline the course and in later posts I’ll add some thoughts on each specific week.

# I3: International Institute for Intelligence

While I was previously discussing my opinion of Open AI, I mentioned that I would do something different if I was in charge. Here is my dream.

# What OpenAI is Missing

Helping everyday people throughout the whole world.

OpenAI’s stated goal is:

OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.

In the short term, we’re building on recent advances in AI research and working towards the next set of breakthroughs.

However, based on their actions so far, this interview with Ilya Sutskever, and popular press articles, the main focus of OpenAI appears to be advanced research in an artificial intelligence by stressing open source, as well as thinking longterm about the impacts of letting advanced artificial intelligence systems control large aspects of our life. While I strongly support these goals, in reality, these will not benefit all of humanity. Instead, it only benefits those with either the necessary training (which is a minimum of a bachelors, but usually means a masters or PhD) or money (to hire top people, buy the required computing resources, etc) to take advantage of the advanced research. So this leaves out the developing world as well as the poor in developed countries, ie contrary to their stated goal, OpenAI is missing the vast majority of humanity.

While one can argue that by making OpenAI’s research open source, eventually it will trickle down and help a wider swath of humanity. However, the current trend suggests that large corporations are best poised to benefit the most from the next revolution (I mean, who is more likely to invent a self driving car, Google, or someone in a developing country?). Additionally, these innovations focus on first world problems (since these are the highest paying customers). And finally, each round of innovation ends up creating fewer and fewer jobs (so the number of unemployed in developed countries may expand). I firmly believe that unless there is a global educational effort (and probably an implementation of basic income), the benefits of AI will be directed towards a tiny sliver of the world’s population.

# My Proposal: I3

Here I lay out my proposal for a new institute that would actually expand the benefits of recent and future advances in machine learning / artificial intelligence to a wider swath of humanity. I don’t claim that it would truly benefit all of humanity (again, see basic income), but it is a way for research advances to reach a larger proportion of it.

I propose a new education and research institute focused on artificial intelligence, machine learning, and computational neuroscience which I’ll call the International Institute for Intelligence. I like alliterations, and since I think it should focus on three types of intelligence, I especially like the idea of calling it I3 or I-Cubed for short.

Why these three research areas? Well, machine learning is currently revolutionizing how companies use data and is facilitating new technological advances everyday. Designing artificial intelligence systems on top of these machine learning algorithms seems like a realistic possibility in the near future. The less conventional choice is computational neuroscience. I think it is important to include for two reasons. First, the brain is the best example we have of an intelligent system, so until we actually design an artificial intelligence, it seems best to understand and mimic the best example (this is the philosophy of Deep Mind according to Demis Hassabis). Second, the US Brain Initiative  and similar international efforts are injecting significant resources into neuroscience, with the hopes of sparking a revolution similar in spirit and magnitude to the widespread effect the Human Genome Project had on biotechnology and genomics. So I figure we might as well prepare everyone for this future.

So what would be the actual purpose of I3? Sticking with the theme of threes, I propose three initiatives that I will list in my order of importance as well as some bonus points.

# 1. International PhD Education

The central goal is to similar program to ICTP (International Centre for Theoretical Physics) but with a different research emphasis. So what is ICTP? It was founded by Nobel Prize Winner Abdus Salam and it has several programs to promote research in developing countries, including:

• Predoctoral program – students get a 1 year course to prep them for PhDs
• Visiting PhD program – students in a developing nation PhD program get to spend a couple of months each year for 3 years at ICTP to participate in their research
• Conferences
• Regional offices (currently Sao Paolo, Brazil, but more in the planning)

So the idea is to implement a similar program but with the research emphasis now focused on machine learning, artificial intelligence, and computational neuroscience. While I think the main thing is to get the predoctoral program and visiting PhD program started, eventually it would be great to have 5 regional offices spread throughout the developing world. For example, I think one is needed in South America (Lima, Peru?), one in Africa (Nairobi, Kenya?), and 2 in Asia (India, and China, but not in a traditional technological center). And assuming I3 is based in the US (see my case for San Diego below), it would be great to have an affiliate office in Europe, maybe in Trieste next to ICTP.

One additional initiative that I think could be useful would be paying people to not leave their country and instead help them establish a research center at their local universities. This could also wait until later because it might be easiest to convince some of the future alumni of the predoctoral or visiting PhD programs to return/stay in their home country.

A second additional initiative would be to encourage professors from developed and developing countries to take their sabbatical at I3. This would provide a fresh stream of mentors and set up potential future collaborations. This is a blend of two programs at KITP (this and that).

# 2. US Primary School Education

The science pipeline analogy is overused, but I don’t have a better one yet. So currently, the researchers in I3 focused areas are predominately male, white or Asian, and middle to upper class. So not a very representative sample of the US (or world) population. Therefore, the best longterm solution is to get a more diverse set of students interested in the research at a young age.

Technically this should have a higher priority over the next initiative (US College Education), but since there are other non-profits interested in this (for example, CodeNow), maybe I3 does not need to be a leader in this and instead can play a supporting role.

# 3. US College Education

And again back to science pipeline analogy, if we are to have a more diverse set of researchers, we need to encourage a diverse set of undergrads to pursue relevant majors and continue on into graduate programs. This won’t be solved by any single program, but here are some potential ideas.

• US underrepresented students could apply for the same 1 year program that is offered to international students.
• Assist universities in establishing bridge programs that partner research universities with colleges that have significant minority populations. A great example of this is the Vanderbilt-Fisk Physics program.
• US colleges would also benefit from the proposed sabbatical program offered to international researchers. I also like the KITP idea of extending it to undergraduate only institutes (especially those with large minority populations) as a way to get more undergrads interested in research.
• Establish a complete set of free college curriculum for machine learning, artificial intelligence, and computational neuroscience. While there are many useful MOOCs on these topics, I still don’t think they beat an actual course.

# Bonus #1 : Research

ICTP has proven that it is possible to further global educational goals and still succeed at research. I would argue that the people working at I3 should mainly be evaluated for tenure based on their mentorship and teaching of students. Research of course will play a role (otherwise it would be poor mentorship of future researchers), but I think there shouldn’t be huge pressure to bring in grants, high-profile publications, etc. But even without that emphasis, there is no way that a group of smart people with motivated students will not lead to great research.

# Bonus #2: International Primary and College Education

This is longer term, but if there are successful programs in improving the US primary and college education, international regional offices, and PhD alumni who are in their home countries, it seems like there should be possible to leverage those connections into a global initiative to improve primary and college education.

# Final Thoughts

So Elon Musk, Peter Thiel, and friends, if you have another billion you want to donate (or Open AI funds to redirect), here is my proposal. In reality, implementing all of my ideas would probably cost several billions, but once you got the center founded, I think that it would be easy to get tech companies, the US government, and even UNESCO to help provide funding.

My final point is that I think San Diego would be a perfect location. I know I’m biased since I live here now, but there a many legitimate reasons San Diego is great for this institute.

1. UCSD already partners with outside research institutes (Salk, Scripps, etc)
2. UCSD (and Salk, etc) are leaders in all of these research areas
3. It is extremely easy to convince people to take a sabbatical in San Diego

While there are many other great potential locations, I strongly suggest that I3 is not in the Bay Area, Seattle, Boston, or New York City. These cities already have plenty of tech jobs, please spread the wealth to other parts of the US.

Anyways, I’ll keep dreaming that someday I’ll get to work at a place like the one I just described.

# Life at Low Reynolds Number

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Diffusion

Organized by Ben Regner

1. Standard Diffusion
2. Anomalous Diffusion
3. Life at Low Reynold’s Number

## Papers

Life at Low Reynold’s Number. By Purcell in 1977.

## Introduction

This is one of my favorite papers. The presentation style is extremely fun and readable without sacrificing any scientific integrity. I think it serves as a great introduction to fluid mechanics at low Reynold’s number. I don’t have too many comments since I think the paper explains it the best, but I will provide a few supplementary details for a more in depth exploration of the ideas from the paper.

And just to get you excited about fluid dynamics, I present an example of laminar flow:

## Basics of Fluid Mechanics

The fundamental equation of fluid mechanics is Navier-Stokes. The relevant version for this paper is the incompressible flow equations with pressure but no other external fields:

$\frac{\partial \vec{u}}{\partial t}+ \vec{u}\cdot\nabla\vec{u} +\frac{1}{\rho}\nabla p -\nu\nabla^2\vec{u}=0$

where $\vec{u}$ is the velocity vector, $\vec{x}$ is position, $\rho$ is density, $p$ is pressure, and $\nu$ is the kinematic viscosity. This equation can be made non-dimensional by the introduction of a characteristic velocity $U$, length $L$, and introducing the dynamic viscosity $\eta=\nu/\rho$. This gives the following dimensionless variables:

$u^* = \frac{u}{U}$

$x^* = \frac{x}{L}$

$p^* = \frac{pL}{\eta U}$

$t^* = \frac{L}{U}$

Substituting in these characteristic length scales and doing some algebra, one arrives at the simplified equations:

$R\frac{\partial \vec{u^*}}{\partial t^*}+ R\vec{u^*}\cdot\nabla^*\vec{u^*} +\nabla^* p^*-(\nabla^*)^2\vec{u^*}=0$

with only one dimensionless constant, the Reynold’s number, defined as:

$R = \frac{UL\rho}{\eta} = \frac{UL}{\nu}$

As explained in the paper, Reynold’s number is one of the essential constants describing a flow. High Reynold’s number leads to turbulent (chaotic) flow, while low Reynold’s number leads to laminar (smooth) flow. For extemely small Reynold’s number, Navier-Stokes simplifies to:

$\nabla^* p^* = (\nabla^*)^2\vec{u^*}$

which is also just called Stoke’s equation.

At the end of the paper, Purcell describes another dimensionless number which he calls $S$ and in a footnote identifies as the Sherwood number. However, Ben Regner pointed out, that Purcell’s $S$ would actually be called the Peclet number today.

## Basics of Ecoli Chemotaxis

Chemotaxis and cellular sensing really deserves its own series of papers. But in the meantime, I recommend the following resources

## Video Proof of Purcell’s Scallop Theorem

Reversible kicking does fine in water (high Reynold’s number)…

… but the same motion has issues in corn syrup (low Reynold’s number).

Here is a solution similar to what Ecoli and other bacteria employ.

## Fundamental Questions

• Purcell does an amazing job, so I have nothing to add.

• What are some other strategies that are employed in biology to get around the issue of mobility at low Reynold’s number? Hint: I already linked to a video of one strategy. There are at least two other strategies, but to find these you will need to think about the assumptions leading to the basic Navier-Stokes equations.

# Anomalous Diffusion

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Diffusion

Organized by Ben Regner

1. Standard Diffusion
2. Anomalous Diffusion
3. Life at Low Reynold’s Number

## What is anomalous diffusion?

If one measures the mean square displacement vs time, it can be parameterized as

$< x^2> = t^\alpha$

where $\alpha=1$ is Brownian (standard diffusion), $0<\alpha<1$ is subdiffusive, $1<\alpha<2$ is superdiffusive, and ballistic is $\alpha=2$. So the technical definition of anomalous diffusion is $0<\alpha<1$ or $1<\alpha<2$.

## How to describe anomalous diffusion?

Currently, there is no “best” or “simple” description of anomalous diffusion in the general case. However, continuous-time random walks (CTRW) are one paradigm that I find helpful as a conceptual and simulation framework.

In the simplest discrete random walk (DRW), at every time step, a particle makes a jump of fixed size, the only question is the direction. The next generalization has the particle make a jump at every time step, but now it draws the jump size from a distribution.

The idea of a CTRW is that there is now a distribution both of the waiting time between jumps, and the jump size. If the waiting time follows the exponential distribution and the jump size follows the normal distribution, one ends up with the Wiener process aka standard diffusion and Brownian motion.

## What causes anomalous diffusion?

Just as a reminder, there are three conditions that need to be satisfied for Brownian motion (standard diffusion):
1. Increments are independent
2. Increments are wide sense stationary. 1st moment and autocovariance don’t depend on time (this is weaker condition then complete stationarity)
3. Zero mean

The third condition is often ignored by examining the motion relative to the mean displacement (ie the actual displacement is not Brownian, but fluctuations in the displacement could be Brownian). So really, the first two are the more important conditions. Therefore, anomalous diffusion arises due to non-independent increments and/or correlations in time of the mean and/or standard deviation.

The CTRW allows one to think more precisely about different mechanisms that can give rise to anomalous diffusion. There is not one single way to get sub or super-diffusion in CTRW, since there are two, potentially dependent, distributions (waiting time and jump size). However, there are a few common situations that seem to arise often in biology and elsewhere (see Random walk models in biology, Box 2 for original idea). Subdiffusion in biology is often caused by longer waiting time distributions (compared to exponential), or molecular crowding, while superdiffusion in occurs when jump sizes are drawn from a Levy flight or other alpha stable distributions.

## Examples

For further exploration of anomalous diffusion in biology, I recommend these papers

• This is an interesting paper that introduces a renormalization group approach to classifying diffusion processes

# Standard Diffusion

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Diffusion

Organized by Ben Regner

1. Standard Diffusion
2. Anomalous Diffusion
3. Life at Low Reynold’s Number

## Papers

Brownian Motion. By Einstein in 1905.

Brownian Motion. By Langevin in 1908.

An Introduction to Fractional Diffusion. By Henry, Langlands, and Straka in 2010.

## What is diffusion?

Diffusion is the general process by which small particles move from regions of high concentration to low concentration. Check out the link to the Wikipedia articles above for some cool videos and animations. Diffusion is extremely ubiquitous and plays an essential role in biology. For example, oxygen diffuses from your lungs to unoxygenated blood, which then delivers it to the rest of your body where it diffuses out of your blood and into your cells. Additionally, signals between neurons are transmitted by several different diffusing molecules.

Mathematically, standard diffusion is described by two fundamental equations.

Fick’s First Law: Particles move from high-to-low concentration.

$j=-D\frac{\partial n}{\partial x}$

where $n$ is the number of particles, $x$ is the location of the particles, $D$ is the diffusion constant, and  $j$ is the flux of particles.

Fick’s Second Law: Conservation of particles combined with Fick’s First Law leads to the diffusion equation.

If particles cannot be created or destroyed, they follow a conservation law:

$\frac{\partial n}{\partial t} = -\frac{\partial j}{\partial x}$

Combining the conservation law with Fick’s First Law gives us the diffusion equation:

$\frac{\partial n}{\partial t} = D \frac{\partial^2 n}{\partial x^2}$

## Brownian Motion

In 1827 Robert Brown looked at pollen in water under a microscope, see Wikipedia page for simulations of the observations. Much to his surprise, the pollen acts as if it alive! Brown verified that pollen is not alive and any small, inorganic particle followed similar motion. In 1905, during Einstein’s miracle year, he wrote a paper on an atomistic description that describes Brownian Motion. In 1908 Langevin used a different approach (that is “infinitely simpler” in his words) to describe Brownian motion. The general explanations are outlined below.

1. Einstein’s Derivation

Einstein’s goal was a probability based description of Brownian motion that connects to Fick’s law. Einstein makes several assumptions about the particles, including

In the end, Einstein finds a solution that is Gaussian, implying that the mean square displacement is linear in time for Brownian motion:

$< x^2> = t$

More generally, the mean square displacement could depend on some power of time, usually parameterized as

$< x^2> = t^\alpha$

where $\alpha=1$ is Brownian, $0<\alpha<1$ is subdiffusive, $1<\alpha<2$ is superdiffusive, and ballistic is $\alpha=2$. Note, one can get up to $\alpha=3$ in certain turbulent regimes.

2. Langevin’s Derivation
The Langevin approach is to start with a particle based description. The first assumption is the equipartition theorem to determine the kinetic energy (KE)
$KE = \frac{k_B T}{2} = m (\frac{d^2 x}{dt^2})^2$

Then, one looks at the actual forces on the particle:

KE = Stoke’s + stochastic variable
$m (\frac{d^2 x}{dt^2})^2 = -6 \pi \eta r \frac{dx}{dt} + X$
where $X$ is a stochastic variable. It is assumed to be zero mean, unit variance, and no time correlations, aka white noise.

After multiplying both sides of the equation by x, doing some algebra, and then taking the average solution, one arrives at the same results as Einstein (after ignore a short time transient).

3. Random Walk Derivation.

There is a third way to derive Brownian motion that is layed out in the book chapter above. The idea is to look at a single particle and do a microscopic random walk. One can set up a recursive definition that defines a binomial probability solution. After a large number of steps, the central limit theorem applies and we end up with a Gaussian solution.

How do we get Brownian motion?

In general, there are three conditions that need to be satisfied for Brownian motion:
1. Increments are independent
2. Increments are wide sense stationary. 1st moment and autocovariance don’t depend on time (this is weaker condition then complete stationarity)
3. Zero mean

The third condition is often ignored by examining the motion relative to the mean displacement (ie the actual displacement is not Brownian, but fluctuations in the displacement could be Brownian). So really, the first two are the more important conditions.

## Fundamental Questions

• Einstein made three major assumptions in his derivation. 2/3 are often violated by biology, which assumption is relatively safe?
• What biological processes do you think are actually diffusive vs sub/super-diffusive? Think about the 3 conditions for Brownian motion listed above. Note, this is a preview for the next post.

# Deep Learning

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Papers

Deep Learning. By LeCun, Bengio, and Hinton in 2015.

Deep learning in neural networks: An overview. By Schmidhuber in 2015.

## What is deep learning?

In previous weeks we have introduced perceptrons, multilayer perceptrons (MLP), Hopfield neural networks, and Boltzmann machines.

But what is deep learning? I think it is really two things:

1. Successful training of multilayered neural networks perform better (higher classification accuracy, etc) and involve more layers than previous implementations
2. Just a rebranding of neural networks

Here is my summary of the history of deep learning, see both reviews above for extended details. In 2006, Deep Belief Networks (DBN) were introduced in two papers (Reducing the Dimensionality of Data with Neural Networks and A Fast Learning Algorithm for Deep Belief Nets). The idea of a DBN is to train a series of restricted Boltzmann machines (RBM). The first RBM is trained and given the original data, produces an output of hidden layer activations. The second RBM uses the first RBM’s hidden layer activations as inputs, and trains on that “data”. This is continued to the desired depth. At this point, the DBN can be used for unsupervised learning, or one can use it as pretraining for a MLP which will utilize backpropagation for supervised learning.

So on the one hand, there were technical breakthroughs that enabled neural networks to utilize more layers than previous iterations and achieve state of the art performance. However, the actual component (RBMs and MLPs), have been around since the 1980s, so it would also be fair to deem deep learning as a rebranding of neural networks.

Therefore, neural network winter (mid 90s to mid 00s) officially ended in 2006. I propose calling 2006-2012 neural network spring. While interest in neural networks increased and new advances were made, the general machine learning community was not obsessed with deep learning. That changed in 2012 when the neural network summer began. This paper presented at NIPS revolutionized the computer vision community by cutting the error rate on Imagenet in half! The Imagenet challenge was viewed as a serious benchmark that all computer vision systems should address. By blowing previous results out of the water, the revolution was completed.

So for now, enjoy neural network summer, but always remember, winter is coming.

## Fundamental Questions

• When do extra layers help in a neural network? When do they hurt?
• Why was pretraining originally needed, but is no longer used in practice? Check out these papers for details: Glorot and Saxe.
• Learn about convolutional and recurrent neural networks. These are extremely popular right now!

• Do research on unsupervised learning! It is definitely less popular today, but all the big-shots think is the longterm future of neural networks.

# Training Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Deep Learning

1. Perceptron
2. Energy Based Neural Networks
3. Training Networks
4. Deep Learning

## Papers

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985.

Learning representations by back-propagating errors. By Rumelhart, Hinton, and Williams in 1986.

## How do you actually train neural networks?

Hopefully the past few posts have piqued your interest in neural networks. Maybe you even want to unleash a neural network on some data. How do you actually train the neural network?

I’m actually going to keep this brief for two reasons. First, detailed derivations can already be found elsewhere (for Boltzmann see Appendix of the original paper as well as MacKay, for backpropagation see Nielsen). Second, I firmly believe that algorithms are best learned by actually stepping through the updates, so any explanation I attempt will not be sufficient for you to truly learn the algorithm. I will provide some general context as well as some questions you should be able to answer, but please go do it yourself!

There are three general classes of machine learning based on the information received:

• Unsupervised – data only. Boltzmann machine.
• Supervised – data with labels. MLP with backpropagation.
• Reinforcement – data, actions, and scores associated with each action. Deserves its own detailed post, but check out papers by DeepMind for cool applications.

The Boltzmann machine learning rule is an example of maximum likelihood. In practice, the original learning rule is too computationally expensive, so a modified algorithm called contrastive divergence (or variants such as persistent contrastive divergence) is utilized instead. See the Hinton guide to RBMs for more details.

Backpropagation is a computationally-efficient writing of the chain rule from calculus, so besides the above paper which popularized it, there is actually a long history of this algorithm being discovered and rediscovered.

## Fundamental Questions

• What is maximum likelihood?
• Why can one interpret the learning terms in the BM algorithm as “waking” and “sleeping”?
• Why are BM hidden layers so important?
• Why are restricted Boltzmann machines, RBMs, much easier to train?
• Why is backpropagation more computationally efficient than the finite difference method?
• Derive the 4 backpropagation equations!