# Temporal Difference Learning

How can humans or machines interact with an environment and learn a strategy for selecting actions that are beneficial to their goals? Answers to this question fall under the artificial intelligence category of reinforcement learning. Here I am going to provide an introduction to temporal difference (TD) learning, which is the algorithm at the heart of reinforcement learning.

I will be presenting TD learning from a computational neuroscience background. My post has been heavily influenced by Dayan and Abbott Ch 9, but I have added some additional points. The ultimate reference for reinforcement learning is the book by Sutton and Barto, and their chapter 6 dives into TD learning.

## Conditioning

To start, let’s review conditioning. The most famous example of conditional is Pavlov’s dogs. The dogs naturally learned to salivate upon the delivery of food, but Pavlov realized that he could condition dogs to associate the ringing of a bell with the delivery of food. Eventually, the ringing of the bell on its own was enough to cause dogs to salivate.

The specific example of Pavlov’s dogs is an example of classical conditioning. In classical conditioning, no action needs to be taken. However, animals can also learn to associate actions with rewards and this is called operant conditioning.

Before I introduce some specific conditioning paradigms, here are the important definitions:

• $s$ = stimulus
• $r$ = reward
• $x$ = no reward
• $v$ = value, or expected reward (generally a function of $r$, $x$)
• $u$ = binary, indicator variable, of stimulus (1 if stimulus present, 0 otherwise)

Here are the conditioning paradigms I want to discuss:

• Pavlovian
• Extinction
• Blocking
• Inhibitory
• Secondary

For each of these paradigms, I will introduce the necessary training stages and the final result. The statement, $a \rightarrow b$, means that $a$ becomes associated ($\rightarrow$) with $b$.

#### Pavlovian

Training: $s \rightarrow r$. The stimulus is trained with a reward.

Results: $s \rightarrow v[r]$. The stimulus is associated with the expectation of a reward.

#### Extinction

Training 1: $s \rightarrow r$. The stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s \rightarrow x$. The stimulus is trained with a no reward.

Results: $s \rightarrow v[x]$. The stimulus is associated with the expectation of no reward. Extinction of the previous Pavlovian training.

#### Blocking

Training 1: $s_1 \rightarrow r$. The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s_1 + s_2 \rightarrow r$. The first stimulus and a second stimulus is trained with a reward.

Results: $s_1 \rightarrow v[r]$, and $s_2 \rightarrow v[x]$. The first stimulus completely explains the reward and hence “blocks” the second stimulus from being associated with the reward.

#### Inhibitory

Training: $s_1+s_2 \rightarrow x$, and $s_1 \rightarrow r$. The combination of two stimuli leads to no reward, but the first stimuli is trained with a reward.

Results: $s_1 \rightarrow v[r]$, and $s_2 \rightarrow -v[r]$. The first stimuli is associated with the expectation of the reward while the second stimuli is associated with the negative of the reward.

#### Secondary

Training 1: $s_1 \rightarrow r$. The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: $s_2 \rightarrow s_1$. The second stimulus is trained with the first stimulus.

Results: $s_2 \rightarrow v[r]$. Eventually the second stimulus is associated with the reward despite never being directly associated with the reward.

## Rescorla-Wagner Rule

How do we turn the various conditioning paradigms into a mathematical framework of learning? The Rescorla Wagner rule (RW) is a very simple model that can explain many, but not all, of the above paradigms.

The RW rule is a linear prediction model that requires these three equations:

1. $v=w \cdot u$
2. $\delta = r-v$
3. $w_{new} = w_{old}+\epsilon \delta u$

and introduces the following new terms:

• $w$ = weights associated with stimuli state
• $\epsilon$ = learning rate, with $0 \le \epsilon \le 1$

What do each of these equations actually mean?

1. The expected reward, $v$, is a linear dot product of a vector of weights, $w$, associated with each stimuli, $u$.
2. But there may be a mismatch, or error, between the true actual reward, $r$, and the expected reward, $v$.
3. Therefore we should update our weights of each stimuli. We do this by adding a term that is proportional to a learning rate $\epsilon$, the error $\delta$, and the stimuli $u$.

During a Pavlovian pairing of stimuli with reward, the RW rule predicts an exponential approach of the weight to $w = \langle ru\rangle$ over the course of several trials for most values of $\epsilon$ (if $\epsilon=1$ it would instantly update to the final value. Why is this usually bad?). Then if the reward stops being paired with the stimuli, the weight will exponential decay over the course of the next trials.

The RW rule will also continue to work when the reward/stimulus pairing is stochastic instead of deterministic and the will will still approach the final value of $w = \langle ru\rangle$.

How does blocking fit into this framework? Well the RW rule says that after the first stage of training, the weights are $w_1 = r$ and $w_2 = 0$ (since we have not presented stimulus two). When we start the second stage of training and try and associate stimulus two with the reward, we find that we cannot learn that association. The reason is that there is no error (hence $\delta = 0$) and therefore $w_2 = 0$ forever. If instead we had only imperfectly learned the weight of the first stimulus, then there is still some error and hence some learning is possible.

One thing that the RW rule incorrectly predicts is secondary conditioning. In this case, during the learning of the first stimulus, $s_1$, the learned weight becomes $w_1 >0$. The RW rule predicts that the second stimulus, $s_2$, will become $w_2 <0$. This is because this paradigm is exactly the same as inhibitory conditioning, according to the RW rule. Therefore, a more complicated rule is required to successfully have secondary conditioning

One final note. The RW rule can provide an even better match to biology by assuming a non-linear relationship between $v$ and the animal behavior. This function is often something that exponentially saturates at the maximal reward (ie an animal is much more motivated to go from 10% to 20% of the max reward rather than from 80% to 90% of the max reward). While this provides a better fit to many biological experiments, it still cannot explain the secondary conditioning paradigm.

## Temporal Difference Learning

To properly model secondary conditioning, we need to explicitly add in time to our equations. For ease, one can assume that time, $t$, is discrete and that a trial lasts for total time $T$ and therefore $0 \le t \le T$.

The straightforward (but wrong) extension of the RW rule to time is:

1. $v[t]=w[t-1] \cdot u[t]$
2. $\delta[t] = r[t]-v[t]$
3. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where we will say that it takes one time unit to update the weights.

Why is this naive RW with time wrong? Well, psychology and biology experiments show that animals expected rewards does NOT reflect the past history of rewards nor just reflect the next time step, but instead reflects the expected rewards during the WHOLE REMAINDER of the trial. Therefore a better match to biology is:

1. $v[t]=w[t-1] \cdot u[t]$
2. $R[t]= \langle \sum_{\tau=0}^{T-t} r[t+\tau] \rangle$
3. $\delta[t] = R[t]-v[t]$
4. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where $R[t]$ is the full reward expected over the remainder of the trial while $r[t]$ remains the reward at a single time step. This is closer to biology, but we are still missing a key component. Not all future rewards are treated equally. Instead, rewards that happen sooner are valued higher than rewards in the distant future (this is called discounting). So the best match to biology is the following:

1. $v[t]=w[t-1] \cdot u[t]$
2. $R[t]= \langle \sum_{\tau=0}^{T-t} \gamma^\tau r[t+\tau] \rangle$
3. $\delta[t] = R[t]-v[t]$
4. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

where $0 \le \gamma \le 1$ is the discounting factor for future rewards. A small discounting factor implies we prefer rewards now while a large discounting factor means we are patient for our rewards.

We have managed to write down a set of equations that accurately summarize biological reinforcement. But how can we actually learn with this system? As currently written, we would need to know the average reward over the remainder of the whole trial. Temporal difference learning makes the following assumptions in order to solve for the expected future rewards:

1. Future rewards are Markovian
2. Current observed estimate of reward is close enough to the typical trial

A Markov process is memoryless in that the next future step only depends on the current state of the system and has no other history dependence. By assuming rewards follow this structure, we can make the following approximation:

• $R[t]= \langle r[t+1] \rangle + \gamma \langle \sum_{\tau=1}^{T-t} \gamma^{\tau-1} r[t+\tau]$
• $R[t]= \langle r[t+1] \rangle + \gamma R[t+1]$

The second approximation is called bootstrapping. We will use the currently observed values rather than the full estimate for future rewards. So finally we end up at the temporal difference learning equations:

1. $v[t]=w[t-1] \cdot u[t]$
2. $R[t] = r[t+1] + \gamma v[t+1]$
3. $\delta[t] =r[t+1] + \gamma v[t+1]-v[t]$
4. $w[t] = w[t-1]+\epsilon \delta[t] u[t]$

Dayan and Abbott, Figure 9.2. This illustrates TD learning in action.

I have included an image from Dayan and Abbott about how TD learning evolves over consecutive trials, please read their Chapter 9 for full details.

Finally, I should mention that in practice, people often use the TD-Lambda algorithm. This version introduces a new parameter, lambda, which controls how far back in time one can make adjustments. Lambda 0 implies one time step only, while lambda 1 implies all past time steps. This allows TD learning to excel even if the full system is not Markovian.

## Dopamine and Biology’s TD system

So does biology actually implement TD learning? Animals definitely utilize reinforcement learning and there is strong evidence that temporal difference learning plays an essential role. The leading contender for the reward signal is dopamine. This is a widely used neurotransmitter that evolved in early animals and remains widely conserved. There are a relatively small number of dopamine neurons (in the basal ganglia and VTA in humans) that project widely throughout the brain. These dopamine neurons can produce an intense sensation of pleasure (and in fact the “high” of drugs often comes about either through stimulating dopamine production or preventing its reuptake).

There are two great computational neuroscience papers that highlight the important connection between TD learning and dopamine that analyze two different biological systems:

Both of these papers deserved to be read in detail, but I’ll give a brief summary of the bee foraging paper here. Experiments were done that tracked bees in an controlled environment consisting of “yellow flowers” and “blue flowers” (which were basically just different colored cups). These flowers had the same amount of nectar on average, but were either consistent or highly variable. The bees quickly learned to only target the consistent flowers. These experimental results were very well modeled by assuming the bee was performing TD learning with a relatively small discount factor (driving it to value recent rewards).

## TD Learning and Games

Playing games is the perfect test bed for TD learning. A game has a final objective (win), but throughout play it can be difficult to determine your probability of winning. TD learning provides a systematic framework to associate the value of a given game state with the eventual probability of learning. Below I highlight the games that have most significantly showcased the usefulness of reinforcement learning.

#### Backgammon

Backgammon is a two person game of perfect information (neither player has hidden knowledge) with an element of chance (rolling dice to determine one’s possible moves). Gerald Tesauro’s TD-Gammon was the first program to showcase the value of TD learning, so I will go through it in more detail.

Before getting into specifics, I need to point out that there are actually two (often competing) branches in artificial intelligence:

Symbolic logic tends to be a set of formal rules that a system needs to follow. These rules need to be designed by humans. The connectionist approach uses artificial neural networks and other approaches like TD learning that attempt to mimic biological neural networks. The idea is that humans set up the overall architecture and model of the neural network, but the specific connections between “neurons” is determined by the learning algorithm as it is fed real data examples.

Tesauro actually created two versions of a backgammon program. The first was called Neurogammon. It was trained using supervised learning where it was given expert games as well as games Tesauro played against himself and told to learn to mimic the human moves. Neurogammon was able to play at an intermediate human level.

Tesauro’s next version of a backgammon program was TD-Gammon since it used the TD learning rule. Instead of trying to mimic the human moves, TD-Gammon used to the TD learning rule to assign a score to each move throughout a game. The additional innovation is that the TD-Gammon program was trained by playing games against itself. This initial version of TD-Gammon soon matched Neurogammon (ie intermediate human level). TD-Gammon was able to beat experts by both using a supervised phase on expert games as well as a reinforcement phase.

Despite being able to beat experts, TD-Gammon still had a weakness in the endgame. Since it only looked two-moves ahead, it could miss key moves that would have been found by a more thorough analytical approach. This is where symbolic logic excels and hence TD-Gammon was a great demonstration of the complimentary strength and weaknesses of symbolic vs connectionist logic.

#### Go

Go is a two person game of perfect information with no element of chance. Despite this perfect knowledge, the game is complex enough that there are around $10^170$ possible games (for reference, there are only about $10^80$ atoms in the whole universe). So despite the perfect information, there are just too many possible games to determine the optimal move.

Recently AlphaGo made a huge splash by beating one of the world’s top players of Go. Most Go players, and even many artificial intelligence researchers, thoughts an expert level Go program was years away. So the win was just as surprising as when DeepBlue beat Kasparov in chess. AlphaGo is a large program with many different parts, but at the heart of it is a reinforcement learning module that utilizes TD learning (see here or here for details).

#### Poker

The final frontier in gaming is poker, specifically multi-person No-Limit Texas Hold’em. The reason this is the toughest game left is that it is a multi-player game with imperfect information and an element of chance.

Last winter the computer systems won against professionals for the first time in a series of heads up matches (computer vs only one human). Further improvements are needed to actually beat the best professionals at a multi-person table, but these results seem encouraging for future successes. The interesting thing to me is that both AI system seems to have used only a limited amount of reinforcement learning. I think that fully embracing reinforcement and TD learning should be the top priority for these research teams and might provide the necessary leap in ability. And they should hurry since others might beat them to it!

# Best Machine Learning Resources

Machine learning is a rapidly evolving field that is generating an intense interest from a wide audience. So how can you get started?

For now, I’m going to assume that you already have the basic programming (ie general introduction to programming and experience with matrices) and mathematical skills (calculus and some probability and linear algebra).

These are the best current books on machine learning:

These are some out of date books that still contain some useful sections (for example, Murphy several times refers you to Bishop or MacKay for more details).

Here is a list of other potential resources:

# I3: International Institute for Intelligence

While I was previously discussing my opinion of Open AI, I mentioned that I would do something different if I was in charge. Here is my dream.

# What OpenAI is Missing

Helping everyday people throughout the whole world.

OpenAI’s stated goal is:

OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.

In the short term, we’re building on recent advances in AI research and working towards the next set of breakthroughs.

However, based on their actions so far, this interview with Ilya Sutskever, and popular press articles, the main focus of OpenAI appears to be advanced research in an artificial intelligence by stressing open source, as well as thinking longterm about the impacts of letting advanced artificial intelligence systems control large aspects of our life. While I strongly support these goals, in reality, these will not benefit all of humanity. Instead, it only benefits those with either the necessary training (which is a minimum of a bachelors, but usually means a masters or PhD) or money (to hire top people, buy the required computing resources, etc) to take advantage of the advanced research. So this leaves out the developing world as well as the poor in developed countries, ie contrary to their stated goal, OpenAI is missing the vast majority of humanity.

While one can argue that by making OpenAI’s research open source, eventually it will trickle down and help a wider swath of humanity. However, the current trend suggests that large corporations are best poised to benefit the most from the next revolution (I mean, who is more likely to invent a self driving car, Google, or someone in a developing country?). Additionally, these innovations focus on first world problems (since these are the highest paying customers). And finally, each round of innovation ends up creating fewer and fewer jobs (so the number of unemployed in developed countries may expand). I firmly believe that unless there is a global educational effort (and probably an implementation of basic income), the benefits of AI will be directed towards a tiny sliver of the world’s population.

# My Proposal: I3

Here I lay out my proposal for a new institute that would actually expand the benefits of recent and future advances in machine learning / artificial intelligence to a wider swath of humanity. I don’t claim that it would truly benefit all of humanity (again, see basic income), but it is a way for research advances to reach a larger proportion of it.

I propose a new education and research institute focused on artificial intelligence, machine learning, and computational neuroscience which I’ll call the International Institute for Intelligence. I like alliterations, and since I think it should focus on three types of intelligence, I especially like the idea of calling it I3 or I-Cubed for short.

Why these three research areas? Well, machine learning is currently revolutionizing how companies use data and is facilitating new technological advances everyday. Designing artificial intelligence systems on top of these machine learning algorithms seems like a realistic possibility in the near future. The less conventional choice is computational neuroscience. I think it is important to include for two reasons. First, the brain is the best example we have of an intelligent system, so until we actually design an artificial intelligence, it seems best to understand and mimic the best example (this is the philosophy of Deep Mind according to Demis Hassabis). Second, the US Brain Initiative  and similar international efforts are injecting significant resources into neuroscience, with the hopes of sparking a revolution similar in spirit and magnitude to the widespread effect the Human Genome Project had on biotechnology and genomics. So I figure we might as well prepare everyone for this future.

So what would be the actual purpose of I3? Sticking with the theme of threes, I propose three initiatives that I will list in my order of importance as well as some bonus points.

# 1. International PhD Education

The central goal is to similar program to ICTP (International Centre for Theoretical Physics) but with a different research emphasis. So what is ICTP? It was founded by Nobel Prize Winner Abdus Salam and it has several programs to promote research in developing countries, including:

• Predoctoral program – students get a 1 year course to prep them for PhDs
• Visiting PhD program – students in a developing nation PhD program get to spend a couple of months each year for 3 years at ICTP to participate in their research
• Conferences
• Regional offices (currently Sao Paolo, Brazil, but more in the planning)

So the idea is to implement a similar program but with the research emphasis now focused on machine learning, artificial intelligence, and computational neuroscience. While I think the main thing is to get the predoctoral program and visiting PhD program started, eventually it would be great to have 5 regional offices spread throughout the developing world. For example, I think one is needed in South America (Lima, Peru?), one in Africa (Nairobi, Kenya?), and 2 in Asia (India, and China, but not in a traditional technological center). And assuming I3 is based in the US (see my case for San Diego below), it would be great to have an affiliate office in Europe, maybe in Trieste next to ICTP.

One additional initiative that I think could be useful would be paying people to not leave their country and instead help them establish a research center at their local universities. This could also wait until later because it might be easiest to convince some of the future alumni of the predoctoral or visiting PhD programs to return/stay in their home country.

A second additional initiative would be to encourage professors from developed and developing countries to take their sabbatical at I3. This would provide a fresh stream of mentors and set up potential future collaborations. This is a blend of two programs at KITP (this and that).

# 2. US Primary School Education

The science pipeline analogy is overused, but I don’t have a better one yet. So currently, the researchers in I3 focused areas are predominately male, white or Asian, and middle to upper class. So not a very representative sample of the US (or world) population. Therefore, the best longterm solution is to get a more diverse set of students interested in the research at a young age.

Technically this should have a higher priority over the next initiative (US College Education), but since there are other non-profits interested in this (for example, CodeNow), maybe I3 does not need to be a leader in this and instead can play a supporting role.

# 3. US College Education

And again back to science pipeline analogy, if we are to have a more diverse set of researchers, we need to encourage a diverse set of undergrads to pursue relevant majors and continue on into graduate programs. This won’t be solved by any single program, but here are some potential ideas.

• US underrepresented students could apply for the same 1 year program that is offered to international students.
• Assist universities in establishing bridge programs that partner research universities with colleges that have significant minority populations. A great example of this is the Vanderbilt-Fisk Physics program.
• US colleges would also benefit from the proposed sabbatical program offered to international researchers. I also like the KITP idea of extending it to undergraduate only institutes (especially those with large minority populations) as a way to get more undergrads interested in research.
• Establish a complete set of free college curriculum for machine learning, artificial intelligence, and computational neuroscience. While there are many useful MOOCs on these topics, I still don’t think they beat an actual course.

# Bonus #1 : Research

ICTP has proven that it is possible to further global educational goals and still succeed at research. I would argue that the people working at I3 should mainly be evaluated for tenure based on their mentorship and teaching of students. Research of course will play a role (otherwise it would be poor mentorship of future researchers), but I think there shouldn’t be huge pressure to bring in grants, high-profile publications, etc. But even without that emphasis, there is no way that a group of smart people with motivated students will not lead to great research.

# Bonus #2: International Primary and College Education

This is longer term, but if there are successful programs in improving the US primary and college education, international regional offices, and PhD alumni who are in their home countries, it seems like there should be possible to leverage those connections into a global initiative to improve primary and college education.

# Final Thoughts

So Elon Musk, Peter Thiel, and friends, if you have another billion you want to donate (or Open AI funds to redirect), here is my proposal. In reality, implementing all of my ideas would probably cost several billions, but once you got the center founded, I think that it would be easy to get tech companies, the US government, and even UNESCO to help provide funding.

My final point is that I think San Diego would be a perfect location. I know I’m biased since I live here now, but there a many legitimate reasons San Diego is great for this institute.

1. UCSD already partners with outside research institutes (Salk, Scripps, etc)
2. UCSD (and Salk, etc) are leaders in all of these research areas
3. It is extremely easy to convince people to take a sabbatical in San Diego

While there are many other great potential locations, I strongly suggest that I3 is not in the Bay Area, Seattle, Boston, or New York City. These cities already have plenty of tech jobs, please spread the wealth to other parts of the US.

Anyways, I’ll keep dreaming that someday I’ll get to work at a place like the one I just described.

# Deep Learning in Python

So maybe after reading some of my past posts, you are fired up to start programming a deep neural network in Python. How should you get started?

If you want to be able to run anything but the simplest neural networks on easy problems, you will find that since pure Python is an interpreted language, it is too slow. Does that mean we have to give up and write our own C++ code? Luckily GPUs and other programmers come to your rescue by offering between 5-100X speedup (I would estimate my average speedup at 10X, but it varies for specific tasks).

There are two main Python packages, Theano and TensorFlow, that are designed to let you write Python code that can either run on a CPU or a GPU. In essence, they are each their own mini-language with the following changes from standard Python:

• Tensors (generalizations of matrices) are the primary variable type and treated as abstract mathematical objects (don’t need to specify actual values immediately).
• Computational graphs are utilized to organize operations on the tensors.
• When one wants to actually evaluate the graph on some data, it is stored in a shared variable that when possible gets sent to the GPU. This data is then processed by the graph (in place of the original tensor placeholders).
• Automatic differentiation (ie it understands derivatives symbolically).
• Built in numerical optimizations.

So to get started you will want to install either Theano (pip install theano), TensorFlow (details here), or both. I personally have only used Theano, but if Google keeps up the developmental progress of TensorFlow, I may end up switching to it.

At the end of the day, that means that if one wants to actually implement neural networks in Theano or TensorFlow, you essentially will learn another language. However, people have built various libraries that are abstractions on top of these mini-languages. Lasagne is one example that basically organizes Theano code so that you have to interact less with Theano, but you will still need to understand Theano. I initially started with Theano and Lasagne, but I am now a convert to Keras.

Instead, I advocate for Keras (pip install keras) for two major reasons:

1.  High level abstraction. You can write standard Python code and get a deep neural network up and running very quickly.
2. Back-end agnostic. Keras can run on either Theano or TensorFlow.

So it seems like a slam dunk right? Unfortunately life is never that simple, instead there are two catches:

1. Mediocre documentation (using Numpy as a gold standard, or even comparing to Lasagne). You can get the standard things up and running based on theirs docs. But if you want to do anything advanced, you will find yourself looking into their source code on GitHub, which has some hidden, but useful, comments.
2. Back-end agnostic. This means if you do want to introduce a modification to the back-end, and you want it to always work in Keras, you need to implement it in both Theano and TensorFlow. In practice this isn’t too bad since Keras has done a good job of implementing low-end operations.

Fortunately, the pros definitely outweigh the cons for Keras and I highly endorse it. Here are a few tips I have learned from my experience with Keras:

• Become familiar with the Keras documentation.
• I recommend only using the functional API which allows you to implement more complicated networks. The sequential API allows you to write simple models in fewer lines of code, but you lose flexibility (for example, you can’t access intermediate layers) and the code won’t generalize to complex models. So just embrace the functional API.
• Explore the examples (here and here).
• Check out the Keras GitHub.
• Names for layers are optional keywords, but definitely use them! It will significantly help you when you are debugging.

Now start coding your own deep neural networks!

# Thoughts on OpenAI

OpenAI was started just over 6 months ago, and I feel like they have done enough to warrant a review of what they have done so far, and my thoughts of what they should do next.

## What is OpenAI?

OpenAI was announced in December 2015 and their stated mission is:

OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.

In the short term, we’re building on recent advances in AI research and working towards the next set of breakthroughs.

## What have they done so far?

1. Started a new, small (so far) research center
2. Experimented with a novel organization of the research center
3. Hired a variety of smart people
4. Released a toolkit for reinforcement learning (RL)

Since it has only been six months and they are still getting setup, it is still difficult to assess how well they have done. But here are my first impressions of the above points.

1. Always great to have more places hiring researchers!
2. Way too early to assess. I’m always intrigued by experiments of new ways to organize research, since there are three dominant types of organizations today (academia, industry focused on development, and industry focused on longterm research).
3. Bodes well for their future success.
4. I have yet to use it, but the it looks awesome. Supervised learning was sped along by datasets such as UC Irvine’s Machine Learning Repository, MNIST, and Imagenet, and I think their toolkit could have a similar impact on RL.

## What do I think they should do?

This blog post was motivated by me having a large list of things that I think OpenAI should be doing. After I started writing, I realized that many of the things on my wish list would probably be better run by a new research institute, which I will detail in a future post. So here, I focus on my research wish-list for OpenAI.

### Keep the Data Flowing

As Neil Lawrence pointed out shortly after OpenAI’s launch, data is king. So I am very happy with OpenAI’s RL toolkit. I hope that they keep adding new datasets or environments that machine learners can use. Some future ideas include supporting new competitions (maybe in partnership with Kaggle?), partnering with organizations to open up their data, and introducing datasets for unsupervised learning.

### Unsupervised Learning

But maybe I’m putting the cart (data) before the horse (algorithms and understanding). Unsupervised learning is tough for a series of interconnected issues:

• What are good test cases / datasets for unsupervised learning?
• How does one assess learning success?
• Are our current algorithms even close to the “best”?

The reason supervised learning is easier is that algorithms require data with labels, there are lots of established metrics for evaluating success (for example, accuracy of label predictions), and we know for most metrics what is the best (100% correct label predictions). Reinforcement learning has some of that (data and a score), but is much less well defined that supervised learning.

So while I think the progress on reinforcement learning will definitely lead to new ideas for unsupervised learning, more work needs to be done directly on unsupervised learning. And since they have no profit motives or tenure pressure, I really hope OpenAI focuses on this extremely tough area.

### Support Deep Learning Libraries

We currently have a very good problem: lots of deep learning libraries, to the point of almost being too many. A few years ago, everyone had to essentially code their own library, but now one can choose from Theano and TensorFlow for low end libraries, to Lasagne and Keras for high end libraries, just to name a few examples from Python.

I think that OpenAI could play a useful role in standardization and testing of libraries. While there are tons of great existing libraries, their documentation quality varies significantly, and in general is sub par (for example compared to NumPy). Additionally, besides choosing a language (I strongly advocate Python), one usually needs to choose a backend library (Theano vs TensorFlow), and then a high end library.

So my specific proposal for OpenAI is the following initiatives:

1. Help establish some deep learning standards so people can verify the accuracy of a library and assess its quality and speed
2. Set up some meetings between Theano, TensorFlow, and others to help standardize the backend (and include them in the settings of standards)
3. Support initiatives for developers to improve documentation of their libraries
4. Support projects that are agnostic to the backend (like Keras) and/or help other packages that are backend specific (like Lasagne) become backend agnostic

As a recent learning of deep learning, and someone who interacts extensively with non-machine learners, I think the above initiatives would allow a wider population of researchers to incorporate deep learning in their research.

### Support Machine Learning Education

I believe this is the crucial area that OpenAI is missing, and it will prevent them from their stated mission to help all of humanity.

Check out a future post for my proposed solution…

# Deep Learning

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Papers

Deep Learning. By LeCun, Bengio, and Hinton in 2015.

Deep learning in neural networks: An overview. By Schmidhuber in 2015.

## What is deep learning?

In previous weeks we have introduced perceptrons, multilayer perceptrons (MLP), Hopfield neural networks, and Boltzmann machines.

But what is deep learning? I think it is really two things:

1. Successful training of multilayered neural networks perform better (higher classification accuracy, etc) and involve more layers than previous implementations
2. Just a rebranding of neural networks

Here is my summary of the history of deep learning, see both reviews above for extended details. In 2006, Deep Belief Networks (DBN) were introduced in two papers (Reducing the Dimensionality of Data with Neural Networks and A Fast Learning Algorithm for Deep Belief Nets). The idea of a DBN is to train a series of restricted Boltzmann machines (RBM). The first RBM is trained and given the original data, produces an output of hidden layer activations. The second RBM uses the first RBM’s hidden layer activations as inputs, and trains on that “data”. This is continued to the desired depth. At this point, the DBN can be used for unsupervised learning, or one can use it as pretraining for a MLP which will utilize backpropagation for supervised learning.

So on the one hand, there were technical breakthroughs that enabled neural networks to utilize more layers than previous iterations and achieve state of the art performance. However, the actual component (RBMs and MLPs), have been around since the 1980s, so it would also be fair to deem deep learning as a rebranding of neural networks.

Therefore, neural network winter (mid 90s to mid 00s) officially ended in 2006. I propose calling 2006-2012 neural network spring. While interest in neural networks increased and new advances were made, the general machine learning community was not obsessed with deep learning. That changed in 2012 when the neural network summer began. This paper presented at NIPS revolutionized the computer vision community by cutting the error rate on Imagenet in half! The Imagenet challenge was viewed as a serious benchmark that all computer vision systems should address. By blowing previous results out of the water, the revolution was completed.

So for now, enjoy neural network summer, but always remember, winter is coming.

## Fundamental Questions

• When do extra layers help in a neural network? When do they hurt?
• Why was pretraining originally needed, but is no longer used in practice? Check out these papers for details: Glorot and Saxe.
• Learn about convolutional and recurrent neural networks. These are extremely popular right now!

• Do research on unsupervised learning! It is definitely less popular today, but all the big-shots think is the longterm future of neural networks.

# Training Networks

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

## Unit: Deep Learning

1. Perceptron
2. Energy Based Neural Networks
3. Training Networks
4. Deep Learning

## Papers

A Learning Algorithm for Boltzmann Machines. By Ackley, Hinton, and Sejnowski in 1985.

Learning representations by back-propagating errors. By Rumelhart, Hinton, and Williams in 1986.

## How do you actually train neural networks?

Hopefully the past few posts have piqued your interest in neural networks. Maybe you even want to unleash a neural network on some data. How do you actually train the neural network?

I’m actually going to keep this brief for two reasons. First, detailed derivations can already be found elsewhere (for Boltzmann see Appendix of the original paper as well as MacKay, for backpropagation see Nielsen). Second, I firmly believe that algorithms are best learned by actually stepping through the updates, so any explanation I attempt will not be sufficient for you to truly learn the algorithm. I will provide some general context as well as some questions you should be able to answer, but please go do it yourself!

There are three general classes of machine learning based on the information received:

• Unsupervised – data only. Boltzmann machine.
• Supervised – data with labels. MLP with backpropagation.
• Reinforcement – data, actions, and scores associated with each action. Deserves its own detailed post, but check out papers by DeepMind for cool applications.

The Boltzmann machine learning rule is an example of maximum likelihood. In practice, the original learning rule is too computationally expensive, so a modified algorithm called contrastive divergence (or variants such as persistent contrastive divergence) is utilized instead. See the Hinton guide to RBMs for more details.

Backpropagation is a computationally-efficient writing of the chain rule from calculus, so besides the above paper which popularized it, there is actually a long history of this algorithm being discovered and rediscovered.

## Fundamental Questions

• What is maximum likelihood?
• Why can one interpret the learning terms in the BM algorithm as “waking” and “sleeping”?
• Why are BM hidden layers so important?
• Why are restricted Boltzmann machines, RBMs, much easier to train?
• Why is backpropagation more computationally efficient than the finite difference method?
• Derive the 4 backpropagation equations!