Deep Learning Tips

I thought I would write up some general tips and tricks that I have learned by experimenting with neural networks. My focus is on tips that apply to any problem and any neural network architecture, and in fact, some of these tips apply more generally to any machine learning algorithm. So what I have learned over the years?

Data Splits

Before doing anything else, you need to split the dataset into training and testing. But how much data should go into each split? This depends on your number of samples and the number of classes. For example, MNIST has only 10 digits with little variation in each digit, so the standard split is around 80% train and 20% test. ImageNet has over a million samples of 1000 diverse classes, so they use around 50% train and 50% test. So if you have an easy problem and/or a small dataset, I would suggest 80% train and 20% test. If you have a very tough problem and/or a large dataset, I would suggest 50% train and 50% test.

The test data should now be put in a lock box and only used on your final model.

Next you also should set aside some of the training data for validation which is used to determine generalization results when tuning hyperparameters. I would suggest around 20% of the training data to be used as a validation.

Finally, I do a little bit of cheating and I data snoop. I usually take a very tiny amount of the data, maybe 1-5% and play around with it. I will inspect the data to make sure that it looks good, and use the small number of samples to debug my initial code and very roughly tune the hyperparameters. This saves you the headache of doing a long training session only to find out that you had a bug in your code or grossly misunderstood where to start your hyperparameter search.

Data Preprocessing

As a general rule, the data should be standardized by preprocessing. I’ll discuss some specific standardizations below, but a general issue is whether to standardize by the whole dataset, per sample, or per feature. I tend to default to per sample, but I don’t have a good scientific reason why that is the best. If you standardize by the whole dataset or per feature, you need to make sure you only use the training data to set the scales. If you standardize per feature, make sure that all of your features have significant variation before doing so (see MNIST for an example where per feature standardization can lead to weird results since many features have a standard deviation of zero).

Mean

All numerical data should be mean centered, no questions asked. If you classes can be robustly classified just by the mean difference, then you don’t need a neural network. You have a very simple problem and should just use a simple threshold discriminator.

Scaling

I highly recommend scaling the data so that it is all order 1. This can speed up training because most initialization schemes of weights assume that the data is mean centered and has values around the size of 1. But there are two possible ways to scale your data: standard deviation or by the range. If you data looks normally distributed, then standard deviation makes sense. Otherwise I just divide by the maximum of the absolute value.

Correlations

In theory, it can also be helpful to remove correlations between features by using PCA or ZCA whitening. However, in practice you may run into numerical stability issues since you will need to invert a matrix. So this is worth considering, but takes some more careful application.

Data Augmentation

More training data is always better, but obtaining that data can be expensive. So I always try hard to find a way to do data augmentation. However, the correct data augmentation is usually problem specific, so I won’t go into details here.

Early Stopping

The no free lunch theorem of machine learning states that there is no general learning algorithm that will solve all problems. However, Geoff Hinton has pointed out that early stopping is as close to a free lunch as we can get. Early stopping is the easiest way for any machine learning algorithm to avoid overfitting, and you can read more about the technical justifications for it at Distill’s momentum article.

Optimizer

SGD vs Adam

In practice, all optimizers for neural networks involve some form of stochastic gradient descent (SGD). The only questions is whether you need to manually tune the learning rate and other parameters, or whether you use an adaptive version of SGD that automatically adjusts the learning rates. I think the best adaptive method is Adam (and Nadam when possible, see later subsection on momentum). So for me the choice is simple: either plain SGD or Adam/Nadam. For a more complete comparison of SGD variants, I highly recommend this blog post.

Learning Rate

If  you are using Adam, you will rarely need to tune the learning rate. But for SGD, the learning rate is by far the most important parameter to tune. A nice tip from Yoshua Bengio is this: the optimal learning rate is often an order of magnitude lower than the smallest learning rate that blows up the loss. So this means, start with a high learning rate and work your way down a half order of magnitude at a time (for example: 1, 0.3, 0.1, …). Then start your fine grained learning rate search about an order of magnitude below the last time the loss blew up.

Another useful tweak on the learning rate is to have it decay over the course of training. I find that this slightly improves the final performance, but more importantly leads to consistent training results. There are a variety of ways to implement the decay, but I’m not sure they make that much of a difference. My standard implementation is

l_{batch} = \frac{l_{start}}{1+decay*(N_{batches})}

where N_{batches} is the number of minibatches seen so far during training. I then set decay so that the final learning rate at the end of all the epochs is 1/10th the starting learning rate.

Momentum

Momentum is very useful for neural networks, but in practice I spend minimal time tuning the momentum rate because I have a few default settings that I strongly recommend.

First, I really only consider three possible momentum values: 0.5, 0.9, and 0.99. Since the maximum effect of momentum is \frac{1}{1-momentum}, my default values are roughly spaced by an order of magnitude. I always start with 0.9 and go from there.

Also, I always choose Nesterov momentum whenever possible. Most packages, like Keras, have Nesterov as an option for SGD, and Keras also has Nadam, which is Adam with Nesterov momentum. For more details on Nesterov, see here. The short explanation is that it leads to the same maximum effect of \frac{1}{1-momentum}, but it does so in a more gradual manner. In practice, this means that while standard momentum gets very unstable above 0.9, Nesterov momentum can be safely set to 0.99.

Another useful tip is to set the momentum to a smaller value (say half your standard value) for the final few epochs (maybe the last 5-10% of epochs). The intuition for why this is helpful is that hopefully by the end of training, the neural network is close to good weights, but it might be rocking back and forth around the optimal weights. Since the neural network weight space is highly non-convex, by tuning down the momentum, you force the neural network to settle down into these non-convex “valleys” that may contain the best weights.

The final tip, originally suggested here, is to exponentially ramp up and down the momentum anytime you want to change the momentum rate during training. This gives the weights updates time to adjust to the new momentum rates. I personally have found this gives a very slight improvement in performance, but more importantly it leads to consistent training results.

Summary of my momentum tips:

  • Peak momentum values of: 0.5, 0.9, or 0.99
  • Always choose Nesterov momentum if possible
  • Start momentum initially at half the desired peak value and exponentially ramp up
  • Towards the end of training, exponentially ramp down momentum to half the desired peak value.
  • Train for 5-10% of epochs at the desired smaller momentum.

Initialization

All weights should be initialized to an orthogonal matrix. This is extremely important for recurrent neural networks (as explained here), but I have also found it to be useful for all neural networks.

Activation Function

The standard is that all hidden layers are ReLUs unless you need the hidden layers to be a valid probability, in which case you should use a sigmoid.

Loss

Choosing the right loss function is very problem dependent, so I will leave that for another day. However, whatever loss function you do choose, make sure the output layer activation function is complimentary to that loss, see Michael Nielsen’s book for details on why sigmoid outputs and crossentropy losses are complimentary.

Regularization

Weights

Weight regularization is almost always a requirement to prevent overfitting and to get good generalization. The two main choices are L1 or L2 regularization. L1 will ensure that small weights are set to zero, and hence will lead to a sparser set of weights. L2 prevents weights from becoming too large, but does not sparsify the weights. Personally, rather than choosing between the two, I tend to default to both. I set L1 to be very small so that I at least get slightly sparser weights, but then I mainly focus on tuning L2 to control overfitting.

Activity

Dropout and batch normalization are not regularizers in the traditional sense, but in practice they help reduce overfitting by controlling the activation outputs. Additionally, it is extremely difficult to train very deep neural networks without using either dropout or batchnorm. Dropout was the standard for several years, but now it is usually replaced by batchnorm.

Parameter Tuning

Neural networks have a lot of interdependent hyperparameters to tune, so picking which ones to tune first is kind of a chicken and the egg problem. Personally, I start off with an adaptive optimizer (like Adam or Nadam) and then tune the architecture. Next I will roughly tune the regularization. Once that leads to acceptable results, I will switch the optimizer to SGD and only focus on tuning the learning rate. If SGD seems promising, I will then tune other parameters like decay and momentum. Hopefully by this point, you are achieving pretty good results. I will then use this neural network as the starting point for a systematic hyperparameter search to truly find the best results.

Final Tips

Don’t take my word for anything, try it out yourself! I strongly recommend experimenting with every option you can find in Keras and see for yourself what actually will work. I also suggest getting opinions from as many people as possible (see Yoshua Bengio’s tips). I think that about 90% of the advice will overlap, but everyone has their own bias. So hopefully be reading enough independent sources, you can average out all our mistakes. Good luck!

Temporal Difference Learning

How can humans or machines interact with an environment and learn a strategy for selecting actions that are beneficial to their goals? Answers to this question fall under the artificial intelligence category of reinforcement learning. Here I am going to provide an introduction to temporal difference (TD) learning, which is the algorithm at the heart of reinforcement learning.

I will be presenting TD learning from a computational neuroscience background. My post has been heavily influenced by Dayan and Abbott Ch 9, but I have added some additional points. The ultimate reference for reinforcement learning is the book by Sutton and Barto, and their chapter 6 dives into TD learning.

Conditioning

To start, let’s review conditioning. The most famous example of conditional is Pavlov’s dogs. The dogs naturally learned to salivate upon the delivery of food, but Pavlov realized that he could condition dogs to associate the ringing of a bell with the delivery of food. Eventually, the ringing of the bell on its own was enough to cause dogs to salivate.

The specific example of Pavlov’s dogs is an example of classical conditioning. In classical conditioning, no action needs to be taken. However, animals can also learn to associate actions with rewards and this is called operant conditioning.

Before I introduce some specific conditioning paradigms, here are the important definitions:

  • s = stimulus
  • r = reward
  • x = no reward
  • v = value, or expected reward (generally a function of r, x)
  • u = binary, indicator variable, of stimulus (1 if stimulus present, 0 otherwise)

Here are the conditioning paradigms I want to discuss:

  • Pavlovian
  • Extinction
  • Blocking
  • Inhibitory
  • Secondary

For each of these paradigms, I will introduce the necessary training stages and the final result. The statement, a \rightarrow b, means that a becomes associated (\rightarrow) with b.

Pavlovian

Training: s \rightarrow r. The stimulus is trained with a reward.

Results: s \rightarrow v[r]. The stimulus is associated with the expectation of a reward.

Extinction

Training 1: s \rightarrow r. The stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: s \rightarrow x. The stimulus is trained with a no reward.

Results: s \rightarrow v[x]. The stimulus is associated with the expectation of no reward. Extinction of the previous Pavlovian training.

Blocking

Training 1: s_1 \rightarrow r. The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: s_1 + s_2 \rightarrow r. The first stimulus and a second stimulus is trained with a reward.

Results: s_1 \rightarrow v[r], and s_2 \rightarrow v[x]. The first stimulus completely explains the reward and hence “blocks” the second stimulus from being associated with the reward.

Inhibitory

Training: s_1+s_2 \rightarrow x, and s_1 \rightarrow r. The combination of two stimuli leads to no reward, but the first stimuli is trained with a reward.

Results: s_1 \rightarrow v[r], and s_2 \rightarrow -v[r]. The first stimuli is associated with the expectation of the reward while the second stimuli is associated with the negative of the reward.

Secondary

Training 1: s_1 \rightarrow r. The first stimulus is trained with a reward. This eventually leads to successful Pavlovian training.

Training 2: s_2 \rightarrow s_1. The second stimulus is trained with the first stimulus.

Results: s_2 \rightarrow v[r]. Eventually the second stimulus is associated with the reward despite never being directly associated with the reward.

Rescorla-Wagner Rule

How do we turn the various conditioning paradigms into a mathematical framework of learning? The Rescorla Wagner rule (RW) is a very simple model that can explain many, but not all, of the above paradigms.

The RW rule is a linear prediction model that requires these three equations:

  1. v=w \cdot u
  2. \delta = r-v
  3. w_{new} = w_{old}+\epsilon \delta u

and introduces the following new terms:

  • w = weights associated with stimuli state
  • \epsilon = learning rate, with 0 \le \epsilon \le 1

What do each of these equations actually mean?

  1. The expected reward, v, is a linear dot product of a vector of weights, w, associated with each stimuli, u.
  2. But there may be a mismatch, or error, between the true actual reward, r, and the expected reward, v.
  3. Therefore we should update our weights of each stimuli. We do this by adding a term that is proportional to a learning rate \epsilon, the error \delta, and the stimuli u.

During a Pavlovian pairing of stimuli with reward, the RW rule predicts an exponential approach of the weight to w = \langle ru\rangle over the course of several trials for most values of \epsilon (if \epsilon=1 it would instantly update to the final value. Why is this usually bad?). Then if the reward stops being paired with the stimuli, the weight will exponential decay over the course of the next trials.

The RW rule will also continue to work when the reward/stimulus pairing is stochastic instead of deterministic and the will will still approach the final value of w = \langle ru\rangle.

How does blocking fit into this framework? Well the RW rule says that after the first stage of training, the weights are w_1 = r and w_2 = 0 (since we have not presented stimulus two). When we start the second stage of training and try and associate stimulus two with the reward, we find that we cannot learn that association. The reason is that there is no error (hence \delta = 0) and therefore w_2 = 0 forever. If instead we had only imperfectly learned the weight of the first stimulus, then there is still some error and hence some learning is possible.

One thing that the RW rule incorrectly predicts is secondary conditioning. In this case, during the learning of the first stimulus, s_1, the learned weight becomes w_1 >0. The RW rule predicts that the second stimulus, s_2, will become w_2 <0. This is because this paradigm is exactly the same as inhibitory conditioning, according to the RW rule. Therefore, a more complicated rule is required to successfully have secondary conditioning

One final note. The RW rule can provide an even better match to biology by assuming a non-linear relationship between v and the animal behavior. This function is often something that exponentially saturates at the maximal reward (ie an animal is much more motivated to go from 10% to 20% of the max reward rather than from 80% to 90% of the max reward). While this provides a better fit to many biological experiments, it still cannot explain the secondary conditioning paradigm.

Temporal Difference Learning

To properly model secondary conditioning, we need to explicitly add in time to our equations. For ease, one can assume that time, t, is discrete and that a trial lasts for total time T and therefore 0 \le t \le T.

The straightforward (but wrong) extension of the RW rule to time is:

  1. v[t]=w[t-1] \cdot u[t]
  2. \delta[t] = r[t]-v[t]
  3. w[t] = w[t-1]+\epsilon \delta[t] u[t]

where we will say that it takes one time unit to update the weights.

Why is this naive RW with time wrong? Well, psychology and biology experiments show that animals expected rewards does NOT reflect the past history of rewards nor just reflect the next time step, but instead reflects the expected rewards during the WHOLE REMAINDER of the trial. Therefore a better match to biology is:

  1. v[t]=w[t-1] \cdot u[t]
  2. R[t]= \langle \sum_{\tau=0}^{T-t} r[t+\tau] \rangle
  3. \delta[t] = R[t]-v[t]
  4. w[t] = w[t-1]+\epsilon \delta[t] u[t]

where R[t] is the full reward expected over the remainder of the trial while r[t] remains the reward at a single time step. This is closer to biology, but we are still missing a key component. Not all future rewards are treated equally. Instead, rewards that happen sooner are valued higher than rewards in the distant future (this is called discounting). So the best match to biology is the following:

  1. v[t]=w[t-1] \cdot u[t]
  2. R[t]= \langle \sum_{\tau=0}^{T-t} \gamma^\tau r[t+\tau] \rangle
  3. \delta[t] = R[t]-v[t]
  4. w[t] = w[t-1]+\epsilon \delta[t] u[t]

where 0 \le \gamma \le 1 is the discounting factor for future rewards. A small discounting factor implies we prefer rewards now while a large discounting factor means we are patient for our rewards.

We have managed to write down a set of equations that accurately summarize biological reinforcement. But how can we actually learn with this system? As currently written, we would need to know the average reward over the remainder of the whole trial. Temporal difference learning makes the following assumptions in order to solve for the expected future rewards:

  1. Future rewards are Markovian
  2. Current observed estimate of reward is close enough to the typical trial

A Markov process is memoryless in that the next future step only depends on the current state of the system and has no other history dependence. By assuming rewards follow this structure, we can make the following approximation:

  • R[t]= \langle r[t+1] \rangle + \gamma \langle \sum_{\tau=1}^{T-t} \gamma^{\tau-1} r[t+\tau]
  • R[t]= \langle r[t+1] \rangle + \gamma R[t+1]

The second approximation is called bootstrapping. We will use the currently observed values rather than the full estimate for future rewards. So finally we end up at the temporal difference learning equations:

  1. v[t]=w[t-1] \cdot u[t]
  2. R[t] =  r[t+1] + \gamma v[t+1]
  3. \delta[t] =r[t+1] + \gamma v[t+1]-v[t]
  4. w[t] = w[t-1]+\epsilon \delta[t] u[t]

 

Screen Shot 2017-05-15 at 5.06.51 PM.png

Dayan and Abbott, Figure 9.2. This illustrates TD learning in action.

I have included an image from Dayan and Abbott about how TD learning evolves over consecutive trials, please read their Chapter 9 for full details.

Finally, I should mention that in practice, people often use the TD-Lambda algorithm. This version introduces a new parameter, lambda, which controls how far back in time one can make adjustments. Lambda 0 implies one time step only, while lambda 1 implies all past time steps. This allows TD learning to excel even if the full system is not Markovian.

Dopamine and Biology’s TD system

So does biology actually implement TD learning? Animals definitely utilize reinforcement learning and there is strong evidence that temporal difference learning plays an essential role. The leading contender for the reward signal is dopamine. This is a widely used neurotransmitter that evolved in early animals and remains widely conserved. There are a relatively small number of dopamine neurons (in the basal ganglia and VTA in humans) that project widely throughout the brain. These dopamine neurons can produce an intense sensation of pleasure (and in fact the “high” of drugs often comes about either through stimulating dopamine production or preventing its reuptake).

There are two great computational neuroscience papers that highlight the important connection between TD learning and dopamine that analyze two different biological systems:

Both of these papers deserved to be read in detail, but I’ll give a brief summary of the bee foraging paper here. Experiments were done that tracked bees in an controlled environment consisting of “yellow flowers” and “blue flowers” (which were basically just different colored cups). These flowers had the same amount of nectar on average, but were either consistent or highly variable. The bees quickly learned to only target the consistent flowers. These experimental results were very well modeled by assuming the bee was performing TD learning with a relatively small discount factor (driving it to value recent rewards).

TD Learning and Games

Playing games is the perfect test bed for TD learning. A game has a final objective (win), but throughout play it can be difficult to determine your probability of winning. TD learning provides a systematic framework to associate the value of a given game state with the eventual probability of learning. Below I highlight the games that have most significantly showcased the usefulness of reinforcement learning.

Backgammon

Backgammon is a two person game of perfect information (neither player has hidden knowledge) with an element of chance (rolling dice to determine one’s possible moves). Gerald Tesauro’s TD-Gammon was the first program to showcase the value of TD learning, so I will go through it in more detail.

Before getting into specifics, I need to point out that there are actually two (often competing) branches in artificial intelligence:

Symbolic logic tends to be a set of formal rules that a system needs to follow. These rules need to be designed by humans. The connectionist approach uses artificial neural networks and other approaches like TD learning that attempt to mimic biological neural networks. The idea is that humans set up the overall architecture and model of the neural network, but the specific connections between “neurons” is determined by the learning algorithm as it is fed real data examples.

Tesauro actually created two versions of a backgammon program. The first was called Neurogammon. It was trained using supervised learning where it was given expert games as well as games Tesauro played against himself and told to learn to mimic the human moves. Neurogammon was able to play at an intermediate human level.

Tesauro’s next version of a backgammon program was TD-Gammon since it used the TD learning rule. Instead of trying to mimic the human moves, TD-Gammon used to the TD learning rule to assign a score to each move throughout a game. The additional innovation is that the TD-Gammon program was trained by playing games against itself. This initial version of TD-Gammon soon matched Neurogammon (ie intermediate human level). TD-Gammon was able to beat experts by both using a supervised phase on expert games as well as a reinforcement phase.

Despite being able to beat experts, TD-Gammon still had a weakness in the endgame. Since it only looked two-moves ahead, it could miss key moves that would have been found by a more thorough analytical approach. This is where symbolic logic excels and hence TD-Gammon was a great demonstration of the complimentary strength and weaknesses of symbolic vs connectionist logic.

Go

Go is a two person game of perfect information with no element of chance. Despite this perfect knowledge, the game is complex enough that there are around 10^170 possible games (for reference, there are only about 10^80 atoms in the whole universe). So despite the perfect information, there are just too many possible games to determine the optimal move.

Recently AlphaGo made a huge splash by beating one of the world’s top players of Go. Most Go players, and even many artificial intelligence researchers, thoughts an expert level Go program was years away. So the win was just as surprising as when DeepBlue beat Kasparov in chess. AlphaGo is a large program with many different parts, but at the heart of it is a reinforcement learning module that utilizes TD learning (see here or here for details).

Poker

The final frontier in gaming is poker, specifically multi-person No-Limit Texas Hold’em. The reason this is the toughest game left is that it is a multi-player game with imperfect information and an element of chance.

Last winter the computer systems won against professionals for the first time in a series of heads up matches (computer vs only one human). Further improvements are needed to actually beat the best professionals at a multi-person table, but these results seem encouraging for future successes. The interesting thing to me is that both AI system seems to have used only a limited amount of reinforcement learning. I think that fully embracing reinforcement and TD learning should be the top priority for these research teams and might provide the necessary leap in ability. And they should hurry since others might beat them to it!

Research Experience for Undergrads (REU)

This National Science Foundation program is designed to give undergraduates, especially those from smaller schools, a chance to gain real research experience for a summer. Personally I participated in one official REU and one program modeling on REUs. I learned a lot (and they were tons of fun!). The best part is not the specific topic you research, but the opportunity to learn how to be a researcher.
Most of the applications are due in February. Check out the the official NSF REU website for the latest details.
 
When you are ready to apply, go here to search for programs of REUs in various subjects. Also, search the internet for other research opportunities; Harvard has a nice list of research programs for undergrads. For more detailed tips on applications, I recommend this site
 
If you want to get an idea of what an REU is like, here are some interviews of past Math REU participants. And also keep in mind these research tips for undergrads if you do get an REU.

QFT Resources

Quantum Field Theory is a notoriously difficult subject to learn, but I found the following resources to be extremely helpful when I took the course a few years ago. I just learned about a few resources that I wish I had then, so here are my current tips for learning QFT. 
 
Books:
Tony Zee’s book QFT in a Nutshell provides a great intuition into what QFT is all about. If you actually want to do calculations, then Peskin and Schroeder’s book is a nice compliment. These two books were the heart of my studies into QFT.
 
David Tong’s Notes:
Great set of lecture notes that provides a different perspective.
 
Sidney Coleman’s Lectures:
Apparently, all modern QFT books are based on Coleman (since all the authors learned QFT from him or his students), and you can still see the original videos.  For years there was a set of hand-written notes that served as a transcript of the video but this was recently LaTeXed and shared on the ArXiv.

Deep Learning Seminar Course

This semester Terry Sejnowski is teaching a graduate seminar course that is focused on Deep Learning. The course meets weekly for two hours to discuss papers. Here I’ll just outline the course and in later posts I’ll add some thoughts on each specific week.

Week 1: Perceptrons

Week 2: Hopfield Nets and Boltzmann Machines

Week 3: Backprop

Week 4: Independent Component Analysis (ICA)

Week 5: Convolutional Neural Networks (CNN)

Week 6: Recurrent Neural Networks (RNN)

Week 7: Reinforcement Learning

Week 8: Information and Control Theory

2016 Election Thoughts: Part 2/2

This post has absolutely nothing to do with science and is just some of my thoughts on the recent US Presidential election. I started writing up my thoughts and I realized it was easiest to organize my thoughts by things I would like to say to Anti-Trump vs Trump voters. In reality, both posts are relevant to either side, but it was a convenient way to cleanly separate my points. Since I respect everyone’s right to a private vote, I’m writing these thoughts as open letters to both sides.



Dear Trump Voters Who Love Me,

I cried.

I’m scared and I cried.

I need you to understand that. This fear of Trump has not gotten better since the election. In fact, it took me until Friday November 11th at 8PM PT for the full implications of the election of President Trump to set in. I finally truly understood what this election meant to me.

I need you to know that when I fully understood what this election meant to me, I cried. Uncontrollable sobbing. It hit me while walking down the hallway towards my apartment. I held it together long enough to go inside, sit down in the dark, and sob uncontrollably by myself. I cried because I was scared. I cried because of innocence lost, both my own and my future children’s. I cried because I didn’t do enough to prevent me from crying. I cried for being naive and stupid and taking this long to truly see the world. I cried for not figuring it out in time to communicate my viewpoint with Trump voters. I cried because I was crying. I cried out of despair and frustration because I realized my future children, at a much younger age, would feel a much worse pain. I cried because I had entered the Dark Forest.

I need you to know that I will remember that cry for a long time. I cry rarely enough that I am pretty sure I can name ever event since my teenage years. This is something I won’t forget anytime soon.

And I realized, that more than anything else, I need you to understand why I cried. I need you to understand why President Donald J. Trump can never be just another politician to me. I need you to realize that you have unleashed a political weapon on me that scares the shit out of me. I need you to understand why this just became a defining point in my life. I need you to understand that I have entered the Dark Forest and what it means for me.

First what is the Dark Forest. I am stealing this from a science fiction series, the Three Body Problem. While the book focuses on interactions between alien civilizations, I think it also a useful analogy for politics today since both sides seem to be alien to each other. The Dark Forest translated to democracy is this:

Axiom 1: A voter’s goal is to survive
Axiom 2: Resources are finite
Axiom 3: Voters and politicians have limited communication
Axiom 4: Strangers have limited communication

Consequence 1, The Light Forest: The combination of axiom 1 and 2 mean that we are all hunters in a forest, competing for resources. This by itself is a perfectly fine world and democracy. Yes we are competing with each other, but since we have plenty of light, we can stay safe. We don’t need to worry that we will mistake each other for the animals we are hunting.

Consequence 2, Chains of Suspicion: The combinations of axiom 3 and 4 lead to Chains of Suspicion. The extreme distances between strangers creates an insurmountable ‘Chain of Suspicion’ where the two strangers cannot communicate fast enough to relieve mistrust, making conflict inevitable.

Consequence 3, The Dark Forest: The Chains of Suspicion cast a dark shadow over the Forest, turning it dark. In the Dark Forest, other hunters become threats. I no longer know if the noise I hear in the dark is an animal or another hunter. I also know that the other hunter has the same problem. I know that this other hunter may shoot me, either by accident, out of fear, or worse, on purpose. Therefore, I can only guarantee my safety if I shoot first and ask questions later.

I need you to realize that politicians words matter. Trump and I will never talk in person. I will never be able to truly get to know Trump. That means, that when Trump says or tweets authoritarian or racist things, I will never know his true intent. It means that Trump and I have an insurmountable Chain of Suspicion.

Looking back, Trump and I have had this Chain of Suspicion for a long time. This Chain did not directly drive me into the Dark Forest of distrust largely because of you. I love and trust you. I know that we may have political differences, but I am confident we can work them out. But you and I are not the issue. You and I are not strangers.

What drove me into the Dark Forest is that Chains of Suspicion multiply like a virus. In the Dark Forest, Trump’s words matter because they are him broadcasting his potential future actions. Maybe Trump’s threats are just a bluff. Maybe those words won’t lead to actions. But I need you to understand, there are others that scare me to my core and I am afraid that Trump has given them more power. Trump has reinforced their terrible ideas and made them seem slightly more normal.

I need you to know that Trump is not a standard politician to me. Trump successfully won election despite doing two things that I thought individually would be disqualifying in modern society:

  1. disregard for democracy
  2. explicit racism

I need you to understand that when Trump combined those two together, he crossed a line that should never be crossed in a functioning democracy. Trump crossed the safety tape separating democracy and fascism. Trump himself has NOT taken us to fascism. But I am afraid he made fascism seem just a little more mainstream to extremists.

One major reason words speak louder than actions is that there are certain words that can’t be unsaid. Trump proclaimed in a nationally televised debate that he may not accept the outcome of the election if he does not win. I need you to really think about the future consequences of that. You need to understand what those words mean  to me and my insurmountable Chain of Suspicion with Trump.

Imagine this scenario that scares the shit out of me and needs to scare you too. Trump in 4 years, as the sitting President (maybe with a Republican House and Senate) says in a presidential debate that he may not accept the outcome of the election if he doesn’t win.

What am I suppose to believe if Trump wins again by a small margin like this year? Should I believe that the election was fair? Or should I worry that Trump used his power as president to ensure his own victory?

If you don’t understand this fear, and why the MERE POSSIBILITY of this fear itself should scare you too, please reconsider. Learn more about history. You need to understand the Dark Forest that I am in now. Talk to me until you understand my fear. A democracy CANNOT survive long if even a small percentage of voters fear the integrity of future votes. I have this fear. This fear leads to a Dark Forest where democracy will struggle.

This fear needs to be extinguished now because when it combines with my next issue, I am afraid it leads to an even Darker Forest were democracy is guaranteed to die. Trump has created an insurmountable racial Chain of Suspicion with me. Trump has engaged a variety of terrible racial rhetoric but there are two things that especially stick with me. The first is Trump’s attack on Judge Curiel which even Paul Ryan called “the textbook definition of a racist comment.”

I need you to know that since I have a Chain of Suspicion with Trump, I cannot avoid taking that attack personally. Trump attacked Judge Curiel for his Mexican heritage despite being born in the United States. Judge Curiel is clearly not American enough for Trump. It doesn’t matter that Tina has Chinese heritage. I need you to know that I see an attack on one minority as an attack on all. I need you to know that I see it as an attack against Tina and our future kids. Will they be American enough for Trump? I just don’t know.

But I really need you to the final realization that made me break down crying and pushed me deep into the Dark Forest. I had managed to forget about Trump’s strange relationship with David Duke (KKK member), see here for details. Trump’s refusal to disavow David Duke in 2016 despite doing so in 2000 scares me. I realized I truly don’t understand Trump.

What drove me to tears was that I realized, even if Trump made an innocent mistake, the damage is done. Trump broadcast a message to David Duke and other racists that can never be unsaid. Trump (unintentionally or intentionally) screamed to them: I can win the presidency despite authoritarian and racist rhetoric. It is not Trump I am scared of. It is the dark hunters he just empowered. I had no illusions that racial extremists did not exist, but now, due to Chains of Suspicion, I am no longer optimistic that their numbers are small.

I need you to realize that this is when I personally entered the Dark Forest. I was walking back from my car to the apartment when I walked past a large group of white men. I unconsciously started doing some math, trying to calculate what are the odds that they voted for Trump and specifically voted for Trump because of his racial rhetoric. Before I could finish the math, I realized I was deciding if I was safe around them and started tearing up. This is when I cried uncontrollably. This is when I realized that I had been naive and living in a false world. I thought I was realistic and understood the darkness that existed in the world. But I was living in a Light Forest that was only a product of many factors including but not limited to me being: male, white, upper middle class, well-educated, etc. I truly saw the Dark Forest.

I cried because I got the tiniest possible sliver of understanding of what it truly means to be a minority and I couldn’t handle the truth. As a minority, they live in the Dark Forest. They have heard and felt the racism. They know that not everyone can be trusted. They know that people can attack them when least expected and they must be suspicious. But I cried because its worse: minorities live in the Dark Forest but have a permanent spotlight on them. They are emitting light into this darkness. They don’t blend in. They always stand out in this vast darkness. That means they are always a target for those that hunt minorities.

I cried because I realized that I live in a Dark Forest and that Tina and our future children will always have a spotlight on them. I cried because the tiny glimpse of the darkness scared me. I cried because I realized that my future children will learn the nature of the Dark Forest at an age that is much too young. I cried because I know the Dark Forest my children will live in is worse than the one I am in. I cried because I am scared of hunters like David Duke. I cried because President Trump doesn’t seem to understand that his words empower these hunters. I cried because I was too stupid to put this all into words sooner. I cried because I don’t know how to protect Tina and our future children. I cried because my natural response to that helplessness was to lash out at others in the same way they want to attack Tina. And I had one final burst of tears when I realized the deep irony that David Duke had just made me into an inverse of himself and made me racist against random white people. I laughed, probably like a maniac, because I realized that after that, I am so far lost in the Dark Forest of distrust that I had managed to become the type of hunter that probably scares David Duke the most.

But most of all, I need you to understand that I love you and look forward to working with you to end the Dark Forest of distrust. I am sorry for not communicating better with you. I don’t know why you voted for Trump. Maybe you are already in the Dark Forest of distrust. Maybe you hated Hillary and had an insurmountable Chain of Suspicion with her. Maybe you thought Trump was a standard Republican candidate.

I know you didn’t mean to scare me. But I need you to realize that Trump is not a standard candidate to me. I need you to realize that I can never personally trust Trump based on the words he has said. I need you to realize that I am especially scared of Trump and the people he might either intentionally or accidentally empower.

And I especially need you to realize that what I am actually more scared about is the fact that I am scared. The part of me that remembers the Light Forest thinks the fear is irrational. But the part of me that has seen the Dark Forest of distrust thinks the fear is rational and maybe that I am not scared enough. I see how the Chains of Distrust multiply. If even a few people share my distrust, it must be extinguished now before it grows too strong.

We have to break taboos. We need to talk about politics. We need to establish ground rules for the type of political discourse and political tactics that are allowed in America. We need to talk about race and discrimination. The only way to turn the Dark Forest into the Light Forest is to break Chains of Suspicion by better communication. We can’t wait four years to discuss these issues. We had a deep divide in this country before the election and Trump made the divide wider. We can only heal this distrust if we start soon.

And finally, I want you to know that I have made peace with this election. I want to sincerely thank you for voting for President Trump. I can now see the world clearer than before. My naivety was dangerous to Tina and our future children. I was complacent. I assumed my children would grow up in a Light Forest. I now realize that they cannot. But I will fight to make the Dark Forest just a little bit brighter. I will fight to extend the time that my children think they are only in a Light Forest. And I now realize the true depths of the Dark Forest, and that I can only fight it with you help. I look forward to working with you to bring Light to the Dark Forest.

With all the love in my heart,
Alex

PS. This is not the world’s weirdest baby announcement. These children I discuss are still in the future. But I still cried for the hypothetical children.

PSS. Dave Chappelle and SNL are very wise. I admit thinking I was more realistic about the US than the people in the skit, but I was just in a slightly different bubble than they were.

2016 Election Thoughts: Part 1/2

This post has absolutely nothing to do with science and is just some of my thoughts on the recent US Presidential election. I started writing up my thoughts and I realized it was easiest to organize my thoughts by things I would like to say to Anti-Trump vs Trump voters. In reality, both posts are relevant to either side, but it was a convenient way to cleanly separate my points. Since I respect everyone’s right to a private vote, I’m writing these thoughts as open letters to both sides.



Dear Anti-Trump Voters Who Love Me,

We fucked up.

Don’t get me wrong, I voted against Trump and you voted against Trump, but that doesn’t mean I don’t still have issues with both you and myself. We didn’t do enough. You can read my letter to Trump Voters to realize the pain I felt.

I have several central ideas and several additional points later.

1. Don’t Disrespect Democracy

We lost and we lost fair and square. I am 100% in support of electoral college reform for 2020 and beyond. I am 0% in any attempt to change it in 2016. Don’t sow seeds of doubt. Accept the results and move forward.

2. Think Long and Hard about WHY People Supported Trump

Spend a lot of time thinking about the chart in this article.  The automation and elimination of jobs is real and will only accelerate. The pain and despair are real. Trump addressed the anger and angst felt by people in these counties. These issues are not going away. I don’t claim to have an answer, but if you want to win over the hearts of Trump supporters, this is a great starting point. Also, despite being on a comedy website, this article also makes many serious points. Its time to win over Trump supporters not demonize them.

3. Words Matter: Stop Crying Wolf

A recent conversation with a wise office mate of mine involved us reminiscing about the good old days when with Mitt Romney we only had to worry about binders full of women and terrible renditions of Who Let the Dogs Out. Those were not real issues, but we cried wolf. Well, the real wolf just got elected and we blew all credibility too soon.

Trump must be opposed. But its time to reserve the harsh words for him and others who are truly racist, sexist, etc. Don’t use the same rhetoric on other Republicans. The false equivalence will continue to cause a credibility gap in the future.

4. Governance Reform Starts Now and MUST Continue When Democrats Win

The political system is broken and we were part of the problem. It doesn’t matter who did it first, last, or most. Both sides have abused weird technicalities in our process of government and that must stop.

I have ideas for more sweeping reforms, but for now, I will just focus on a few of the major problems I see.

A. President: Limit Executive Power
Executive power is like heroin. Might feel great while you are high and in charge but it sucks the rest of the time. We let Obama do too much. The withdrawal is going to suck majorly.

B. House of Representatives: Gerrymandering
The Republicans are going to win just over 52% of the two party vote but around 55% of the seats. Not all of that is due to gerrymandering, but at least part of it is. Check out the Texas districts. Both Democrats and Republicans should learn about California’s new redistricting commission. I can attest that the districts seem more reasonable and that the “jungle” primary is quite fun.

C. Senate: Filibuster

Let’s all agree to just end the filibuster now. Just because the Republicans successfully used the filibuster to block a Supreme Court nominee for nearly a year does not mean that Democrats should turn around and do the same. It is time to end the filibuster and just let the majority of the Senate govern. This will really hurt in the short term. But it will be much better in the long term.

D. Electoral College Reform

Again, I am 100% in support of electoral college reform for 2020 and beyond. I am 0% in any attempt to change it in 2016.

Any argument in favor of the electoral college has to explain this fact for me: Hillary Clinton will probably win the popular vote by about 1% and lose the electoral college by 6.5%. That huge discrepancy goes against every principle of one person, one vote. Look back at past elections, the popular vote is way out of sync with the electoral college.

Love,
Alex

PS Points:

1. #TrumpIsOurPresident

While I understand the spirit of #NotOurPresident is that you disagree with Trump, no one gets to pretend that Trump isn’t truly our president. We are all responsible for Trump. I know I personally didn’t do enough to oppose him, since I honestly didn’t truly think he would win. But Trump did win and this is on everyone now.

2. Please Protest Peacefully
I am 1000% behind everyone’s right to protest. Just please don’t turn violent, that will only play into Trump’s hand and give his paranoid rants more legitimacy.

3. Stop Crashing Canada’s Immigration Website
Back to PS point 1, Trump is our president. Deal with it here. You don’t get to flee.

4. Stop Imagining Alternative Pasts
What if Bernie Sanders was the nominee? What if the third party vote was different? Etc, etc, etc. The election is done. Now don’t get me wrong, it is worth learning from mistakes. But learn from the past to make the future you dream of a reality, instead of only dreaming about the past.

5. California Doesn’t Get to Secede
Just stop, its stupid.