Posted on August 20, 2016 General science, Research, Teaching

Best Machine Learning Resources

Machine learning is a rapidly evolving field that is generating an intense interest from a wide audience. So how can you get started?

For now, I’m going to assume that you already have the basic programming (ie general introduction to programming and experience with matrices) and mathematical skills (calculus and some probability and linear algebra).

These are the best current books on machine learning:

Murphy. This is a comprehensive introduction to the whole field.
Learning From Data. This is a brief introduction to a subset of topics.
Deep Learning. Also check out my previous post.

These are some out of date books that still contain some useful sections (for example, Murphy several times refers you to Bishop or MacKay for more details).

Bishop. Predecessor to Murphy.
MacKay. Free pdf!
Hastie, Tibshirani, and Friedman. Free pdf!

Here is a list of other potential resources:

Posted on June 15, 2016 Experts, Research

Deep Learning in Python

So maybe after reading some of my past posts, you are fired up to start programming a deep neural network in Python. How should you get started?

If you want to be able to run anything but the simplest neural networks on easy problems, you will find that since pure Python is an interpreted language, it is too slow. Does that mean we have to give up and write our own C++ code? Luckily GPUs and other programmers come to your rescue by offering between 5-100X speedup (I would estimate my average speedup at 10X, but it varies for specific tasks).

There are two main Python packages, Theano and TensorFlow, that are designed to let you write Python code that can either run on a CPU or a GPU. In essence, they are each their own mini-language with the following changes from standard Python:

Tensors (generalizations of matrices) are the primary variable type and treated as abstract mathematical objects (don’t need to specify actual values immediately).
Computational graphs are utilized to organize operations on the tensors.
When one wants to actually evaluate the graph on some data, it is stored in a shared variable that when possible gets sent to the GPU. This data is then processed by the graph (in place of the original tensor placeholders).
Automatic differentiation (ie it understands derivatives symbolically).
Built in numerical optimizations.

So to get started you will want to install either Theano (pip install theano), TensorFlow (details here), or both. I personally have only used Theano, but if Google keeps up the developmental progress of TensorFlow, I may end up switching to it.

At the end of the day, that means that if one wants to actually implement neural networks in Theano or TensorFlow, you essentially will learn another language. However, people have built various libraries that are abstractions on top of these mini-languages. Lasagne is one example that basically organizes Theano code so that you have to interact less with Theano, but you will still need to understand Theano. I initially started with Theano and Lasagne, but I am now a convert to Keras.

Instead, I advocate for Keras (pip install keras) for two major reasons:

High level abstraction. You can write standard Python code and get a deep neural network up and running very quickly.
Back-end agnostic. Keras can run on either Theano or TensorFlow.

So it seems like a slam dunk right? Unfortunately life is never that simple, instead there are two catches:

Mediocre documentation (using Numpy as a gold standard, or even comparing to Lasagne). You can get the standard things up and running based on theirs docs. But if you want to do anything advanced, you will find yourself looking into their source code on GitHub, which has some hidden, but useful, comments.
Back-end agnostic. This means if you do want to introduce a modification to the back-end, and you want it to always work in Keras, you need to implement it in both Theano and TensorFlow. In practice this isn’t too bad since Keras has done a good job of implementing low-end operations.

Fortunately, the pros definitely outweigh the cons for Keras and I highly endorse it. Here are a few tips I have learned from my experience with Keras:

Become familiar with the Keras documentation.
I recommend only using the functional API which allows you to implement more complicated networks. The sequential API allows you to write simple models in fewer lines of code, but you lose flexibility (for example, you can’t access intermediate layers) and the code won’t generalize to complex models. So just embrace the functional API.
Explore the examples (here and here).
Check out the Keras GitHub.
Names for layers are optional keywords, but definitely use them! It will significantly help you when you are debugging.

Now start coding your own deep neural networks!

Posted on June 6, 2016June 6, 2016 General science, Research

General Programming Tips

I thought I would put together some useful programming tips that I have learned over the years. Most of these are general tips, but they are tailored towards Python.

Zen of Python. Even if you don’t use Python, these are good ideas to internalize.
The language documentation (Python’s standard library), StackOverflow, and Google searches are your best friends.
Utilize modern IDEs (like Spyder for Python) and tab-completion to reduce the number of basic errors.
Comments are not optional. The general logic of functions and objects should be understandable from the comments. Every block of code logic should have a short comment to aid future changes. If you find a chunk of code confusing now, it will be just as confusing if not worse in the future!
Use sensible variable names. This cuts down on the number/length of comments.
Try to adhere to the language standards (Python’s), but don’t obsess over it.
Set your own consistent standards (Do variable names end in s or not? Do boolean variables have similar style names? Etc).
When starting a project, do you best to get quickly get up to a basic working prototype. Working but incomplete code is always better than non-working code. Quick coding is aided by the next point…
Outline your code before starting. My tips for outlining in Python are detailed after this list.
Write modular code. Common tasks should be made into functions or objects.
Avoid magic numbers and hard coded values. Better to include a set of named parameters in one section of your code where the basic logic of these variables is explained.
Avoid multiple inheritance (check out this fun explanation of why this is bad).
A program should have a standard interface, I like to call it main, and a way to run the standard interface with some default values. In Python, utilize if __name__ == ‘__main__’: to define standard parameters and then call main(parameters). This aids the goal of always having working code, as well as making it easier to interact with different programs.
Check out these Python tricks (1-23 are the best, rest are more advanced).

Here are details on how I outline code in Python. I try my best to have running code at all times, even if it does absolutely nothing. If it isn’t real code, I leave it as a comment. Therefore, my programming tends to proceed as follows.

Outline the general logic of the code in comments. Define needed functions, but at first have it take no actual variables (utilize pass to keep it as functioning Python code). In the comments inside a function, list the data type you think it should take in, what it should do, and what it should return.
If you start to code a function or series of logic, you can safely leave it incomplete by having it raise NotImplementedError.
Use assert to check any of your assumptions. A custom assert statement will save you lots of time later.
While the Pythonic way is to utilize duck typing, I still prefer to do some type checking if there is potential for confusion. So I like to utilize things like isinstance or implement checks on attributes.
Take advantage of your IDE’s additional formatting options. For example, Spyder specially highlights comments that start with TODO: with a little checkmark. Additionally, it supports code blocks and defines them by #%%. This lets you quickly run small chunks of a larger code.

What is the major advantages of coding like this?

If your code always runs, it allows you to quickly find syntax errors and typos.
You avoid implementing unused code. It sucks to really work on a code section to only realize later that you didn’t actually need it.
You spend your time on the standard case and can add certain options or take care of edge cases when the appropriate time arises. Because sometimes that time will never arise…

Anyways, enough advice, start coding!

Posted on June 1, 2016March 18, 2017 General science, Research

Basic Bash

Basically these are the only things I know about Bash :). These aren’t all truly Bash commands, but instead these are common commands that everyone should know when using the Linux or Mac command line.

Notes

First, the command follows the $ and is listed in bold, the == is just a spacer, and everything else is a description of the command. The <> symbols designate where some other name should go there (like a file, folder, username, etc). The * symbol designates a wildcard, and can be used in conjunction with a partial search term. This means *.txt matches file.txt, file1.txt, etc.

Very Basics Commands

$ Ctrl + C == Kill whatever is running in the foreground

$ tab == complete current typing

$ <command> –help == lists options of a given command

$ Ctrl + A == Go to the beginning of the line you are currently typing on

$ Ctrl + E = Go to the end of the line you are currently typing on

$ Ctrl + U == Clears the line before the cursor position. If you are at the end of the line, clears the entire line.

Where Am I and How to I Move Elsewhere?

$ pwd == print working directory

$ cd == change directory

$ cd / == go to root

$ cd .. == go up one level

$ ls == list files and folders

$ ls -a == list all files and folders (including hidden)

Basic File/Folder Manipulation

$ mkdir <folder name> == create new folder

$ cp <old file name> <new file name> == copy and rename a file

$ mv <file> <folder> == move file to a folder

$ rm <file> == delete file

$ rm -r <folder> == delete folder

$ cat <file> == show content of file

$ head -n <number of lines> <file> == show top n lines of file

$ tail -n <number of lines> <file> == show last n lines of file

Nano

This is a basic file editor

$ nano <existing file> == opens up a file

$ nano <non-existing file> == creates and opens up a file

Within nano, the needed commands are listed at the bottom where the ^ symbol stands for Ctrl.

What is currently running? Can I stop it?

$ top == list all processes running on a computer

$ top -u <username> == lists processes being run by username

$ kill <pid> == kill a process identified by pid, which can be found by using top

$ killall -u <username> == kills all processes being run by username

How to run stuff in the background

$ <command> & == runs command in background

If you want to run jobs on a remote server, there will be some queueing system. Check out useful commands like qsub, qstat, etc. However, if you just want to run multiple processes on a single computer (even after logging off), Screen and Tmux are the tools you need. I personally use screen, and below are some useful commands.

Screen

$ screen -S <name> == new screen with name

$ Ctrl-a d == detach from screen

$ screen -r name == reattach to screen

$ screen -ls == list of screens

$ exit == kills currently attached screen

Example Running Code in the Background

Here is an example set of commands that will run a Python script in the background.

$ screen -S test

$ python test.py > out.txt 2>&1 &

$ Ctrl-a d

You now can safely exit and the computer will do all the tough work for you!

I did add one new command in there, the redirect (>). What this means is that Python runs test.py. If there is anything that should be printed to the terminal, it gets redirected and saved into the file out.txt. Otherwise when you reattach to the screen named test, you will just get the recent terminal output (usually there is some display length limit).

1 Comment Posted on May 22, 2016 General science, Research

Deep Learning: 0-60 in a few hours?

Here, I will try to outline the fastest possible path to go from zero understanding of deep learning to an understanding of the basic ideas. In a follow up post, I’ll outline some deep learning packages where you could actually implement these ideas.

I think by far the best introduction to deep learning is Michael Nielsen’s ebook. Before you get started with it, I think the minimum required mathematics includes an understanding of the following:

Vector and Matrix multiplication – especially when written in summation notation
Exponents and Logarithms
Derivatives and Partial Derivatives
Probability, mainly Bayes Theorem (not actually needed for Michael Nielsen’s book, but it is essential for later topics)

I really think that if you understand those mathematical topics, you can start reading the ebook.

Here is my proposed learning strategy. Iterate between reading the ebook (Chapters 1-5 only) and playing with this cool interactive neural network every time a new idea is mentioned. For a first pass, just read the ebook and don’t do the exercises or worry about actual code implementation. Additionally, chapter 6 introduces convolutional neural networks which are a more advanced topic that can be saved for later.

Once you have some intuition about neural networks, I recommend reading this review by several of the big names in deep learning. This will give you a flavor of the current status of the field.

Now you are ready to start coding!

PS. If you want to get into more advanced deep learning topics, check out my previous Deep Learning Unit. And to really get up to speed on research, there is a deep learning book that should be published soon.

Posted on May 22, 2016 General science, Research

Learning Python for Science

Here I outline how to learn Python on your own with emphasis on solving science problems. The first section applies to anyone, but the end is specialized towards computational problems that arise in science.

Python Basics

I recommend the following two tutorials:

Some additional resources that may be helpful include:

My suggested workflow:

Do Codecademy and Python the Hard Way at the same time.
If Codecademy/Python the Hard Way is too difficult, also read a Byte of Python.
If Codecademy/Python the Hard Way is easy, use Think Python as an additional resource.
If you are confused about a specific chunk of code, put it into Python Tutor which will walk you step by step through the program.
Additionally, Google and Stack Overflow are extremely useful for coding questions or go to the original Python documentation.

The essential things one needs to learn about Python include:

data types: int, float, string
data structures: lists, dictionaries, tuples, sets
control statements: for, while, if else
print function
open / write to a text file
custom functions and objects
list comprehensions – comes up less often in numerical code, but still good to know

Numpy and Scipy

Numpy is the essential mathematics module in Python and is part of the larger Scipy project. All standard numerical needs are covered in Numpy, while more advanced functions are in Scipy.

I recommend the following tutorials:

Numpy’s tutorial
This fun tutorial that programs the Game of Life (GoL). I only recommend the section that implements GoL in Numpy, the rest of it is not essential. It also has a useful quick reference guide.
This Numpy tutorial from Scipy Lecture Notes

Matplotlib

Visualizing data is essential to understanding and communicating science ideas. Matplotlib is the standard plotting module. While it has its limitations, I still personally use it for my everyday plots. For more advanced plot types, check out Plotly, Seaborn, Mayavi, ggplot, and Bokeh.

Start with the Matplotlib basic Pyplot tutorial. Check out their Beginner’s Guide for additional topics.
I also recommend this tutorial, especially the quick reference guide.
You can also try this Matplotlib tutorial from Scipy Lecture Notes

And assuming you are new to making scientific figures, there are some good habits you should get into. First, read these tips from Plos. Second, never ever use rainbow colors aka jet. Color challenged people like myself will hate you. Please stick with a color map that uses shading sensible. Besides making me happy, it also is easier to print to gray scale.

Posted on May 22, 2016May 22, 2016 Everybody, Research

Python on a Mac

I personally do most of my coding on my laptop, which is a Mac. Eventually that code gets run on a Linux server, but all initial coding, exploratory data analysis, etc is done on my laptop. And since I advocate for Python, I thought I would lay out all the steps I needed to do to setup my Mac in the easiest manner. (Note: probably similar steps on Windows, but I haven’t used a Windows computer in so long that I don’t know the potential differences).

Unfortunately, the Python 2.x vs 3.x divide exists and so far, I have yet to be able to completely commit to 3.x due to a few packages with legacy issues. Luckily, there is a pretty easy solution below. Note, your Mac has Python preinstalled (go to terminal and type python to start coding…). However, if you want to update any packages, you can quickly run into issues. So it is easiest to install your own version of Python.

Install Anaconda (I advocate version 2.7, Anaconda will call this environment root)
I recommend using Anaconda Navigator and using Spyder for an IDE
Install version 3.5 and make an environment (in Anaconda Navigator or terminal commands below):
```
$ conda create -n python3.5 python=3.5 anaconda
```
You can switch between python environments {root, python3.5}
```
$ source activate {insert environment name here}
```
To add new python packages use conda or pip (anaconda has made its own pip the default)
WARNING: always close Spyder before using conda update or pip. I got stuck in some weird place where Spyder would no longer launch. Apparently it can happen if Spyder is open and underlying packages get changed.

To get around the 2.x vs 3.x issue, go to your terminal and use pip install for the following packages: future, importlib, unittest2, and argparse. See the package’s website for details of any differences. Then, start your Python code with the following two lines:

from __future__ import (absolute_import, division, print_function, 
unicode_literals)

from builtins import *

For nearly all scientific computing applications, you are essentially writing Python 3 code. So make sure to read the correct documentation!

Personally, I found Anaconda to be a lifesaver. Otherwise, I got stuck in some weird infinite update loop to install all required packages for machine learning (specifically Theano).

Now you are ready to code! If you aren’t familiar with Python, my recommended tutorials will be in a future post.

Posted on May 21, 2016May 21, 2016 General science, Research

The one ring to rule them all: Python

Ringfrodo

This lays out why I think all scientists should learn Python first and use it as their primary programming language. I think many of the reasons why scientists should learn Python first are equally applicable to everyone, but computer scientists and others probably have different demands for their primary language.

First, a quick history of languages I have programmed in. My first programming project was as a freshman in college and I used Fortran for simulations of water molecules. Then as a sophomore I used MATLAB for a summer research project that testing components for a high energy experiment. The following summer I used C for a summer research project on stochastic simulations. This was followed by me taking my first programming course where I used Java.

At this point I had used a slew of programming languages, but MATLAB was my primary language. I rarely had to deal with strings or statistics, so MATLAB had everything I needed. This continued into graduate school. Eventually, I ended up doing some bioinformatics, so I had to learn R. And finally, just for the hell of it, I decided to write some code in Python. Additionally, I have used Mathematica, but I wouldn’t count it as a true language.

So this is a really long-winded explanation of why I have some credibility when I say Python is the best (compared to Fortran, C, MATLAB, Java, and R which with I have personal experience). When I started my postdoc, it seemed like the perfect time to make the complete switch to only use Python.

So to start with, why would I recommend it as a first language?

Correct level of difficulty
Pythonic – simple expressions usually work as you would guess
Versatile – can do everything one needs
Open sourced
Community – lots of great packages

And why should scientists use it?

Versatile – handles all data types easily (unlike MATLAB)
Fast enough – Cython, Theano, etc can be used when speed matters
Plenty of scientific packages – Numpy, Scipy, Matplotlib, Scikit-Learn, etc
Large science community – new packages all the time

Additionally, its a language that is popular enough (see here and here) to lead to a job in industry and will safely be around for years to come. So please drink the Kool-Aid and join the Python cult!

3 Comments Posted on March 24, 2016March 24, 2016 General science, Research, Teaching

Life at Low Reynolds Number

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Diffusion

Organized by Ben Regner

Standard Diffusion
Anomalous Diffusion
Life at Low Reynold’s Number

Papers

Life at Low Reynold’s Number. By Purcell in 1977.

Other Useful References

Wired Article on Fluid Dynamics

Introduction

This is one of my favorite papers. The presentation style is extremely fun and readable without sacrificing any scientific integrity. I think it serves as a great introduction to fluid mechanics at low Reynold’s number. I don’t have too many comments since I think the paper explains it the best, but I will provide a few supplementary details for a more in depth exploration of the ideas from the paper.

And just to get you excited about fluid dynamics, I present an example of laminar flow:

Basics of Fluid Mechanics

The fundamental equation of fluid mechanics is Navier-Stokes. The relevant version for this paper is the incompressible flow equations with pressure but no other external fields:

$\frac{\partial \vec{u}}{\partial t}+ \vec{u}\cdot\nabla\vec{u} +\frac{1}{\rho}\nabla p -\nu\nabla^2\vec{u}=0$

where $\vec{u}$ is the velocity vector, $\vec{x}$ is position, $\rho$ is density, $p$ is pressure, and $\nu$ is the kinematic viscosity. This equation can be made non-dimensional by the introduction of a characteristic velocity $U$ , length $L$ , and introducing the dynamic viscosity $\eta=\nu/\rho$ . This gives the following dimensionless variables:

$u^* = \frac{u}{U}$

$x^* = \frac{x}{L}$

$p^* = \frac{pL}{\eta U}$

$t^* = \frac{L}{U}$

Substituting in these characteristic length scales and doing some algebra, one arrives at the simplified equations:

$R\frac{\partial \vec{u^*}}{\partial t^*}+ R\vec{u^*}\cdot\nabla^*\vec{u^*} +\nabla^* p^*-(\nabla^*)^2\vec{u^*}=0$

with only one dimensionless constant, the Reynold’s number, defined as:

$R = \frac{UL\rho}{\eta} = \frac{UL}{\nu}$

As explained in the paper, Reynold’s number is one of the essential constants describing a flow. High Reynold’s number leads to turbulent (chaotic) flow, while low Reynold’s number leads to laminar (smooth) flow. For extemely small Reynold’s number, Navier-Stokes simplifies to:

$\nabla^* p^* = (\nabla^*)^2\vec{u^*}$

which is also just called Stoke’s equation.

At the end of the paper, Purcell describes another dimensionless number which he calls $S$ and in a footnote identifies as the Sherwood number. However, Ben Regner pointed out, that Purcell’s $S$ would actually be called the Peclet number today.

Basics of Ecoli Chemotaxis

Chemotaxis and cellular sensing really deserves its own series of papers. But in the meantime, I recommend the following resources

Chemotaxis on Wikipedia
Howard Berg’s videos on individual Ecoli
Howard Berg’s videos on swarms of Ecoli
Berg and Purcell, Physics of chemoreception, 1977.

Video Proof of Purcell’s Scallop Theorem

Reversible kicking does fine in water (high Reynold’s number)…

… but the same motion has issues in corn syrup (low Reynold’s number).

Here is a solution similar to what Ecoli and other bacteria employ.

Fundamental Questions

Purcell does an amazing job, so I have nothing to add.

Advanced Questions

What are some other strategies that are employed in biology to get around the issue of mobility at low Reynold’s number? Hint: I already linked to a video of one strategy. There are at least two other strategies, but to find these you will need to think about the assumptions leading to the basic Navier-Stokes equations.

3 Comments Posted on March 24, 2016March 24, 2016 General science, Research, Teaching

Anomalous Diffusion

This is part of my “journal club for credit” series. You can see the other computational neuroscience papers in this post.

Unit: Diffusion

Organized by Ben Regner

Papers

Random walk models in biology. By Codling, Plank, and Benhamou in 2008.

Other Useful References

Anomalous Diffusion

What is anomalous diffusion?

If one measures the mean square displacement vs time, it can be parameterized as

$< x^2> = t^\alpha$

where $\alpha=1$ is Brownian (standard diffusion), $0<\alpha<1$ is subdiffusive, $1<\alpha<2$ is superdiffusive, and ballistic is $\alpha=2$ . So the technical definition of anomalous diffusion is $0<\alpha<1$ or $1<\alpha<2$ .

How to describe anomalous diffusion?

Currently, there is no “best” or “simple” description of anomalous diffusion in the general case. However, continuous-time random walks (CTRW) are one paradigm that I find helpful as a conceptual and simulation framework.

In the simplest discrete random walk (DRW), at every time step, a particle makes a jump of fixed size, the only question is the direction. The next generalization has the particle make a jump at every time step, but now it draws the jump size from a distribution.

The idea of a CTRW is that there is now a distribution both of the waiting time between jumps, and the jump size. If the waiting time follows the exponential distribution and the jump size follows the normal distribution, one ends up with the Wiener process aka standard diffusion and Brownian motion.

What causes anomalous diffusion?

Just as a reminder, there are three conditions that need to be satisfied for Brownian motion (standard diffusion):
1. Increments are independent
2. Increments are wide sense stationary. 1st moment and autocovariance don’t depend on time (this is weaker condition then complete stationarity)
3. Zero mean

The third condition is often ignored by examining the motion relative to the mean displacement (ie the actual displacement is not Brownian, but fluctuations in the displacement could be Brownian). So really, the first two are the more important conditions. Therefore, anomalous diffusion arises due to non-independent increments and/or correlations in time of the mean and/or standard deviation.

The CTRW allows one to think more precisely about different mechanisms that can give rise to anomalous diffusion. There is not one single way to get sub or super-diffusion in CTRW, since there are two, potentially dependent, distributions (waiting time and jump size). However, there are a few common situations that seem to arise often in biology and elsewhere (see Random walk models in biology, Box 2 for original idea). Subdiffusion in biology is often caused by longer waiting time distributions (compared to exponential), or molecular crowding, while superdiffusion in occurs when jump sizes are drawn from a Levy flight or other alpha stable distributions.

Examples

For further exploration of anomalous diffusion in biology, I recommend these papers

Advanced Questions

This is an interesting paper that introduces a renormalization group approach to classifying diffusion processes

Share this:

Share this:

Share this:

Notes

Very Basics Commands

Where Am I and How to I Move Elsewhere?

Basic File/Folder Manipulation

Nano

What is currently running? Can I stop it?

How to run stuff in the background

Screen

Example Running Code in the Background

Share this:

Share this:

Python Basics

Numpy and Scipy

Matplotlib

Share this:

Share this:

Share this:

Unit: Diffusion

Papers

Other Useful References

Introduction

Basics of Fluid Mechanics

Basics of Ecoli Chemotaxis

Video Proof of Purcell’s Scallop Theorem

Fundamental Questions

Advanced Questions

Share this:

Unit: Diffusion

Papers

Other Useful References

What is anomalous diffusion?

How to describe anomalous diffusion?

What causes anomalous diffusion?

Examples

Advanced Questions

Share this: