## February 28, 2015

### Matthew Rocklin

#### Ising models and Numba

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr I play with Numba and find it effective.

## Introduction

Confession, I’ve never actually used Numba. Well that’s not quite true; I’ve indirectly used Numba thousands of times because Blaze auto-generates numba ufuncs. Still I’ve never used it for a particular problem. I usually define problems with large array operations and compile those down. Numba takes a different approach and translates Python for loops to efficient LLVM code. This is all lower in the hardware stack than where I usually think.

But when I was looking for applications to motivate recent work in nearest-neighbor communications in dask a friend pointed me towards the Ising model, a simple physical system that is both easy to code up and has nice macro-scale properties. I took this as an example to play with Numba. This post details my experience.

## Ising Model

Disclaimer: I am not a physicist

The Ising model represents a regular grid of points where each point has two possible states, spin up and spin down. States like to have the same spin as their immediate neighbors so when a spin-down state is surrounded by more spin-up states it will switch to spin-up and vice versa. Also, due to random fluctuations, points might switch spins, even if this switch is not favorable. In pseudocode an Ising update step might look like the following

for every point in the grid:
energy = my spin * sum of all of the spins (+1 or -1) of neighboring points
if energy is improved by switching:
switch
else if we're particulalry unlucky
switch anyway


For this kind of algorithm imperative for-loopy code is probably the most clear. You can do this with high-level array operations too (e.g. NumPy/Blaze/Theano), but it’s a mess.

## Numba code

Here is my solution to the problem with Numba and a gif of the solution

import numba
import numpy as np
from math import exp, log, e, sqrt

kT = 2 / log(1 + sqrt(2), e)

@numba.jit(nopython=True)
def _update(x, i, j):
n, m = x.shape
dE = 2* x[i, j] * (
x[(i-1)%n, (j-1)%m]
+ x[(i-1)%n,  j     ]
+ x[(i-1)%n, (j+1)%m]

+ x[ i     , (j-1)%m]
+ x[ i     , (j+1)%m]

+ x[(i+1)%n, (j-1)%m]
+ x[(i+1)%n,  j     ]
+ x[(i+1)%n, (j+1)%m]
)
if dE <= 0 or exp(-dE / kT) > np.random.random():
x[i, j] *= -1

@numba.jit(nopython=True)
def update(x):
n, m = x.shape

for i in range(n):
for j in range(0, m, 2):  # Even columns first to avoid overlap
_update(x, i, j)

for i in range(n):
for j in range(1, m, 2):  # Odd columns second to avoid overlap
_update(x, i, j)

if __name__ == '__main__':
x = np.random.randint(2, size=(1000, 1000)).astype('i1')
x[x == 0] = -1
for i in range(100):
update(x)

My old beater laptop executes one update step on a 1000x1000 grid in 50ms. Without Numba this takes 12s. This wasn’t a canned demo by an expert user / numba developer; this was just my out-of-the-box experience.

## Thoughts

I really like the following:

• I can remove @numba.jit and use the Python debugger
• I can assert that I’m only using LLVM with nopython=True
• I can manage data with NumPy (or dask.array) separately from managing computation with Numba

I ran in to some issues and learned some things too:

• random is only present in the developer preview builds of Numba (conda install -c numba numba). It will be officially released in the 0.18 version this March.
• You don’t have to provide type signature strings. I tried providing these at first but I didn’t know the syntax and so repeatedly failed to write down the type signature correctly. Turns out the cost of not writing it down is that Numba will jit whenever it sees a new signature. For my application this is essentially free.

## February 26, 2015

### NeuralEnsemble

#### Workshop Announcement - "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1"

Registration is now open for the workshop "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1", to be held March 31st - April 1st, 2015 at UCL School of Pharmacy in London, supported by the Human Brain Project (www.humanbrainproject.eu).

In short, the aims of the workshop are two-fold. First, to engage the larger community of experimentalists and modelers working on hippocampus, and highlight existing modeling efforts and strategic datasets for modeling hippocampal area CA1. Second, to define and bootstrap an inclusive community-driven model and data-integration process to achieve open pre-competitive reference models of area CA1 (and, ultimately, the rest of the hippocampus), which are well documented, validated, and released at regular intervals (supported in part by IT infrastructure funded by HBP). Involvement from the community interested in characterization and modeling of the hippocampus is highly encouraged. To keep the meeting focused on the task, participation will be limited to ~30 people, so registration is required.

Please consult the meeting website at http://neuralensemble.org/meetings/HippocampCA1/for registration and further details.

Organizing committee:Jo Falck (UCL), Szabolcs Káli (Hungarian Academy of Sciences), Sigrun Lange (UCL), Audrey Mercer (UCL), Eilif Muller (EPFL), Armando Romani (EPFL) and Alex Thomson (UCL).

## February 24, 2015

### Jake Vanderplas

#### Optimizing Python in the Real World: NumPy, Numba, and the NUFFT

Donald Knuth famously quipped that "premature optimization is the root of all evil." The reasons are straightforward: optimized code tends to be much more difficult to read and debug than simpler implementations of the same algorithm, and optimizing too early leads to greater costs down the road. In the Python world, there is another cost to optimization: optimized code often is written in a compiled language like Fortran or C, and this leads to barriers to its development, use, and deployment.

Too often, tutorials about optimizing Python use trivial or toy examples which may not map well to the real world. I've certainly been guilty of this myself. Here, I'm going to take a different route: in this post I will outline the process of understanding, implementing, and optimizing a non-trivial algorithm in Python, in this case the Non-uniform Fast Fourier Transform (NUFFT). Along the way, we'll dig into the process of optimizing Python code, and see how a relatively straightforward pure Python implementation, with a little help from Numba, can be made to nearly match the performance of a highly-optimized Fortran implementation of the same algorithm.

## Why a Python Implementation?

First, I want to answer the inevitable question: why spend the time to make a Python implementation of an algorithm that's already out there in Fortran? The reason is that I've found in my research and teaching that pure-Python implementations of algorithms are far more valuable than C or Fortran implementations, even if they might be a bit slower. This is for a number of reasons:

• Pure-Python code is easier to read, understand, and contribute to. Good Python implementations are much higher-level than C or Fortran, and abstract-away loop indices, bit twiddling, workspace arrays, and other sources of code clutter. A typical student reading good Python code can immediately understand and modify the algorithm, while the same student would be lost trying to understand typical optimized Fortran code.

• Pure-python packages are much easier to install than Python-wrapped C or Fortran code. This is especially true on non-Linux systems. Fortran in particular can require some installation prerequisites that are non-trivial for many users. In practice, I've seen people give up on better tools when there is an installation barrier for those tools.

• Pure-python code often works for many data types. Because of the way it is written, pure Python code is often automatically applicable to single or double precision, and perhaps even to extensions to complex numbers. For compiled packages, supporting and compiling for all possible types can be a burden.

• Pure-python is easier to use at scale. Because it does not require complicated installation, pure Python packages can be much easier to install on cloud VMs and/or shared clusters for computation at scale. If you can easily pip-install a pure-Python package on a VM, then services like AWS and TravisCI are much easier to set up.

Certainly code speed will overcome these considerations if the performance gap is great enough, but I've found that for many applications a pure Python package, cleverly designed and optimized, can be made fast enough that these larger considerations win-out. The challenge is making the Python fast. We'll explore this below.

## Background: The Non-Uniform Fast Fourier Transform

The Fast Fourier Transform (FFT) is perhaps the most important and fundamental of modern numerical algorithms. It provides a fast, $$O[N\log N]$$ method of computing the discrete Fourier transform: $Y_k^\pm = \sum_{n=0}^{N-1} y_n e^{\pm i k n / N}$ You can read more about the FFT in my previous post on the subject.

One important limitation of the FFT is that it requires that input data be evenly-spaced: that is, we can think of the values $$y_n$$ as samples of a function $$y_n = y(x_n)$$ where $$x_n = x_0 + n\Delta x$$ is a regular grid of points. But what about when your grid is not uniform? That is, what if you want to compute this result: $Y_k^\pm = \sum_{j=1}^N y(x_j) e^{\pm i k x_j}$ where $$y(x)$$ is evaluated at an arbitrary set of points $$x_j$$? In this case, the FFT is no longer directly applicable, and you're stuck using a much slower $$O[N^2]$$ direct summation.

Stuck, that is, until the NUFFT came along.

The NUFFT is an algorithm which converts the non-uniform transform into an approximate uniform transform, not with error-prone interpolation, but instead using a clever "gridding" operation motivated by the convolution theorem. If you'd like to read about the algorithm in detail, the Courant Institute's NUFFT page has a nice set of resources.

Below we'll take a look at implementing this algorithm in Python.

## Direct Non-Uniform Fourier Transform

When developing optimized code, it is important to start with something easy to make sure you're on the right track. Here we'll start with a straightforward direct version of the non-uniform Fourier transform. We'll allow non-uniform inputs $$x_j$$, but compute the output on a grid of $$M$$ evenly-spaced frequencies in the range $$-M/2 \le f/\delta f < M/2$$. This is what the NUFFT group calls the Type-1 NUFFT.

First we'll implement nufftfreqs(), which returns the frequency grid for a given $$M$$, and nudft() which computes the non-uniform discrete Fourier transform using a slow direct method. The arguments for the latter include iflag, which is a positive or negative number indicating the desired sign of the exponent:

In [1]:
from __future__ import print_function, division
import numpy as np

def nufftfreqs(M, df=1):
"""Compute the frequency range used in nufft for M frequency bins"""
return df * np.arange(-(M // 2), M - (M // 2))

def nudft(x, y, M, df=1.0, iflag=1):
"""Non-Uniform Direct Fourier Transform"""
sign = -1 if iflag < 0 else 1
return (1 / len(x)) * np.dot(y, np.exp(sign * 1j * nufftfreqs(M, df) * x[:, np.newaxis]))


Again, I can't emphasize this enough: when writing fast code, start with a slow-and-simple version of the code which you know gives the correct result, and then optimize from there.

## Comparing to the Fortran NUFFT

We can double-check that this is producing the desired result by comparing to the Fortran NUFFT implementation, using Python wrappers written by Dan Foreman-Mackey, available at http://github.com/dfm/python-nufft/:

In [2]:
# Install nufft from http://github.com/dfm/python-nufft/
from nufft import nufft1 as nufft_fortran

x = 100 * np.random.random(1000)
y = np.sin(x)

Y1 = nudft(x, y, 1000)
Y2 = nufft_fortran(x, y, 1000)

np.allclose(Y1, Y2)

Out[2]:
True


The results match! A quick check shows that, as we might expect, the Fortran algorithm is orders of magnitude faster:

In [3]:
%timeit nudft(x, y, 1000)
%timeit nufft_fortran(x, y, 1000)

10 loops, best of 3: 47.9 ms per loop
1000 loops, best of 3: 265 µs per loop



On top of this, for $$N$$ points and $$N$$ frequencies, the Fortran NUFFT will scale as $$O[N\log N]$$, while our simple implementation will scale as $$O[N^2]$$, making the difference even greater as $$N$$ increases! Let's see if we can do better.

## NUFFT with Python

Here we'll attempt a pure-Python version of the fast, FFT-based NUFFT. We'll follow the basics of the algorithm presented on the NUFFT page, using NumPy broadcasting tricks to push loops into the compiled layer of NumPy. For later convenience, we'll start by defining a utility to compute the grid parameters as detailed in the NUFFT paper.

In [4]:
def _compute_grid_params(M, eps):
# Choose Msp & tau from eps following Dutt & Rokhlin (1993)
if eps <= 1E-33 or eps >= 1E-1:
raise ValueError("eps = {0:.0e}; must satisfy "
"1e-33 < eps < 1e-1.".format(eps))
ratio = 2 if eps > 1E-11 else 3
Msp = int(-np.log(eps) / (np.pi * (ratio - 1) / (ratio - 0.5)) + 0.5)
Mr = max(ratio * M, 2 * Msp)
lambda_ = Msp / (ratio * (ratio - 0.5))
tau = np.pi * lambda_ / M ** 2
return Msp, Mr, tau

def nufft_python(x, c, M, df=1.0, eps=1E-15, iflag=1):
"""Fast Non-Uniform Fourier Transform with Python"""
Msp, Mr, tau = _compute_grid_params(M, eps)
N = len(x)

# Construct the convolved grid
ftau = np.zeros(Mr, dtype=c.dtype)
Mr = ftau.shape[0]
hx = 2 * np.pi / Mr
mm = np.arange(-Msp, Msp)
for i in range(N):
xi = (x[i] * df) % (2 * np.pi)
m = 1 + int(xi // hx)
spread = np.exp(-0.25 * (xi - hx * (m + mm)) ** 2 / tau)
ftau[(m + mm) % Mr] += c[i] * spread

# Compute the FFT on the convolved grid
if iflag < 0:
Ftau = (1 / Mr) * np.fft.fft(ftau)
else:
Ftau = np.fft.ifft(ftau)
Ftau = np.concatenate([Ftau[-(M//2):], Ftau[:M//2 + M % 2]])

# Deconvolve the grid using convolution theorem
k = nufftfreqs(M)
return (1 / N) * np.sqrt(np.pi / tau) * np.exp(tau * k ** 2) * Ftau


Let's compare this to the previous results. For convenience, we'll define a single routine which validates the results and times the execution:

In [5]:
from time import time

def test_nufft(nufft_func, M=1000, Mtime=100000):
# Test vs the direct method
print(30 * '-')
name = {'nufft1':'nufft_fortran'}.get(nufft_func.__name__,
nufft_func.__name__)
print("testing {0}".format(name))
rng = np.random.RandomState(0)
x = 100 * rng.rand(M + 1)
y = np.sin(x)
for df in [1, 2.0]:
for iflag in [1, -1]:
F1 = nudft(x, y, M, df=df, iflag=iflag)
F2 = nufft_func(x, y, M, df=df, iflag=iflag)
assert np.allclose(F1, F2)
print("- Results match the DFT")

# Time the nufft function
x = 100 * rng.rand(Mtime)
y = np.sin(x)
times = []
for i in range(5):
t0 = time()
F = nufft_func(x, y, Mtime)
t1 = time()
times.append(t1 - t0)
print("- Execution time (M={0}): {1:.2g} sec".format(Mtime, np.median(times)))

In [6]:
test_nufft(nufft_python)
test_nufft(nufft_fortran)

------------------------------
testing nufft_python
- Results match the DFT
- Execution time (M=100000): 1.6 sec
------------------------------
testing nufft_fortran
- Results match the DFT
- Execution time (M=100000): 0.043 sec



The good news is that our Python implementation works; the bad news is that it remains several orders of magnitude slower than the Fortran result!

Let's make it faster.

### Data

Beautiful code is kind of a goal unto itself, but we really want to demonstrate how useful SciPy is in real scientific analysis. Therefore, although cute examples on synthetic data can illustrate quite well what a piece of code does, it would be extremely useful for us if you can point to examples with real scientific data behind them.

Thank you, and we hope to hear from you!

## February 17, 2015

### Continuum Analytics

#### Bokeh 0.8 Released

We are excited to announce the release of version 0.8 of Bokeh, an interactive web plotting library for Python… and other languages! This release includes many major new features:

• New and updated language bindings: R, JavaScript, Julia, Scala, and Lua now available
• More bokeh-server features geared towards production deployments
• Live gallery of server examples and apps!
• Simpler, more easily extensible design for charts API, plus new Horizon chart
• New build automation and substantial documentation improvements
• Shaded grid bands, configurable hover tool, and pan/zoom for categorical plots
• Improved and more robust crossfilter application
• AjaxDataSource for clients to stream data without a Bokeh server

### Matthew Rocklin

#### Towards Out-of-core ND-Arrays -- Dask + Toolz = Bag

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr We use dask to build a parallel Python list.

## Introduction

This is the seventh in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

Today we take a break from ND-Arrays and show how task scheduling can attack other collections like the simple list of Python objects.

## Unstructured Data

Often before we have an ndarray or a table/DataFrame we have unstructured data like log files. In this case tuned subsets of the language (e.g. numpy/pandas) aren’t sufficient, we need the full Python language.

My usual approach to the inconveniently large dump of log files is to use Python streaming iterators along with multiprocessing or IPython Parallel on a single large machine. I often write/speak about this workflow when discussing toolz.

This workflow grows complex for most users when you introduce many processes. To resolve this we build our normal tricks into a new dask.Bag collection.

## Bag

In the same way that dask.array mimics NumPy operations (e.g. matrix multiply, slicing), dask.bag mimics functional operations like map, filter, reduce found in the standard library as well as many of the streaming functions found in toolz.

• Dask bag = Python/Toolz + processes

## Example

Here’s the obligatory wordcount example

>>> from dask.bag import Bag

>>> b = Bag.from_filenames('data/*.txt')

>>> def stem(word):
...     """ Stem word to primitive form """
...     return word.lower().rstrip(",.!:;'-\"").lstrip("'\"")

>>> dict(b.map(str.split).map(concat).map(stem).frequencies())
{...}

We use all of our cores and stream through memory on each core. We use multiprocessing but could get fancier with some work.

There are a lot of much larger much more powerful systems that have a similar API, notably Spark and DPark. If you have Big Data and need to use very many machines then you should stop reading this and go install them.

1. It was very easy given the work already done on dask.array
2. I often only need multiprocessing + a heavy machine
3. I wanted something trivially pip installable that didn’t use the JVM

But again, if you have Big Data, then this isn’t for you.

## Design

As before, a Bag is just a dict holding tasks, along with a little meta data.

>>> d = {('x', 0): (range, 5),
...      ('x', 1): (range, 5),
...      ('x', 2): (range, 5)}

>>> b = Bag(d, 'x', npartitions=3)

In this way we break up one collection

[0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4]


into three independent pieces

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]


When we abstractly operate on the large collection…

>>> b2 = b.map(lambda x: x * 10)

… we generate new tasks to operate on each of the components.

>>> b2.dask
{('x', 0): (range, 5),
('x', 1): (range, 5),
('x', 2): (range, 5)}
('bag-1', 0): (map, lambda x: x * 10, ('x', 0)),
('bag-1', 1): (map, lambda x: x * 10, ('x', 1)),
('bag-1', 2): (map, lambda x: x * 10, ('x', 2))}

And when we ask for concrete results (the call to list) we spin up a scheduler to execute the resulting dependency graph of tasks.

>>> list(b2)
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

More complex operations yield more complex dasks. Beware, dask code is pretty Lispy. Fortunately these dasks are internal; users don’t interact with them.

>>> iseven = lambda x: x % 2 == 0
{'bag-3': (sum, [('bag-2', 1), ('bag-2', 2), ('bag-2', 0)]),
('bag-2', 0): (count,
(filter, iseven, (range, 5))),
('bag-2', 1): (count,
(filter, iseven, (range, 5))),
('bag-2', 2): (count,
(filter, iseven, (range, 5)))}

The current interface for Bag has the following operations:

all             frequencies         min
any             join                product
count           map                 std
filter          map_partitions      sum
fold            max                 topk
foldby          mean                var


Manipulations of bags create task dependency graphs. We eventually execute these graphs in parallel.

## Execution

We repurpose the threaded scheduler we used for arrays to support multiprocessing to provide parallelism even on pure Python code. We’re careful to avoid unnecessary data transfer. None of the operations listed above requires significant communication. Notably we don’t have any concept of shuffling or scatter/gather.

We use dill to take care to serialize functions properly and collect/report errors, two issues that plague naive use of multiprocessing in Python.

>>> list(b.map(lambda x: x * 10))  # This works!
[0, 10, 20, 30, 40, 0, 10, 20, 30, 40, 0, 10, 20, 30, 40]

>>> list(b.map(lambda x: x / 0))   # This errs gracefully!
ZeroDivisionError:  Execption in remote Process

integer division or modulo by zero

Traceback:
...

These tricks remove need for user expertise.

## Productive Sweet Spot

I think that there is a productive sweet spot in the following configuration

1. Pure Python functions
2. Streaming/lazy data
3. Multiprocessing
4. A single large machine or a few machines in an informal cluster

This setup is common and it’s capable of handling terabyte scale workflows. In my brief experience people rarely take this route. They use single-threaded in-memory Python until it breaks, and then seek out Big Data Infrastructure like Hadoop/Spark at relatively high productivity overhead.

Your workstation can scale bigger than you think.

## Example

Here is about a gigabyte of network flow data, recording which computers made connections to which other computers on the UC-Berkeley campus in 1996.

846890339:661920 846890339:755475 846890340:197141 168.237.7.10:1163 83.153.38.208:80 2 8 4294967295 4294967295 846615753 176 2462 39 GET 21068906053917068819..html HTTP/1.0

846890340:989181 846890341:2147 846890341:2268 13.35.251.117:1269 207.83.232.163:80 10 0 842099997 4294967295 4294967295 64 1 38 GET 20271810743860818265..gif HTTP/1.0

846890341:80714 846890341:90331 846890341:90451 13.35.251.117:1270 207.83.232.163:80 10 0 842099995 4294967295 4294967295 64 1 38 GET 38127854093537985420..gif HTTP/1.0


This is actually relatively clean. Many of the fields are space delimited (not all) and I’ve already compiled and run the decade old C-code needed to decompress it from its original format.

Lets use Bag and regular expressions to parse this.

In [1]: from dask.bag import Bag, into

In [2]: b = Bag.from_filenames('UCB-home-IP*.log')

In [3]: import re

In [4]: pattern = """
...: (?P<request_time>\d+:\d+)
...: (?P<response_start>\d+:\d+)
...: (?P<response_end>\d+:\d+)
...: (?P<client_ip>\d+\.\d+\.\d+\.\d+):(?P<client_port>\d+)
...: (?P<server_ip>\d+\.\d+\.\d+\.\d+):(?P<server_port>\d+)
...: (?P<if_modified_since>\d+)
...: (?P<response_data_length>\d+)
...: (?P<request_url_length>\d+)
...: (?P<expires>\d+)
...: (?P<last_modified>\d+)
...: (?P<method>\w+)
...: (?P<domain>\d+..)\.(?P<extension>\w*)(?P<rest_of_url>\S*)
...: (?P<protocol>.*)""".strip().replace('\n', '\s+')

In [5]: prog = re.compile(pattern)

In [6]: records = b.map(prog.match).map(lambda m: m.groupdict())

This returns instantly. We only compute results when necessary. We trigger computation by calling list.

In [7]: list(records.take(1))
Out[7]:
'client_ip': '168.237.7.10',
'client_port': '1163',
'domain': '21068906053917068819.',
'expires': '2462',
'extension': 'html',
'if_modified_since': '4294967295',
'last_modified': '39',
'method': 'GET',
'protocol': 'HTTP/1.0',
'request_time': '846890339:661920',
'request_url_length': '176',
'response_data_length': '846615753',
'response_end': '846890340:197141',
'response_start': '846890339:755475',
'rest_of_url': '',
'server_ip': '83.153.38.208',
'server_port': '80'}]

Because bag operates lazily this small result also returns immediately.

To demonstrate depth we find the ten client/server pairs with the most connections.

In [8]: counts = records.pluck(['client_ip', 'server_ip']).frequencies()

In [9]: %time list(counts.topk(10, key=lambda x: x[1]))
CPU times: user 11.2 s, sys: 1.15 s, total: 12.3 s
Wall time: 50.4 s
Out[9]:
[(('247.193.34.56', '243.182.247.102'), 35353),
(('172.219.28.251', '47.61.128.1'), 22333),
(('240.97.200.0', '108.146.202.184'), 17492),
(('229.112.177.58', '47.61.128.1'), 12993),
(('146.214.34.69', '119.153.78.6'), 12554),
(('17.32.139.174', '179.135.20.36'), 10166),
(('97.166.76.88', '65.81.49.125'), 8155),
(('55.156.159.21', '157.229.248.255'), 7533),
(('55.156.159.21', '124.77.75.86'), 7506),
(('55.156.159.21', '97.5.181.76'), 7501)]

## Comparison with Spark

First, it is silly and unfair to compare with PySpark running locally. PySpark offers much more in a distributed context.

In [1]: import pyspark

In [2]: sc = pyspark.SparkContext('local')

In [3]: from glob import glob
In [4]: filenames = sorted(glob('UCB-home-*.log'))
In [5]: rdd = sc.parallelize(filenames, numSlices=4)

In [6]: import re
In [7]: pattern = ...
In [8]: prog = re.compile(pattern)

In [9]: lines = rdd.flatMap(lambda fn: list(open(fn)))
In [10]: records = lines.map(lambda line: prog.match(line).groupdict())
In [11]: ips = records.map(lambda rec: (rec['client_ip'], rec['server_ip']))

In [12]: from toolz import topk
In [13]: %time dict(topk(10, ips.countByValue().items(), key=1))
CPU times: user 1.32 s, sys: 52.2 ms, total: 1.37 s
Wall time: 1min 21s
Out[13]:
{('146.214.34.69', '119.153.78.6'): 12554,
('17.32.139.174', '179.135.20.36'): 10166,
('172.219.28.251', '47.61.128.1'): 22333,
('229.112.177.58', '47.61.128.1'): 12993,
('240.97.200.0', '108.146.202.184'): 17492,
('247.193.34.56', '243.182.247.102'): 35353,
('55.156.159.21', '124.77.75.86'): 7506,
('55.156.159.21', '157.229.248.255'): 7533,
('55.156.159.21', '97.5.181.76'): 7501,
('97.166.76.88', '65.81.49.125'): 8155}

So, given a compute-bound mostly embarrassingly parallel task (regexes are comparatively expensive) on a single machine they are comparable.

Reasons you would want to use Spark

• You want to use many machines and interact with HDFS
• Shuffling operations

Reasons you would want to use dask.bag

• Trivial installation
• No mucking about with JVM heap sizes or config files
• Nice error reporting. Spark error reporting includes the typical giant Java Stack trace that comes with any JVM solution.
• Easier/simpler for Python programmers to hack on. The implementation is 350 lines including comments.

Again, this is really just a toy experiment to show that the dask model isn’t just about arrays. I absolutely do not want to throw Dask in the ring with Spark.

## Conclusion

However I do want to stress the importance of single-machine parallelism. Dask.bag targets this application well and leverages common hardware in a simple way that is both natural and accessible to most Python programmers.

A skilled developer could extend this to work in a distributed memory context. The logic to create the task dependency graphs is separate from the scheduler.

Special thanks to Erik Welch for finely crafting the dask optimization passes that keep the data flowly smoothly.

## February 13, 2015

### Matthew Rocklin

#### Towards Out-of-core ND-Arrays -- Slicing and Stacking

This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project

tl;dr Dask.arrays can slice and stack. This is useful for weather data.

## Introduction

This is the sixth in a sequence of posts constructing an out-of-core nd-array using NumPy, and dask. You can view these posts here:

Now we talk about slicing and stacking. We use meteorological data as an example use case.

## Slicing

Dask.array now supports most of the NumPy slicing syntax. In particular it supports the following:

• Slicing by integers and slices x[0, :5]
• Slicing by a list/np.ndarray of integers x[[1, 2, 4]]
• Slicing by a list/np.ndarray of booleans x[[False, True, True, False, True]]

It does not currently support the following:

• Slicing one dask.array with another x[x > 0]
• Slicing with lists in multiple axes x[[1, 2, 3], [3, 2, 1]]

## Stack and Concatenate

We often store large arrays on disk in many different files. We want to stack or concatenate these arrays together into one logical array. Dask solves this problem with the stack and concatenate functions, which stitch many arrays together into a single array, either creating a new dimension with stack or along an existing dimension with concatenate.

### Stack

We stack many existing dask arrays into a new array, creating a new dimension as we go.

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # A small stack of dask arrays

>>> da.stack(arrays, axis=0).shape
(3, 4, 4)

>>> da.stack(arrays, axis=1).shape
(4, 3, 4)

>>> da.stack(arrays, axis=2).shape
(4, 4, 3)

This creates a new dimension with length equal to the number of slices

### Concatenate

We concatenate existing arrays into a new array, extending them along an existing dimension

>>> import dask.array as da
>>> arrays = [from_array(np.ones((4, 4)), blockshape=(2, 2))
...            for i in range(3)]  # small stack of dask arrays

>>> da.concatenate(arrays, axis=0).shape
(12, 4)

>>> da.concatenate(arrays, axis=1).shape
(4, 12)

## Case Study with Meteorological Data

To test this new functionality we download meteorological data from the European Centre for Medium-Range Weather Forecasts. In particular we have the temperature for the Earth every six hours for all of 2014 with spatial resolution of a quarter degree. We download this data using this script (please don’t hammer their servers unnecessarily) (Thanks due to Stephan Hoyer for pointing me to this dataset).

As a result, I now have a bunch of netCDF files!

$into myfile.csv ssh://hostname:myfile.json --delimter ','  We also have docs! http://into.readthedocs.org/en/latest/ ## February 09, 2015 ### numfocus #### NumFOCUS 2014 Review While not represented in our blog activity, 2014 turned out to be a very busy year for NumFOCUS. We sponsored more projects, gained new board members, and added programs. It has been hard to keep up with all the activity within the organization, but perhaps much harder for those outside. Let me go through a few of the highlights from the year to get folks caught up. First and foremost, we had the first election process. This has allowed us to welcome Lorena Barba, Brian Granger, Stefan Karpinski and Cindee Madison to the board of directors. In an effort to grow the number of people helping out we have also formed two new ways for people to contribute: Advisor Council and Vice President roles. The inaugural advisor council consists of Fernando Perez and Travis Oliphant. They will help guide the organization as it grows in both scope and operations. The new Vice President are Sheila Miguez, James Powell and Ben Zaitlen, all committed community members who have helped in various ways to get us started. We are always looking for more volunteers to help build our organization, but with these added team members we hope to continue to deliver high quality services to our community. Over the year we held four PyData conferences, supported one John Hunter Technology Fellow, lead Women in Technology bootcamps, helped between 20 and 30 conference attendees travel to important conferences, and more. As the new year starts, we will be doing many of these things plus working with the Training Up program. We see the need to not only sponsor projects but to create important programs that get the community together. As always, we are always looking for more volunteers for these events, so please do get in touch. Perhaps one of the most exciting things that we did this year is expand our language scope from a Python centric organization to include several non-Python projects. We have ROpenSci and Julia now a part of our line up. Joining us on the education front are Software Carpentry and sister organization Data Carpentry. We also finished signing our Pyhton projects that we had been working with for a while including: SymPy, IPython, and AstroPy. While fiscal sponsorship doesn’t make sense for every project, it is a service we happily provide to help projects sustain and grow in natural ways. In the coming year, we plan to be a great deal more vocal about our activities. Personally, two major goals I have is to expand our reach to Europe and start making headway on project governance recommendations. But that is another story. ## February 08, 2015 ### Titus Brown #### Review: &quot;Accurate multi-kb reads resolve complex populations and detect rare microorganisms&quot; Here is the top bit of a review I wrote of a very nice paper by Itai Sharon et al., from Jill Banfield's lab, on using Illumina TruSeq long reads (a.k.a. Moleculo), to look at complex metagenomes. The paper is newly available here (although it is behind a paywall ;(. Citation: Accurate, multi-kb reads resolve complex populations and detect rare microorganisms Genome Res. gr.183012.114. Published in Advance February 9, 2015. doi: 10.1101/gr.183012.114 This is an excellent application of new long-read technology to further illuminate the characteristics of several medium-to-high complexity microbial communities. The methods are expert, the results are fascinating, and the discussion is well done. Objectives: 1. test the efficacy of assembling Moleculo reads to improve short-read contigs; 2. evaluate accuracy of curated short-read assemblies; 3. look at organisms present at very low abundance levels; 4. evaluate levels of sequence variation & genomic content in strains that could not otherwise be resolved by short-read assembly; Results: 1. Long-read data revealed many very abundant organisms... 2. ...that were entirely missed in short-read assemblies. 3. Genome architecture and metabolic potential were reconstructed using a new synteny based method. 4. "Long tail" of low-abundance organisms belong to phyla represented by highly abundant organisms. 5. Diversity of closely-related strains & rare organisms account for major portion of the communities. The portion of the results that is most novel and most fascinating is the extensive analysis of rare sequences and the disparity in observations from Illumina (assemblies) and Moleculo (long reads and assemblies). The basic results are, on first examination, counter-intuitive: many long-read sequences are obtained from abundant organisms that simply don't show up in Illumina short-read assemblies. The statement is made that this is because of strain variation in the community, i.e. that Illumina assemblies are fragmented due to strain variation and this blocks the observation of the majority of the community. This is to some extent born out by the low mapping percentages (which is the primary evidence offered by the authors), and also matches our own observations. --titus ## February 07, 2015 ### Titus Brown #### How we develop software (2015 version) A colleague who is starting their own computational lab just asked me for some advice on how to run software projects, and I wrote up the following. Comments welcome! A brief summary of what we've converged on for our own needs is this: --titus ## February 05, 2015 ### Continuum Analytics #### Continuum Analytics - February Tech Events The Continuum Analytics team believes in engaging with our community and fostering relationships throughout the tech world. To keep you all in the loop, we’re updating our blog each month with a list of events where you can find us attending, hosting, or participating! ## February 04, 2015 ### Juan Nunez-Iglesias #### jnuneziglesias It’s official! Harriet Dashnow, Stéfan van der Walt, and I will be writing an O’Reilly book about the SciPy library and the surrounding ecosystem. The book is called Elegant SciPy, and is intended to teach SciPy to fledgling Pythonistas, guided by the most elegant SciPy code examples we can find. So, if you recently came across scientific Python code that made you go “Wow!” with its elegance, simplicity, cleverness, or power, please point us to it! As an example, have a look at Vighnesh Birodkar’s code to build a region adjacency graph from an n-dimensional image, which I highlighted previously here. Each chapter will have one or two snippets that we will work towards. Each of these will be credited as “written by/nominated by”, and needs to be published under a permissive license such as MIT, BSD, or public domain to be considered for inclusion. We would especially like nominations in the following categories: • statistics (using scipy.stats) • image processing and computer vision • Fourier transforms • sparse matrix operations • eigendecomposition and linear algebra • optimization • streaming data analysis • spatial and geophysical data analysis We’ll also consider other parts of the SciPy library and ecosystem. We invite you to submit code snippets for inclusion in the book. We’d also appreciate a small description of about one paragraph explaining what the code is used for and why you think it’s elegant, even though this is often self-evident. =) ### How to submit Thank you, Juan, Harriet, and Stéfan. ## February 03, 2015 ### Matthew Rocklin #### ReIntroducing Into This work is supported by Continuum Analytics and the XDATA Grant as part of the Blaze Project tl;dr into efficiently migrates data between formats. ## Motivation We spend a lot of time migrating data from common interchange formats, like CSV, to efficient computation formats like an array, a database or binary store. Worse, many don’t migrate data to efficient formats because they don’t know how or can’t manage the particular migration process for their tools. Your choice of data format is important. It strongly impacts performance (10x is a good rule of thumb) and who can easily use and interpret your data. When advocating for Blaze I often say “Blaze can help you query your data in a variety of formats.” This assumes that you’re able to actually get it in to that format. ## Enter the into project The into function efficiently migrates data between formats. These formats include both in-memory data structures like the following: list, set, tuple, Iterator numpy.ndarray, pandas.DataFrame, dynd.array Streaming Sequences of any of the above  as well as persistent data living outside of Python like the following: CSV, JSON, line-delimited-JSON Remote versions of the above HDF5 (both standard and Pandas formatting), BColz, SAS SQL databases (anything supported by SQLAlchemy), Mongo  The into project migrates data between any pair of these formats efficiently by using a network of pairwise conversions. (visualized towards the bottom of this post) ## How to use it The into function takes two arguments, a source and a target. It moves data in the source to the target. The source and target can take the following forms  Target Source Example Object Object A particular DataFrame or list String String 'file.csv', 'postgresql://hostname::tablename' Type Like list or pd.DataFrame So the following would be valid calls to into >>> into(list, df) # create new list from Pandas DataFrame >>> into([], df) # append onto existing list >>> into('myfile.json', df) # Dump dataframe to line-delimited JSON >>> into(Iterator, 'myfiles.*.csv') # Stream through many CSV files >>> into('postgresql://hostname::tablename', df) # Migrate dataframe to Postgres >>> into('postgresql://hostname::tablename', 'myfile.*.csv') # Load CSVs to Postgres >>> into('myfile.json', 'postgresql://hostname::tablename') # Dump Postgres to JSON >>> into(pd.DataFrame, 'mongodb://hostname/db::collection') # Dump Mongo to DataFrame Note that into is a single function. We’re used to doing this with various to_csv, from_sql methods on various types. The into api is very small; Here is what you need in order to get started: $ pip install into

>>> from into import into

See into on github

## Examples

We now show some of those same examples in more depth.

Turn list into numpy array

>>> import numpy as np
>>> into(np.ndarray, [1, 2, 3])
array([1, 2, 3])

Load CSV file into Python list

>>> into(list, 'accounts.csv')
[(1, 'Alice', 100),
(2, 'Bob', 200),
(3, 'Charlie', 300),
(4, 'Denis', 400),
(5, 'Edith', 500)]

Translate CSV file into JSON

>>> into('accounts.json', 'accounts.csv')
$head accounts.json {"balance": 100, "id": 1, "name": "Alice"} {"balance": 200, "id": 2, "name": "Bob"} {"balance": 300, "id": 3, "name": "Charlie"} {"balance": 400, "id": 4, "name": "Denis"} {"balance": 500, "id": 5, "name": "Edith"}  Translate line-delimited JSON into a Pandas DataFrame >>> import pandas as pd >>> into(pd.DataFrame, 'accounts.json') balance id name 0 100 1 Alice 1 200 2 Bob 2 300 3 Charlie 3 400 4 Denis 4 500 5 Edith ## How does it work? This is challenging. Robust and efficient conversions between any two pairs of formats is fraught with special cases and bizarre libraries. The common solution is to convert through a common format like a DataFrame, or streaming in-memory lists, dicts, etc. (see dat) or through a serialization format like ProtoBuf or Thrift. These are excellent options and often what you want. Sometimes however this can be slow, particularly when dealing with live computational systems or with finicky storage solutions. Consider for example, migrating between a numpy.recarray and a pandas.DataFrame. We can migrate this data very quickly in place. The bytes of data don’t need to change, only the metadata surrounding them. We don’t need to serialize to an interchange format or translate to intermediate pure Python objects. Consider migrating data from a CSV file to a PostgreSQL database. Using Python iterators through SQLAlchemy we rarely exceed migration speeds greater than 2000 records per second. However using direct CSV loaders native to PostgreSQL we can achieve speeds greater than 50000 records per second. This is the difference between an overnight job and a cup of coffee. However this requires that we’re flexible enough to use special code in special situations. Expert pairwise interactions are often an order of magnitude faster than generic solutions. Into is a network of these pairwise migrations. We visualize that network below: Into's decentralized migration scheme. Complex but powerful Each node is a data format. Each directed edge is a function that transforms data between two formats. A single call to into may traverse multiple edges and multiple intermediate formats. For example, we when migrate a CSV file to a Mongo database we might take the following route: • Load in to a DataFrame (pandas.read_csv) • Convert to np.recarray (DataFrame.to_records) • Then to a Python Iterator (np.ndarray.tolist) • Finally to Mongo (pymongo.Collection.insert) Alternatively we could write a special function that uses MongoDB’s native CSV loader and shortcut this entire process with a direct edge CSV -> Mongo. To find the most efficient route we weight the edges of this network with relative costs (measured ad-hoc.) We use networkx to find the shortest path before we start the migration. If for some reason an edge fails (raises NotImplementedError) we can reroute automatically. In this way we’re both efficient and robust to failure. Note that we color some nodes red. These nodes can be larger than memory. When we migrate between two red nodes (both the input and output may be larger than memory) then we limit our path to the red subgraph to ensure that we don’t blow up mid-migration. One format to note is chunks(...) like chunks(DataFrame) which is an iterable of in-memory DataFrames. This convenient meta-format allows us to use compact data structures like numpy arrays and pandas DataFrames on large data while keeping only a few tens of megabytes in memory at a time. The networked approach allows developers to write specialized code for special situations and know that this code will only be used in the right situation. This approach allows us to handle a very complex problem in an isolated and separable manner. The central dispatching system keeps us sane. ## History I wrote about into long ago in connection to Blaze. I then promptly shut up about it. This was because the old implementation (before the network approach) was difficult to extend/maintain and wasn’t ready for prime-time. I am very happy with this network. Unexpected applications very often just work and into is now ready for prime-time. It’s also available independently from Blaze, both via conda and via pip. The major dependencies are NumPy, Pandas, and NetworkX so it’s relatively lightweight for most people who read my blog. If you want to take advantage of some of the higher performing formats, like HDF5, you’ll need to install those libraries as well (pro-tip, use conda). ## How do I get started? You should download a recent version. $ pip install --upgrade git+https://github.com/ContinuumIO/into
or
\$ conda install into --channel blaze


You then might want to go through the first half of this tutorial

Or just give it a shot without reading anything. My hope is that the interface is simple enough (just one function!) that users can pick it up naturally. If you run in to issues then I’d love to hear about them at blaze-dev@continuum.io

## February 02, 2015

### Titus Brown

#### khmer development sprint: Feb 19-20 and 23-25th, 2015

Michael R. Crusoe and I are throwing a sprint!

Somewhat in the vein of last year's mini-Hackathon, Michael and I and other members of the lab are going to focus in on reviewing contributions and closing issues on the khmer project for a 5 day period.

To track the sprint, subscribe to the github issue.

Michael and I will be working ~9-5pm Eastern, Thu/Fri/Mon/Tues/Wed, Feb 19-25th (so weekend excepted), and people are welcome to drop in anytime. We are planning to focus on khmer, screed, khmer-protocols, and khmer-recipes; other lab projects (like paper pipelines or workshop materials) are fair game, too. We'll have a list of issues posted for people who are looking for a small task.

We don't have any particular physical location reserved, but you're welcome to join us at Michigan State University to hang out in person. We also plan to be fully online-enabled within the 9-5pm EDT period, and will have a Google Hangout running that anyone can join. We can also set up one-to-one video conferences and screen sharing with people who need help. (We will not be working outside of those hours: family, sanity, etc.)

• you're interested in seeing the "github flow" model in action, for scientific software (including automated tests, continuous integration, code review, and a pre-merge checklist!);
• you have a particular bug or problem you want to fix in any of our software;
• you work for or with the lab;
• you want to see some of the technical underpinnings of our approach(es) to open science;

Again, subscribe here to be kept in the loop as our plans progress. And check out our WSSPE report on last July's hackathon!

cheers, --titus

## February 01, 2015

### Titus Brown

#### Lab for Data Intensive Biology at UC Davis joins Software Carpentry as an Affiliate

We are pleased to announce that the Laboratory for Data Intensive Biology at UC Davis has joined the Software Carpentry Foundation as an Affiliate Member for three years, starting in January 2015.

"We've been long-term supporters of Software Carpentry, and Affiliate status lets us support the Software Carpentry Foundation in a tangible way," said Dr. C. Titus Brown, the lab director. "This status also gives us the opportunity to include Software Carpentry as part of a larger biological data science training program at UC Davis."

Software Carpentry is a volunteer organisation whose goal is to make scientists more productive, and their work more reliable, by teaching them basic computing skills. Founded in 1998, it runs short, intensive workshops that cover program design, version control, testing, and task automation. It has taught over 10,000 learners, and primarily uses a two-day workshop format.

In October 2014, a non-profit Software Carpentry Foundation was created to act as a governing body for the project, with Dr. Greg Wilson as the founding Executive Director. Affiliate status with the SCF provides preferential access to instructor training as well as waiving per-workshop fees.

The Laboratory for Data Intensive Biology (DIB) is newly started with Dr. Brown's move to UC Davis in January 2015. The lab is in the Department of Population Health and Reproduction in the School of Veterinary Medicine, and is part of both the Data Science Initiative and the Genome Center at UC Davis. DIB does fundamental and applied research on making use of the increasing volume and variety of biological data, and Dr. Brown currently has funding from the NIH, the USDA, and the Moore Foundation.

Read more about Software Carpentry at the Software Carpentry blog, and also see Software Carpentry's announcement of UC Davis' affiliate status.

## January 31, 2015

### The Nipy blog

#### Ugly speaking

I just skip-read this post about Python 3 by Armin Ronacher. I skip-read it because I found it so ugly.

There was a BBC sitcom called One foot in the grave. It had a character called Victor Meldrew who was always angry. His catchphrase was an enraged I can't believe it.

For some reason, this genre seems to be a standard for blog posts. The author just can't believe that X (a language or operating system) could be so stupid as to do Y.

It is a little puzzling, because I guess most people reading someone saying how stupid they are, do not say "ah yes, thank you" but "f#@& you asshole". So, if we do write in that style, we make it less likely the problem will be fixed. I suppose that's obvious to the person writing, so there must be some other thing that is attractive about labeling the other person / system / idea as a loser.

We technical types don't like to think about soft issues like tone. Like the drunk under the lamp-post we prefer technical problems, and ignore other important problems such as the way we are making our readers feel.

When I am reading a blog post of the "I can't believe it" type, I feel like someone with bad breath is shouting in my face, and my first and strong reaction is to move away and avoid that person whenever I see them. For example, I suppose I won't click on a link to blog post by Armin Ronacher again, because I predict it will make me feel bad. I am sure that is a shame for me, because I have every reason to think Mr Ronacher is competent, intelligent and well-informed. It is just I don't want to feel that way. I find it harder to work when I feel that way. I am less gentle with my friends when I feel that way.

### Randy Olson

#### Python usage survey 2014

Remember that Python usage survey that went around the interwebs late last year? Well, the results are finally out and I’ve visualized them below for your perusal. This survey has been running for two years now (2013-2014), so where we have data for both years, I’ve charted the results so we can see the changes in Python usage over time.

I’ll note that a big focus of this survey is to find out if Python users are transitioning over to Python 3, and if they aren’t, then why they aren’t making that transition. If you’re on the fence about switching from Python 2 to 3, there are some great articles out there about the key differences between the two versions and the many, many, …many advantages that Python 3 offers over Python 2.

If you want to check out the underlying data yourself, head on over to Bruno Cauet’s page.

The first big question that we all we know is: How many Python users have actually used Python 2 and 3?

As expected, nearly every Python user has used Python 2 at some point in their career. We also see good news for Python 3 over the past year: Python 3 usage increased 12 percentage points in 2014, up to nearly 3/4 of the surveyed Python users.

Of course, “writing code with Python 3 once” doesn’t mean that they actually use it regularly. The question below gets at that question more directly.

Here we see yet more good news for Python 3: As much as 10% more Python users are primarily using Python 3 than in 2013, now accounting for 1/3 of all Python users. It seems the transition to Python 3 will be slow but steady for the next few years.

The transition we see above may be caused by project managers at work. What version do Python users go to when working on their personal projects?

Here we see a more close divide between the two versions: Nearly half of all users will start up their own projects with Python 3, whereas the other half still ardently remains pro-Python 2.

Let’s break these usage patterns down by the specific version now.

Somehow, Python v. 2.5 and 2.6 are still in use in some places, but 2.7 still dominates the Python 2 landscape. We also see telltale signs that Python 3 is becoming a mature language in its own right, with users stuck in the older 3.2 and 3.3 versions.

To summarize so far: Over 2/3 of all Python users are still using Python 2, with the majority of them sitting at 2.7. This tells us that Python is still very much a divided language, with a large portion of the user base unwilling to upgrade to the latest version despite the fact that Python 2 nearly 5 years old now. (That’s ancient in programming language years!)

A common complaint of the ardent Python 2 users is that Python 3 was a huge mistake. How does that turn out in this survey?

Surprisingly, it seems the complainers are a minority: Only 12% of the Python users surveyed think Python 3 was a mistake. 1/3 of Python users think Python 3 is great (probably the ones who made the switch!), whereas over half of Python users think the Python developers could’ve made the Python 2 -> 3 transition more fluid.

So, most Python users don’t think Python 3 was a mistake, but 2 in 3 of them still haven’t made the switch. That leaves us to wonder: Why haven’t the Python 2 users made the switch yet?

Package dependencies are — by far — the most-cited reasons for refusing to switch to Python 3. Many of us rely on specific packages — such as numpy, scipy, pandas, etc. — for our day-to-day work. If the packages haven’t been upgraded to Python 3 yet, then why should we? Lucky for us, there’s a web site dedicated to tracking the packages that are Python 3 ready. Are all of your packages on there? Then maybe it’s time to make the switch.

Another common reason for refusing to switch to Python 3 is that there’s no incentive. We’re comfortable with Python 2, we know its ins and outs, and everything works fine. Why bother upgrading? If you fall in this category, I’ll point you to this article again that compares the key differences between Python 2 and 3. And don’t forget about the 2 pounds of reasons why Python 3 is a huge upgrade over Python 2.

The last two major reasons for refusing to switch to Python 3 are a little harder to address. If you have a large legacy code base or your manager simply refuses to make the upgrade, then it’s up to you to convince management (or yourself) that the upgrade is worth the work. The links I included in the above paragraph are a good start.

Finally, the survey wanted to look at cross-compatibility between Python 2 and 3.

Here we start to see more good news for the future of Python 3: We saw a significant increase in the number of users who have ported their code from Python 2 to 3.

It’s not very well-known that there Python has packages for converting code from Python 2 to 3 and Python 3 to 2. When the survey asked users whether they’d used either of these packages even once, well, see for yourself…

It seems like the Python devs need to do a better job of raising awareness of these code conversion packages.

Because Python 2 and 3 really aren’t that different, it’s not necessary to write code for just one version. As the famous taco shell commercial goes: “Por que no los dos?” (Why not both?)

To that end, the survey also polled users on whether they write Python 2- and 3- compatible code.

Apparently only 1 in 3 Python developers bother to write multi-version compatible code, and there hasn’t been much of a change since 2013.

I look forward to seeing how the Python 3 transition progresses in 2014. If you have any more resources for helping (and convincing!) Python developers to make the transition, please share them in the comments below.

# Advice for applying Machine Learning¶

This post is based on a tutorial given in a machine learning course at University of Bremen. It summarizes some recommendations on how to get started with machine learning on a new problem. This includes

• ways of visualizing your data
• choosing a machine learning method suitable for the problem at hand
• identifying and dealing with over- and underfitting
• dealing with large (read: not very small) datasets
• pros and cons of different loss functions.

The post is based on "Advice for applying Machine Learning" from Andrew Ng. The purpose of this notebook is to illustrate the ideas in an interactive way. Some of the recommendations are debatable. Take them as suggestions, not as strict rules.

In [1]:
import time
import numpy as np
np.random.seed(0)

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
Expand Code
# Modified from http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html
from sklearn.learning_curve import learning_curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
train_sizes=np.linspace(.1, 1.0, 5)):
"""
Generate a simple plot of the test and traning learning curve.

Parameters
----------
estimator : object type that implements the "fit" and "predict" methods
An object of that type which is cloned for each validation.

title : string
Title for the chart.

X : array-like, shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and
n_features is the number of features.

y : array-like, shape (n_samples) or (n_samples, n_features), optional
Target relative to X for classification or regression;
None for unsupervised learning.

ylim : tuple, shape (ymin, ymax), optional
Defines minimum and maximum yvalues plotted.

cv : integer, cross-validation generator, optional
If an integer is passed, it is the number of folds (defaults to 3).
Specific cross-validation objects can be passed, see
sklearn.cross_validation module for the list of possible objects
"""

plt.figure()
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=5, n_jobs=1, train_sizes=train_sizes)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")

plt.xlabel("Training examples")
plt.ylabel("Score")
plt.legend(loc="best")
plt.grid("on")
if ylim:
plt.ylim(ylim)
plt.title(title)


### Dataset¶

We will generate some simple toy data using sklearn's make_classification function:

In [4]:
from sklearn.datasets import make_classification
X, y = make_classification(1000, n_features=20, n_informative=2,
n_redundant=2, n_classes=2, random_state=0)

from pandas import DataFrame
df = DataFrame(np.hstack((X, y[:, None])),
columns = range(20) + ["class"])


Notice that we generate a dataset for binary classification consisting of 1000 datapoints and 20 feature dimensions. We have used the DataFrame class from pandas to encapsulate the data and the classes into one joint data structure. Let's take a look at the first 5 datapoints:

In [5]:
df[:5]

Out[5]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 class
0 -1.063780 0.676409 1.069356 -0.217580 0.460215 -0.399167 -0.079188 1.209385 -0.785315 -0.172186 ... -0.993119 0.306935 0.064058 -1.054233 -0.527496 -0.074183 -0.355628 1.057214 -0.902592 0
1 0.070848 -1.695281 2.449449 -0.530494 -0.932962 2.865204 2.435729 -1.618500 1.300717 0.348402 ... 0.225324 0.605563 -0.192101 -0.068027 0.971681 -1.792048 0.017083 -0.375669 -0.623236 1
2 0.940284 -0.492146 0.677956 -0.227754 1.401753 1.231653 -0.777464 0.015616 1.331713 1.084773 ... -0.050120 0.948386 -0.173428 -0.477672 0.760896 1.001158 -0.069464 1.359046 -1.189590 1
3 -0.299517 0.759890 0.182803 -1.550233 0.338218 0.363241 -2.100525 -0.438068 -0.166393 -0.340835 ... 1.178724 2.831480 0.142414 -0.202819 2.405715 0.313305 0.404356 -0.287546 -2.847803 1
4 -2.630627 0.231034 0.042463 0.478851 1.546742 1.637956 -1.532072 -0.734445 0.465855 0.473836 ... -1.061194 -0.888880 1.238409 -0.572829 -1.275339 1.003007 -0.477128 0.098536 0.527804 0

5 rows × 21 columns

It is hard to get any clue of the problem by looking at the raw feature values directly, even on this low-dimensional example. Thus, there is a wealth of ways of providing more accessible views of your data; a small subset of these is discussed in the next section.

### Visualization¶

First step when approaching a new problem should nearly always be visualization, i.e., looking at your data.

Seaborn is a great package for statistical data visualization. We use some of its functions to explore the data.

First step is to generate scatter-plots and histograms using the pairplot. The two colors correspond to the two classes and we use a subset of the features and only the first 50 datapoints to keep things simple.

In [6]:
_ = sns.pairplot(df[:50], vars=[8, 11, 12, 14, 19], hue="class", size=1.5)


Based on the histograms, we can see that some features are more helpful to distinguish the two classes than other. In particular, feature 11 and 14 seem to be informative. The scatterplot of those two features shows that the classes are almost linearly separable in this 2d space. A further thing to note is that feature 12 and 19 are highly anti-correlated. We can explore correlations more systematically by using corrplot:

In [7]:
plt.figure(figsize=(12, 10))
_ = sns.corrplot(df, annot=False)


We can see our observations from before confirmed here: feature 11 and 14 are strongly correlated with the class (they are informative). They are also strongly correlated with each other (via the class). Furthermore, feature 12 is highly anti-correlated with feature 19, which in turn is correlated with feature 14. We have thus some redundant features. This can be problematic for some classifiers, e.g., naive Bayes which assumes the features being independent given the class. The remaining features are mostly noise; they are neither correlated with each other nor with the class.

Notice that data visualization becomes more challenging if you have more feature dimensions and less datapoints. We give an example for visualiszing high-dimensional data later.

### Choice of the method¶

Once we have visually explored the data, we can start applying machine learning to it. Given the wealth of methods for machine learning, it is often not easy to decide which method to try first. This simple cheat-sheet (credit goes to Andreas Müller and the sklearn-team) can help to select an appropriate ML method for your problem (see http://dlib.net/ml_guide.svg for an alternative cheat sheet).

In [8]:
from IPython.display import Image
Image(filename='ml_map.png', width=800, height=600)

Out[8]: