June 18, 2013

Josef Perkoltd

Quasi-Random Numbers with Halton Sequences

Just two quick plots.

For maximum simulated likelihood estimation and for some other cases, we need to integrate the likelihood function with respect to a distribution that reflects unobserved heterogeneity. When numeric integration is too difficult, then we can integrate by simulating the underlying distribution.

However, using random draws from the underlying distribution can be inefficient for integration, and there are several ways of speeding up the integration or of increasing the accuracy for the same amount of time.

One possibility is to use sequences that mimic random draws from the underlying distribution but have a better coverage of the underlying space, examples for low-discrepancy_sequences are Sobol and Halton sequences.
Read more »

by Josef Perktold (noreply@blogger.com) at June 18, 2013 08:02 PM

Blake Griffith

week 3 - Boolean Operations, != and == for CSR, CSC, and BSR

This week focused on implementing support for the == and != boolean comparisons for the BSR, CSC, CSR sparse matrix types. The culmination of my work this week is this pull request.

!=

For sparse matrices, the != (not equal to) generally has an output that can be efficiently represented by sparse matrices. With the exception of comparing the whole matrix with a nonzero scalar.

==

The boolean equal to comparison in general has an output that is not efficiently represented by sparse matrices, since all the zero entries return True. The exception being comparison with a nonzero scalar.

Adding a c routine for this operation with the existing code, would be problematic. The binop routines only apply the given binop to element pairs in which one of them is nonzero. So all the co-occurring zero elements would not return True like they should. Because of this, and == being in general inefficient, I did not bother to implement the == operation in sparsetools. Instead, the == operation is computed using the != operation and inverting it.

Implementation

C++ & SWIG level

Implementing the routines for != and == in the sparsetools/ was pretty easy, using the existing code. There are handy csr_binop_csr and bsr_binop_bsr functions which take c++ a functional as an argument for the binop. The only problem is that the output type is not the same as the input type. So the output is not bool by default.

Python level

In Python, != and == operations are can be overridden by defining the methods __ne__ and __eq__. I did this to handle the case wher the other object is a scalar, dense, or a sparse matrix.

I defined these methods in compressed.py which defines the base class for BSR, CSC, and CSR.

In defined these methods in base.py, here they just convert the sparse matrix to CSR and preform the desired operation.

__bool__

There was however some complication with how these worked when a dense ndarray is the first argument, ideally it would ask the sparse matrix what to do, however this was not happening whenever the sparse matrix's __bool__ (for Python3, __nonzero__ for python2.x) method returned True or False, I still don't know why. It did however work when __bool__ raised an error or returned something other than T/F. Since NumPy's ndarrays raise an error when __bool__ is called, I thought this was a reasonable change to make to sparse matrices __bool__. Although ideally, I'd like to better understand what is going on so I don't have to make an incompatible change to SciPy's API.

This change broke one test in scipy/io/matlab/tests/test_mio.py there could be more serious issues relating to this change, currently they are being discussed on the pull request

June 18, 2013 03:00 PM

June 17, 2013

Continuum Analytics

Parallel Python with IPCluster and Wakari

IPCluster provides an easy way to do parallel computing in Python, allowing you to connect to and control Python processes running on multiple compute cores, either locally or across a cluster. This introduction should take about 20 minutes and will show you how to start a cluster, access engines, and execute commands. We will use our web-based Wakari analytics environment that will allow you to get started right away, without needing to install or configure anything.

by Ian Stokes-Rees at June 17, 2013 12:00 PM

Paul Ivanov

What happens to a talk deferred...

I'll be in Austin, TX for SciPy next week, and I'll be giving an updated talk about tools for reproducible research.

What follows is the abstract of a talk that I really wanted to give and submitted to the general track which did not get in. I care deeply about this topic, and I'm hoping to have conversations with others about it. The talk was deferred to a poster, so if you don't bump into me at other times, seek me out during the poster session (mine's poster #15, 10:35 AM - 11:35 AM on June 27th). This post is intended as a pre-conference warm up to that.

Navigating the Scientific Python Communities - the missing guide.

The awful truth: our newcomers have a deluge of options to wade through as they begin their journey. What tools are available, how do I install them, how do I make them work together -- all hard questions facing a budding scipythonista.

On developer-friendly platforms, the popular approach is to just install numpy, scipy, matplotlib, and ipython using package management facilities provided by the operating system. On Mac OS X and Windows, the least painful, bootstrapping approach is to use a Python distribution like Python(X,Y), EPD, Anaconda, or Sage.

Both of these paths obscure a reality which must be stated explicitly: the development of packages is fundamentally decentralized. The scientific python ecosystem consists of a loose but thriving confederation of projects and communities.

The chaos of installation options and lack of centralization around a canonical solution, which on the surface appears to be a point of weakness that is detrimental to community growth, is a symptom of one of its greatest strengths. Namely, what we have is a direct democracy for user-developers. They vote with their feet: filing bug reports, testing pre-release versions, participating in mailing lists, writing and reviewing pull requests, and advertising the tools they use in talks, papers, and conversations at conferences.

I will discuss why this is the case, why this is a good thing, and how we can embrace it. The talk will present the ethos and expectations of an effective newcomer, as well as resources and strategies for incremental progress toward becoming a master scipythonista.

by Paul Ivanov at June 17, 2013 07:00 AM

June 15, 2013

Jake Vanderplas

Numba vs. Cython: Take 2

Last summer I wrote a post comparing the performance of Numba and Cython for optimizing array-based computation. Since posting, the page has received thousands of hits, and resulted in a number of interesting discussions. But in the meantime, the Numba package has come a long way both in its interface and its performance.

Here I want to revisit those timing comparisons with a more recent Numba release, using the newer and more convenient autojit syntax, and also add in a few additional benchmarks for completeness. I've also written this post entirely within an IPython notebook, so it can be easily downloaded and modified.

As before, I'll use a pairwise distance function. This will take an array representing M points in N dimensions, and return the M x M matrix of pairwise distances. This is a nice test function for a few reasons. First of all, it's a very clean and well-defined test. Second of all, it illustrates the kind of array-based operation that is common in statistics, datamining, and machine learning. Third, it is a function that results in large memory consumption if the standard numpy broadcasting approach is used (it requires a temporary array containing M * M * N elements), making it a good candidate for an alternate approach.

We'll start by defining the array which we'll use for the benchmarks: one thousand points in three dimensions.

In [1]:
import numpy as np
X = np.random.random((1000, 3))

Numpy Function With Broadcasting

We'll start with a typical numpy broadcasting approach to this problem. Numpy broadcasting is an abstraction that allows loops over array indices to be executed in compiled C. For many applications, this is extremely fast and efficient. Unfortunately, there is a problem with broadcasting approaches that comes up here: it ends up allocating hidden temporary arrays which can eat up memory and cause computational overhead. Nevertheless, it's a good comparison to have. The function looks like this:

In [2]:
def pairwise_numpy(X):
    return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
%timeit pairwise_numpy(X)
10 loops, best of 3: 111 ms per loop

Pure Python Function

A loop-based solution avoids the overhead associated with temporary arrays, and can be written like this:

In [3]:
def pairwise_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D
%timeit pairwise_python(X)
1 loops, best of 3: 13.4 s per loop

As we see, it is over 100 times slower than the numpy broadcasting approach! This is due to Python's dynamic type checking, which can drastically slow down nested loops. With these two solutions, we're left with a tradeoff between efficiency of computation and efficiency of memory usage. This is where tools like Numba and Cython become vital

I should note that there exist alternative Python interpreters which improve on the computational inefficiency of the Python run-time, one of which is the popular PyPy project. PyPy is extremely interesting. However, it's currently all but useless for scientific applications, because it does not support NumPy, and by extension cannot run code based on SciPy, scikit-learn, matplotlib, or virtually any other package that makes Python a useful tool for scientific computing. For that reason, I won't consider PyPy here.

Numba Wrapper

Numba is an LLVM compiler for python code, which allows code written in Python to be converted to highly efficient compiled code in real-time. Due to its dependencies, compiling it can be a challenge. To experiment with Numba, I recommend using a local installation of Anaconda, the free cross-platform Python distribution which includes Numba and all its prerequisites within a single easy-to-install package.

Numba is extremely simple to use. We just wrap our python function with autojit (JIT stands for "just in time" compilation) to automatically create an efficient, compiled version of the function:

In [4]:
from numba import double
from numba.decorators import jit, autojit

pairwise_numba = autojit(pairwise_python)

%timeit pairwise_numba(X)
1 loops, best of 3: 9.12 ms per loop

Adding this simple expression speeds up our execution by over a factor of over 1400! For those keeping track, this is about 50% faster than the version of Numba that I tested last August on the same machine.

Optimized Cython Function

Cython is another package which is built to convert Python-like statemets into compiled code. The language is actually a superset of Python which acts as a sort of hybrid between Python and C. By adding type annotations to Python code and running it through the Cython interpreter, we obtain fast compiled code. Here is a highly-optimized Cython version of the pairwise distance function, which we compile using IPython's Cython magic:

In [5]:
%load_ext cythonmagic
In [6]:
%%cython
import numpy as np
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return np.asarray(D)
In [7]:
%timeit pairwise_cython(X)
100 loops, best of 3: 9.87 ms per loop

The Cython version, despite all the optimization, is a few percent slower than the result of the simple Numba decorator! I should emphasize here that I have years of experience with Cython, and in this function I've used every Cython optimization there is (if any Cython super-experts are out there and would like to correct me on that, please let me know in the blog comment thread!) By comparison, the Numba version is a simple, unadorned wrapper around plainly-written Python code.

Fortran/F2Py

Another option for fast computation is to write a Fortran function directly, and use the f2py package to interface with the function. We can write the function as follows:

In [8]:
%%file pairwise_fort.f

      subroutine pairwise_fort(X,D,m,n)
          integer :: n,m
          double precision, intent(in) :: X(m,n)
          double precision, intent(out) :: D(m,m) 
          integer :: i,j,k
          double precision :: r 
          do i = 1,m 
              do j = 1,m 
                  r = 0
                  do k = 1,n 
                      r = r + (X(i,k) - X(j,k)) * (X(i,k) - X(j,k)) 
                  end do 
                  D(i,j) = sqrt(r) 
              end do 
          end do 
      end subroutine pairwise_fort
Writing pairwise_fort.f

We can then use the shell interface to compile the Fortran function. In order to hide the output of this operation, we direct it into /dev/null (note: I tested this on Linux, and it may have to be modified for Mac or Windows).

In [9]:
# Compile the Fortran with f2py.
# We'll direct the output into /dev/null so it doesn't fill the screen
!f2py -c pairwise_fort.f -m pairwise_fort > /dev/null

We can import the resulting code into Python to time the execution of the function. To make sure we're being fair, we'll first convert the test array to Fortran-ordering so that no conversion needs to happen in the background:

In [10]:
from pairwise_fort import pairwise_fort
XF = np.asarray(X, order='F')
%timeit pairwise_fort(XF)
100 loops, best of 3: 16.7 ms per loop

The result is nearly a factor of two slower than the Cython and Numba versions.

Now, I should note here that I am most definitely not an expert on Fortran, so there may very well be optimizations missing from the above code. If you see any obvious problems here, please let me know in the blog comments.

Scipy Pairwise Distances

Because pairwise distances are such a commonly used application in scientific computing, both Scipy and scikit-learn have optimized routines to compute them. The Scipy version is a Python wrapper of C code, and can be called as follows:

In [11]:
from scipy.spatial.distance import cdist
%timeit cdist(X, X)
100 loops, best of 3: 12.9 ms per loop

cdist is about 50% slower than Numba.

Scikit-learn Pairwise Distances

Scikit-learn contains the euclidean_distances function, works on sparse matrices as well as numpy arrays, and is implemented in Cython:

In [12]:
from sklearn.metrics import euclidean_distances
%timeit euclidean_distances(X, X)
10 loops, best of 3: 35.6 ms per loop

euclidean_distances is several times slower than the Numba pairwise function on dense arrays.

Comparing the Results

Out of all the above pairwise distance methods, unadorned Numba is the clear winner, with highly-optimized Cython coming in a close second. Both beat out the other options by a large amount.

As a summary of the results, we'll create a bar-chart to visualize the timings:

Edit: I changed the "fortran" label to "fortran/f2py" to make clear that this is not raw Fortran.

In [13]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.
In [14]:
labels = ['python\nloop', 'numpy\nbroadc.', 'sklearn', 'fortran/\nf2py', 'scipy', 'cython', 'numba']
timings = [13.4, 0.111, 0.0356, 0.0167, 0.0129, 0.00987, 0.00912]
x = np.arange(len(labels))

ax = plt.axes(xticks=x, yscale='log')
ax.bar(x - 0.3, timings, width=0.6, alpha=0.4, bottom=1E-6)
ax.grid()
ax.set_xlim(-0.5, len(labels) - 0.5)
ax.set_ylim(1E-3, 1E2)
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda i, loc: labels[int(i)]))
ax.set_ylabel('time (s)')
ax.set_title("Pairwise Distance Timings")
Out[14]:
<matplotlib.text.Text at 0x4174810>

Note that this is log-scaled, so the vertical space between two grid lines indicates a factor of 10 difference in computation time!

When I compared Cython and Numba last August, I found that Cython was about 30% faster than Numba. Since then, Numba has had a few more releases, and both the interface and the performance has improved. On top of being much easier to use (i.e. automatic type inference by autojit) it's now about 50% faster, and is even a few percent faster than the Cython option.

And though I've seen similar things for months, I'm still incredibly impressed by the results enabled by Numba: a single function decorator results in a 1300x speedup of simple Python code. I'm becoming more and more convinced that Numba is the future of fast scientific computing in Python.

This post was written entirely as an IPython notebook. The full notebook can be downloaded here, or viewed statically on nbviewer

by Jake Vanderplas at June 15, 2013 03:00 PM

June 13, 2013

Matthieu Brucher

Book review: Numpy Beginner’s Guide

I had the opportunity from Packt Publishing to review the second edition of Numpy Beginner’s Guide. Many thanks to the publisher for this and let’s go to the review.

Content and opinions

The first chapter introduces the reader to the scientific ecosystem. The main modules are covered, but there are some small mistakes (ipython –pylab does not import Matplotlib, Numpy ans Scipy, just a subset of the first that gives a Matlab-like interface aith some renamed Numpy functions) and what I think is a bad habit (importing everything from Pylab and using it as is). Nothing is lost because in the second chapter, Numpy is properly introduced and imported explicitly. There is a link missing between the two chapters because I didn’t understand why Numpy was used this way and not with pylab.

So the second chapter is about the core Numpy data structure: the multidimensional array. The author browses through different ways of creating them, by stacking them, flattening them… The next chapter deals with universal functions, or put it in an other way, functions that run on array element by elements. There are many different way to do so, with specifics that are tackled properly.

Before the chapter of matrices, there is a useful chapter on correlation, convolution and polynomials. Although one may want to go up to Scipy for signal processing, more often than not, Numpy is enough. There enough examples to understand how everything work. Then, the much hated matrix class is introduced. There are many discussions online on whether this class is actually useful, and I won’t delve on this. Suffice to say that the power of this class can be seen in the examples.

The following chapter is about the different submodules inside Numpy: linear algebra, fft, random number. The proper pointers to Scipy are provided, as well as explicit samples. I only can regret the time shifting example is not perfect, as a filter is applied on the amplitude and the inverse FFT is applied on the amplitudes only. So the transformed signal loses all the phase information, which may be why the image is similar but not that similar to the original one.

The seventh chapter tackles different extra functions, mainly finance-related, and I have to say that I don’t know their use enough to comment. Of course, this is why they are in the middle of the book and not in the first pages as the other Numpy functions. Still, there are some useful functions here (some I didn’t know about), as sorting, searching…

When one code scientific code, one often forget about testing. And Numpy has a nice module for scientific testing. It is nice to know that this aspect of science is not forgotten here and has a proper introduction. (also don’t forget about version control!)

The last three chapters introduces additional modules. The first one was partial addressed before, Matplotlib, and if you need something more advanced, there is always the Matplotlib book also published by Packt. Then there are some examples with Scipy, Scikits (soon a new book on Machine Learning with scikit-learn will be available, also by Packt, for which I was a technical reviewer, and it is really great) and other tools. The final chapter is about Pygame, but I don’t code games 100% in Python, so I didn’t really read it!

Conclusion

It’s hard to be mad at the author for the import issue. But I find it also difficult not to, as the philosophy changes depending on the chapter without saying why. Still, for an introductory book to Numpy, it is great if not excellent. A lot of simple examples, a lot of checks, the good pointers to write efficient code…

by Matt at June 13, 2013 07:59 AM

June 10, 2013

PythonXY

Python(x, y) 2.7.5.0 Released!

Hi All,

I'm happy to announce that Python(x, y) 2.7.5.0 is available for immediate download. It has been over 6 months since the previous release. Hopefully the amount of changes and updates offsets the long wait.

As always this release can be downloaded from any of the mirrors.

Major Changes

  • A dozen new plugins were added - mahotas, cffi, lxml, paramiko, Babel, PycURL and others. 
  • Eliminated most duplicate dependencies between packages - HDF5, zlib, bzip2, libpng, jpeg, tiff etc. These are part of their respective python wrappers or grouped under base_libraries.
  • base_python plugin created to host numerous infrastructure, python3 compatibility and utility packages (six, decorator, curses, pep8 etc).
  • Image manipulation was greatly improved with the addition of FreeImage python wrappers and replacing PIL with pillow.
  • WindowsXP is no longer officially supported. HDF5 cannot be compiled to be thread safe on it.

Notable Plugin Changes:

  • Sypder - updated to 2.2.0
  • ETS updated to 4.3.0.
  • VTK - updated to 5.10.1, Enabled: TextAnalysis, QVTK, WrapSIP, Oggrtheora,
  • ITK - updated to v4.4.0, Enabled: VTKGlue.
  • NetCDF4 - Enabled support for DAP and HDF4.
  • lxml - libxml & libxslt rebuilt with iconv, ICU, zlib and LZMA.
  • Numpy, Scipy and IPython updated.

The full change log can be viewed here.

Please post your comments and suggestions to the mailing list.

-Gabi Davar

by Gabi Davar (noreply@blogger.com) at June 10, 2013 11:06 PM

June 06, 2013

Continuum Analytics

Realtime Analytics on Streaming Twitter Data

From weather readings to financial tickers, many interesting data come in the form of streams. An issue with streaming data analysis in IPython is that one often has to either be collecting the data from the stream, or analyzing it, but not both. In this tutorial, I demonstrate a single-notebook IPython workflow for simultaneous collection and analysis of realtime streaming data:

by Clayton Davis at June 06, 2013 10:00 AM

June 04, 2013

Blake Griffith

SciPy Workflow

There are a few things that have taken some getting used to in working with SciPy. These things are aggregated here:

Keeping my SciPy repo up-to-date

I need to upadate my copy of SciPy regularly from upstream, to avoid merge conflicts. I have done this plenty of times, but for some reason I forget how every time. There is a good howto on stackoverflow on this. From said article:

# Add the remote, call it "upstream":

git remote add upstream git://github.com/whoever/whatever.git

# Fetch all the branches of that remote into remote-tracking branches,
# such as upstream/master:

git fetch upstream

# Make sure that you're on your master branch:

git checkout master

# Rewrite your master branch so that any commits of yours that
# aren't already in upstream/master are replayed on top of that
# other branch:

git rebase upstream/master

Pushing new local branches to a remote Git repo

Every time a make a new local branch, I forget how to push it to github as new branch. Google always brings me to this article. Basically if I have a new local branch call it new-feature and want to push it to my github repo and create a new new-feature branch there too. I do:

git push -u origin new-feature

and if I wanted to delete this remote branch, I could do:

git push origin :new-feature

and delete it locally with:

git branch -D new-feature

SciPy's commit messages

SciPy require commit messages start with some standard acronyms.

API: an (incompatible) API change
BLD: change related to building numpy
BUG: bug fix
DEP: deprecate something, or remove a deprecated object
DEV: development tool or utility
DOC: documentation
ENH: enhancement
MAINT: maintenance commit (refactoring, typos, etc.)
REV: revert an earlier commit
STY: style fix (whitespace, PEP8)
TST: addition or modification of tests
REL: related to releasing numpy

The acronyms are easy to remember, but adding the acronyms is easy to forget. Which brings me to my next topic.

Changing commit messages, after you have pushed

This is not the simplest thing to do in git, fortunately my mentor Pauli showed me a great way to do it. Assuming no one has pulled from your repo yet.

  • First do git log and look for the last COMMIT befor the commit messages you want to change.

  • Do git rebase -i COMMIT

  • Replace pick with r

  • Then Git will prompt you to change the commit messages one at a time.

  • Push your changes with git push -f

June 04, 2013 12:36 PM

June 02, 2013

Blake Griffith

Week 1 - bool dtype support

I got accepted to Google Summer of Code! Thank you everyone who has helped me get this far, and Amber for making me a congratulatory google cake!

google cake

On my proposal timeline I gave myself 2 weeks to add support to sparse matrices for dtype=bool. So far, dtype=bool works implementation was not too hard, and most of my time was spent figuring out how SWIG works. However there is still a lot of work to do in testing bool support. This will be more time consuming than it should be because the sparse test suite is messy.

bool dtype implementation

sparsetools

The sparsetools/ directory contains several *.cxx, *.h, *.i, and *.py files. The header files (*.h) contain C++ routines these are wrapped by SWIG according to instructions in the *.i interface files. SWIG uses the header and interface files to generate python *.py files C++ code in *.cxx files. Once these *.cxx files are compiled and linked the functions they define can be called by the generated *.py files.

Where do I add bool support?

Contained in sparsetools/ is a file called complex_ops.h which translates numpy's complex types to C++ classes. Well, that is what I want to do, except for numpy's bool types instead! My new file bool_ops.h is much simpler though. It basically does

typedef npy_int8 npy_bool_wrapper;

Yep that's it, (for now anyway). This file basically says to treat the npy_bool_wrapper type like the npy_int8. Because boolean algebra is almost integer algebra.

npy_bool_wrapper is used in numpy.i to define a typemap from numpy's npy_bool type to our npy_bool_wrapper.

In sparsetools.i we have to declare the new data type and add it to the INSTANTIATE_ALL macro which is used in all the *.i files to call the functions we want from the *.h files.

What else?

Adding bool to supported_dtypes in sputils.py prevents it from upcast. Then tests need to be added, but this isn't too easy. The test suite does not have a way to take some general data or dtype. So I am going through adding tests one-by-one where it seems appropriate. I'm also trying to add tests for other dtypes because there don't seem to be any tests for anything other than float64 and int64.

While I was doing this I discovered a few bugs relating to how sparse handles uint dtypes.

Once I'm done adding tests and fixing what ever does not pass, I'll move on to the next stage in my proposal. Adding support for boolean operations.

June 02, 2013 06:05 AM

June 01, 2013

Jake Vanderplas

IPython Notebook: Javascript/Python Bi-directional Communication

I've been working with javascript and the IPython notebook recently, and found myself in need of a way to pass data back and forth between the Javascript runtime and the IPython kernel. There's a bit of information about this floating around on various mailing lists and forums, but no real organized tutorial on the subject. Partly this is because the tools are relatively specialized, and partly it's because the functionality I'll outline here is planned to be obsolete in the 2.0 release of IPython.

Nevertheless, I thought folks might be interested to hear what I've learned. What follows are a few basic examples of moving data back and forth between the IPython kernel and the browser's javascript. Note that if you're viewing this statically (i.e. on the blog or on nbviewer) then the javascript calls below will not work: to see the code in action, you'll need to download the notebook and open it in IPython.

Executing Python Statements From Javascript

The key functionality needed for interaction between javascript and the IPython kernel is the kernel object in the IPython Javascript package. A python statement can be executed from javascript as follows:

var kernel = IPython.notebook.kernel;
kernel.execute(command);

where command is a string containing python code.

Here is a short example where we use HTML elements and javascript callbacks to execute a statement in the Python kernel from Javascript, using the kernel.execute command:

In [1]:
from IPython.display import HTML

input_form = """
<div style="background-color:gainsboro; border:solid black; width:300px; padding:20px;">
Variable Name: <input type="text" id="var_name" value="foo"><br>
Variable Value: <input type="text" id="var_value" value="bar"><br>
<button onclick="set_value()">Set Value</button>
</div>
"""

javascript = """
<script type="text/Javascript">
    function set_value(){
        var var_name = document.getElementById('var_name').value;
        var var_value = document.getElementById('var_value').value;
        var command = var_name + " = '" + var_value + "'";
        console.log("Executing Command: " + command);
        
        var kernel = IPython.notebook.kernel;
        kernel.execute(command);
    }
</script>
"""

HTML(input_form + javascript)
Out[1]:
Variable Name:
Variable Value:

After pressing above with the default arguments, the value of the variable foo is set in the Python kernel, and can be accessed from Python:

In [2]:
print foo
bar

Examining the code, we see that when the button is clicked, the set_value() function is called, which constructs a simple Python statement assigning var_value to the variable given by var_name. As mentioned above, the key to interaction between Javascript and the notebook kernel is to use the IPython.notebook.kernel.execute() command, passing valid Python code in a string. We also log the result to the javascript console, which can be helpful for Javascript debugging.

Accessing Python Output In Javascript

Executing Python statements from Javascript is one thing, but we'd really like to be able to do something with the output.

In order to process the output of a Python statement executed in the kernel, we need to add a callback function to the execute statement. The full extent of callbacks is a bit involved, but the first step is to set a callback which does something with the output attribute.

To set an output, we pass a Javascript callback object to the execute call, looking like this:

var kernel = IPython.notebook.kernel;
function callback(out_type, out_data){
    // do_something
}
kernel.execute(command, {"output": callback});

Using this, we can execute a Python command and do something with the result. The python command can be as simple as a variable name: in this case, the value returned is simply the value of that variable.

To demonstrate this, we'll first import pi and sin from the math package in Python:

In [3]:
from math import pi, sin

And then we'll manipulate this value via Javascript:

In [4]:
# Add an input form similar to what we saw above
input_form = """
<div style="background-color:gainsboro; border:solid black; width:600px; padding:20px;">
Code: <input type="text" id="code_input" size="50" height="2" value="sin(pi / 2)"><br>
Result: <input type="text" id="result_output" size="50" value="1.0"><br>
<button onclick="exec_code()">Execute</button>
</div>
"""

# here the javascript has a function to execute the code
# within the input box, and a callback to handle the output.
javascript = """
<script type="text/Javascript">
    function handle_output(out_type, out){
        console.log(out_type);
        console.log(out);
        var res = null;
         // if output is a print statement
        if(out_type == "stream"){
            res = out.data;
        }
        // if output is a python object
        else if(out_type === "pyout"){
            res = out.data["text/plain"];
        }
        // if output is a python error
        else if(out_type == "pyerr"){
            res = out.ename + ": " + out.evalue;
        }
        // if output is something we haven't thought of
        else{
            res = "[out type not implemented]";   
        }
        document.getElementById("result_output").value = res;
    }
    
    function exec_code(){
        var code_input = document.getElementById('code_input').value;
        var kernel = IPython.notebook.kernel;
        var callbacks = {'output' : handle_output};
        document.getElementById("result_output").value = "";  // clear output box
        var msg_id = kernel.execute(code_input, callbacks, {silent:false});
        console.log("button pressed");
    }
</script>
"""

HTML(input_form + javascript)
Out[4]:
Code:
Result:

Pressing above will call kernel.execute with the contents of the Code box, passing a callback which displays the result in the result box.

The reason the callback has so many conditionals is because there are several types of outputs we need to handle. Note that the output handler is given as the output attribute of a Javascript object, and passed to the kernel.execute function. Again, we use console.log to allow us to inspect the objects using the Javascript console.

Application: An On-the-fly Matplotlib Animation

In a previous post I introduced a javascript viewer for matplotlib animations. This viewer pre-computes all the matplotlib frames, embeds them in the notebook, and offers some tools to view them.

Here we'll explore a different strategy: rather than precomputing all the frames before displaying them, we'll use the javascript/python kernel communication and generate the frames as needed.

Note that if you're viewing this statically (e.g. in nbviewer or on my blog), it will be relatively unexciting: with no IPython kernel available, calls to the kernel will do nothing. To see this in action, please download the notebook and open it with a running IPython notebook instance.

In [6]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.
In [7]:
from IPython.display import HTML
from cStringIO import StringIO

# We'll use HTML to create a control panel with an
# empty image and a number of navigation buttons.

disp_html = """
<div class="animation" align="center">
<img id="anim_frame" src=""><br>
<button onclick="prevFrame()">Prev Frame</button>
<button onclick="reverse()">Reverse</button>
<button onclick="pause()">Pause</button>
<button onclick="play()">Play</button>
<button onclick="nextFrame()">Next Frame</button>
</div>
"""

# now the javascript to drive it.  The nextFrame() and prevFrame()
# functions will call the kernel and pull-down the frame which
# is generated.  The play() and reverse() functions use timeouts
# to repeatedly call nextFrame() and prevFrame().

javascript = """
<script type="text/Javascript">
var count = -1;  // keep track of frame number
var animating = 0;  // keep track of animation direction
var timer = null;
var kernel = IPython.notebook.kernel;

function output(out_type, out){
    data = out.data["text/plain"];
    document.getElementById("anim_frame").src = data.substring(1, data.length - 1);
    
    if(animating > 0){
        timer = setTimeout(nextFrame, 0);
    }
    else if(animating < 0){
        timer = setTimeout(prevFrame, 0);
    }
}

var callbacks = {'output' : output};

function pause(){
    animating = 0;
    if(timer){
        clearInterval(timer);
        timer = null;
    }
}

function play(){
    pause();
    animating = +1;
    nextFrame();
}

function reverse(){
    pause();
    animating = -1;
    prevFrame();
}

function nextFrame(){
    count += 1;
    var msg_id = kernel.execute("disp._get_frame_data(" + count + ")", callbacks, {silent:false});
}

function prevFrame(){
    count -= 1;
    var msg_id = kernel.execute("disp._get_frame_data(" + count + ")", callbacks, {silent:false});
}

// display the first frame
setTimeout(nextFrame, 0);

</script>
"""

# Here we create a class whose HTML representation is the above
# HTML and javascript.  Note that we've hard-coded the global
# variable name `disp` in the Javascript, so you'll have to assign
# the resulting object to this name in order to view it.

class DisplayAnimation(object):
    def __init__(self, anim):
        self.anim = anim
        self.fig = anim._fig
        plt.close(self.fig)
        
    def _get_frame_data(self, i):
        anim._draw_frame(i)
        buffer = StringIO()
        fig.savefig(buffer, format='png')
        buffer.reset()
        data = buffer.read().encode('base64')
        return "data:image/png;base64,{0}".format(data.replace('\n', ''))
    
    def _repr_html_(self):
        return disp_html + javascript

This code should be considered a proof-of-concept: in particular, it requires the display object to be named disp in the global namespace. But making it more robust would be a relatively simple process.

Here we'll test the result by creating a simple animation and displaying it dynamically:

In [8]:
from matplotlib import animation

fig = plt.figure()
ax = plt.axes(xlim=(0, 10), ylim=(-2, 2))
line, = ax.plot([], [], lw=2)

def init():
    line.set_data([], [])
    return line,

def animate(i):
    x = np.linspace(0, 10, 1000)
    y = np.cos(i * 0.02 * np.pi) * np.sin(x - i * 0.02 * np.pi)
    line.set_data(x, y)
    return line,

anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=30)

# For now, we need to name this `disp` for it to work
disp = DisplayAnimation(anim)
disp
Out[8]:

Once again, if you're viewing this statically, you'll see nothing above the buttons. The kernel needs to be running in order to see this: you can download the notebook and run it to see the results (To see a statically-viewable version of the animation, refer to the previous post). But I assure you, it works! We've created an animation viewer which uses bi-directional communication between javascript and matplotlib to generate the frames in real-time.

Note that this is still rather limited, and should be considered a proof-of-concept more than a finished result. In particular, on my four-year-old linux box, I can only achieve a frame-rate of about 10 frames/sec. Part of this is due to the reliance on png images saved within matplotlib, as we can see by profiling the function:

In [10]:
def save_to_mem(fig):
    buffer = StringIO()
    fig.savefig(buffer, format='png')
    buffer.reset()
    data = buffer.read().encode('base64')
    return "data:image/png;base64,{0}".format(data.replace('\n', ''))

fig, ax = plt.subplots()
ax.plot(rand(200))
Out[10]:
[<matplotlib.lines.Line2D at 0x3a26f50>]
In [11]:
%timeit save_to_mem(fig)
10 loops, best of 3: 54.2 ms per loop

That's a cap of about 20 frames per second on matplotlib's end, not including the time required to serialize and send the data, or to render each frame on the page. I'm certain there are much better ways to do this particular application, but they would take a bit more thought.

I hope this post was helpful to you, as unpolished as the results are, and please let me know if you have ideas about how to do this more effectively! Also, keep in mind that Javascript support should be improving immensely in IPython 2.0, which (according to the roadmap) should be released in December of 2013. At that point I may have more to say on the subject!

This post was composed entirely in IPython notebook. The source notebook can be downloaded here

by Jake Vanderplas at June 01, 2013 05:00 PM

May 31, 2013

Enthought

Avoiding “Excel Hell!” using a Python-based Toolchain

Didrik Pinte gave an informative, provocatively-titled presentation at the second, in-person New York Quantitative Python User’s Group (NY QPUG) meeting earlier this month. There are a lot of examples in the press of Excel workflow mess-ups and spreadsheet errors contributing to some eye-popping mishaps in the finance world (e.g. JP Morgan’s spreadsheet issues may have [...]

by mtranby at May 31, 2013 08:00 PM

Fernando Perez

Exploring Open Data with Pandas and IPython at the Berkeley I School

"Working with Open Data", a course by Raymond Yee

This will be a guest post, authored by Raymond Yee from the UC Berkeley School of Information (or I School, as it is known around here). This spring, Raymond has been teaching a course titled "Working with Open Data", where students learn how to work with openly available data sets with Python.

Raymond has been using IPython and the notebook since the start of the course, as well as hosting lots of materials directly using github. He kindly invited me to lecture in his course a few weeks ago, and I gave his students an overview of the IPython project as well as our vision of reproducible research and of building narratives that are anchored in code and data that are always available for inspection, discussion and further modification.

Towards the end of the course, his students had to develop a final project, organizing themselves in groups of 2-4 and producing a final working system that would use open datasets to produce an interesting analysis of their choice.

I recently had the chance to see the final projects, and I have to say that I walked out blown away by the results. This is a group of students who don't come from traditionally computationally-intensive backgrounds, as the only requirement was some basic Python experience. And in a matter of a few weeks, they created very compelling tools, that dove into different problem domains from health care to education and even sports, producing interesting results complete with a narrative, plots and even interactive JavaScript controls and SVG output elements. Keep in mind that the tools to do some of this stuff aren't even really documented or explained much in IPython yet, as we haven't really dug into that part in earnest (that is our Fall 2013 plan).

The students obviously did run into some issues, and I took notes on what we can do to improve the situtaion. We had a follow-up meeting on campus where we gave them pointers on how to do certain things more easily. But to me, these results validate the idea that we can construct computational narratives based on code and data that will ultimately lead to a more informed discourse.

I am very much looking forward to future collaborations with Raymond: he has shown that we can create an educational experience around data-driven discovery, using IPython, pandas, matplotlib and the rest of the SciPy stack, that is completely accessible to students without major computational training, and that lets them produce interesting results around socially interesting and relevant datasets.

I will now pass this on to Raymond, so he can describe a little bit more about the course, challenges they encountered, and highlight some of the course projects that are already available on github for further forking and collaboration.

As always, this post is available as an IPython notebook. The rest of the post is Raymond's authorship.

The course

Over a 15-week period, my students and I met twice a week to study open data, using Python to access, process, and interpret that data. Twenty-four students completed the course: 17 Masters students from the I School and 7 undergraduates from electrical engineering/computer science, statistics, and business.

We covered about half of our textbook Python for Data Analysis by Wes McKinney. Accordingly, a fair amount of our energy was directed to studying the pandas library. The prerequisite programming background was the one semester Python minimum requirement for I School Masters students. The students learned a lot (while having a good time overall, so I'm told) about how to program pandas and use the IPython notebook by working through a series of assignments. Students filled in missing code in IPython notebooks I had created to illustrate various techniques for data analysis. (Most of the resources for the course are contained in the course github repository.)

My students and I were particularly grateful for the line-up of guest speakers (which included Fernando Perez), who shared their expertise on topics ranging from scraping the web to archive legal cases, the UC Berkeley Library Data Lab, open data in archaeology, open access web crawling, and scientific reproducibility.

Final projects

The culmination of the course came in the final projects, where groups of two to four students designed and implemented data analysis projects. The final deliverable for the course was an IPython notebook that was expected to contain the following attributes:

  • a clear articulation of the problem being addressed in the project
  • a clear description of what was originally proposed and what the students ended up doing, describing what led the students to go from the proposed work to the final outcome
  • a thorough description of what was behind-the-scenes: the sources of data from which the students drew and the code they wrote to analyze the data
  • a clear description of the results and precise details on how to reproduce the results
  • if the students were to continue their project beyond the course, what would be this future work
  • a paragraph outlining how the work was split among group members.

The students and I welcomed enthusiastic visitors to the Open House -- in which the I School community and, in fact, the larger campus community was invited to attend.

Working with Open Data 2013 Open House

Working with Open Data 2013 Open House

Here are abstracts for the eight projects, where each screenshot is a link to its IPython notebook. Enjoy!

Stock Performance of Product Releases
Edward Lee, Eugene Kim

Drawing connections for open data available pertaining to Apple in order to examine how Apple's stock performance was impacted by a certain product. We examine Wikipedia data for detailed information on Apple's product releases, make use of Yahoo Finance's API for specific stock performance metrics, and openly available Form 10-Q's for internal (financial) changes to Apple. The main purpose is to examine available data to draw new conclusions centered on the time around the product release date.

Education First
Carl Shan, Bharathkumar Gunasekaran, Haroon Rasheed Paul Mohammed, Sumedh Sawant

Most parents nowadays have a general sense of the significant factors for choosing a school for their children. However, with a lack of existing tools and information sources, most of these parents have a hard time measuring, weighing and comparing these factors in relation to geographical areas when they are trying to pick the best place to live in with the best schools.

Thus, our team aims to address this problem by visualizing the statistical data from the NCES with geo-data to help the parents through the process of picking the best area to live in with the best schools. Parents can specify exactly what parameters they consider important in their decision process and we will generate a heat-map of the state they’re interested in living in and dynamically color it according to how closely each county matches their preferences. The heat map will be displayed with a web browser.

The League of Champions
Natarajan Chakrapani, Mark Davidoff, Kuldeep Kapade

In the soccer world, there is a lot of money involved in transfer of players in the premier leagues around the world. Focusing on the English Premier League, our project - “The League Of Champions” aims to analyze the return on investment on soccer transfer done by teams in the English premier league. It aims to measure club return on each dollar spent on their acquired players for a season, on parameters like Goals scored, active time on the field, assists etc. In addition, we also look to analyze how big a factor player age is, in commanding a high transfer fee, and if clubs prefer to pay large amounts for specialist players in specific field positions.

All About TEDx
Chan Kim, JT Huang

TED is a nonprofit devoted to Ideas Worth Spreading. It started out (in 1984) as a conference bringing together people from three worlds: Technology, Entertainment, Design. The TED Open Translation Project brings TED Talks beyond the English-speaking world by offering subtitles, interactive transcripts and the ability for any talk to be translated by volunteers worldwide. The project was launched with 300 translations, 40 languages and 200 volunteer translators; now, there are more than 32,000 completed translations from the thousands-strong community. The TEDx program is designed to give communities the opportunity to stimulate dialogue through TED-like experiences at the local level.

Our project wants to encourage people to translate TEDx Talk as well by showing how TEDx Talk videos are translated and spreaded among different languages, places and topics, and comparing the spreading status with TED Talk videos.

Environmental Health Gap
Rohan Salantry, Deborah Linton, Alec Hanefeld, Eric Zan

There is growing evidence to support environmental factors trigger chronic diseases such as asthma that result in billions in health care costs. However a gap in knowledge exists concerning the extent of the link and where it is more prevalent. We aim to create a framework for closing this gap by integrating health and environmental condition data sets. Specifically, the project will link emissions data from the EPA and the California Department of Public Health in an attempt to find a correlation between incidences of asthma treatments and emissions seen as triggers for asthma.

The project hopes to be a stepping stone for policy decisions concerning the value tradeoff between health care treatment and environmental regulation as well as where to concentrate resources based on severity of need.

World Bank Data Analysis
Aisha Kigongo, Sydney Friedman, Ignacio Pérez

Our goal was to use a variety of tools to investigate the impact of project funding in developing countries. In order to do so, we looked at open data from the World Bank, which keeps a strong track of every project that gets funded, who funds it, and the goal of the project whether agricultural, economic or related to health. By using Python, we used an index of the World Bank to see where the most funded countries were and how they related to various indicators such as the Human Development Index, the Freedom Index, and for the future, health, educational and other economic indexes. Our secondary goal is to analyze what insight open data can give us as to how effective initiatives and funding actually is as opposed to what it’s meant to be.

Dr. Book
AJ Renold, Shohei Narron, Alice Wang

When we read a book, all the information is contained in that resource. But what if you could learn more about a concept, historical figure, or location presented in a chapter? Dr. Book expands your reading experience by connecting people, places, topics and concepts within a book to render a webpage linking these resources to Wikipedia.

Book Hunters
Fred Chasen, Luis Aguilar, Sonali Sharma

When we search for books on the internet we are often overwhelmed with results coming from various sources. It’s difficult to get direct trusted urls to books. Project Gutenberg, HathiTrust and Open Library all provide an extensive library of books online, each with their own large repository titles. By combining their catalogs, Book Hunters enables querying for a book across those different sources, our project will highlight key statistics about the three datasets. These statistics include: number of books in all the three data sources, formats, language, publishing date. Apart from that we will ask users to search for a particular book of interest and we will return combined results from all the three resources and also provide the direct link to the pdf, text or epub format of the book. This will be an exercise to filter out results for the users and provide them with easy access to the books that they are looking for.

by Fernando Perez (noreply@blogger.com) at May 31, 2013 01:00 AM

May 30, 2013

Continuum Analytics

Follow the Money with IOPro

One of IOPro’s more advanced features is the ability to work with csv data directly from Amazon’s S3 service. Combining this with Wakari’s notebook sharing feature gives us a powerful way to collaborate on data analysis using data stored in the cloud.

by Jay Bourque at May 30, 2013 11:01 AM

May 29, 2013

Jake Vanderplas

A Simple Animation: The Magic Triangle

I've been spending a lot of time lately playing with animations in IPython notebook. Here's my latest one - a slightly puzzling rearrangement of shapes into two different triangles:

Where does the extra block go? Look close and you might be able to figure it out.

I created the animation using just a few lines of matplotlib. Here we'll display it using the Javascript Animation widget that I wrote:

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.
In [2]:
# JS Animation import is available at http://github.com/jakevdp/JSAnimation
from JSAnimation.IPython_display import display_animation
from matplotlib import animation

# Set up the axes, making sure the axis ratio is equal
fig = plt.figure(figsize=(6.5, 2.5))
ax = fig.add_axes([0, 0, 1, 1], xlim=(-0.02, 13.02), ylim=(-0.02, 5.02),
                  xticks=range(14), yticks=range(6), aspect='equal', frameon=False)
ax.grid(True)

# Define the shapes of the polygons
P1 = np.array([[0, 0], [5, 0], [5, 2], [0, 0]])
P2 = np.array([[0, 0], [8, 0], [8, 3], [0, 0]])
P3 = np.array([[0, 0], [5, 0], [5, 1], [3, 1], [3, 2], [0, 2], [0, 0]])
P4 = np.array([[0, 1], [3, 1], [3, 0], [5, 0], [5, 2], [0, 2], [0, 1]])

# Draw the empty polygons for the animation
kwds = dict(ec='k', alpha=0.5)
patches = [ax.add_patch(plt.Polygon(0 * P1, fc='g', **kwds)),
           ax.add_patch(plt.Polygon(0 * P2, fc='b', **kwds)),
           ax.add_patch(plt.Polygon(0 * P3, fc='y', **kwds)),
           ax.add_patch(plt.Polygon(0 * P4, fc='r', **kwds))]

# This function moves the polygons as a function of the frame i
Nframes = 30
def animate(nframe):
    f = nframe / (Nframes - 1.0)
    patches[0].set_xy(P1 + (8 - 8 * f, 3 - 3 * f + 0.5 * np.sin(f * np.pi)))
    patches[1].set_xy(P2 + (5 * f, 2 * f - 0.5 * np.sin(f * np.pi)))
    patches[2].set_xy(P3 + (8 - 3 * f, 0))
    patches[3].set_xy(P4 + (8, 1 - f))
    return patches
    
anim = animation.FuncAnimation(fig, animate, frames=Nframes, interval=50)
display_animation(anim, default_mode='once')
Out[2]:


Once Loop Reflect

For a brief tutorial on creating matplotlib animations, see my previous post on the subject.

By the way, the animated gif at the head of this post was created using the following script. It's similar to above, except we add some extra frames to create the patrol-loop and the pause at each end.

Note that saving the animation as an animated gif requires both matplotlib version 1.3+ and the imagemagick command.

In [3]:
def animate_as_gif(nframe):
    if nframe >= Nframes and nframe < Nframes + 15:
        nframe = Nframes - 1
    elif nframe >= Nframes + 15 and nframe < 2 * Nframes + 15:
        nframe = 2 * Nframes + 14 - nframe
    elif nframe >= 2 * Nframes + 15:
        nframe = 0
    return animate(nframe)
    
anim = animation.FuncAnimation(fig, animate_as_gif,
                               frames=2 * Nframes + 30, interval=50)
anim.save('MagicTriangle.gif', writer='imagemagick')

Enjoy!

This post was written in an IPython notebook, which can be downloaded here, or viewed statically here.

by Jake Vanderplas at May 29, 2013 04:00 AM

Arink Verma

Performance parity between numpy arrays and Python scalars

Small numpy arrays are very similar to Python scalars but numpy incurs a fair amount of extra overhead for simple operations. For large arrays this doesn't matter, but for code that manipulates a lot of small pieces of data, it can be a serious bottleneck.

For example:

 
In [1]: x = 1.0

In [2]: numpy_x = np.asarray(x)

In [3]: timeit x + x
10000000 loops, best of 3: 61 ns per loop

In [4]: timeit numpy_x + numpy_x
1000000 loops, best of 3: 1.66 us per loop

I tried to introduced, a short path (at present) for integer and float addition of numpy array. In umath/ufunc_type_resolution.c , ufunc lookup loop find best data types based on input operands types. In short path, rather than going to loop again for addition operation, it return the best known data type.
 
/* Short path for addition of int + int */
int key = 0;
for (j = 0; j < nargs; ++j) {
key = (key<<5 data-blogger-escaped-dtypes="" data-blogger-escaped-j="">type_num;
}
NPY_UF_DBG_PRINT1("key is %d\n",key);

if(strcmp(ufunc_name,"add")==0){
int rent = -1;
if(key == 7399)
rent = 7;
else if(key == 12684)
rent = 13;

NPY_UF_DBG_PRINT1("rent is %d\n",rent);
if(rent > 0){
*out_innerloop = ufunc->functions[rent];
*out_innerloopdata = ufunc->data[rent];
NPY_UF_DBG_PRINT1("type @ hashposition %d\n",rent);
return 0;
}
}


Following are the benchmark result based on vbench for numpy


All operations have been run 1000000, with timeit function. And results for commits after 2013, May, 1 have been taken.

Multiplication 

timeit.timeit('x * x',setup='import numpy as np;x = np.asarray(1)')
numpy.asarray(1) * numpy.asarray(1) 


timeit.timeit('x * x',setup='import numpy as np;x = np.asarray(1.0)')
numpy.asarray(1.0) * numpy.asarray(1.0) 


timeit.timeit('x * y',setup='import numpy as np;x = np.asarray(1.0);y = np.asarray(1)')
numpy.asarray(1) * numpy.asarray(1.0) 

Addition

timeit.timeit('x + x',setup='import numpy as np;x = np.asarray(1)')
numpy.asarray(1) + numpy.asarray(1)


timeit.timeit('x + y',setup='import numpy as np;x = np.asarray(1.0)';y = np.asarray(1)')
numpy.asarray(1) + numpy.asarray(1.0)


timeit.timeit('x + x',setup='import numpy as np;x = np.asarray(1.0)')
numpy.asarray(1.0) + numpy.asarray(1.0)


Benchmark for numpy can be found on github

by Arink Verma (noreply@blogger.com) at May 29, 2013 01:04 AM

May 27, 2013

The Nipy blog

A separation too early

I followed a Google+ link to an article that has just come out in Science magazine. The article is "Troubling Trends in Scientific Software Use" by Lucas Joppa and others 1.

I did not find Joppa's article very satisfying, but there were some interesting references. For example (from Joppa et al):

From fields such as high-performance computing, we learn key insights and best practices for how to develop, standardize, and implement software (11)

Reference 11 is an article by Victor Basili and others on "Understanding the High-Performance-Computing Community" 2. It is a thoughtful reflection on the experience of academic software engineers working with scientists using massively parallel computing. The message I took from the article was that scientists may have good reasons to reject suggestions from academic software engineers. This is often because the solutions are too general to be useful for a particular problem. The last sentence from the article is "We've much to learn from each other".

Another couple of sentences in the Joppa article led me to interesting places:

Reliance on personal recommendations and trust is a strategy with risks to science and scientist. "End-user developers" commonly create scientific software (17, 21, 22) 1

Reference (22) is "A Software Chasm: Software Engineering and Scientific Computing" by Diane Kelley 3. Kelley also has some interesting thoughts on the separation of science and software engineering:

Creating the chasm

In a 1997 paper, Iris Vessey described a shift in computing philosophy that occurred in the late 1960s. She quoted George E. Forsyth, the ACM's president at that time, as stating that computer science was a field of its own, separate from the application domains. The result, as Vessey points out, was the insistence that anyone who wanted to be taken seriously in the field of computing, and software engineering, must develop domain-independent techniques, methods, and paradigms. The pressure was such that claims of broad applicability became commonplace. So, the bridge between software engineering and any application domain became a single massive structure that everyone had to use.

This is the relevant quote from Vessey (1997)4:

Then followed a debate, which endured for over a decade, that pitted academia against the computing profession. The notions of traditional computer scientists steeped in the notion of theory and the stigma of all things "applied" led to the following statement by George E. Forsyth, President of ACM, in 1965: "... the core of computer science has become and will remain a field of its own, ahead of, and separate from the application domain specialists." This philosophy led, not only to the creation of generic "solutions" to problems, but also to claims that any "creation" was applicable in all situations, because to state otherwise would have branded the creator as lacking in academic respectability.

I wonder if this separation has had a larger effect on our thinking than we recognize.

For example, our original NIPY grant application asked for salaries for two full-time programmers to help us apply software engineering methods to writing neuroimaging software. We did hire two programmers but we had given them an impossible job. To do a good job, they had to now become experts in a scientific field they did not know. Useful code and documentation had to be written so that other scientists in the field would find it easy to follow. To do this, you need to know the field. We scientists spend more time than we think working out how to talk to our colleagues. That experience is very hard to teach.

These articles gave me a new way to think about the false separation of "software" and "science". It is mysterious that we make this separation, because my whole experience tells me they are closely linked. I wonder whether we make this separation because we have come under the unconscious influence of the movement that Kelley and Vessey describe. It is easy to see how tempting it is. How easy life would be if we really could pay a programmer to write up our ideas as we sketch them airily on a white-board and retire to the garden, to think great thoughts.


  1. Joppa, Lucas N., et al. "Troubling Trends in Scientific Software Use." Science 340.6134 (2013): 814-815. 

  2. Basili, Victor R., et al. "Understanding the High-Performance-Computing Community." (2008). 

  3. Kelly, Diane F. "A software chasm: Software engineering and scientific computing." Software, IEEE 24.6 (2007): 120-119. 

  4. Vessey, Iris. "Problems versus solutions: the role of the application domain in software." Papers presented at the seventh workshop on Empirical studies of programmers. ACM, 1997. 

by Matthew Brett at May 27, 2013 07:00 PM

May 26, 2013

The Nipy blog

Unscientific programming

We various of the Berkeley NIPY crowd were in a Mexican restaurant yesterday lunchtime. We discussed the IPython notebook and scientific programming and reproducible research.

One question struck me as fundamental: do scientists have to learn to be software engineers?

I think the common answer to this is "No - scientific coding is different". The "No" camp points out that scientists use code in a different way to software engineers. Science is exploration, and so is scientific coding. Scientific code is provisional, often changing. There is no specfication, there is no production system that must just work. We are trying stuff out, seeing what works, adjusting to our better understanding of the data and the ideas.

Maybe the scientist's idea of the canonical software engineer is someone who writes and hosts a web application with some database behind it.

Accepting the "No", we look at things that software engineers should learn:

  1. Version control. Essential for a software engineer, apparently, because they need to be able to rollback their changes, and keep track of released and development versions. Desirable maybe for a scientist, but not essential, because the code never goes into production, and we always use the latest version of the code.
  2. Testing. Essential for a software engineer, otherwise they may release code that corrupts a database of financial transactions or delivers the wrong item to your address. Desirable maybe for a scientist, but not essential because the nature of the code changes so often that it would only slow things down to write exhaustive tests, only for the code to change track and make the tests obsolete.
  3. Code review. Essential for a software engineer, because they work in teams with other software engineers. This is their job, to write code, and they can learn from each other. Desirable maybe for a scientist, but not essential because each scientist owns their own problem and so only they can understand their code. Explaining the code is time-consuming and slows progress in developing new ideas. Others in the lab are not software engineers, so they are not trained to review code, they will not enjoy it and they will not do a good job.
  4. Releases. Essential for a software engineer because they may be paid to provide code that others can use. The release labels code that can be used. The software engineer can and should make sure the code works as advertised. Desirable maybe for a scientist, but not essential, because the code changes often, and the code may be written to solve a very specific problem that could not be of much interest to another researcher. If the other researcher is interested, they should make the effort to work out how to use the code, that is not the job of the author-scientist, they are not paid to support software, but to produce science.
  5. Documentation. Essential for a software engineer because they want their code to be used correctly by other people. Desirable but not essential for the scientist because scientific code is not for distribution except to other scientists who must take responsibility for the code themselves.

I think that these set of beliefs lie at the heart of the problem for reproducible science.

The beliefs rest on the following model of writing scientific code:

A folk model of scientific code

A scientist named A writes some code. They check the code. They confirm it is working. Most likely this means the code is free of major errors.

Another scientist named B is reading some results published by A. They can assume that the results are free of major errors.

Living with the folk model

Now everything makes sense. Version control, testing, code review, releases, documentation are great, of course, but if we can assume there are no major errors, then - why would B want to see my code? If B gets my code, why would she need to understand it? Why would she need to know what version it was, or how that related to my paper? She should assume that there are no major errors.

Rejecting the folk model

The folk model is not just too simple, it is flat-out wrong. We make mistakes all the time. All the time. If you know that, then you know that B needs to check your stuff. They can't do science without checking your stuff. If they need to check your stuff, you need to write your code for production. Then, nothing is optional. We all need; version control, testing, code review, releases, documentation.

by Matthew Brett at May 26, 2013 10:30 PM

The ubiquity of error

I was talking to an esteemed colleague about six months ago about how easy it was to believe that we do not make errors, unless we check.

He said that he had noticed this change when his students and research assistants started to use their own computer programs to analyze data. Before this, when he and others added up a column of numbers, they would check several times that they were correct. After, the student or RA would analyze some data with a program they had written themselves, and my friend would say "are you sure the program is right" and they would say "yes I'm sure". My friend would then go through the results and check; they would often find mistakes. He thought that there was something about using a computer that made it easy to believe that the result was right, even if the process of getting the result had the same chance of error as a simpler task like adding numbers.

To quote David Donoho:

In my own experience, error is ubiquitous in scientific computing, and one needs to work very diligently and energetically to eliminate it. One needs a very clear idea of what has been done in order to know where to look for likely sources of error. I often cannot really be sure what a student or colleague has done from his/her own presentation, and in fact often his/her description does not agree with my own understanding of what has been done, once I look carefully at the scripts. Actually, I find that researchers quite generally forget what they have done and misrepresent their computations.

Computing results are now being presented in a very loose, “breezy” way—in journal articles, in conferences, and in books. All too often one simply takes computations at face value. This is spectacularly against the evidence of my own experience. I would much rather that at talks and in referee reports, the possibility of such error were seriously examined.

-- David L. Donoho (2010). An invitation to reproducible computational research. Biostatistics Volume 11, Issue 3 Pp. 385-388

There is something about computational science that seems to make the idea of error less worrying or important:

In stark contrast to the sciences relying on deduction or empiricism, computational science is far less visibly concerned with the ubiquity of error. At conferences and in publications, it’s now completely acceptable for a researcher to simply say, “here is what I did, and here are my results.” Presenters devote almost no time to explaining why the audience should believe that they found and corrected errors in their computations. The presentation’s core isn’t about the struggle to root out error — as it would be in mature fields — but is instead a sales pitch: an enthusiastic presentation of ideas and a breezy demo of an implementation. Computational science has nothing like the elaborate mechanisms of formal proof in mathematics or meta-analysis in empirical science. Many users of scientific computing aren’t even trying to follow a systematic, rigorous discipline that would in principle allow others to verify the claims they make. How dare we imagine that computational science, as routinely practiced, is reliable!

-- David L. Donoho, Arian Maleki, Inam Ur Rahman, Morteza Shahram and Victoria Stodden (2009) Reproducible Research in Computational Harmonic Analysis. Computing in Science and Engineering 11(1) pp 8-18

by Matthew Brett at May 26, 2013 10:00 PM

May 24, 2013

Paul Ivanov

tags-vs-categories in Pelican

So, following up on my transition to pelican, here's a little help for navigating a pelican site. If you've ever wondered why sometimes there's a highlighted menu item at the top, and other times there isn't, read on.

Categories

With Pelican, every post belongs to exactly one category - that category is 'misc' if one is not provided (and you can override that named by setting DEFAULT_CATEGORY = 'somethingelse' in your pelicanconf.py file). Alternatively, if the post is in a subdirectory, it's category becomes the subdirectory name. So if you have a situation like this:

.../content/unicorns/lady-rainicorn.md

The lady-rainicorn post will be filed under the category of unicorns. You can explicitly override this category-from-subdirectory behavior, by specifying the category in the header of the file, like so:

title: All about Lady Rainicorn
slug: lady-rainicorn
category: adventure-time

Categories are where a lot of the pelican themes get the top navigation bar, which is referred to as the menu in Pelican. You'll notice that this post is filed under technology. If you're viewing this on the main page, that won't be highlighted, but if you click on the title of this post, you'll see that technology will become highlighted. Same deal if you just click on technology, which will get you a listing of the latest post in that category, and a pagination of all previous ones.

Pages

The menu items need not be limited to categories, however. With the popular notmyidea template (which is what I've customized for my journal here), if you create any pages (by putting the file in a content/pages/ directory), then the title of that page will appear in front of all of the categories in the menu. These pages don't show up in RSS feeds, though.

Menuitems

Additionally, you can specify other MENUITEMS in pelicanconf.py, which will go in front of all pages and categories. MENUITEMS is a list of tuples, each tuple should be composed of a link name, and a url. This is how I link to the listing of all of my blog posts:

MENUITEMS = [('all', '/blog/archives.html')]

Tags

Finally, you can have zero or more tags attached to a post. Tags, by default, do not get a feed generated for them. To enable a tag feed, you'll have to set a line like this in your pelicanconf.py files:

TAG_FEED_ATOM = "feeds/tag_%s.atom.xml"

Menu highlighting

The notmyidea template only contains logic for highlighting the active category. This is why, if you click on 'All' at the top of mine, it doesn't stay highlighted when you're on that page.

On the other hand, if you click at cycling at the menu bar, it will stay highlighted, and you will see a listing of the articles in that category. But if you click on the cycling tag in one of those posts - the menu item will not be highlighted, and you will see some more articles - in particular, right now my review of Just Ride will also be listed, but it lives in the category of books, because that's where I'm keeping book reviews.

Summary

So there you have it - now you know the differences between categories, tags, pages, and menu items, as far as Pelican is concerned. For me, the bottom line is that I'll probably stick to just using categories, and will try to use tags sparingly. In my transition from WordPress, I cleaned up and removed a whole bunch of tags (since they were either redundant (the same tags kept appearing together), or were too specific (most tags were only used for one post).

by Paul Ivanov at May 24, 2013 09:55 PM

embracing my hypertextuality

Well, it's happened again, I've jumped (back) on the static blog engine bandwagon. Early versions of my site were generated literally using #define ,#include, gcc, and a Makefile...). Back then I was transitioning away from using livejournal, and decided to use WordPress so I wouldn't have to roll my own RSS feed generator.

I tolerated WP for a while - and the frequency of my posts was so low, that it wasn't much of an issue.

Except for the security upgrades. I logged into the wp-admin console way more times than I cared to just to press the little "upgrade" box. The reason is that wordpress keeps everything in a database that gets queried every time someone hits the site. I was never comfortable with the fact, because content can be lost in case of database corruption during either an upgrade or a security breech. Also, my content just isn't that dynamic. The WP-cache stuff just seemed overkill, since I don't get that many visitors.

But lately, I've found myself wanting to write more, to post more, but also shying away from it because I hate dealing with the WordPress editor. And I also hate being uncertain about whether any of it will survive the next upgrade, or the next security hole, whichever I happen to stub my toe on first.

And the thing is, I really like to use version control for everything I do. I liked my blog posts to be just text files I can check into version control. I also like typing "make" to generate the blog, and now I get to!

For added fun, I'm hoping that writing my posts in markdown will make it easier to coordinate my gopher presence, since it's pretty close.

For posterity, I'm capturing what the first version of my Pelican-based blog looked like. I did the same thing when I moved to WordPress.

Embracing my
hypertextuality

I ran into some confusing things about transitioning to Pelican, so I thought I'd note them here, for the benefit of others.

Unadulturated code blocks

I like to use indentation as a proxy for the venerable <tt> tag - which uses a monospace font.

If you just want to use an indentation, but do not want the indented text parsed as a programming language, put a :::text at the top of that block. Here's what I mean. Take this Oscar Wilde quote, where I've inserted a line break for drammatic effect, for example:

A gentleman is one who never hurts anyone's feelings
unintentionally.

Now, you see that ugly red box around the apostrophe? Well, That's because all I did was indent the two lines. If I just put a :::text above the quote, indented to the same level,

    :::text
    A gentleman is one who never hurts anyone's feelings
    unintentionally.

the result will render like this.

A gentleman is one who never hurts anyone's feelings
unintentionally.

As the pelican documentation specifies, this is also the way you can also specify the specific programming language you want, so :::python would be one way to not make pygments guess. You can get a list of all supported languages here, just use one of the short names for your language of choice.

And you should really do this, even if you aren't bothered by the red marks, because the code highlighting plugin goes in and tokenizes all of those words and surrounds them in <span> tags. Here's the HTML generated for the first version:

<div class="codehilite"><pre><span class="n">A</span> <span class="n">gentleman</span> <span class="n">is</span> <span class="n">one</span> <span class="n">who</span> <span class="n">never</span> <span class="n">hurts</span> <span class="n">anyone</span><span class="err">&#39;</span><span class="n">s</span> <span class="n">feelings</span>
<span class="n">unintentionally</span><span class="p">.</span>
</pre></div>

and here's the version generated when you add the :::text line at the top:

<div class="codehilite"><pre>A gentleman is one who never hurts anyone&#39;s feelings
unintentionally.
</pre></div>

nikola?

At some point, having a dialogue with myself, I wrote in here "or should I not use pelican and use nicola instead?"

Ok, tried it - nikola takes too long to startup -

nikola --help
real    0m1.202s
user    0m0.876s
sys     0m0.300s

pelican --help
real    0m0.639s
user    0m0.496s
sys 0m0.132s

I'm sure it's a fine static blogging engine - and Damian Avila's already written IPython Notebook converters for it, but it just feels like it tries to do too many things. Constraints are good. I'll stick with Pelican for now. (Though I did use the nikola wordpress import tool to grab the wp-upload images from my WordPress blog)

another set of instructions I consulted: Kevin Deldycke's WordPress to Pelican, which is how I did get my articles out using exitwp which I patched slightly, so files got saved as .md, and preserve other format properties.

Redirects

also known as: not breaking the web

I wanted to preserve rss feeds, and also not break old WordPress style /YYYY/MM/DD urls - the Nikola wp-import script had created a url remapping scheme in a file called url_map.csv.

I don't have that many posts, so I just added them in by hand:

Options +FollowSymLinks
RewriteEngine on
RedirectMatch 301 /blog/feed/ /blog/feeds/all.atom.xml
RedirectMatch 301 /blog/2006/10/18/todd-chritien-greens-choice-voting/ /blog/todd-chritien-greens-choice-voting.html
RedirectMatch 301 /blog/2007/01/04/changelogs-with-dates-gui-goodness/ /blog/changelogs-with-dates-gui-goodness.html
...

Enabling table of contents for posts

If you want to include a table of contents within a post using [TOC], you must enabled the markdown toc processor with a line like this is your pelicanconf.py:

MD_EXTENSIONS =  [ 'toc', 'codehilite','extra']

Categories and tags

Ok, so this was never clear to me in wordpress, either - but what's the difference between a tag and a category? is it the case that a post can only belong to one category, whereas it can have any number of tags?

I think I used categories as tags on wordpress. Looks like all posts on Pelican can have at most one category. Turns out this little aside was long enough to turn into its own post, so if you're interested, pelican tags-vs-categories has got you covered.

That's it for now

Thanks to Preston Holmes (@ptone) for encouraging me to transition away from WordPress, and pointing me to this post by Gabe Weatherhead (@MacDrifter) for how to do that. It should be said that the pelican documentation itself is very good for getting you going. Additionally, I consulted this post by Steve George which has a good description to get you started, and also covers a bunch of little gotchas, and lots of pointers. Also, thanks to Jake Vanderplas (@jakevdp) for his writeup on transitioning to Pelican, which I will consult later for incorporating IPython notebooks into my markdown posts, in the future. This is good enough for now. LTS.

by Paul Ivanov at May 24, 2013 09:41 PM

May 23, 2013

Continuum Analytics

Accelerating Python Libraries with Numba (Part 2)

Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.

In this Continuum Open Notebook, you’ll learn more about how Numba works and how it reduces your programming effort, and see that it achieves comparable performance to C and Cython over a range of benchmarks.

by Aron Ahmadia at May 23, 2013 09:57 AM

Paul Ivanov

Cycling

A page about cycling in the East Bay (and the San Francisco Bay Area in general).

This started off as the sort of resource that I wish I had when visiting a new town: so hi there, out-of-towner, welcome to our corner of the world. Hope you enjoy your stay and find these resources helpful.

You need a bike, a water bottle, a jersey or a saddle bag (to store your snack, phone, patch kit). Everything else is optional.

Routes

Though I consider myself an avid cyclist, the reality is that I am only mildly so. I have not been on a ton of rides in the area, and don't know them well at all outside of the East Bay - I have yet to cycle across the Golden Gate Bridge - and the Marin Headlands (an omission I plan to rectify soon). I did mountain bike a bit in Marin when I was a kid, and rode a fair bit around the Peninsula while I was in high school. I got my first roadie the summer before heading off to UC Davis. It was a wonderfully light steel blue Peugeot from the 70s (you can see a picture of it here). I convinced (my then future roommate) Philip to get a matching white one. But it was taken from me my first year in grad school (2006). I hadn't been cycling much since, until I started up again recently. Below, you will find a concise guide to some of my favorite rides in the area. As far as my actual experiences, you can get a flavor of the rides I go on by reading my cycling log.

20-40 km

Here is the overview: If you want a shorter ride with a great view - I recommend this 24 km (15 mi) loop from UC Berkeley to Grizzly Peak. The ascent to Grizzly Peak Rd is steady and manageable. The road does get pretty rough when you turn off of College until you get to Broadway, but you avoid the dangerous Ashby route to Tunnel Road. Recently, I've been doing a similar ride that avoids that rough patch by going down College Ave all the way to Broadway. As an alternative option, instead of going up to Grizzly Peak, you can turn off for a slightly shorter (but really fun) descent that ends at the historic Claremont Hotel.

What's nice about these loops is that you can do them back-to-back for a total of ~50 km. But If you want to do a ride that long with more variety, read on for the longer routes.

50 km and beyond

If you have some time (about twice the time it takes to do the shorter routes above), the Redwood-Pinehurst route will take you through an incredible canyon that's filled with redwoods! You can take the loop in the other direction - Pinehurst-Redwood.

Another local favorite is the "Three Bears" ride, which has some truly scenic views all throughout (62 km, ~1 km gain).

The longest ride I've done so far was pretty pleasant, and had a wide variety of terrain, see the map (93 km, ~1.5 km gain).

Centuries

I haven't yet completed a century. But there are some local ones that I have my eye on.

Grizzly Peak Century is one that goes through some of the local East Bay routes I described above.

Best of The Bay Century likewise starts off in the East Bay routes I'm most familiar with, and heads south all the way to San Jose, before returning a bit to Fremont BART station.

Foxy's Fall Century is an organized ride based out of Davis, CA, where I was an undergrad.

Davis Double Century is in my sights as well. Maybe next year I'll get on the board of the California Triple Crown.

There are a lot of long rides listed on the Different Spokes San Francisco website. San Francisco Randonneurs run what looks like a very regular brevet series.

Routes to avoid

Ohlone Greenway: it's not fun - there are too many stops and unprotected crossings. This is a nice route to take a stroll through, on a bike it's just frustrating.

Local Bike Shops

I want to keep track of reasonable places that rent bicycles in the area (in particularly road bikes), as well as routes you should be sure to checkout (with variations, caveats, and general overview).

Missing Link Bicycle Cooperative - One can't say enough good things about this place, super friendly atmosphere. They're located right downtown Berkeley, they both sell gear and fix bikes, and even have two complete sets of tools and stands you can use to fix up your bike for free!

Blue Heron Bikes - a relatively new shop (opened in 2012), in "Westbrae" right of the Ohlone Greenway, I cycle past it every day on my commute. Soon after they first opened, I chatted with the owner, Rob, and he's a great guy.

Wheels of Justice Cyclery - The wheels of justice grind slowly, but exceedingly fine.

Resources

East Bay Bike Coalition - a very active local group.

Albany Strollers and Rollers - Albany, California, is a little 1.5 square mile city just north of Berkeley which I call home.

Berkeley Critical Mass - 2nd Friday of each month - departs at 6pm from Downtown Berkeley BART station.

East Bay Bike Party - a monthly evening ride.

Just Ride - my review of Grant Petersen's book.


Here's some ASCII art I made that I now use as my email signature:

                   _
                  / \
                A*   \^   -
             ,./   _.`\\ / \
            / ,--.S    \/   \
           /  `"~,_     \    \
     __o           ?
   _ \<,_         /:\
--(_)/-(_)----.../ | \
--------------.......J

by Paul Ivanov at May 23, 2013 07:00 AM

May 21, 2013

Continuum Analytics news

Continuum Analytics Launches Full-Featured, In-Browser Data Analytics Environment

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Wakari version 1.0, an easy-to-use, cloud-based, collaborative Python environment for analyzing, exploring and visualizing large data sets

by Corinna Bahr at May 21, 2013 10:00 AM

May 20, 2013

Enthought

Enthought awarded $1M DOE SBIR grant to develop open-source Python HPC framework

We are excited to announce that Enthought is undertaking a multi-year project to bring the strengths of NumPy to high-performance distributed computing.  The goal is to provide a more intuitive and user-friendly interface to both distributed array computing and to high-performance parallel libraries.  We will release the project as open source, providing another tool in [...]

by mtranby at May 20, 2013 04:11 PM

Philip Herron

Woo SunShine!

Forgot to mention i am in San Francisco at the moment with Work just took this picture for the lulz:

whoop1

by redbrain at May 20, 2013 04:41 AM

Dealing with Flack

So this is kind of a rant/political type post. When i started my project GCCPY as soon as it was announced as part of Google Summer of code 2010. A person who i should probably leave nameless nearly immediately jumped on #gcc irc.oftc.net, and started super questioning me on why the hell i would approach python in this way. To the point where he said that i should not bother and just contribute to PyPy. And the moderators in the channel had to kind of step in because it was just way too much.

So ok fair enough, then everything was fairly quiet i get the odd email from interested people then i noticed my project was on reddit http://www.reddit.com/r/Python/comments/1dc0tl/python_frontend_to_gcc_xpost_from_rgcc/

Same guy on there generally a negative slate on my project again. Partly i don’t blame them since my wiki is not that detailed. But i tend to feel people don’t think for themselves and just start believing people from established projects are rock stars, to the point were new projects are pointless to them.

Everyone is entiled to their opinion but what i will say is, think for yourself. And _dont_ compare a project like GCCPY to PyPy. Event just look at PyPy’s main page they have over 25k in donations!! What do i have, my spare time at work and doing everything from scratch in C where as pypy is basically python and jit reusing libpython.so. So before you start slating something as not worth its and cant do x,y,z. I have my own opinion that i believe this is a valid way of implementing python. Yes its _no_ where near python 2 complete! Make it a full time job it would be done in a year. I will say from my benchmarks which i want to post at the end of the year at python con IE; gccpy is looking truly amazing.

Think for yourself reddit!

by redbrain at May 20, 2013 04:38 AM

May 19, 2013

Fabian Pedregosa

Numerical optimizers for Logistic Regression

Following a challenge proposed by Gael to my group I compared several implementations of Logistic Regression. The task was to implement a Logistic Regression model using standard optimization tools from scipy.optimize and compare them against state of the art implementations such as LIBLINEAR.

In this blog post I'll write down all the implementation details of this model, in the hope that not only the conclusions but also the process would be useful for future comparisons and benchmarks.

Function evaluation

The loss function for the $\ell_2$-regularized logistic regression, i.e. the function to be minimized is

$$ \mathcal{L}(w, \lambda, X, y) = - \sum_{i=1}^n \log(\phi(y_i w^T X_i)) + \lambda w^T w $$

where $\phi(t) = 1. / (1 + \exp(-t))$ is the logistic function, $\lambda w^T w$ is the regularization term and $X, y$ is the input data, with $X \in \mathbb{R}^{n \times p}$ and $y \in \{-1, 1\}^n$. However, this formulation is not great from a practical standpoint. Even for not unlikely values of $t$ such as $t = -100$, $\exp(100)$ will overflow, assigning the loss an (erroneous) value of $+\infty$. For this reason 1, we evaluate $\log(\phi(t))$ as

$$ \log(\phi(t)) = \begin{cases} - \log(1 + \exp(-t)) \text{ if } t > 0 \\ t - \log(1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases} $$

The gradient of the loss function is given by

$$ \nabla_w \mathcal{L} = \sum_{i=1}^n y_i X_i (\phi(y_i w^T X_i) - 1) + \lambda w $$

Similarly, the logistic function $\phi$ used here can be computed in a more stable way using the formula

$$ \phi(t) = \begin{cases} 1 / (1 + \exp(-t)) \text{ if } t > 0 \\ \exp(t) / (1 + \exp(t)) \text{ if } t \leq 0\\ \end{cases} $$

Finally, we will also need the Hessian for some second-order methods, which is given by

$$ \nabla_w ^2 \mathcal{L} = X^T D X + \lambda I $$

where $I$ is the identity matrix and $D$ is a diagonal matrix given by $D_{ii} = \phi(y_i w^T X_i)(1 - \phi(y_i w^T X_i))$.

In Python, these function can be written as

def phi(t):
    # logistic function, returns 1 / (1 + exp(-t))
    idx = t > 0
    out = np.empty(t.size, dtype=np.float)
    out[idx] = 1. / (1 + np.exp(-t[idx]))
    exp_t = np.exp(t[~idx])
    out[~idx] = exp_t / (1. + exp_t)
    return out

def loss(w, X, y, alpha):
    # logistic loss function, returns Sum{-log(phi(t))}
    z = X.dot(w)
    yz = y * z
    idx = yz > 0
    out = np.zeros_like(yz)
    out[idx] = np.log(1 + np.exp(-yz[idx]))
    out[~idx] = (-yz[~idx] + np.log(1 + np.exp(yz[~idx])))
    out = out.sum() + .5 * alpha * w.dot(w)
    return out

def gradient(w, X, y, alpha):
    # gradient of the logistic loss
    z = X.dot(w)
    z = phi(y * z)
    z0 = (z - 1) * y
    grad = X.T.dot(z0) + alpha * w
    return grad

def hessian(s, w, X, y, alpha):
    # hessian of the logistic loss
    z = X.dot(w)
    z = phi(y * z)
    d = z * (1 - z)
    wa = d * X.dot(s)
    Hs = X.T.dot(wa)
    out = Hs + alpha * s
    return  out

Benchmarks

I tried several methods to estimate this $\ell_2$-regularized logistic regression. There is one first-order method (that is, it only makes use of the gradient and not of the Hessian), Conjugate Gradient whereas all the others are Quasi-Newton methods. The method I tested are:

  • CG = Conjugate Gradient as implemented in scipy.optimize.fmin_cg
  • TNC = Truncated Newton as implemented in scipy.optimize.fmin_tnc
  • BFGS = Broyden–Fletcher–Goldfarb–Shanno method, as implemented in scipy.optimize.fmin_bfgs.
  • L-BFGS = Limited-memory BFGS as implemented in scipy.optimize.fmin_l_bfgs_b. Contrary to the BFGS algorithm, which is written in Python, this one wraps a C implementation.
  • Trust Region = Trust Region Newton method 1. This is the solver used by LIBLINEAR that I've wrapped to accept any Python function in the package pytron

To assure the most accurate results across implementations, all timings were collected by callback functions that were called from the algorithm on each iteration. Finally, I plot the maximum absolute value of the gradient (=the infinity norm of the gradient) with respect to time.

The synthetic data used in the benchmarks was generated as described in 2 and consists primarily of the design matrix $X$ being Gaussian noise, the vector of coefficients is drawn also from a Gaussian distribution and the explained variable $y$ is generated as $y = \text{sign}(X w)$. We then perturb matrix $X$ by adding Gaussian noise with covariance 0.8. The number of samples and features was fixed to $10^4$ and $10^3$ respectively. The penalization parameter $\lambda$ was fixed to 1.

In this setting variables are typically uncorrelated and most solvers perform decently:

Benchmark Logistic

Here, the Trust Region and L-BFGS solver perform almost equally good, with Conjugate Gradient and Truncated Newton falling shortly behind. I was surprised by the difference between BFGS and L-BFGS, I would have thought that when memory was not an issue both algorithms should perform similarly.

To make things more interesting, we now make the design to be slightly more correlated. We do so by adding a constant term of 1 to the matrix $X$ and adding also a column vector of ones this matrix to account for the intercept. These are the results:

Benchmark Logistic

Here, we already see that second-order methods dominate over first-order methods (well, except for BFGS), with Trust Region clearly dominating the picture but with TNC not far behind.

Finally, if we force the matrix to be even more correlated (we add 10. to the design matrix $X$), then we have:

Benchmark Logistic

Here, the Trust-Region method has the same timing as before, but all other methods have got substantially worse.The Trust Region method, unlike the other methods is surprisingly robust to correlated designs.

To sum up, the Trust Region method performs extremely well for optimizing the Logistic Regression model under different conditionings of the design matrix. The LIBLINEAR software uses this solver and thus has similar performance, with the sole exception that the evaluation of the logistic function and its derivatives is done in C++ instead of Python. In practice, however, due to the small number of iterations of this solver I haven't seen any significant difference.


  1. A similar development can be found in the source code of LIBLINEAR, and is probably also used elsewhere. 

  2. "A comparison of numerical optimizers for logistic regression", P. Minka, URL 

  3. "Newton's Method for Large Bound-Constrained Optimization Problems", Chih-Jen Lin, Jorge J. More URL 

  4. IPython Notebook to reproduce the benchmarks source 

by Fabian Pedregosa at May 19, 2013 10:00 PM

Jake Vanderplas

A Javascript Viewer for Matplotlib Animations

I'll cut to the chase: here's what I've created: a javascript-based animation viewer, with hooks to embed it in IPython. It's best viewed in a modern browser (unfortunately Firefox does not currently qualify as "modern" due to its lack of HTML5 support)

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.
In [2]:
# get the JSAnimation import at https://github.com/jakevdp/JSAnimation
from JSAnimation import examples
examples.basic_animation()
Out[2]:


Once Loop Reflect

I think the result is pretty good, if I do say so myself.

The Background

Last week, Fernando Perez visited UW to give a talk for the eScience institute. Over lunch we were discussing the possibility of building a Javascript-based animation viewer which could be embedded in IPython notebooks. I had written a short hack to embed mp4 movies in IPython, which works quite well: Michael Kuhlen of Berkeley ran with the idea and made this notebook, which embeds a 3D rendering of orbits within an N-body simulation.

The problem with this mp4 approach is that it requires installation of ffmpeg or mencoder with the proper video codec libraries. What we wanted was something that only requires Python and a web browser: something that could use Javascript to display frames rendered by Matplotlib.

Above you see the result of a week's worth of evenings hacking on Python, html, and Javascript -- my first real foray into the latter. The result is a small python package, available on my github page: https://github.com/jakevdp/JSAnimation. See the README and examples on that page for details of how this can be used.

So what's going on here?

You can dig into the code to see how it works, but here's the short version:

The package adds an IPython representation hook to the animation object, similar to the one I showed here. When the animation is displayed, IPython calls the new HTMLWriter to convert the animation to an embeddable html document. This writer is capable of saving any animation to a stand-alone HTML file, with the frames either embedded or in a separate directory: this stand-alone file is created, read-in, and embedded into the document as raw HTML.

For IPython, the animation creates frames that are embedded directly in the HTML source via the base-64 representation. A base-64 representation is a standard way of encoding binary data to a normal string of text, which looks like this:

In [3]:
fig, ax = plt.subplots()
ax.plot(random.rand(100))
# write the figure to a temporary file, and encode the results to base64
import tempfile
with tempfile.NamedTemporaryFile(suffix='.png') as f:
    fig.savefig(f.name)
    data = open(f.name, 'rb').read().encode('base64')
    
# close the figure and display the data
plt.close(fig)
print data[:460]
iVBORw0KGgoAAAANSUhEUgAAAbAAAAEgCAYAAADVKCZpAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAAIABJREFUeJztnXt0XVW977877zZN2iZ9QJNAgYQ+KLRotaCi4VFLKxQF
PNRzREdFrGgH6pFzz7l6zj3gHVaq5w4Pw3od9QEHAQvq8VpQCMgjqC1thVYKtEJ5tKTpO02aNGne
6/7xy+xee+251l6POddj799njIw2yc7ea6+91vrO7/f3m3OlDMMwwDAMwzAJoyjqDWAYhmEYP7CA
MQzDMImEBYxhGIZJJCxgDMMwTCJhAWMYhmESCQsYwzAMk0hYwBiGYZhEwgLGMAzDJBIWMIZhGCaR
sIAxDMMwiYQFjGEYhkkkLGAMwzBMImEBYxiGYRIJCxjDMAyTSFjAGIZhmETCAsYwDMMkEhYwhmE

We only print the first few sections of the data, as it is a rather large string. The magic of this is that contained in that string is all the information needed to reconstruct the original PNG image. We can see that directly by inserting the string into an HTML image tag, which results in a frame embedded in the document itself:

In [4]:
from IPython.display import HTML
HTML('<img src="data:image/png;base64,{0}">'.format(data))
Out[4]:

This sort of thing is similar to what goes on in the background every time you use embedded figures in an IPython notebook.

By embedding all the frames this way, we're able to use Javascript to switch between them at a given frame-rate using the javascript setInterval() function. The rest is just straightforward javascript event handling.

Embedding Your Own Animation

If you'd like to use this to create your own animation, you can follow the suggestions in the animation tutorial. To embed your animation in the notebook, import IPython_display from the JSAnimation package. This will add the _repr_html_ method to the animation class, so that creating the animation will lead to it being displayed via the Javascript embedding.

Here is an example showing how the above animation was created:

In [5]:
from matplotlib import animation
from JSAnimation import IPython_display

fig = plt.figure()
ax = plt.axes(xlim=(0, 10), ylim=(-2, 2))
line, = ax.plot([], [], lw=2)

def init():
    line.set_data([], [])
    return line,

def animate(i):
    x = np.linspace(0, 10, 1000)
    y = np.cos(i * 0.02 * np.pi) * np.sin(x - i * 0.02 * np.pi)
    line.set_data(x, y)
    return line,

animation.FuncAnimation(fig, animate, init_func=init,
                        frames=100, interval=30)
Out[5]:


Once Loop Reflect

Remaining Issues

One of my goals in this was to make it relatively lightweight: For this reason, it doesn't depend on JQuery, JQuery-UI, and other nice packages that might be suited for this type of application.

For that reason, the frame slider uses the HTML5 slider element, which is not yet supported by all browsers. In particular, if you're using older versions of IE or Firefox, the frame dragger will appear as an ugly numerical input box. Making this compatible with non-HTML5-compliant browsers would be possible, but would require a lot more javascript hacking.

Second, this does not scale well to large animations. Because each frame is individually embedded in the document, the size of the notebook can become very large very quickly. Typical web-ready video formats involve a lot of image compression. This tends to be easy for video: most frames look very similar to the last, with just a few changes. Implementing this sort of compression in Javascript would certainly be possible, but is well beyond my minimal Javascript hacking abilities. It would be very cool if someone could run with this idea and create what would amount to a Javascript video codec to reduce the size of these embedded animations. I think it could be done with some effort.

As it is, though, I think this is a pretty neat widget and will definitely find good uses.

I hope you've found this useful, and thanks for reading! This was a fun one.

This post was written in the IPython notebook: The notebook can be downloaded here, or viewed statically here

by Jake Vanderplas at May 19, 2013 02:00 PM

May 17, 2013

Paul Ivanov

Just Ride

This is a pretty good book - though it's more of a set of short little blog posts combined into a book. It's nice to read the very opinionated thoughts of someone who's been dealing with bicycles for a very long time.

The whole time I was reading this, I felt like this is the type of stuff that my pal Jon G would have said, and in a similar style to the way he would have said it. I look forward to reading Jon's treatises on cycling in the future.

Reading this book made me proud to and convinced me to continue to never wear clipless pedals and cycling shoes (I have toe clips), and having a kickstand on my roadie. I'm an unracer - and to quote the title of one of the little chapters: "Racing ruins the breed".

I also learned how to corner and turn properly - which has been invaluable in my windy descents lately. Finally, I was reminded to not just count miles, because every ride counts ("No ride too short"). Also it makes as much sense to count time spent on the bike, and the amount of elevation gain, and the number of days biked, period. Thanks to my commute, this means my days biked per week stays above 5 during the most of the year in California.

I also learned a new word: Beausage (byoo-sidj). My current bike doesn't have any, but my last laptop certainly does.

It also felt kind of cool that the author is local. Grant Petersen founded and still runs Rivendell Bicycle Works in Walnut Creek, CA.

by Paul Ivanov at May 17, 2013 07:00 AM

May 16, 2013

Continuum Analytics

Wakari and Financial Health Checks

Accrual Ratios are an important index for assessing the performance and continued viability of every publicly traded company. Using Wakari, TR CONNECT, and Thomson Reuters Financial Data I calculated accrual ratios for a variety of companies and explored news events, which may correlate with strong signals in the data. You can easily download and edit the notebook used in this post into your Wakari account.

by Benjamin Zaitlen at May 16, 2013 01:45 PM

May 13, 2013

The Nipy blog

Nipy World blog moved from there to here

The Nipy World blog used to be on Blogspot at http://nipyworld.blogspot.com. I'm restarting the blog here because it's easier and more fun to write the blog using Pelican. I hope that it will be easier for my fellow NIPistas to add their own posts using Github.

by Matthew Brett at May 13, 2013 02:42 AM

Nipy World blog moved from there to here

The Nipy World blog used to be on Blogspot at http://nipyworld.blogspot.com. I'm restarting the blog here because it's easier and more fun to write the blog using Pelican. I hope that it will be easier for my fellow NIPistas to add their own posts using Github.

by Matthew Brett at May 13, 2013 02:42 AM

Jake Vanderplas

Embedding Matplotlib Animations in IPython Notebooks

I've spent a lot of time on this blog working with matplotlib animations (see the basic tutorial here, as well as my examples of animating a quantum system, an optical illusion, the Lorenz system in 3D, and recreating Super Mario). Up until now, I've not have not combined the animations with IPython notebooks. The problem is that so far the integration of IPython with matplotlib is entirely static, while animations are by their nature dynamic. There are some efforts in the IPython and matplotlib development communities to remedy this, but it's still not an ideal setup.

I had an idea the other day about how one might get around this limitation in the case of animations. By creating a function which saves an animation and embeds the binary data into an HTML string, you can fairly easily create automatically-embedded animations within a notebook.

The Animation Display Function

As usual, we'll start by enabling the pylab inline mode to make the notebook play well with matplotlib.

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.kernel.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.

Now we'll create a function that will save an animation and embed it in an html string. Note that this will require ffmpeg or mencoder to be installed on your system. For reasons entirely beyond my limited understanding of video encoding details, this also requires using the libx264 encoding for the resulting mp4 to be properly embedded into HTML5.

In [2]:
from tempfile import NamedTemporaryFile

VIDEO_TAG = """<video controls>
 <source src="data:video/x-m4v;base64,{0}" type="video/mp4">
 Your browser does not support the video tag.
</video>"""

def anim_to_html(anim):
    if not hasattr(anim, '_encoded_video'):
        with NamedTemporaryFile(suffix='.mp4') as f:
            anim.save(f.name, fps=20, extra_args=['-vcodec', 'libx264'])
            video = open(f.name, "rb").read()
        anim._encoded_video = video.encode("base64")
    
    return VIDEO_TAG.format(anim._encoded_video)

With this HTML function in place, we can use IPython's HTML display tools to create a function which will show the video inline:

In [3]:
from IPython.display import HTML

def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim_to_html(anim))

Example of Embedding an Animation

The result looks something like this -- we'll use a basic animation example taken from my earlier Matplotlib Animation Tutorial post:

In [4]:
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
fig = plt.figure()
ax = plt.axes(xlim=(0, 2), ylim=(-2, 2))
line, = ax.plot([], [], lw=2)

# initialization function: plot the background of each frame
def init():
    line.set_data([], [])
    return line,

# animation function.  This is called sequentially
def animate(i):
    x = np.linspace(0, 2, 1000)
    y = np.sin(2 * np.pi * (x - 0.01 * i))
    line.set_data(x, y)
    return line,

# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=20, blit=True)

# call our new function to display the animation
display_animation(anim)
Out[4]:
Your browser does not support the video tag.

Making the Embedding Automatic

We can go a step further and use IPython's display hooks to automatically represent animation objects with the correct HTML. We'll simply set the _repr_html_ member of the animation base class to our HTML converter function:

In [5]:
animation.Animation._repr_html_ = anim_to_html

Now simply creating an animation will lead to it being automatically embedded in the notebook, without any further function calls:

In [6]:
animation.FuncAnimation(fig, animate, init_func=init,
                        frames=100, interval=20, blit=True)
Out[6]:
Your browser does not support the video tag.

So simple! I hope you'll find this little hack useful!

This post was created entirely in IPython notebook. Download the raw notebook here, or see a static view on nbviewer.

by Jake Vanderplas at May 13, 2013 02:00 AM

May 12, 2013

Mathieu Blondel

Large-scale sparse multiclass classification

I’m thrilled to announce that my paper “Block Coordinate Descent Algorithms for Large-scale Sparse Multiclass Classification” (published in the Machine Learning journal) is now online: PDF, BibTeX [*].

Abstract

Over the past decade, l1 regularization has emerged as a powerful way to learn classifiers with implicit feature selection. More recently, mixed-norm (e.g., l1/l2) regularization has been utilized as a way to select entire groups of features. In this paper, we propose a novel direct multiclass formulation specifically designed for large-scale and high-dimensional problems such as document classification. Based on a multiclass extension of the squared hinge loss, our formulation employs l1/l2 regularization so as to force weights corresponding to the same features to be zero across all classes, resulting in compact and fast-to-evaluate multiclass models. For optimization, we employ two globally-convergent variants of block coordinate descent, one with line search (Tseng and Yun, 2009) and the other without (Richtárik and Takáč, 2012). We present the two variants in a unified manner and develop the core components needed to efficiently solve our formulation. The end result is a couple of block coordinate descent algorithms specifically tailored to our multiclass formulation. Experimentally, we show that block coordinate descent performs favorably to other solvers such as FOBOS, FISTA and SpaRSA. Furthermore, we show that our formulation obtains very compact multiclass models and outperforms l1/l2- regularized multiclass logistic regression in terms of training speed, while achieving comparable test accuracy.

Code

The code of the proposed multiclass method is available in my Python/Cython machine learning library, lightning. Below is an example of how to use it on the News20 dataset.

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.primal_cd import CDClassifier
 
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target
 
clf = CDClassifier(penalty="l1/l2",
                   loss="squared_hinge",
                   multiclass=True,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)
clf.fit(X, y)
# accuracy
print clf.score(X, y) 
# percentage of selected features
print clf.n_nonzero(percentage=True)

To use the variant without line search (as presented in the paper), add the max_steps=0 option to CDClassifier.

Data

I also released the Amazon7 dataset used in the paper. It contains 1,362,109 reviews of Amazon products. Each review may belong to one of 7 categories (apparel, book, dvd, electronics, kitchen & housewares, music, video) and is represented as a 262,144-dimensional vector. It is, to my knowledge, one of the largest publically available multiclass classification dataset.

[*] The final publication is available here.

by Mathieu at May 12, 2013 12:52 PM

May 11, 2013

Matthieu Brucher

Annoucement: scikits.optimization 0.3

I’m please to announce a new version for scikits.optimization. The main focus of this iteration was to finish usual unconstrained optimization algorithms.

Changelog

  • Fixes on the Simplex state implementation
  • Added several Quasi-Newton steps (BFGS, rank 1 update…)

The scikit can be installed with pip/easy_install or downloaded from PyPI

Old announces:

Buy Me a Coffee!



Other Amount:



Your Email Address :



by Matt at May 11, 2013 06:58 AM

May 08, 2013

Continuum Analytics news

Continuum Analytics Releases Anaconda v1.5

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its premium collection of libraries for Python. Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 1.5, includes the addition of netCDF4, as well as updates to Python, NumPy, SciPy, IPython, MatPlotLib, Pandas, and Cython.

by Corinna Bahr at May 08, 2013 03:00 PM

Jake Vanderplas

Migrating from Octopress to Pelican

After nine months on Octopress, I've decided to move on.

I should start by saying that Octopress is a great platform for static blogging: it's powerful, flexible, well-supported, well-integrated with GitHub pages, and has tools and plugins to do just about anything you might imagine. There's only one problem:

It's written in Ruby.

Now I don't have anything against Ruby per se. However, it was starting to seem a bit awkward that a blog called Pythonic Perambulations was built with Ruby, especially given the availability of so many excellent Python-based static site generators (Hyde, Nikola, and Pelican in particular).

Additionally, a few things with Octopress were starting to become difficult: first, I wanted a way to easily insert IPython notebooks into posts. Sure, I developed a hackish solution to notebooks in Octopress which had worked well enough for a while, but a cleaner method would have involved digging into the Ruby source code and writing a full-fledged Octopress extension, based on nbconvert. This would have involved a fair bit of effort to learn Ruby and figure out how to best interface it with the Python nbconvert code. Second, Ruby has so many strange and difficult pieces: GemFiles, RVM, rake... and I never took the time to really understand the real purpose of all of them (self-reflection: what parts of Python would seem strange and difficult if I hadn't been using them for so many years?). The black-box nature of the process, at least in my own case, was starting to bother me.

But the kicker was this: In January I got a new computer, and after a reasonable amount of effort was unable to successfully build my blog. I've been writing my posts exclusively on my old laptop which I somehow managed to successfully set up last August. But that laptop now has a sorely outdated Ubuntu distro that I couldn't upgrade for fear of losing the ability to update my blog. Needless to say, this was not the most effective setup.

It was time to switch my blog engine to Python.

Choosing a Python Static Generator

I started asking around, and found that there were three solid contenders for a Python-based platform to replace Octopress: Hyde, Nikola, and Pelican. I gave Hyde a test-run a few weeks ago in re-making my website, and I really like it: it's clean, straightforward, powerful, and easy to use. The documentation is a bit lacking, though, and I think it would take a fair bit more effort at this point to build a more complicated site with Hyde.

Nikola and Pelican both seem to be well-loved by their users, but I had to choose one. I went with Pelican for one simple reason: it has more GitHub forks. I'm sure this is entirely unfair to Nikola and all the contributors who have poured their energy into the project, but I had to choose one way or another. I'm pleased to say that Pelican has not been a disappointment: I've found it to be flexible and powerful. It has an active developer-base, and makes available a wide array of themes and plugins. For the few extra pieces I needed, I found the plugin and theming API to be well-documented and straightforward to use.

Migrating to Pelican from Octopress

I won't attempt to write a one-size-fits-all guide to migrating to Pelican from Octopress: there are too many possibile combinations of formats, plugins, themes, etc. But I will walk through my own process in some detail, in hopes that it might help others who find themselves in a similar predicament.

I had several goals when doing this migration:

  • I wanted, as much as possible, to maintain the look and feel of the blog. I like the default Octopress theme: it's simple, clean, compact, and includes all the aspects I need for a good blog.
  • I wanted, as much as possible, to leave the source of my posts unmodified: luckily Pelican supports writing posts in markdown and allows easy insertion of custom plugins, so this was relatively easy to accommodate.
  • I wanted to maintain the history of Disqus comments for each page, as well as the Twitter and Google Pages tools.
  • I wanted, as much as possible, to maintain the same URLs for all content, including posts, notebooks, images, and videos.
  • I wanted a clean way to insert html-formatted IPython notebooks into blog posts. Nearly half my posts are written as notebooks, and the old way of including them was becoming much too cumbersome.

I was able to suitably address all these goals with Pelican in a few evenings' effort. Some of it was already built-in to the Pelican settings architecture, some required customization of themes and extensions, and some required writing some brand new plugins. I'll summarize these aspects below:

Blog theme

As I mentioned, I wanted to keep the look and feel of the blog consistent. Luckily, someone had gone before me and created an octopress Pelican theme which did most of the heavy lifting. I contributed a few additional features, including the ability to specify Disqus tags and maintain comment history, to add Twitter, Google Plus, and Facebook links in the sidebar and footer, to add a custom site search box which appears in the upper right of the navigation panel, as well as a few other tweaks. The result is what you see here: nearly identical to the old layout, with all the bells and whistles included.

Octopress Markdown to Pelican Markdown

Octopress has a few plugins which add some syntactic sugar to the markdown language. These are tags specified in Liquid-style syntax:

{% tag arg1 arg2 ... %}

I have made extensive use of these in my octopress posts, primarily to insert videos, images, and code blocks from file. In order to accommodate this in Pelican, I wrote a Pelican plugin which wraps a custom Markdown preprocessor written via the Markdown extension API which can correctly interpret these types of tags. The tags ported from octopress thus far are:

The Image Tag

The image tag allows insertion of an image into the post with a specified size and position:

{% img [position] /url/to/img.png [width] [height] [title] [alt] %}

Here is an example of the result of the image tag:

A Galaxy

The Video Tag

The video tag allows embedding of an HTML5/Flash-compatible video into the post:

{% video /url/to/video.mp4 [width] [height] [/url/to/poster.png] %}

Here is an example of the output of the video tag:

(see this post for a description of this video).

The Code Include Tag

The include_code tag allows the insertion of formatted lines from a file into the post, with a title and a link to the source file:

{% include_code filename.py [lang:python] [title] %}

Here is an example of the output of the code include tag:

Hello World hello_world.py download
import sys
import os
print("hello_world")

For more information on using these tags, refer to the module doc-strings.

Maintaining Disqus Comment Threads

Static blogs are fast, lightweight, and easy to deploy. A disadvantage, though, is the inability to natively include dynamic elements such as comment threads. Disqus is a third-party service that skirts this disadvantage very seamlessly. All it takes is to add a small javascript snippet with some identifiers in the appropriate place on your page, and Disqus takes care of the rest. To keep the comment history on each page required assuring that the site identifier and page identifiers remained the same between blog versions. This part happens within the theme, and my Disqus PR to the Pelican Octopress theme made this work correctly.

Maintaining the URL structure

By default, Octopress stores posts with a structure looking like blog/YYYY/MM/DD/article-slug/. The Pelican default is different, but easy enough to change. In the pelicanconf.py settings file, this corresponds to the following:

ARTICLE_URL = 'blog/{date:%Y}/{date:%m}/{date:%d}/{slug}/'
ARTICLE_SAVE_AS = 'blog/{date:%Y}/{date:%m}/{date:%d}/{slug}/index.html'

Next, at the top of the markdown file for each article, the metadata needs to be slightly modified from the form used by Octopress -- here is the actual metadata used in the document that generates this page:

Title: Migrating from Octopress to Pelican
date: 2013-05-07 17:00
slug: migrating-from-octopress-to-pelican

Additionally, the static elements of the blog (images, videos, IPython notebooks, code snippets, etc.) must be put within the correct directory structure. These static files should be put in paths which are specified via the STATIC_PATHS setting:

STATIC_PATHS = ['images', 'figures', 'downloads']

Pelican presented a challenge here: as of the time of this writing, Pelican has a hard-coded 'static' subdirectory where these static paths are saved. I submitted a pull request to Pelican that replaces this hard-coded setting with a configurable path: because the change conflicts with a bigger refactoring of the code which is ongoing, the PR will not be merged. But until this new refactoring is finished, I'll be using my own branch of Pelican to make this blog, and specify the correct static paths.

Inserting Notebooks

The ability to seamlessly insert IPython notebooks into posts was one of the biggest drivers of my switch to Pelican. Pelican has an ipython notebook plugin available, but I wasn't completely happy with it. The plugin implements a reader which treats the notebooks themselves as the source of a post, leading to the requirement to insert blog metadata into the notebook itself. This is a suitable solution, but for my own purposes I much prefer a solution in which the content of a notebook could be inserted into a stand-alone post, such that the notebook and the blog metadata are completely separate.

To accomplish this, I added a submodule to my liquid_tags Pelican plugin which allows the insertion of notebooks using the following syntax:

{% notebook path/to/notebook.ipynb [cells[i:j]] %}

This inserts the notebook at the given location in the post, optionally selecting a given range of notebook cells with standard Python list slicing syntax.

The formatting of notebooks requires some special CSS styles which must be inserted into the header of each page where notebooks are shown. Rather than requiring the theme to be customized to support notebooks, I decided on a solution where an EXTRA_HEADER setting is used to specify html and CSS which should be added to the header of the main page. The notebook plugin saves the required header to a file called _nb_header.html within the main source directory. To insert the appropriate formatting, we add the following lines to the settings file, pelicanconf.py:

EXTRA_HEADER = open('_nb_header.html').read().decode('utf-8')

In the theme file, within the header tag, we add the following:

 {% if EXTRA_HEADER %}
 {{ EXTRA_HEADER }}
 {% endif %}

Here is the result: a short notebook inserted directly into the post:

This Is An IPython Notebook

Here is some code:

In [1]:
import numpy
print numpy.random.random(10)
[ 0.25463203  0.55637185  0.02743774  0.57380221  0.52378531  0.95099357
  0.70975568  0.19575853  0.589227    0.06959599]

Here is some math:

$$e^{i\pi} + 1 = 0$$

With all those things in place, the blog was ready to be built. The result is right in front of you: you're reading it. If you'd like to see the source from which this blog built, it's available at http://www.github.com/jakevdp/PythonicPerambulations. Feel free to adapt the configurations and theme to suit your own needs.

I'm glad to be working purely in Python from now on!

by Jake Vanderplas at May 08, 2013 12:00 AM

May 07, 2013

Continuum Analytics

Accelerating Python Libraries with Numba (Part 1)

Welcome. This post is part of a series of Continuum Analytics Open Notebooks showcasing our projects, products, and services.

In this Continuum Open Notebook, you’ll see how Numba accelerates the performance of the GrowCut image segmentation window function by three orders of magnitude in two lines of Python.

by Aron Ahmadia at May 07, 2013 11:07 AM

May 02, 2013

Matthieu Brucher

Comparison of optimization algorithms

In the next version of scikits.optimization, I’ve added some Quasi-Newton steps. Before this version is released, I thought I would compare several methods of optimizing the Rosenbrock function.

Optimizers

What is great with the Rosenbrock cost function can be summed up in a few points:

  1. It is hard to optimize
  2. Gradient can be easily computed

I’ve decided to compare the number of function and gradient calls as well as the cost behavior for several usual optimization algorithms. So the contestants will be:

  • A Nelder-Mead/Polytope/simplex optimizer
  • SSA, a simplex with simulated annealing (think of amotsa from Numerical Recipes)
  • GA, a genetic algorithm
  • a gradient-based optimizer
  • CG, a conjugate-gradient optimizer
  • BFGS, a quasi-Newton optimizer

The first 4 are from scikits.optimization, the last 2 are based on proprietary code that cannot be published, but it’s interesting to compare with other tools that are used to compare gradient-free complex cost functions.

Results

I’ve made a small slideshow with the derivative-free algorithms. First you have for each of the three algorithms the number of function calls versus iteration, the cost versus iteration and finally the location of testing parameters.
Click to view slideshow.

This slideshow is for the derivative-based algorithms.
Click to view slideshow.

Conclusion

I was quite surprised by some algorithm behaviors. Clearly, the conjugate gradient algorithm behaves far better than the simple gradient, but the BFGS followed the Rosenbrock valley far better. A good quasi-Newton can be really efficient (not a brent because it needs to solve a linear equation), although a conjugate gradient can be enough in some cases.

For the gradient-free algorithms, SSA really behaved badly. This is mainly due because the hyperparameters that must be adequately tuned. This function is quite simple, but my first trial at setting these parameters was far more efficient for GA or the simplex than for SSA. So I would go for GA for gradient-free optimization: few and easy hyper parameters and a good browse of the search space.

The code for the 4 first tests and display plots

by Matt at May 02, 2013 07:03 AM

May 01, 2013

Fabian Pedregosa

Logistic Ordinal Regression

TL;DR: I've implemented a logistic ordinal regression or proportional odds model. Here is the Python code

The logistic ordinal regression model, also known as the proportional odds was introduced in the early 80s by McCullagh [1, 2] and is a generalized linear model specially tailored for the case of predicting ordinal variables, that is, variables that are discrete (as in classification) but which can be ordered (as in regression). It can be seen as an extension of the logistic regression model to the ordinal setting.

Given $X \in \mathbb{R}^{n \times p}$ input data and $y \in \mathbb{N}^n$ target values. For simplicity we assume $y$ is a non-decreasing vector, that is, $y_1 \leq y_2 \leq ...$. Just as the logistic regression models posterior probability $P(y=j|X_i)$ as the logistic function, in the logistic ordinal regression we model the cummulative probability as the logistic function. That is,

$$ P(y \leq j|X_i) = \phi(\theta_j - w^T X_i) = \frac{1}{1 + \exp(w^T X_i - \theta_j)} $$

where $w, \theta$ are vectors to be estimated from the data and $\phi$ is the logistic function defined as $\phi(t) = 1 / (1 + \exp(-t))$.

Toy example with three classes denoted in different colors. Also shown the vector of coefficients $w$ and the thresholds $\theta_0$ and $\theta_1$

Compared to multiclass logistic regression, we have added the constrain that the hyperplanes that separate the different classes are parallel for all classes, that is, the vector $w$ is common across classes. To decide to which class will $X_i$ be predicted we make use of the vector of thresholds $\theta$. If there are $K$ different classes, $\theta$ is a non-decreasing vector (that is, $\theta_1 \leq \theta_2 \leq ... \leq \theta_{K-1}$) of size $K-1$. We will then assign the class $j$ if the prediction $w^T X$ (recall that it's a linear model) lies in the interval $[\theta_{j-1}, \theta_{j}[$. In order to keep the same definition for extremal classes, we define $\theta_{0} = - \infty$ and $\theta_K = + \infty$.

The intuition is that we are seeking a vector $w$ such that $X w$ produces a set of values that are well separated into the different classes by the different thresholds $\theta$. We choose a logistic function to model the probability $P(y \leq j|X_i)$ but other choices are possible. In the proportional hazards model 1 the probability is modeled as $-\log(1 - P(y \leq j | X_i)) = \exp(\theta_j - w^T X_i)$. Other link functions are possible, where the link function satisfies $\text{link}(P(y \leq j | X_i)) = \theta_j - w^T X_i$. Under this framework, the logistic ordinal regression model has a logistic link function and the proportional hazards model has a log-log link function.

The logistic ordinal regression model is also known as the proportional odds model, because the ratio of corresponding odds for two different samples $X_1$ and $X_2$ is $\exp(w^T(X_1 - X_2))$ and so does not depend on the class $j$ but only on the difference between the samples $X_1$ and $X_2$.

Optimization

Model estimation can be posed as an optimization problem. Here, we minimize the loss function for the model, defined as minus the log-likelihood:

$\begin{align} \mathcal{L}(w, \theta) &= - \sum_{i=1}^n \log(\phi(\theta_{y_i} - w^T X_i) - \phi(\theta_{y_i -1} - w^T X_i)) \\ &= \sum_{i=1}^n w^T X_i - \log(\exp(\theta_{y_i}) - \exp(\theta_{y_i-1})) + \log(\phi(\theta_{y_i -1} - w^T X_i)) + \log(\phi(\theta_{y_i} - w^T X_i)) \\ \end{align}$

In this sum all terms are convex on $w$, thus the loss function is convex over $w$. It might be also jointly convex over $w$ and $\theta$, although I haven't checked. I use the function fmin_slsqp in scipy.optimize to optimize $\mathcal{L}$ under the constraint that $\theta$ is a non-decreasing vector. There might be better options, I don't know. If you do know, please leave a comment!.

Using the formula $\log(\phi(t))^\prime = (1 - \phi(t))$, we can compute the gradient of the loss function as

$\begin{align} \nabla_w \mathcal{L}(w, \theta) &= \sum_{i=1}^n X_i (1 - \phi(\theta_{y_i} - w^T X_i)) - \phi(\theta_{y_i-1} - w^T X_i)) \\ % \nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n - \frac{e_{y_i} \exp(\theta_{y_i}) - e_{y_i -1} \exp(\theta_{y_i -1})}{\exp(\theta_{y_i}) - \exp(\theta_{y_i-1})} \\ \nabla_\theta \mathcal{L}(w, \theta) &= \sum_{i=1}^n e_{y_i} \left(1 - \phi(\theta_{y_i} - w^T X_i) - \frac{1}{1 - \exp(\theta_{y_i -1} - \theta_{y_i})}\right) \\ & \qquad + e_{y_i -1}\left(1 - \phi(\theta_{y_i -1} - w^T X_i) - \frac{1}{1 - \exp(- (\theta_{y_i-1} - \theta_{y_i}))}\right) \end{align}$

where $e_i$ is the $i$th canonical vector.

Code

I've implemented a Python version of this algorithm using Scipy's optimize.fmin_slsqp function. This takes as arguments the loss function, the gradient denoted before and a function that is > 0 when the inequalities on $\theta$ are satisfied.

Code can be found here as part of the minirank package, which is my sandbox for code related to ranking and ordinal regression. At some point I would like to submit it to scikit-learn but right now the I don't know how the code will scale to medium-scale problems, but I suspect not great. On top of that I'm not sure if there is a real demand of these models for scikit-learn and I don't want to bloat the package with unused features.

Performance

I compared the prediction accuracy of this model in the sense of mean absolute error (IPython notebook) on the boston house-prices dataset. To have an ordinal variable, I rounded the values to the closest integer, which gave me a problem of size 506 $\times$ 13 with 46 different target values. Although not a huge increase in accuracy, this model did give me better results on this particular dataset:

Here, ordinal logistic regression is the best-performing model, followed by a Linear Regression model and a One-versus-All Logistic regression model as implemented in scikit-learn.


  1. "Regression models for ordinal data", P. McCullagh, Journal of the royal statistical society. Series B (Methodological), 1980 

  2. "Generalized Linear Models", P. McCullagh and J. A. Nelder (Book) 

  3. "Loss Functions for Preference Levels : Regression with Discrete Ordered Labels", Jason D. M. Rennie, Nathan Srebro 

by Fabian Pedregosa at May 01, 2013 10:00 PM

Josef Perkoltd

Power Plots in statsmodels

I just want to show another two plots for the statistical power of a test, since I didn't have time for this earlier

The code to produce them is just calling the methods of the power classes, for example for the one sample t-test.

Read more »

by Josef Perktold (noreply@blogger.com) at May 01, 2013 10:21 PM

April 30, 2013

Continuum Analytics

MKL Optimizations for Anaconda

We are happy to announce a new Anaconda add-on product called MKL Optimizations”, which allows packages in Anaconda to take advantage of the Math Kernel Library (MKL) by Intel.

by Ilan Schnell at April 30, 2013 03:15 PM

April 29, 2013

Jake Vanderplas

Benchmarking Nearest Neighbor Searches in Python

I recently submitted a scikit-learn pull request containing a brand new ball tree and kd-tree for fast nearest neighbor searches in python. In this post I want to highlight-ipynb some of the features of the new ball tree and kd-tree code that's part of this pull request, compare it to what's available in the scipy.spatial.cKDTree implementation, and run a few benchmarks showing the performance of these methods on various data sets.

My first-ever open source contribution was a C++ Ball Tree code, with a SWIG python wrapper, that I submitted to scikit-learn. A Ball Tree is a data structure that can be used for fast high-dimensional nearest-neighbor searches: I'd written it for some work I was doing on nonlinear dimensionality reduction of astronomical data (work that eventually led to these two papers), and thought that it might find a good home in the scikit-learn project, which Gael and others had just begun to bring out of hibernation.

After a short time, it became clear that the C++ code was not performing as well as it could be. I spent a bit of time writing a Cython adaptation of the Ball Tree, which is what currently resides in the sklearn.neighbors module. Though this implementation is fairly fast, it still has several weaknesses:

  • It only works with a Minkowski distance metric (of which Euclidean is a special case). In general, a ball tree can be written to handle any true metric (i.e. one which obeys the triangle inequality).
  • It implements only the single-tree approach, not the potentially faster dual-tree approach in which a ball tree is constructed for both the training and query sets.
  • It implements only nearest-neighbors queries, and not any of the other tasks that a ball tree can help optimize: e.g. kernel density estimation, N-point correlation function calculations, and other so-called Generalized N-body Problems.

I had started running into these limits when creating astronomical data analysis examples for astroML, the Python library for Astronomy and Machine Learning Python that I released last fall. I'd been thinking about it for a while, and finally decided it was time to invest the effort into updating and enhancing the Ball Tree. It took me longer than I planned (in fact, some of my first posts on this blog last August came out of the benchmarking experiments aimed at this task), but just a couple weeks ago I finally got things working and submitted a pull request to scikit-learn with the new code.

Features of the New Ball Tree and KD Tree

The new code is actually more than simply a new ball tree: it's written as a generic N dimensional binary search tree, with specific methods added to implement a ball tree and a kd-tree on top of the same core functionality. The new trees have a lot of very interesting and powerful features:

  • The ball tree works with any of the following distance metrics, which match those found in the module scipy.spatial.distance:

    ['euclidean', 'minkowski', 'manhattan', 'chebyshev', 'seuclidean', 'mahalanobis', 'wminkowski', 'hamming', 'canberra', 'braycurtis', 'matching', 'jaccard', 'dice', 'kulsinski', 'rogerstanimoto', 'russellrao', 'sokalmichener', 'sokalsneath', 'haversine']

Alternatively, the user can specify a callable Python function to act as the distance metric. While this will be quite a bit slower than using one of the optimized metrics above, it adds nice flexibility.

  • The kd-tree works with only the first four of the above metrics. This limitation is primarily because the distance bounds are less efficiently calculated for metrics which are not axis-aligned.

  • Both the ball tree and kd-tree implement k-neighbor and bounded neighbor searches, and can use either a single tree or dual tree approach, with either a breadth-first or depth-first tree traversal. Naive nearest neighbor searches scale as $\mathcal{O}[N^2]$; the tree-based methods here scale as $\mathcal{O}[N \log N]$.

  • Both the ball tree and kd-tree have their memory pre-allocated entirely by numpy: this not only leads to code that's easier to debug and maintain (no memory errors!), but means that either data structure can be serialized using Python's pickle module. This is a very important feature in some contexts, most notably when estimators are being sent between multiple machines in a parallel computing framework.

  • Both the ball tree and kd-tree implement fast kernel density estimation (KDE), which can be used within any of the valid distance metrics. The supported kernels are

    ['gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine']

the combination of these kernel options with the distance metric options above leads to an extremely large number of effective kernel forms. Naive KDE scales as $\mathcal{O}[N^2]$; the tree-based methods here scale as $\mathcal{O}[N \log N]$.

  • Both the ball tree and kd-tree implement fast 2-point correlation functions. A correlation function is a statistical measure of the distribution of data (related to the Fourier power spectrum of the density distribution). Naive 2-point correlation calculations scale as $\mathcal{O}[N^2]$; the tree-based methods here scale as $\mathcal{O}[N \log N]$.

Comparison with cKDTree

As mentioned above, there is another nearest neighbor tree available in the SciPy: scipy.spatial.cKDTree. There are a number of things which distinguish the cKDTree from the new kd-tree described here:

  • like the new kd-tree, cKDTree implements only the first four of the metrics listed above.

  • Unlike the new ball tree and kd-tree, cKDTree uses explicit dynamic memory allocation at the construction phase. This means that the trained tree object cannot be pickled, and must be re-constructed in place of being serialized.

  • Because of the flexibility gained through the use of dynamic node allocation, cKDTree can implement a more sophisticated building methods: it uses the "sliding midpoint rule" to ensure that nodes do not become too long and thin. One side-effect of this, however, is that for certain distributions of points, you can end up with a large proliferation of the number of nodes, which may lead to a huge memory footprint (even memory errors in some cases) and potentially inefficient searches.

  • The cKDTree builds its nodes covering the entire $N$-dimensional data space. this leads to relatively efficient build times because node bounds do not need to be recomputed at each level. However, the resulting tree is not as compact as it could be, which potentially leads to slower query times. The new ball tree and kd tree code shrinks nodes to only cover the part of the volume which contains points.

With these distinctions, I thought it would be interesting to do some benchmarks and get a detailed comparison of the performance of the three trees. Note that the cKDTree has just recently been re-written and extended, and is much faster than its previous incarnation. For that reason, I've run these benchmarks with the current bleeding-edge scipy.

Preparing the Benchmarks

But enough words. Here we'll create some scripts to run these benchmarks. There are several variables that will affect the computation time for a neighbors query:

  • The number of points $N$: for a brute-force search, the query will scale as $\mathcal{O}[N^2]$ . Tree methods usually bring this down to $\mathcal{O}[N \log N]$ .
  • The dimension of the data, $D$ : both brute-force and tree-based methods will scale approximately as $\mathcal{O}[D]$ . For high dimensions, however, the curse of dimensionality can make this scaling much worse.
  • The desired number of neighbors, $k$ : $k$ does not affect build time, but affects query time in a way that is difficult to quantify
  • The tree leaf size, leaf_size: The leaf size of a tree roughly specifies the number of points at which the tree switches to brute-force, and encodes the tradeoff between the cost of accessing a node, and the cost of computing the distance function.
  • The structure of the data: though data structure and distribution do not affect brute-force queries, they can have a large effect on the query times of tree-based methods.
  • Single/Dual tree query: A single-tree query searches for neighbors of one point at a time. A dual tree query builds a tree on both sets of points, and traverses both trees at the same time. This can lead to significant speedups in some cases.
  • Breadth-first vs Depth-first search: This determines how the nodes are traversed. In practice, it seems not to make a significant difference, so it won't be explored here.
  • The chosen metric: some metrics are slower to compute than others. The metric may also affect the structure of the data, the geometry of the tree, and thus the query and build times.

In reality, query times depend on all seven of these variables in a fairly complicated way. For that reason, I'm going to show several rounds of benchmarks where these variables are modified while holding the others constant. We'll do all our tests here with the most common Euclidean distance metric, though others could be substituted if desired.

We'll start by doing some imports to get our IPython notebook ready for the benchmarks. Note that at present, you'll have to install scikit-learn off my development branch for this to work. In the future, the new KDTree and BallTree will be part of a scikit-learn release.

In [1]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type &aposhelp(pylab)&apos.
In [2]:
import numpy as np
from scipy.spatial import cKDTree
from sklearn.neighbors import KDTree, BallTree

Data Sets

For spatial tree benchmarks, it's important to use various realistic data sets. In practice, data rarely looks like a uniform distribution, so running benchmarks on such a distribution will not lead to accurate expectations of the algorithm performance.

For this reason, we'll test three datasets side-by-side: a uniform distribution of points, a set of pixel values from images of hand-written digits, and a set of flux observations from astronomical spectra.

In [3]:
# Uniform random distribution
uniform_N = np.random.random((10000, 4))
uniform_D = np.random.random((1797, 128))
In [4]:
# Digits distribution
from sklearn.datasets import load_digits
digits = load_digits()

print digits.images.shape
(1797, 8, 8)
In [5]:
# We need more than 1797 digits, so let's stack the central
# regions of the images to inflate the dataset.
digits_N = np.vstack([digits.images[:, 2:4, 2:4],
                      digits.images[:, 2:4, 4:6],
                      digits.images[:, 4:6, 2:4],
                      digits.images[:, 4:6, 4:6],
                      digits.images[:, 4:6, 5:7],
                      digits.images[:, 5:7, 4:6]])
digits_N = digits_N.reshape((-1, 4))[:10000]

# For the dimensionality test, we need up to 128 dimesnions, so
# we'll combine some of the images.
digits_D = np.hstack((digits.data,
                      np.vstack((digits.data[:1000], digits.data[1000:]))))
# The edge pixels are all basically zero.  For the dimensionality tests
# to be reasonable, we want the low-dimension case to probe interir pixels
digits_D = np.hstack([digits_D[:, 28:], digits_D[:, :28]])
In [6]:
# The spectra can be downloaded with astroML: see http://www.astroML.org
from astroML.datasets import fetch_sdss_corrected_spectra
spectra = fetch_sdss_corrected_spectra()['spectra']
spectra.shape
Out[6]:
(4000, 1000)
In [7]:
# Take sections of spectra and stack them to reach N=10000 samples
spectra_N = np.vstack([spectra[:, 500:504],
                       spectra[:, 504:508],
                       spectra[:2000, 508:512]])
# Take a central region of the spectra for the dimensionality study
spectra_D = spectra[:1797, 400:528]
In [8]:
print uniform_N.shape, uniform_D.shape
print digits_N.shape, digits_D.shape
print spectra_N.shape, spectra_D.shape
(10000, 4) (1797, 128)
(10000, 4) (1797, 128)
(10000, 4) (1797, 128)

We now have three datasets with similar sizes. Just for the sake of visualization, let's visualize two dimensions from each as a scatter-plot:

In [9]:
titles = ['Uniform', 'Digits', 'Spectra']
datasets_D = [uniform_D, digits_D, spectra_D]
datasets_N = [uniform_N, digits_N, spectra_N]

fig, ax = plt.subplots(1, 3, figsize=(12, 3.5))

for axi, title, dataset in zip(ax, titles, datasets_D):
    axi.plot(dataset[:, 1], dataset[:, 2], '.k')
    axi.set_title(title, size=14)

We can see how different the structure is between these three sets. The uniform data is randomly and densely distributed throughout the space. The digits data actually comprise discrete values between 0 and 16, and more-or-less fill certain regions of the parameter space. The spectra display strongly-correlated values, such that they occupy a very small fraction of the total parameter volume.

Benchmarking Scripts

Now we'll create some scripts that will help us to run the benchmarks. Don't worry about these details for now -- you can simply scroll down past these and get to the plots.

In [10]:
from time import time

def average_time(executable, *args, **kwargs):
    """Compute the average time over N runs"""
    N = 5
    t = 0
    for i in range(N):
        t0 = time()
        res = executable(*args, **kwargs)
        t1 = time()
        t += (t1 - t0)
    return res, t * 1. / N
In [11]:
TREE_DICT = dict(cKDTree=cKDTree, KDTree=KDTree, BallTree=BallTree)
colors = dict(cKDTree='black', KDTree='red', BallTree='blue', brute='gray', gaussian_kde='black')

def bench_knn_query(tree_name, X, N, D, leaf_size, k,
                    build_args=None, query_args=None):
    """Run benchmarks for the k-nearest neighbors query"""
    Tree = TREE_DICT[tree_name]
    
    if build_args is None:
        build_args = {}
        
    if query_args is None:
        query_args = {}
        
    NDLk = np.broadcast(N, D, leaf_size, k)
        
    t_build = np.zeros(NDLk.size)
    t_query = np.zeros(NDLk.size)
    
    for i, (N, D, leaf_size, k) in enumerate(NDLk):
        XND = X[:N, :D]
        
        if tree_name == 'cKDTree':
            build_args['leafsize'] = leaf_size
        else:
            build_args['leaf_size'] = leaf_size
        
        tree, t_build[i] = average_time(Tree, XND, **build_args)
        res, t_query[i] = average_time(tree.query, XND, k, **query_args)
        
    return t_build, t_query
In [12]:
def plot_scaling(data, estimate_brute=False, suptitle='', **kwargs):
    """Plot the scaling comparisons for different tree types"""
    # Find the iterable key
    iterables = [key for (key, val) in kwargs.iteritems() if hasattr(val, '__len__')]
    if len(iterables) != 1:
        raise ValueError("A single iterable argument must be specified")
    x_key = iterables[0]
    x = kwargs[x_key]
    
    # Set some defaults
    if 'N' not in kwargs:
        kwargs['N'] = data.shape[0]
    if 'D' not in kwargs:
        kwargs['D'] = data.shape[1]
    if 'leaf_size' not in kwargs:
        kwargs['leaf_size'] = 15
    if 'k' not in kwargs:
        kwargs['k'] = 5
    
    fig, ax = plt.subplots(1, 2, figsize=(10, 4),
                           subplot_kw=dict(yscale='log', xscale='log'))
        
    for tree_name in ['cKDTree', 'KDTree', 'BallTree']:
        t_build, t_query = bench_knn_query(tree_name, data, **kwargs)
        ax[0].plot(x, t_build, color=colors[tree_name], label=tree_name)
        ax[1].plot(x, t_query, color=colors[tree_name], label=tree_name)
        
        if tree_name != 'cKDTree':
            t_build, t_query = bench_knn_query(tree_name, data,
                                               query_args=dict(breadth_first=True, dualtree=True),
                                               **kwargs)
            ax[0].plot(x, t_build, color=colors[tree_name], linestyle='--')
            ax[1].plot(x, t_query, color=colors[tree_name], linestyle='--')
            
    if estimate_brute:
        Nmin = np.min(kwargs['N'])
        Dmin = np.min(kwargs['D'])
        kmin = np.min(kwargs['k'])
        
        # get a baseline brute force time by setting the leaf size large,
        # ensuring a brute force calculation over the data
        _, t0 = bench_knn_query('KDTree', data, N=Nmin, D=Dmin, leaf_size=2 * Nmin, k=kmin)
        
        # use the theoretical scaling: O[N^2 D]
        if x_key == 'N':
            exponent = 2
        elif x_key == 'D':
            exponent = 1
        else:
            exponent = 0
            
        t_brute = t0 * (np.array(x, dtype=float) / np.min(x)) ** exponent
        ax[1].plot(x, t_brute, color=colors['brute'], label='brute force (est.)')
            
    for axi in ax:
        axi.grid(True)
        axi.set_xlabel(x_key)
        axi.set_ylabel('time (s)')
        axi.legend(loc='upper left')
        axi.set_xlim(np.min(x), np.max(x))
        
    info_str = ', '.join([key + '={' + key + '}' for key in ['N', 'D', 'k'] if key != x_key])
    ax[0].set_title('Tree Build Time ({0})'.format(info_str.format(**kwargs)))
    ax[1].set_title('Tree Query Time ({0})'.format(info_str.format(**kwargs)))
    
    if suptitle:
        fig.suptitle(suptitle, size=16)
    
    return fig, ax

Benchmark Plots

Now that all the code is in place, we can run the benchmarks. For all the plots, we'll show the build time and query time side-by-side. Note the scales on the graphs below: overall, the build times are usually a factor of 10-100 faster than the query times, so the differences in build times are rarely worth worrying about.

A note about legends: we'll show single-tree approaches as a solid line, and we'll show dual-tree approaches as dashed lines. In addition, where it's relevant, we'll estimate the brute force scaling for ease of comparison.

Scaling with Leaf Size

We will start by exploring the scaling with the leaf_size parameter: recall that the leaf size controls the minimum number of points in a given node, and effectively adjusts the tradeoff between the cost of node traversal and the cost of a brute-force distance estimate.

In [13]:
leaf_size = 2 ** np.arange(10)
for title, dataset in zip(titles, datasets_N):
    fig, ax = plot_scaling(dataset, N=2000, leaf_size=leaf_size, suptitle=title)

Note that with larger leaf size, the build time decreases: this is because fewer nodes need to be built. For the query times, we see a distinct minimum. For very small leaf sizes, the query slows down because the algorithm must access many nodes to complete the query. For very large leaf sizes, the query slows down because there are too many pairwise distance computations. If we were to use a less efficient metric function, the balance between these would change and a larger leaf size would be warranted. This benchmark motivates our setting the leaf size to 15 for the remaining tests.

Scaling with Number of Neighbors

Here we'll plot the scaling with the number of neighbors $k$. This should not effect the build time, because $k$ does not enter there. It will, however, affect the query time:

In [14]:
k = 2 ** np.arange(1, 10)
for title, dataset in zip(titles, datasets_N):
    fig, ax = plot_scaling(dataset, N=4000, k=k, suptitle=title,
                           estimate_brute=True)

Naively you might expect linear scaling with $k$, but for large $k$ that is not the case. Because a priority queue of the nearest neighbors must be maintained, the scaling is super-linear for large $k$.

We also see that brute force has no dependence on $k$ (all distances must be computed in any case). This means that if $k$ is very large, a brute force approach will win out (though the exact value for which this is true depends on $N$, $D$, the structure of the data, and all the other factors mentioned above).

Note that although the cKDTree build time is a factor of ~3 faster than the others, the absolute time difference is less than two milliseconds: a difference which is orders of magnitude smaller than the query time. This is due to the shortcut mentioned above: the cKDTree doesn't take the time to shrink the bounds of each node.

Scaling with the Number of Points

This is where things get interesting: the scaling with the number of points $N$ :

In [15]:
N = (10 ** np.linspace(2, 4, 10)).astype(int)
for title, dataset in zip(titles, datasets_N):
    plot_scaling(dataset, N=N, estimate_brute=True, suptitle=title)

We have set d = 4 and k = 5 in each case for ease of comparison. Examining the graphs, we see some common traits: all the tree algorithms seem to be scaling as approximately $\mathcal{O}[N\log N]$, and both kd-trees are beating the ball tree. Somewhat surprisingly, the dual tree approaches are slower than the single-tree approaches. For 10,000 points, the speedup over brute force is around a factor of 50, and this speedup will get larger as $N$ further increases.

Additionally, the comparison of datasets is interesting. Even for this low dimensionality, the tree methods tend to be slightly faster for structured data than for uniform data. Surprisingly, the cKDTree performance gets worse for highly structured data. I believe this is due to the use of the sliding midpoint rule: it works well for evenly distributed data, but for highly structured data can lead to situations where there are many very sparsely-populated nodes.

Scaling with the Dimension

As a final benchmark, we'll plot the scaling with dimension.

In [16]:
D = 2 ** np.arange(8)
for title, dataset in zip(titles, datasets_D):
    plot_scaling(dataset, D=D, estimate_brute=True, suptitle=title)