April 16, 2015

Titus Brown

Please destroy this software after publication. kthxbye.

tl;dr? A while back I wrote that there are three uses of research software: replication, reproduction, and reuse. The world of computational science would be better off if people clearly delineated whether or not they wanted anyone else to reuse their software, and I think it's a massive mistake to expect that everyone's software should be reusable.

A few months back, I reviewed a pretty exciting paper - one I will probably highlight on my blog, when it comes out. The paper outlined a fairly simple concept for comparing sequences and then used that to develop some new ultra-scalable functionality. The theory seemed novel, the computational results were pretty good, and I recommended acceptance (or minor revisions). This was in spite of the fact that the authors stated quite clearly that they had produced largely unusable software.

Other reviewers were not quite so forgiving, however -- one reviewer declined to review the paper until they could run the software on their own data.

This got me thinking - I think I have a reputation as wanting people to release their software in a useful manner, but I've actually shied away from requiring it on several occasions. Here was a situation that was a pretty direct conflict: neat result, with software that was not intended for reuse. Interestingly, I've drawn this line before, although without much personal comment. In my blog post on review criteria for bioinformatics papers, there's nothing there about whether or not the software is reusable - it must just be legally readable and executable. But I'm also pretty loud-mouthed about wanting good quality (or at least better quality) software out there in the bioinformatics world!

So what gives? I felt that the new theory looked pretty awesome, and would be tremendously useful, while the implementation was (as stated) unlikely to be something I (or others) used. So what? Publish!

I think this highlights that there are two different possible goals for bioinformatics papers. One goal is the standard scientific goal: to demonstrate a new method or technique, whether it be mathematical or computational. The other goal is different, and in some ways much harder: to provide a functioning tool for use and reuse. These should have different review standards, and that maybe the authors should be given the opportunity to distinguish clearly between the two goals.

There's actually a lot of commonality between what I would request of the software from either kind of paper, a technique paper or a tool paper.

  • Both need to be accessible for download and viewing - otherwise, how can I understand the details of the implementation?
  • Both types of software need to be usable enough to reproduce the results in the paper, in theory (e.g. given sufficient access to compute infrastructure).
  • Both should be in a publicly accessible and archived format, to avoid loss of the software from personal Web sites, etc.
  • Both should show evidence of decent principles of basic software engineering, including the use of version control, some form of testing (albeit unit testing or functional testing or even just defined input with known good output), release/version information, installation/dependency information, and the like.

However, there are some interesting differences. Off the top of my head, I'm thinking that:

  • Crucially, the software from the technique paper would not need to be open source - by the OSI definition, the technique code would not need to be freely modifiable or re-sharable.

    (To be clear, I know of neither any formal (journal) requirements nor ethical requirements that a particular implementation be modifiable or redistributable.)

  • Nor need the software from the technique paper be written in a general way (e.g. to robustly process different formats), or for broader re-use. In particular, this means that documentation and unit/functional tests might be minimal - enough to support replication but no more.

  • The software from the technique paper should be accessible to basic review, but should not be subject to code review on style or implementation - correctness only.

  • Software from a "tools" paper, by contrast, should be held to much higher standards, and be subject to code review (in parts) and examination for documentation and installation and ... oh, heck, just start with the sustainability evaluation checklist at the SSI!

I'm aware that by having such relaxed constraints on technique publication I'm more or less directly contradicting myself in my earlier blog post on automated testing and research software - all of that definitely holds for software that you hope to be reused.

I'm not sure how or where to draw the line here, exactly. It's certainly reasonable to say that software that doesn't have unit tests is likely to be wrong, and therefore unit tests should be required - but, in science, we should never rely on a single study to prove something anyway, so I'm not sure why it matters if software is wrong in some details. This is where the difference between "replicability" and "reproducibility" becomes important. If I can't replicate your computation (at least in theory) then you have no business publishing it; but reproducing it is something that is a much larger task, outside the scope of any given paper.

I want to quote David States, who wrote a comment two years ago on my blog:

Too often, developers work in isolation, and this creates a high risk for code errors and science errors. Good code needs to be accessible and this includes not just sharing of the source code itself but also use of effective style, inclusion of tests and validation procedures and appropriate documentation.

I think I agree - but what's the minimum, here, for a technique paper that is meant to be a demonstration of a technique and no more?

One final point: in all of this we should recognize that the current situation is quite poor, in that quite a bit of software is simply inaccessible for replication purposes. (This mirrors my personal experiences in bioinformatics review, too.)

Improving this situation is important, but I think we need to be precise about what the minimim is. I don't think we're going to get very far by insisting that all code be held to high standards; that's a generational exercise (and part of why I'm so bullish on Software Carpentry).

So: what's the minimum necessary for decent science?


p.s. In case anyone is wondering, I don't think our software really meets my own criteria for tool publication, although it's getting closer.

p.p.s. Drawing this distinction leads in some very good directions for publishers and funding bodies to think about, too. More on that in another blog post, if I get the chance.

p.p.p.s. My 2004 paper (Brown and Callan) has a table that's wrong due to a fencepost error. But it's not seriously wrong. shrug

by C. Titus Brown at April 16, 2015 10:00 PM

The PyCon 2015 Ally's Workshop

At PyCon 2015, I had the pleasure of attending the Ally Skills Workshop, organized by @adainitiative (named after Ada Lovelace).

The workshop was a 3 hour strongly guided discussion centering around 4-6 person group discussion of short scenarios. There's a guide to running them here, although I personally would not have wanted to run one without attending one first!

I attended the workshop for at least three reasons --

First, I want to do better myself. I have put some effort into (and received a lot of encouragement for) making my lab an increasingly open and welcoming place. While I have heard concerns about being insufficiently critical and challenging of bad ideas in science (and I have personally experienced a few rather odd situations where obviously bad ideas weren't called out in my past labs), I don't see any inherent conflict between being welcoming and being intellectually critical - in fact, I rather suspect they are mutually supportive, especially for the more junior people.

But, doing better is surprisingly challenging; everyone needs a mentor, or at least guideposts. So when I heard about this workshop, I leapt at the chance to attend!

Second, I am interested in connecting these kinds of things to my day job in academia, where I am now a professor at UC Davis. UC Davis is the home of the somewhat notorious Jonathan Eisen, who is notorious for many reasons that include boycotting and calling out conferences that have low diversity. UC Davis also has an effort to increase diversity at the faculty level, and I think that this is an important effort. I'm hoping to be involved in this when I actually take up residence in Davis, and learning to be a male ally is one way to help. More, I think that Davis would be a natural home to some of these ally workshops, and so I attended the Ally Skills workshop to explore this.

And third, I was just curious! It's surprisingly tricky to confront and talk about sexism effectively, and I thought seeing how the the pros did it would a good way to start.

Interestingly, 2/3 of my lab attended the workshop, too - without me requesting it. I think they found it valuable, too.

The workshop itself

Valerie Aurora ran the workshop, and it's impossible to convey how good it was, but I'll try by picking out some choice quotes:

"You shouldn't expect praise or credit for behaving like a decent human being."

"Sometimes, you just need a flame war to happen." (paraphrase)

"LPT: Read Captain Awkward. And read the comments."

"It's not up to the victim whether you enforce your code of conduct."

"The physiological effects of alcohol are actually limited, and most effects of alcohol are socially and/or culturally mediated."

"Avoid rules lawyering. I don't now if you've ever worked with lawyers, but software engineers are almost as bad."

"One problem for male allies is the assumption that you are only talking to a woman because you are sexually interested in them."

"Trolls are good at calibrating their level of awfulness to something that you will feel guilty about moderating."

Read the blog post "Tone policing only goes one way.

Overall, a great experience and something I hope to help host more of at UC Davis.


by C. Titus Brown at April 16, 2015 10:00 PM

Continuum Analytics

Find Continuum at PyData Dallas

PyData Dallas, the first PyData conference in Texas, is taking place next week, April 24-26. PyData has been a wonderful conference for fostering the Python community and giving developers and other Python enthusiasts the opportunity to share their ideas, projects and the future of Python. Continuum Analytics is proud to be a founding sponsor for such an innovative, community-driven conference.

by Continuum at April 16, 2015 12:00 AM

April 15, 2015

Titus Brown

The three porridge bowls of sustainable scientific software development

(The below issues are very much on my mind as I think about how to apply for another NIH grant to fund continued development on the khmer project.)

Imagine that we have a graph of novel functionality versus software engineering effort for a particular project, cast in the shape of a tower or pyramid, i.e. a support structure for cool science.


Fig 1. Novel functionality (height) vs software engineering effort (area under curve).

The more novel functionality implemented, the taller the building, and the broader the software engineering base needs to be to support the building. If you have too much novel functionality with too little software engineering base, the tower will have too little support and catastrophe can ensue - either no new functionality can be added past a certain point, or we discover that much of the implemented functionality is actually unstable and incorrect.


Fig 2. One failure mode for scientific software development, where too much novel functionality (height) is supported by too little investment in software engineering effort (area under curve). This results in structural instability and incorrectness.

Since everybody likes novel functionality - for example, it's how we grade grants in science -- this is a very common failure mode. It is particularly problematic in situations where we have built a larger structure by placing many of these individual buildings on top of others; the entire structure is not much stronger than its weakest (least supported) component.

Another possible failure mode is if the base becomes too big too soon:


Fig 3. Another failure mode for scientific software development, where too little novel functionality (height) is developed, relative to too much investment in software engineering effort (area under curve).

That is, if too much effort is spent on software engineering at the expense of building novel functionality on top of it, then the building remains the same height while the base broadens. This is a failure for an individual project, because no new functionality gets built, and the project falls out of funding.

In the worst case, the base can become over-wrought and be designed to support functionality that doesn't yet exist. In most situations, this work will be entirely wasted, either because the base was designed for the wrong functionality, or because the extra work put into the base will delay the work put into new features.

Where projects are designed to be building blocks from the start, as opposed to a leap into the unknown like most small-lab computational science projects, a different structure is worth investing in -- but I'm skeptical that this is ever the way to start a project.


Fig 4. What a building block might look like - purposely eschewing novel functionality (height) for the purpose of building out a support platform for other science.

Supporting this kind of project is something that Dan Katz has written and presented about; see (for example) A Method to Select e-Infrastructure Components to Sustain.

And, of course, the real danger is that we end up in a situation where a poorly engineered structure is used to support a much larger body of scientific work:


Fig 5. The danger of the first failure mode is that we build new science (bowl-like shape) on top of a bunch of novel functionality (height of spike), with too little engineering (area in the spike).

The question that I am trying to understand is this: what are the lifecycle stages for research software, and how should we design for them (as researchers), and how should we think about funding them (as reviewers and program officers)?

To bring things back to the title, how do we make sure we mix the right amount of software development (cold porridge) with novel functionality (hot porridge) to make something edible for little bears?


by C. Titus Brown at April 15, 2015 10:00 PM

Matthieu Brucher

Announcement: ATKStereoCompressor 1.0.0

I’m happy to announce the release of a stereo compressor based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. This stereo compressor can work on two channels, left/right or middle/side, possibly in linked mode (only one set of parameters), and can be set up to mix the input signal with the compressed signal (serial/parallel compression).


The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKSteroCompressor

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

ATK SD1, ATKCompressor and ATKUniversalDelay were upgraded after AU validation failed. This is now fixed.

Buy Me a Coffee!

Other Amount:

Your Email Address :

by Matt at April 15, 2015 07:36 AM

April 14, 2015

Jan Hendrik Metzen

Probability calibration

As a follow-up of my previous post on reliability diagrams, I have worked jointly with Alexandre Gramfort, Mathieu Blondel and Balazs Kegl (with reviews by the whole team, in particular Olivier Grisel) on adding probability calibration and reliability diagrams to scikit-learn. Those have been added in the recent 0.16 release of scikit-learn as CalibratedClassifierCV and calibration_curve.

This post contains an interactive version of the documentation in the form of an IPython notebook; parts of the text/code are thus due to my coauthors.

Note that the 0.16 release of scikit-learn contains a bug in IsotonicRegression, which has been fixed in the 0.16.1 release. For obtaining correct results with this notebook, you need to use 0.16.1 or any later version.

Reliability curves

In [1]:
Expand Code
import numpy as np

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline

from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import (brier_score_loss, precision_score, recall_score,
                             f1_score, log_loss)
from sklearn.cross_validation import train_test_split

When performing classification you often want not only to predict the class label, but also obtain a probability of the respective label. This probability gives you some kind of confidence on the prediction. Some models can give you poor estimates of the class probabilities and some even do not not support probability prediction. The calibration module allows you to better calibrate the probabilities of a given model, or to add support for probability prediction.

Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the positive class. The following plot compares how well the probabilistic predictions of different classifiers are calibrated:

In [2]:
Expand Code
X, y = datasets.make_classification(n_samples=100000, n_features=20,
                                    n_informative=2, n_redundant=2)

train_samples = 100  # Samples used for training the models

X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]

# Create classifiers
lr = LogisticRegression()
gnb = GaussianNB()
svc = LinearSVC(C=1.0)
rfc = RandomForestClassifier(n_estimators=100)
In [3]:
Expand Code
plt.figure(figsize=(9, 9))
ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
ax2 = plt.subplot2grid((3, 1), (2, 0))

ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
for clf, name in [(lr, 'Logistic'),
                  (gnb, 'Naive Bayes'),
                  (svc, 'Support Vector Classification'),
                  (rfc, 'Random Forest')]:
    clf.fit(X_train, y_train)
    if hasattr(clf, "predict_proba"):
        prob_pos = clf.predict_proba(X_test)[:, 1]
    else:  # use decision function
        prob_pos = clf.decision_function(X_test)
        prob_pos = \
            (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())
    fraction_of_positives, mean_predicted_value = \
        calibration_curve(y_test, prob_pos, n_bins=10)

    ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
             label="%s" % (name, ))

    ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
             histtype="step", lw=2)

ax1.set_ylabel("Fraction of positives")
ax1.set_ylim([-0.05, 1.05])
ax1.legend(loc="lower right")
ax1.set_title('Calibration plots  (reliability curve)')

ax2.set_xlabel("Mean predicted value")
ax2.legend(loc="upper center", ncol=2)


LogisticRegression returns well calibrated predictions by default as it directly optimizes log-loss. In contrast, the other methods return biased probabilities; with different biases per method:

  • Naive Bayes (GaussianNB) tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because it makes the assumption that features are conditionally independent given the class, which is not the case in this dataset which contains 2 redundant features.

  • RandomForestClassifier shows the opposite behavior: the histograms show peaks at approximately 0.2 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and Caruana [4]: "Methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values. Because predictions are restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example, if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly with random forests because the base-level trees trained with random forests have relatively high variance due to feature subseting." As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the classifier could trust its "intuition" more and return probabilties closer to 0 or 1 typically.

  • Linear Support Vector Classification (LinearSVC) shows an even more sigmoid curve as the RandomForestClassifier, which is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [4]), which focus on hard samples that are close to the decision boundary (the support vectors).

Calibration of binary classifiers

Two approaches for performing calibration of probabilistic predictions are provided: a parametric approach based on Platt's sigmoid model and a non-parametric approach based on isotonic regression (sklearn.isotonic). Probability calibration should be done on new data not used for model fitting. The class CalibratedClassifierCV uses a cross-validation generator and estimates for each split the model parameter on the train samples and the calibration of the test samples. The probabilities predicted for the folds are then averaged. Already fitted classifiers can be calibrated by CalibratedClassifierCV via the paramter cv="prefit". In this case, the user has to take care manually that data for model fitting and calibration are disjoint.

The following images demonstrate the benefit of probability calibration. The first image present a dataset with 2 classes and 3 blobs of data. The blob in the middle contains random samples of each class. The probability for the samples in this blob should be 0.5.

In [4]:
Expand Code
n_samples = 50000
n_bins = 3  # use 3 bins for calibration_curve as we have 3 clusters here

# Generate 3 blobs with 2 classes where the second blob contains
# half positive samples and half negative samples. Probability in this
# blob is therefore 0.5.
centers = [(-5, -5), (0, 0), (5, 5)]
X, y = datasets.make_blobs(n_samples=n_samples, n_features=2, cluster_std=1.0,
                           centers=centers, shuffle=False, random_state=42)

y[:n_samples // 2] = 0
y[n_samples // 2:] = 1
sample_weight = np.random.RandomState(42).rand(y.shape[0])

# split train, test for calibration
X_train, X_test, y_train, y_test, sw_train, sw_test = \
    train_test_split(X, y, sample_weight, test_size=0.9, random_state=42)

y_unique = np.unique(y)
colors = cm.rainbow(np.linspace(0.0, 1.0, y_unique.size))
for this_y, color in zip(y_unique, colors):
    this_X = X_train[y_train == this_y]
    this_sw = sw_train[y_train == this_y]
    plt.scatter(this_X[:, 0], this_X[:, 1], s=this_sw * 50, c=color, alpha=0.5,
                label="Class %s" % this_y)
<matplotlib.text.Text at 0x5b37b10>

The following image shows on the data above the estimated probability using a Gaussian naive Bayes classifier without calibration, with a sigmoid calibration and with a non-parametric isotonic calibration. One can observe that the non-parametric model provides the most accurate probability estimates for samples in the middle, i.e., 0.5.

In [5]:
Expand Code
# Gaussian Naive-Bayes with no calibration
clf = GaussianNB()
clf.fit(X_train, y_train)  # GaussianNB itself does not support sample-weights
prob_pos_clf = clf.predict_proba(X_test)[:, 1]

# Gaussian Naive-Bayes with isotonic calibration
clf_isotonic = CalibratedClassifierCV(clf, cv=2, method='isotonic')
clf_isotonic.fit(X_train, y_train, sw_train)
prob_pos_isotonic = clf_isotonic.predict_proba(X_test)[:, 1]

# Gaussian Naive-Bayes with sigmoid calibration
clf_sigmoid = CalibratedClassifierCV(clf, cv=2, method='sigmoid')
clf_sigmoid.fit(X_train, y_train, sw_train)
prob_pos_sigmoid = clf_sigmoid.predict_proba(X_test)[:, 1]

print("Brier scores: (the smaller the better)")

clf_score = brier_score_loss(y_test, prob_pos_clf, sw_test)
print("No calibration: %1.3f" % clf_score)

clf_isotonic_score = brier_score_loss(y_test, prob_pos_isotonic, sw_test)
print("With isotonic calibration: %1.3f" % clf_isotonic_score)

clf_sigmoid_score = brier_score_loss(y_test, prob_pos_sigmoid, sw_test)
print("With sigmoid calibration: %1.3f" % clf_sigmoid_score)
Brier scores: (the smaller the better)
No calibration: 0.104
With isotonic calibration: 0.084
With sigmoid calibration: 0.109
In [6]:
Expand Code
order = np.lexsort((prob_pos_clf, ))
plt.plot(prob_pos_clf[order], 'r', label='No calibration (%1.3f)' % clf_score)
plt.plot(prob_pos_isotonic[order], 'g', linewidth=3,
         label='Isotonic calibration (%1.3f)' % clf_isotonic_score)
plt.plot(prob_pos_sigmoid[order], 'b', linewidth=3,
         label='Sigmoid calibration (%1.3f)' % clf_sigmoid_score)
plt.plot(np.linspace(0, y_test.size, 51)[1::2],
         y_test[order].reshape(25, -1).mean(1),
         'k', linewidth=3, label=r'Empirical')
plt.ylim([-0.05, 1.05])
plt.xlabel("Instances sorted according to predicted probability "
           "(uncalibrated GNB)")
plt.legend(loc="upper left")
plt.title("Gaussian naive Bayes probabilities")
<matplotlib.text.Text at 0x623ce90>

The following experiment is performed on an artificial dataset for binary classification with 100.000 samples (1.000 of them are used for model fitting) with 20 features. Of the 20 features, only 2 are informative and 10 are redundant. The figure shows the estimated probabilities obtained with logistic regression, a linear support-vector classifier (SVC), and linear SVC with both isotonic calibration and sigmoid calibration. The calibration performance is evaluated with Brier score brier_score_loss, reported in the legend (the smaller the better).

In [7]:
Expand Code
# Create dataset of classification task with many redundant and few
# informative features
X, y = datasets.make_classification(n_samples=100000, n_features=20,
                                    n_informative=2, n_redundant=10,

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.99,
In [8]:
Expand Code
def plot_calibration_curve(est, name, fig_index):
    """Plot calibration curve for est w/o and with calibration. """
    # Calibrated with isotonic calibration
    isotonic = CalibratedClassifierCV(est, cv=2, method='isotonic')

    # Calibrated with sigmoid calibration
    sigmoid = CalibratedClassifierCV(est, cv=2, method='sigmoid')

    # Logistic regression with no calibration as baseline
    lr = LogisticRegression(C=1., solver='lbfgs')

    fig = plt.figure(fig_index, figsize=(9, 9))
    ax1 = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    ax2 = plt.subplot2grid((3, 1), (2, 0))

    ax1.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
    for clf, name in [(lr, 'Logistic'),
                      (est, name),
                      (isotonic, name + ' + Isotonic'),
                      (sigmoid, name + ' + Sigmoid')]:
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        if hasattr(clf, "predict_proba"):
            prob_pos = clf.predict_proba(X_test)[:, 1]
        else:  # use decision function
            prob_pos = clf.decision_function(X_test)
            prob_pos = \
                (prob_pos - prob_pos.min()) / (prob_pos.max() - prob_pos.min())

        clf_score = brier_score_loss(y_test, prob_pos, pos_label=y.max())
        print("%s:" % name)
        print("\tBrier: %1.3f" % (clf_score))
        print("\tPrecision: %1.3f" % precision_score(y_test, y_pred))
        print("\tRecall: %1.3f" % recall_score(y_test, y_pred))
        print("\tF1: %1.3f\n" % f1_score(y_test, y_pred))

        fraction_of_positives, mean_predicted_value = \
            calibration_curve(y_test, prob_pos, n_bins=10)

        ax1.plot(mean_predicted_value, fraction_of_positives, "s-",
                 label="%s (%1.3f)" % (name, clf_score))

        ax2.hist(prob_pos, range=(0, 1), bins=10, label=name,
                 histtype="step", lw=2)

    ax1.set_ylabel("Fraction of positives")
    ax1.set_ylim([-0.05, 1.05])
    ax1.legend(loc="lower right")
    ax1.set_title('Calibration plots  (reliability curve)')

    ax2.set_xlabel("Mean predicted value")
    ax2.legend(loc="upper center", ncol=2)

In [9]:
# Plot calibration cuve for Linear SVC
plot_calibration_curve(LinearSVC(), "SVC", 2)
	Brier: 0.099
	Precision: 0.872
	Recall: 0.851
	F1: 0.862

	Brier: 0.163
	Precision: 0.872
	Recall: 0.852
	F1: 0.862

SVC + Isotonic:
	Brier: 0.100
	Precision: 0.853
	Recall: 0.878
	F1: 0.865

SVC + Sigmoid:
	Brier: 0.099
	Precision: 0.874
	Recall: 0.849
	F1: 0.861

One can observe here that logistic regression is well calibrated as its curve is nearly diagonal. Linear SVC's calibration curve has a sigmoid curve, which is typical for an under-confident classifier. In the case of LinearSVC, this is caused by the margin property of the hinge loss, which lets the model focus on hard samples that are close to the decision boundary (the support vectors). Both kinds of calibration can fix this issue and yield nearly identical results. The next figure shows the calibration curve of Gaussian naive Bayes on the same data, with both kinds of calibration and also without calibration.

In [10]:
# Plot calibration cuve for Gaussian Naive Bayes
plot_calibration_curve(GaussianNB(), "Naive Bayes", 1)
	Brier: 0.099
	Precision: 0.872
	Recall: 0.851
	F1: 0.862

Naive Bayes:
	Brier: 0.118
	Precision: 0.857
	Recall: 0.876
	F1: 0.867

Naive Bayes + Isotonic:
	Brier: 0.098
	Precision: 0.883
	Recall: 0.836
	F1: 0.859

Naive Bayes + Sigmoid:
	Brier: 0.109
	Precision: 0.861
	Recall: 0.871
	F1: 0.866

One can see that Gaussian naive Bayes performs very badly but does so in an other way than linear SVC: While linear SVC exhibited a sigmoid calibration curve, Gaussian naive Bayes' calibration curve has a transposed-sigmoid shape. This is typical for an over-confident classifier. In this case, the classifier's overconfidence is caused by the redundant features which violate the naive Bayes assumption of feature-independence.

Calibration of the probabilities of Gaussian naive Bayes with isotonic regression can fix this issue as can be seen from the nearly diagonal calibration curve. Sigmoid calibration also improves the brier score slightly, albeit not as strongly as the non-parametric isotonic calibration. This is an intrinsic limitation of sigmoid calibration, whose parametric form assumes a sigmoid rather than a transposed-sigmoid curve. The non-parametric isotonic calibration model, however, makes no such strong assumptions and can deal with either shape, provided that there is sufficient calibration data. In general, sigmoid calibration is preferable if the calibration curve is sigmoid and when there is few calibration data while isotonic calibration is preferable for non- sigmoid calibration curves and in situations where many additional data can be used for calibration.

Multi-class classification

CalibratedClassifierCV can also deal with classification tasks that involve more than two classes if the base estimator can do so. In this case, the classifier is calibrated first for each class separately in an one-vs-rest fashion. When predicting probabilities for unseen data, the calibrated probabilities for each class are predicted separately. As those probabilities do not necessarily sum to one, a postprocessing is performed to normalize them.

The next image illustrates how sigmoid calibration changes predicted probabilities for a 3-class classification problem. Illustrated is the standard 2-simplex, where the three corners correspond to the three classes. Arrows point from the probability vectors predicted by an uncalibrated classifier to the probability vectors predicted by the same classifier after sigmoid calibration on a hold-out validation set. Colors indicate the true class of an instance (red: class 1, green: class 2, blue: class 3).

In [11]:
Expand Code

# Generate data
X, y = datasets.make_blobs(n_samples=1000, n_features=2, random_state=42,
X_train, y_train = X[:600], y[:600]
X_valid, y_valid = X[600:800], y[600:800]
X_train_valid, y_train_valid = X[:800], y[:800]
X_test, y_test = X[800:], y[800:]
In [12]:
Expand Code
# Train uncalibrated random forest classifier on whole train and validation
# data and evaluate on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train_valid, y_train_valid)
clf_probs = clf.predict_proba(X_test)
score = log_loss(y_test, clf_probs)

# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)
In [13]:
Expand Code
# Plot changes in predicted probabilities via arrows
plt.figure(0, figsize=(10, 8))
colors = ["r", "g", "b"]
for i in range(clf_probs.shape[0]):
    plt.arrow(clf_probs[i, 0], clf_probs[i, 1],
              sig_clf_probs[i, 0] - clf_probs[i, 0],
              sig_clf_probs[i, 1] - clf_probs[i, 1],
              color=colors[y_test[i]], head_width=1e-2)

# Plot perfect predictions
plt.plot([1.0], [0.0], 'ro', ms=20, label="Class 1")
plt.plot([0.0], [1.0], 'go', ms=20, label="Class 2")
plt.plot([0.0], [0.0], 'bo', ms=20, label="Class 3")

# Plot boundaries of unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label="Simplex")

# Annotate points on the simplex
plt.annotate(r'($\frac{1}{3}$, $\frac{1}{3}$, $\frac{1}{3}$)',
             xy=(1.0/3, 1.0/3), xytext=(1.0/3, .23), xycoords='data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='center', verticalalignment='center')
plt.plot([1.0/3], [1.0/3], 'ko', ms=5)
plt.annotate(r'($\frac{1}{2}$, $0$, $\frac{1}{2}$)',
             xy=(.5, .0), xytext=(.5, .1), xycoords='data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $\frac{1}{2}$, $\frac{1}{2}$)',
             xy=(.0, .5), xytext=(.1, .5), xycoords='data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($\frac{1}{2}$, $\frac{1}{2}$, $0$)',
             xy=(.5, .5), xytext=(.6, .6), xycoords='data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $0$, $1$)',
             xy=(0, 0), xytext=(.1, .1), xycoords='data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($1$, $0$, $0$)',
             xy=(1, 0), xytext=(1, .1), xycoords='data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='center', verticalalignment='center')
plt.annotate(r'($0$, $1$, $0$)',
             xy=(0, 1), xytext=(.1, 1), xycoords='data',
             arrowprops=dict(facecolor='black', shrink=0.05),
             horizontalalignment='center', verticalalignment='center')
# Add grid
for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    plt.plot([0, x], [x, 0], 'k', alpha=0.2)
    plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)
    plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)

plt.title("Change of predicted probabilities after sigmoid calibration")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)

print("Log-loss of")
print(" * uncalibrated classifier trained on 800 datapoints: %.3f "
      % score)
print(" * classifier trained on 600 datapoints and calibrated on "
      "200 datapoint: %.3f" % sig_score)
Log-loss of
 * uncalibrated classifier trained on 800 datapoints: 1.280 
 * classifier trained on 600 datapoints and calibrated on 200 datapoint: 0.536

The base classifier is a random forest classifier with 25 base estimators (trees). If this classifier is trained on all 800 training datapoints, it is overly confident in its predictions and thus incurs a large log-loss. Calibrating an identical classifier, which was trained on 600 datapoints, with method='sigmoid' on the remaining 200 datapoints reduces the confidence of the predictions, i.e., moves the probability vectors from the edges of the simplex towards the center:

In [14]:
Expand Code
# Illustrate calibrator
plt.figure(1, figsize=(10, 8))
# generate grid over 2-simplex
p1d = np.linspace(0, 1, 20)
p0, p1 = np.meshgrid(p1d, p1d)
p2 = 1 - p0 - p1
p = np.c_[p0.ravel(), p1.ravel(), p2.ravel()]
p = p[p[:, 2] >= 0]

calibrated_classifier = sig_clf.calibrated_classifiers_[0]
prediction = np.vstack([calibrator.predict(this_p)
                        for calibrator, this_p in
                        zip(calibrated_classifier.calibrators_, p.T)]).T
prediction /= prediction.sum(axis=1)[:, None]

# Ploit modifications of calibrator
for i in range(prediction.shape[0]):
    plt.arrow(p[i, 0], p[i, 1],
              prediction[i, 0] - p[i, 0], prediction[i, 1] - p[i, 1],
              head_width=1e-2, color=colors[np.argmax(p[i])])
# Plot boundaries of unit simplex
plt.plot([0.0, 1.0, 0.0, 0.0], [0.0, 0.0, 1.0, 0.0], 'k', label="Simplex")

for x in [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]:
    plt.plot([0, x], [x, 0], 'k', alpha=0.2)
    plt.plot([0, 0 + (1-x)/2], [x, x + (1-x)/2], 'k', alpha=0.2)
    plt.plot([x, x + (1-x)/2], [0, 0 + (1-x)/2], 'k', alpha=0.2)

plt.title("Illustration of sigmoid calibrator")
plt.xlabel("Probability class 1")
plt.ylabel("Probability class 2")
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
(-0.05, 1.05)

This calibration results in a lower log-loss. Note that an alternative would have been to increase the number of base estimators which would have resulted in a similar decrease in log-loss.


In summary, the newly added CalibratedClassifierCV allows to improve the quality of predicted class probabilities of binary and multi-class classifiers. It provides both a parametric calibration assuming a sigmoid shape of the calibration curve and a non-parametric calibration based on isotonic regression, which can cope with any shape of the calibration curve, provided that sufficient calibration data exists. The main bottlenecks of the method are the increased computation time due to the internal cross-validation loop and the necessity of additional calibration data, which can be alleviated by cross-validation.


[1] Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers, B. Zadrozny & C. Elkan, ICML 2001

[2] Transforming Classifier Scores into Accurate Multiclass Probability Estimates, B. Zadrozny & C. Elkan, (KDD 2002)

[3] Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods, J. Platt, (1999)

[4] Predicting Good Probabilities with Supervised Learning, A. Niculescu-Mizil & R. Caruana, ICML 2005

In [15]:
%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy
Jan Hendrik Metzen 14/04/2015 

CPython 2.7.5+
IPython 2.4.1

numpy 1.9.2

compiler   : GCC 4.8.1
system     : Linux
release    : 3.16.0-28-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

This post was written as an IPython notebook. You can download this notebook.

by Jan Hendrik Metzen at April 14, 2015 12:30 PM

April 13, 2015

Titus Brown

PyCon sprints - playing with Docker

I'm at the PyCon 2015 sprints (day 2), and I took the opportunity to play around with Docker a bit.

First, I created a local docker container that contained an installed version of khmer. I ran a blank docker container:

docker run -it ubuntu

and then installed the khmer prereqs on it, followed by khmer (run on the docker container):

apt-get update
apt-get install -y python-pip python-dev
pip install khmer

then I logged out, and created a named docker container from that:

docker commit -m "installed khmer" -a "Titus Brown" 28138ab4095d titus/khmer:v1.3

And now I can run the khmer scripts like this --

docker run -it titus/khmer:v1.3 /usr/local/bin/normalize-by-median.py

Next, I automated all of this with a Dockerfile:

cat > Dockerfile <<EOF
FROM ubuntu:14.04
MAINTAINER Titus Brown <titus@idyll.org>
RUN apt-get update && apt-get install -y python-dev python-pip
RUN pip install -U setuptools
RUN pip install khmer==1.3

docker build -t titus/khmer:v1.3 .

Either container lets me run normalize by median on a file outside of the container like so:

mkdir foo
curl https://raw.githubusercontent.com/ged-lab/khmer/master/data/100k-filtered.fa > foo/sequence.fa

docker run -v $PWD/foo:/data -it titus/khmer:v1.3 normalize-by-median.py /data/sequence.fa -o /data/result.fa

Here, I'm running 'normalize-by-median' on the input file 'foo/sequence.fa', and the output is placed in 'foo/result.fa'. The trick is that the './foo/' directory is mounted on the docker container as '/data/', so normalize-by-median.py sees it as '/data/sequence.fa'.

Nice and easy, so far...


by C. Titus Brown at April 13, 2015 10:00 PM

PyCon sprints - playing with named pipes and streaming and khmer

I'm at the PyCon 2015 sprints (day 2), and I took the opportunity to play around with named pipes.

I was reminded of named pipes by Vince Buffalo in this great blog post, and since we at the khmer project are very interested in streaming, and named pipes fit well with a streaming perspective, I thought I'd check out named pipes as a way to communicate sequences between different khmer scripts.

First, I tried using named pipes to tie digital normalization together with splitting reads out into two output files, left and right:

mkfifo aa
mkfifo bb

# set up diginorm, reading from 'aa' and writing to 'bb'
normalize-by-median.py aa -o bb &

# split reads into left and right, reading from 'bb' and outputting to
# output.1 / output.2
split-paired-reads.py bb -1 output.1 -2 output.2 &

# feed in your sequences!
cat sequence.fa > aa

Here the setup is a bit weird, because you have to set up all the commands that are going to operate on the data before feeding in the data. But it all works!

Next, I tried process substitution. This does essentially the same thing as above, but is much nicer and more succinct:

normalize-by-median.py sequence.fa -o >(split-paired-reads.py /dev/stdin -1 output.1 -2 output.2)

Finally, I tried to make use of functionality from our new semi-streaming paper to run 3-pass digital normalization using semi-streaming --

trim-low-abund.py -Z 20 -C 2 sequence.fa -o >(normalize-by-median.py /dev/stdin -o result.fa

but this fell apart because 'trim-low-abund.py' doesn't support -o ;). This led to a few issues being filed in our issue tracker...

Very neat stuff. It has certainly given me a strong reason to make sure all of our scripts support streaming input and output!


by C. Titus Brown at April 13, 2015 10:00 PM

Taking grad students to PyCon

I am still up at PyCon 2015 in Montreal, and most of my lab is here with me.

On Saturday, I told Terry Peppers and some others that PyCon had been one of my (limited) lifelines to (limited) sanity during my early tenure-track years. Whenever I was in danger of buying into too much of the academic bubble, events like PyCon and the various Twitter and blogging interactions I had in the Python world helped remind me how ridiculous some aspects of the academic world are.

On Sunday, a colleague at another university (who had not been to PyCon before, I think) was talking to me and said (paraphrased): "I wish I could afford to bring my graduate students. This conference is way more relevant to what they'll probably be doing down the road than any academic conference."

On Monday, I gave a talk in the Biology program at McGill, where I was invited by the grad student association. I spent a fair amount of time talking with grad students about career options, and when I mentioned that PyCon was just finishing up, a few of them said they wished they'd known about it. I think they'd have had a good time, especially at the job fair.

Moral of the story, for me: keep on bringing my grad students to PyCon.


by C. Titus Brown at April 13, 2015 10:00 PM

April 09, 2015

Titus Brown

PyCon 2015 talk notes for &quot;How to interpret your own genome&quot;

Here are talk notes and links for my PyCon 2015 talk.

The talk slides are up on SlideShare.

General background

You should definitely check out Mike Lin's great blog posts on "Blogging my genome".

I found SNPedia through this wonderful blog post on how to use 23andMe irresponsibly, on Slate Star Codex.

My introduction to bcbio came from Brad Chapman's excellent blog post on evaluating and comparing variant detection methods.

There are several openly available benchmarking data sets for human genetics/genomics. The Ashkenazim data set I used for my talk is here, and you can see the Personal Genome Project profile for the son, here. The raw data is available here, and you can see the resequencing report for the son, here.

The Personal Genome Project is something worth checking out.

More and more of human genetics and genomics is "open" -- check out the Variant Call Format (VCF) spec, now on github.


To run the bcbio variant calling pipeline I discuss in the talk, or examine the SNPs in the Ashkenazim trio with Gemini, take a look at my pipeline notes. The Gemini part will let you examine SNPs for the three individuals in the Ashkenazi trio, starting from the VCF files.

Slide notes

Slide 4: this link explains recombination and inheritance REALLY well.

This John Hawks' blog post is my source for 300-600 novel mutations per generation.

Slide 19:

You can read more about the Ashkenazi Jews here.

The data sets are available here.

Slide 27: Canavan Disease

Slides 30 and 31 from Demographic events and evolutionary forces shaping European genetic diversity by Veeramah and Novembre, 2014.

Slide 32 from Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls, 2007.

Slide 35: the "narcissome" link

Slide 36: a paper on lack of concordance amongst variant callers.

Slide 37: the gene drive link.

How to get involved

I asked the bcbio and gemini folk if there were any opportunities for Python folk to get involved in their work.

Here are some of their thoughts:

by C. Titus Brown at April 09, 2015 10:00 PM

Continuum Analytics

What's new in Blaze

tl; dr: We discuss the latest features and user facing API changes of Blaze.

Blaze has undergone quite a bit of development in the last year. There are some exciting features to discuss with some important user facing API changes.

by Phillip Cloud at April 09, 2015 07:01 PM

April 08, 2015


Elephant 0.1.0 released

We are pleased to announce the first release of the Elephant toolbox for the analysis of neurophysiology data.

Elephant builds on the Python scientific stack (NumPy, SciPy) to provide a set of well-tested analysis functions for spike train data and time series recordings from electrodes, such as spike train statistics, power spectrum analysis, filtering, cross-correlation and spike-triggered averaging. The toolbox also provides tools to generate artificial spike trains, either from stochastic processes or by controlled perturbations of real spike trains. Elephant is built on the Neo data model, and takes advantage of the broad file-format support provided by the Neo library. A bridge to the Pandas data analysis library is also provided.

Elephant is a community-based effort, aimed at providing a common platform to test and distribute code from different laboratories, with the goal of improving the reproducibility of modern neuroscience research. If you are a neuroscience researcher or student using Python for data analysis, please consider joining us, either to contribute your own code or to help with code review and testing.

Elephant is the direct successor to NeuroTools and maintains ties to complementary projects such as OpenElectrophy and SpykeViewer. It is also the default tool for electrophysiology data analysis in the Human Brain Project.

As a simple example, let's generate some artificial spike train data using a homogeneous Poisson process:

from elephant.spike_train_generation import homogeneous_poisson_process
from quantities import Hz, s, ms
spiketrains = [
    homogeneous_poisson_process(rate=10.0*Hz, t_start=0.0*s, t_stop=100.0*s)
    for i in range(100)]

and visualize it in Matplotlib:

import matplotlib.pyplot as plt
import numpy as np
for i, spiketrain in enumerate(spiketrains):
    t = spiketrain.rescale(ms)
    plt.plot(t, i * np.ones_like(t), 'k.', markersize=2)
plt.xlim(0, 1000)
plt.xlabel('Time (ms)', fontsize=16)
plt.ylabel('Spike Train Index', fontsize=16)
plt.gca().tick_params(axis='both', which='major', labelsize=14)
Now we calculate the coefficient of variation of the inter-spike interval for each of the 100 spike trains.

from elephant.statistics import isi, cv
cv_list = [cv(isi(spiketrain)) for spiketrain in spiketrains]

As expected for a Poisson process, the values cluster around 1:

plt.xlabel('CV', fontsize=16)
plt.ylabel('count', fontsize=16)
plt.gca().tick_params(axis='both', which='major', labelsize=14)


by Andrew Davison (noreply@blogger.com) at April 08, 2015 05:01 PM

April 06, 2015

Titus Brown

Towards a bioinformatics middle class

Note: Turns out Nick Loman is a C programmer. Well, that's what happens when I make assumptions, folks ;).

Jared Simpson just posted a great blog entry on nanopolish, an HMM-based consensus caller for Oxford Nanopore data. In it he describes how he moved from a Python prototype to a standalone C++ program. It's a great blog post, but it struck one discordant note for me. In the post, Jared says:

I was not satisfied with the Python/C++ hybrid design. I am sensitive to installation issues when releasing software as I have found that installing dependencies is a major source of problems for the user (...). I admire Heng Li's software where one usually just needs to run git clone and make to build the program.

Fundamentally, moving from a lightweight Python layer on top of a heavier, optimized C++ library into a standalone binary seems like a step backwards to me. I wrote on Twitter,

I worry ... that short-term convenience is lost at expense of composability and flexibility. Thoughts?

which spawned an interesting conversation about dependency hell, software engineering done proper, funding, open source process and upstream contributions, and design vs language.

I don't have a single coherent message to give, but I wanted to make a few points in a longer format (hence this blog post).

First, format hell is directly caused by everyone developing their own programs, which consume and emit semi-standard formats. In the absence of strong format standards (which I don't particularly want), our choice may be between this format hell, and internal manipulation of rich formats and data structures. (Maybe I'm over-optimistic about the latter?)

Second, a component ecosystem, based on scripting language wrappers around C/C++, strikes me as a major step forward in terms of flexibility and composability. (That's what we're working towards with some of our khmer development.) A bunch of lightweight components that each do one interesting thing, well, would let us separate specific function from overall processing logic and format parsing. Moreover, it would be testable - a major concern with our current "stack" in bioinformatics, which is only amenable to high level testing.

Third, and this is my real concern - C++ is an utterly opaque language to most people. For example, Nick Loman - a pretty sophisticated bioinformatics dude, IMO - is almost certainly completely incapable of doing anything with nanopolish's internals. This is because his training is in biology, not in programming. I'm picking on Nick because he's Jared's partner in nanopolish, but I think this is a generally true statement of many quite capable bioinformaticians. Heck, I'm perfectly capable in C and can scratch my way through C++ programming, but I do my best to avoid packages that have only a C++ interface because of the procedural overhead involved.

I disagree strongly with Jared's black & white statement that "this isn't a language problem" -- part of it absolutely is! Scripting languages enable a much more flexible and organic interaction with algorithms than languages like Java and C++, in my extensive personal experience. People can also pick up scripting languages much more easily than they can C++ or Java, because the learning curve is not so steep (although the ultimate distance climbed may be long).

This leads into my choice of title - at the end of the day, what do we want? Do we want a strong division between bioinformatics "haves" - those who can grok serious C++ code at a deep level, and interact with the C++ interface when they need to adapt code, vs those who consume text I/O at the command line or via hacked-together pipelines? Or do we want a thicker "middle class", who appreciate the algorithms and can use and reuse them in unexpected ways via a scripting language, but without the investment and time commitment of digging into the underlying library code?

I've put my energy squarely on the latter vision, with teaching, training, software development, and publication, so you know where I stand :). Maybe I'm naive in thinking that this approach will build a better, stronger set of approaches to bioinformatics and data-intensive biology than the other approach, but I'm giving it the ol' college try.


p.s. I'm hoping to post a powerful demonstration of the component/library approach in a few weeks; I'll revisit the topic then.

p.p.s. It should go without saying that Jared and Nick and Heng Li are all great; this is not a personal diatribe, and given the amount of time and energy we put into building khmer as a Python library, I wouldn't recommend this approach in the current funding climate. But I want to (re)start the discussion!

by C. Titus Brown at April 06, 2015 10:00 PM

April 04, 2015

Juan Nunez-Iglesias


Over the past few weeks, I’ve been heavily promoting the SciPy conference, a meeting about scientific programming in Python. I’ve been telling everyone who would listen that they should submit a talk abstract and go, because scientific programming is increasingly common in any scientist’s work and SciPy massively improves how you do that.

I have also been guiltily ommitting that the speaker and attendee diversity at SciPy is shockingly bad. Last year, for example, 15% of attendees were women, and that was an improvement over the ratio three years ago, when just 3% (!!!) were women.

I rationalised continuing to promote this conference because there was talk from past organisers about making efforts to improve. (And indeed, the past three years have been on an upward trajectory.)

A couple of days ago, however, the full list of keynote speakers was announced, and lo and behold, it’s three white guys. I have to acknowledge that they are extremely accomplished in the SciPy universe, and, if diversity were not more generally a problem at this conference and in tech in general, I wouldn’t bat an eye. Excellent choice of speakers, really. Looking forward to it.

But diversity is a problem. It’s an enormous problem. I’m inclined to call it catastrophic.

Let me try to quantify it. Men and women are equally capable scientific programmers. So out of a total pool of 100 potential SciPy attendees/contributors, 50 are women and 50 are men. Now, let’s assume the male side of the community is working at near-optimum capacity, so, 50 of 50 those men are at SciPy. 15% of attendees being women means just 9 of the 50 potential female contributors are making it out to the conference (9/59 ≈ 15%). Or, slice it another way, a whopping (50 – 9) / 50 = 82% of women who could be contributing to SciPy are missing.

Now, think of where we would be if we took 82% of male science-Pythonistas and just erased their talks, their discussions, and their code contributions. The SciPy ecosystem would suck. Yet that is exactly how many coders are missing from our community.

Now, we can sit here and play female conference speaker bingo until the cows come home to roost, but that is missing the point: these are all only excuses for not doing enough. “Not my fault” is not good enough anymore. It is everyone’s fault who does not make an active and prolonged effort to fix things.

The keynote speakers are an excellent place to make a difference, because you can eliminate all sorts of confounders. I have a certain sympathy, for example, for the argument that one should pick the best abstracts/scholarship recipients, rather than focus on race or gender. (Though the process should be truly blind to remove pervasive bias, as studies and the experience of orchestra auditions have repeatedly shown.) For keynotes though, organisers are free to pursue any agenda they like. For example, they can make education a theme, and get Lorena Barba and Greg Wilson in, as they did last year.

Until the gender ratio begins to even remotely approach 1:1, diversity as an agenda should be a priority for the organisers. This doesn’t mean invite the same number of women and men to give keynotes. This means keep inviting qualified women until you have at least one confirmed female keynote speaker, and preferably two. Then, and only then, you can look into inviting men.

Women have been found to turn down conference invitations more often than men, irrespective of ability or accomplishment. I don’t know why, but I suspect one reason is lack of role models, in the form of previous female speakers. That’s why this keynote roster is so disappointing. There’s tons of accomplished female Pythonistas out there, and there would be even more if we all made a concerted effort to improve things.

I don’t want to retread the same territory that Jonathan Eisen (@phylogenomics) has already covered in “Calling attention to meeting with skewed gender ratios, even when it hurts“. In particular, see that article for links to many others with ideas to improve gender ratios. But this is my contribution in the exact same series: love SciPy. See my previous posts for illustration.

Even looking back at my recent post, when I looked for a picture that I thought captured the collegial, collaborative feel of the conference, I unintentionally picked one featuring only men. This needs to improve, massively, if I’m going to keep supporting this conference. I really hope the organisers place diversity at the centre of their agenda for every decision going forward.

I thank Jonathan Eisen, Andy Ray Terrel, and April Wright for comments on earlier versions of this article.

by Juan Nunez-Iglesias at April 04, 2015 02:46 AM

April 02, 2015

Titus Brown

OpenCon webcast: &quot;How to get tenure (while being open)&quot;

This is a stub blog post for the talk notes for my OpenCon talk on how to get tenure as an open scientist.

A few links --

More soon!


playing on easy

by C. Titus Brown at April 02, 2015 10:00 PM

April 01, 2015

Titus Brown

Cultural confusions about data - the intertidal zone between two styles of biology

A few weeks back, a journalist contacted me about my old blog post comparing physics and biology, and amidst other conversation, I pointed them at my latest blog post on data and said that I thought a lot of (molecular) biologists were "culturally confused about data". The next question was, perhaps obviously, "what do you mean by that?" and I wrote in response:

... for molecular biologists, "data" is what they collect piece by piece with PCR, qPCR, clone sequencing, perturbation experiments, image observations. It is so individual and specialized to a problem that to share it prior to publication makes no sense; no one could understand it all unless it were fit into a narrative as part of a pub, and the only useful product of the data is the publication; access to the data is only useful for verifying that it wasn't manufactured.

Sequencing data was one of the first outputs (as opposed to things like reagents, antibodies and QPCR primers) that was useful beyond a particular narrative. I might sequence a gene because I want to knock it down and need to know its leader sequence for that, but then you might care about its exonic structure for evolutionary reasons, and Phil over there might be really interested in its protein domains, while Kim might be looking at an allele of that gene that is only in part of the population.

I'm probably overstating that distinction but it helps explain a LOT of what I've seen in terms of cultural differences between my grad/pd labs (straight up bio) and where I think bio is going.

I'm sure I'm wrong (certainly incomplete) about lots of this, but it does fit my own personal observations. Other perspectives welcome!

I decided to write this up as a blog post because I read Carly Strasser's excellent blog post introducing open science, which emphasizes data, and it made me think about my response above. I think it's interesting to think about how "data" can be interpreted by different fields, and I'd like to stress how important it is that we bridge the gap between these high-level views and day-to-day practice in each subdomain - the culture and language can vary so significantly between even neighboring fields!

Oh, and Carly Strasser is now one of the Moore Data Driven Discovery Initiative Program Officers - I'm really happy to see the Moore Foundation confronting these aspects of data head on by hiring someone with Carly's experience and expertise, and I look forward to interacting with her more on these issues!


by C. Titus Brown at April 01, 2015 10:00 PM

Gaël Varoquaux

Job offer: working on open source data processing in Python

We, Parietal team at INRIA, are recruiting software developers to work on open source machine learning and neuroimaging software in Python.

In general, we are looking for people who:

  • have a mathematical mindset,
  • are curious about data (ie like looking at data and understanding it)
  • have an affinity for problem-solving tradeoffs
  • love high-quality code
  • worry about users
  • are good scientific Python coders,
  • enjoy interacting with a community of developers

We welcome candidates people without all the skills, but are strongly motivated to acquire them. Prior open-source experience is a big plus.

One example of such position with application in Neuroimaging is: http://gael-varoquaux.info/programming/hiring-a-programmer-for-a-brain-imaging-machine-learning-library.html Which was opened a year ago and has now resulted in nilearn: http://nilearn.github.io/

Other positions may be more focused on general machine learning or computing tools such as scikit-learn and joblib, which are reference open-source libraries for data processing in Python.

We are a tightly knit team, with a high degree of programming, data analysis and neuroimaging skills.

Please contact me and Olivier Grisel if you are interested,

by Gaël Varoquaux at April 01, 2015 10:00 PM


ANN: HoloViews 1.0 data visualization and ImaGen 2.0 pattern generation in Python

We are pleased to announce the first public release of HoloViews, a free Python package for scientific and engineering data visualization:


and version 2.0 of ImaGen, a free Python package for generating two-dimensional patterns useful for vision research and computational modeling:


HoloViews provides composable, sliceable, declarative data structures for building even complex visualizations of any scientific data very easily.  With HoloViews, you can see your data as publication-quality figures almost instantly, so that you can focus on the data itself, rather than on laboriously putting together your figures.  Even complex multi-subfigure layouts and animations are very easily built using HoloViews.

ImaGen provides highly configurable, resolution-independent input patterns, directly visualizable using HoloViews but also available without any plotting package so that they can easily be incorporated directly into your computational modeling or visual stimulus generation code.  With ImaGen, any software with a Python interface can immediately support configurable streams of 0D, 1D, or 2D patterns, without any extra coding.

HoloViews and ImaGen are very general tools, but they were designed to solve common problems faced by vision scientists and computational modelers.  HoloViews makes it very easy to visualize data from vision research, whether it is visual patterns, neural activity patterns, or more abstract measurements or analyses.  Essentially, HoloViews provides a set of general, compositional, multidimensional data structures suitable for both discrete and continuous real-world data, and pairs them with separate customizable plotting classes to visualize them without extensive coding.

ImaGen 2.0 uses the continuous coordinate systems provided by HoloViews to implement flexible resolution-independent generation of streams of patterns, with parameters controlled by the user and allowing randomness or other arbitrary changes over time.  These patterns can be used for visual stimulus generation, testing or training computational models, initializing connectivity in models, or any other application where random or dynamic but precisely controlled streams of patterns are needed.


- Freely available under a BSD license
- Python 2 and 3 compatible
- Minimal external dependencies -- easy to integrate into your workflow
- Declarative approach provides powerful compositionality with minimal coding
- Include extensive, continuously tested IPython Notebook tutorials
- Easily reconfigurable using documented and validated parameters
- Animations are supported natively, with no extra work
- Supports reproducible research -- simple specification, archived in an IPython Notebook, providing a recipe for regenerating your results
- HoloViews is one of three winners of the 2015 UK Open Source Awards

To get started, check out ioam.github.io/holoviews and ioam.github.io/imagen!

Jean-Luc Stevens
Philipp Rudiger
Christopher Ball
James A. Bednar

by Jim Bednar (noreply@blogger.com) at April 01, 2015 04:46 AM

March 31, 2015

Continuum Analytics

Continuum Analytics - April Tech Events

Catch us this month at PyCon Montreal and PyData Dallas - the first PyData Conference in Texas - plus all the events below. Email info@continuum.io to schedule a meeting or to let us know about your meetup/event!

by Continuum at March 31, 2015 12:00 PM

Matthieu Brucher

Book: Building Machine Learning Systems in Python

Almost 18 months ago, I posted a small post on the first version of this book (http://matt.eifelle.com/2013/09/04/book-building-machine-learning-systems-in-python/). At the time, I was really eager to see the second edition of it. And there it is!

I had once again the privilege of being a technical reviewer for this book, and I havce to say that the quality didn’t lower one bit, it went even higher. Of course, there is still room for a better book, when all Python module for Machine Learning are even better. I guess that will be for the third edition!

To get the book from the publisher: https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python-second-edition

On other matters, the blog was quiet for a long time, I’m hoping to get some time to post a few new posts soon, but it is quite hard as I’m currently studying for another master’s degree!

by Matt at March 31, 2015 07:13 AM

March 30, 2015

Titus Brown

Some ideas for workshops and unconference models for data-intensive biology

Here at the Lab for Data-Intensive Biology (TM) we are constantly trying to explore new ideas for advancing the practice of biological data sciences. Below are some ideas that originated with or were sharpened by conversations with Greg Wilson (Executive Director, Software Carpentry) and Tracy Teal (Project Lead, Data Carpentry) that I am interested in turning into reality, as part of my training efforts in Data Intensive Biology at UC Davis (also see my blog post).

Quarterly in-depth "unconferences" on developing advanced domain-specific data analyses

I spend an increasing amount of time working to teach people how to analyze sequencing data, but in practice we kinda suck at analyzing this data, especially when it's from non-model systems. We need some workshops to advance those efforts, too.

So, I am thinking of running quarterly (4x year) week-long unconferences, each on a different topic. The idea is to get together a small group of people (~15-25) actively and openly working on various aspects of a specific type of data analysis, to hang out and collaborate/coordinate their efforts for a week. The plan would be to intersperse a few presentations with lots of hacking and communication time, with the goal of making progress on topics of mutual interest and nucleating collaborative online interactions.

The following topics are areas where I can easily imagine a week-long technical workshop making some substantive progress:

  • vet/ag genome (re)annotation and curation
  • non-model transcriptome analysis
  • geomicrobiology/function-focused metagenome analysis
  • bioinformatics training (e.g. train the trainers)
  • reproducibility in genomic analysis
  • computational protocols and benchmarking
  • advanced statistical approaches to data integration

Importantly, these workshops would be inexpensive and largely unfunded - I would ask participants to fund themselves, rather than seeking to write grants to support a bunch of people. If we can locate them in an inexpensive place then the total cost would be in the $1000-2000/person range, which most active research labs could probably support. I would seek funding for scholarships to increase diversity of participants, but beyond that my goal would be to make these workshops so useful that active and funded researchers want to come. (I mean, I wouldn't turn down money that dropped into my lap, but I've had too many workshop proposals rejected as not quite what the PMs wanted to get on that merry-go-round again, unless it's critical to make something super important happen.)

One non-negotiable component for me would be that everything worked on at these meetings would be under open licenses, and already being developed openly. Although I suppose we could have a meeting where people interested in opening their software could get together to do so in a guided fashion... proponents of open science, take note!

Such workshops would not need to be hosted at UC Davis, or by me; I just want to make 'em happen and am happy to co-organize or coordinate, and think I could do ~4 a year myself. There are a lot of people invested in progressing on these issues who already have some money, and so one option for moving forward more generally would be to find those people and co-opt them :).

A workshop consisting of half-day focused lessons

Last week, I ran a workshop on starting a new project with reproducibility in mind --

Reproducible Computational Analysis - How to start a new project

Description: Computational science projects, from data analysis to modeling, can benefit dramatically from a little up-front investment in automation; starting off with version control and automated building of results will pay off in efficiency, agility, and both transparency and reproducibility of the results. However, most computational researchers have never been exposed to a completely automated analysis pipeline. I will demonstrate the process of initiating a new project, building a few initial scripts, and automating the generation of results, as well as building some graphs. While the topic will be from my own research in bioinformatics, the overall approach should apply to anyone doing data analysis or simulations.

Technology used will include git, IPython Notebook, and 'make'.

This is an interactive seminar intended for computational science researchers with some experience in version control and scripting (for example, if you've taken a Software Carpentry workshop, you will be at a good starting point).

This is an idea that originated with Greg - he nucleated the idea, and then I went ahead and tried it out. More on that workshop later, but... why not do this a lot?

We are thinking about how to do a focused series of these ~3 hour learning opportunities, either all demo or half-demo/half participation, each on a different topic. For example, my lab could do a section on k-mer analyses of large sequencing data sets, or on GitHub Flow, or on software testing, or on whatever; the important thing is that there are tons of Software Carpentry instructors with deep roots in one discipline or another, and it'd be a fun way to learn from each other while teaching to a larger audience.

This is something we might try during the third week of our NGS summer course; if you're a badged SWC instructor and want to demo something related to sequence analysis, drop me a note with a brief proposal.

Instructor gatherings for lesson development and testing

Tracy Teal, Jason Williams, Mike Smorul, Mary Shelley, Shari Ellis, and Hilmar Lapp just ran a Data Carpentry hackathon focused on lesson development and assessment. Riffing off of that, what about getting instructors together to do lesson development and testing on a regular basis, and then present it in front of a more advanced crowd? This would be an opportunity for people to develop and test lessons for Software Carpentry and Data Carpentry on a tolerant audience, with other instructors around to offer help and advice, and without the challenges of a completely novice audience for the first time through.

This is also something we might try during the third week of our NGS summer course; if you're a badged SWC instructor and want to do something on sequence analysis, please drop me a note and tell me what!

Any other thoughts on things that have worked, or might work, for advancing training and practice in a hands-on manner?



by C. Titus Brown at March 30, 2015 10:00 PM

March 27, 2015

Gaël Varoquaux

Euroscipy 2015: Call for paper

EuroScipy 2015, the annual conference on Python in science will take place in Cambridge, UK on 26-30 August 2015. The conference features two days of tutorials followed by two days of scientific talks & posters and an extra day dedicated to developer sprints. It is the major event in Europe in the field of technical/scientific computing within the Python ecosystem. Scientists, PhD’s, students, data scientists, analysts, and quants from more than 20 countries attended the conference last year.

The topics presented at EuroSciPy are very diverse, with a focus on advanced software engineering and original uses of Python and its scientific libraries, either in theoretical or experimental research, from both academia and the industry.

Submissions for posters, talks & tutorials (beginner and advanced) are welcome on our website at http://www.euroscipy.org/2015/ Sprint proposals should be addressed directly to the organisation at euroscipy-org@python.org

Important dates:

  • Apr 30, 2015 Talk and tutorials submission deadline
  • May 1, 2015 Registration opens
  • May 30, 2015 Final program announced
  • Jun 15, 2015 Early-bird registration ends
  • Aug 26-27, 2015 Tutorials
  • Aug 28-29, 2015 Main conference
  • Aug 30, 2015 Sprints

We look forward to an exciting conference and hope to see you in Cambridge

The EuroSciPy 2015 Team - http://ww.euroscipy.org/2015/

by Gaël Varoquaux at March 27, 2015 11:00 PM

March 26, 2015

Titus Brown

A new preprint on semi-streaming analysis of large data sets

We just posted a new preprint (well, ok, a few weeks back)! The preprint title is "Crossing the streams: a framework for streaming analysis of short DNA sequencing reads", by Qingpeng Zhang, Sherine Awad, and myself. Note that like our other recent papers, this paper is 100% reproducible, with all source code in github. we haven't finished writing up the instructions yet, but happy to share (see AWS.md for notes).

This paper's major point is that you can use the massive redundancy of deep short-read sequencing to analyze data as it comes in, and that this could easily be integrated into existing k-mer based error correctors and variant callers. Conveniently the algorithm doesn't care what you're sequencing - no assumptions are made about uniformity of coverage, so you can apply the same ol' spectral approaches you use for genomes to transcriptomes, metagenomes, and probably single-cell data. Which is, and let's be frank here, super awesome.

The paper also provides two useful tools, both implemented as part of khmer: one is an efficient approach to k-mer-abundance-based error trimming of short reads, and the other is a streaming approach to looking at per-position error profiles of short-read sequencing experiments.

A colleague, Erich Schwarz, suggested that I more strongly make the point that is really behind this work: we're in for more data. Scads and scads of it. Coming it with ways of efficiently dealing with it at an algorithmic level is important. (We didn't strengthen this point in the posted preprint - the feedback came too late -- but we will hopefully get a chance to stick it in in the revisions.)

The really surprising thing for me is that the general semi-streaming approach has virtually no drawbacks - the results are very similar to the full two-pass offline approaches used currently for error correction etc. Without implementing a huge amount of stuff we had to make this argument transitively, but I think it's solid.

For entertainment, take a look at the error profile in Figure 6. That's from a real data set, published in Nature something or other...

And, finally, dear prospective reviewers: the biggest flaws that I see are these:

  • we chose to make most of our arguments by analyzing real data, and we didn't spend any time developing theory. This is a choice that our lab frequently makes -- to implement effective methods rather than developing the underlying theory -- but it leaves us open for a certain type of criticism.
  • to extend the previous point, the algorithmic performance depends critically on details of the data set. We didn't know how to discuss this and so the discussion is maybe a bit weak. We'd love reviewers to ask pointed questions that we can address in order to shore it up.
  • Repeats! How does all this stuff work with repeats!? I did a lot of work simulating repetitive sequences and couldn't find any place where repeats actually caused problems. My intuition now tells me that repeats are not actually a problem for methods that interrogate De Bruijn graphs using entire reads as an index into the graph, but I'd welcome someone telling me I'm wrong and either telling me where to look, or asking concrete questions that illuminate better directions to take it.


by C. Titus Brown at March 26, 2015 11:00 PM

March 25, 2015

Matthew Rocklin

Partition and Shuffle

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers.

tl;dr We partition out-of-core dataframes efficiently.

Partition Data

Many efficient parallel algorithms require intelligently partitioned data.

For time-series data we might partition into month-long blocks. For text-indexed data we might have all of the “A”s in one group and all of the “B”s in another. These divisions let us arrange work with foresight.

To extend Pandas operations to larger-than-memory data efficient partition algorithms are critical. This is tricky when data doesn’t fit in memory.

Partitioning is fundamentally hard

Data locality is the root of all performance
    -- A Good Programmer

Partitioning/shuffling is inherently non-local. Every block of input data needs to separate and send bits to every block of output data. If we have a thousand partitions then that’s a million little partition shards to communicate. Ouch.

Shuffling data between partitions

Consider the following setup

  100GB dataset
/ 100MB partitions
= 1,000 input partitions

To partition we need shuffle data in the input partitions to a similar number of output partitions

  1,000 input partitions
* 1,000 output partitions
= 1,000,000 partition shards

If our communication/storage of those shards has even a millisecond of latency then we run into problems.

  1,000,000 partition shards
x 1ms
= 18 minutes

Previously I stored the partition-shards individually on the filesystem using cPickle. This was a mistake. It was very slow because it treated each of the million shards independently. Now we aggregate shards headed for the same out-block and write out many at a time, bundling overhead. We balance this against memory constraints. This stresses both Python latencies and memory use.

BColz, now for very small data

Fortunately we have a nice on-disk chunked array container that supports append in Cython. BColz (formerly BLZ, formerly CArray) does this for us. It wasn’t originally designed for this use case but performs admirably.

Briefly, BColz is…

  • A binary store (like HDF5)
  • With columnar access (useful for tabular computations)
  • That stores data in cache-friendly sized blocks
  • With a focus on compression
  • Written mostly by Francesc Alted (PyTables) and Valentin Haenel

It includes two main objects:

  • carray: An on-disk numpy array
  • ctable: A named collection of carrays to represent a table/dataframe

Partitioned Frame

We use carray to make a new data structure pframe with the following operations:

  • Append DataFrame to collection, and partition it along the index on known block divisions blockdivs
  • Extract DataFrame corresponding to a particular partition

Internally we invent two new data structures:

  • cframe: Like ctable this stores column information in a collection of carrays. Unlike ctable this maps perfectly onto the custom block structure used internally by Pandas. For internal use only.
  • pframe: A collection of cframes, one for each partition.

Partitioned Frame design

Through bcolz.carray, cframe manages efficient incremental storage to disk. PFrame partitions incoming data and feeds it to the appropriate cframe.


Create test dataset

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [1, 2, 3, 4],
...                        'b': [1., 2., 3., 4.]},
...                       index=[1, 4, 10, 20])

Create pframe like our test dataset, partitioning on divisions 5, 15. Append the single test dataframe.

In [3]: from pframe import pframe
In [4]: pf = pframe(like=df, blockdivs=[5, 15])
In [5]: pf.append(df)

Pull out partitions

In [6]: pf.get_partition(0)
   a  b
1  1  1
4  2  2

In [7]: pf.get_partition(1)
    a  b
10  3  3

In [8]: pf.get_partition(2)
    a  b
20  4  4

Continue to append data…

In [9]: df2 = pd.DataFrame({'a': [10, 20, 30, 40],
...                         'b': [10., 20., 30., 40.]},
...                        index=[1, 4, 10, 20])
In [10]: pf.append(df2)

… and partitions grow accordingly.

In [12]: pf.get_partition(0)
    a   b
1   1   1
4   2   2
1  10  10
4  20  20

We can continue this until our disk fills up. This runs near peak I/O speeds (on my low-power laptop with admittedly poor I/O.)


I’ve partitioned the NYCTaxi trip dataset a lot this week and posting my results to the Continuum chat with messages like the following

I think I've got it to work, though it took all night and my hard drive filled up.
Down to six hours and it actually works.
Three hours!
By removing object dtypes we're down to 30 minutes
20!  This is actually usable.
OK, I've got this to six minutes.  Thank goodness for Pandas categoricals.
Down to about three and a half with multithreading, but only if we stop blosc from segfaulting.

And thats where I am now. It’s been a fun week. Here is a tiny benchmark.

>>> import pandas as pd
>>> import numpy as np
>>> from pframe import pframe

>>> df = pd.DataFrame({'a': np.random.random(1000000),
                       'b': np.random.poisson(100, size=1000000),
                       'c': np.random.random(1000000),
                       'd': np.random.random(1000000).astype('f4')}).set_index('a')

Set up a pframe to match the structure of this DataFrame Partition index into divisions of size 0.1

>>> pf = pframe(like=df,
...             blockdivs=[.1, .2, .3, .4, .5, .6, .7, .8, .9],
...             chunklen=2**15)

Dump the random data into the Partition Frame one hundred times and compute effective bandwidths.

>>> for i in range(100):
...     pf.append(df)

CPU times: user 39.4 s, sys: 3.01 s, total: 42.4 s
Wall time: 40.6 s

>>> pf.nbytes

>>> pf.nbytes / 40.6 / 1e6  # MB/s

>>> pf.cbytes / 40.6 / 1e6  # Actual compressed bytes on disk

We partition and store on disk random-ish data at 68MB/s (cheating with compression). This is on my old small notebook computer with a weak processor and hard drive I/O bandwidth at around 100 MB/s.

Theoretical Comparison to External Sort

There isn’t much literature to back up my approach. That concerns me. There is a lot of literature however on external sorting and they often site our partitioning problem as a use case. Perhaps we should do an external sort?

I thought I’d quickly give some reasons why I think the current approach is theoretically better than an out-of-core sort; hopefully someone smarter can come by and tell me why I’m wrong.

We don’t need a full sort, we need something far weaker. External sort requires at least two passes over the data while the method above requires one full pass through the data as well as one additional pass through the index column to determine good block divisions. These divisions should be of approximately equal size. The approximate size can be pretty rough. I don’t think we would notice a variation of a factor of five in block sizes. Task scheduling lets us be pretty sloppy with load imbalance as long as we have many tasks.

I haven’t implemented a good external sort though so I’m only able to argue theory here. I’m likely missing important implementation details.

  • PFrame code lives in a dask branch at the moment. It depends on a couple of BColz PRs (#163, #164)

March 25, 2015 12:00 AM

March 24, 2015

Martin Fitzpatrick

Live Demo - Wooey: Web UIs for Python scripts

A new live demo of Wooey is now up and running with a few simple example scripts. Features:

  • Web UIs for Python scripts generated automatically from argparse
  • An automated background worker to run scripts and collect outputs
  • Run Queue to schedule (and prioritise) jobs
  • Automatic rendering of viewable outputs, downloadable .zips for everything else

Try it out, fork it or leave feedback below.

by Martin Fitzpatrick at March 24, 2015 11:35 AM

Juan Nunez-Iglesias

Photo by Ian Rees (from the SciPy 2012 conference website)

SciPy is my favourite conference. My goal with this post is to convince someone to go who hasn’t had that chance yet.

Photo by Ian Rees (from the SciPy 2012 conference website)

Why SciPy?

Most scientists go to conferences in their own field: neuroscientists go to the monstrous Society for Neuroscience (SfN); Bioinformaticians go to RECOMB, ISMB, or PSB; and so on.

People go to these to keep up with the latest advances in their field, and often, to do a bit of networking.

SciPy is a different kind of conference. It changes the way you do science. You learn about the latest free and open source software to help you with your work. You learn to write functions and interfaces instead of scripts, and to write tests so you don’t break your code. You also learn to contribute these to bigger projects, maximising the reach and impact of your work (see “sprints”, below).

And you learn these things by doing them, with the people who are the best at this, rather than by reading books and blog posts. (Which maybe I shouldn’t knock, since I’m writing a book about all this and you are reading my blog!)

Attendees to SciPy have scientific software in common, but come from diverse fields, including physics, pure maths, data visualisation, geosciences, climatology, and yes, biology and bioinformatics. Mingling with such a diverse group is a fantastic way to get your creative juices flowing!

The conference lasts a full week and is broken up into three parts: tutorials, main conference, and sprints.

the tutorials

With a few exceptions, you won’t learn about your own field. But you will learn an enormous amount about tools that will help you be a better scientist. If you are a newbie to Python, you can go to the beginner tutorials track and learn about the fantastic scientific libraries available in Python. If you already use NumPy, SciPy, pandas, and the IPython notebook, you can go to the intermediate or advanced tracks, and learn new things about those. Even as an advanced SciPy user I still get tons of value from the tutorials. (Last year Min RK gave a wild demo of IPython parallel’s capabilities by crawling wikipedia remotely while building up a graph visualisation on his live notebook.) (Fast-forward to the 1h mark to see just the payoff.) Here’s last year’s tutorial schedule for an idea of what to expect.

the main conference track

You will also hear about the latest advances in the scientific libraries you know and use, and about libraries you didn’t know about but will find useful (such as scikit-bio, yt or epipy). The main conference track features software advances written by presenters from many different fields. Hearing about these from the authors of the software lets you ask much different questions compared to hearing someone say, for example, “we used the Matlab image processing toolbox”. If you ever had a feature request for your favourite library, or you wondered why they do something in a particular way, there’s no better opportunity to get some closure.

The crosstalk between different fields is phenomenal. Hearing how a diverse set of people deal with their software problems really opens your mind to completely different approaches to what you had previously considered.

the sprints

Finally, there’s two days of coding sprints. Even if you are a complete beginner in software development, do yourself a favour and participate in one of these.

Two and a half years after my first SciPy in 2012, I’m writing a scientific Python book for O’Reilly, and I can 100% trace it to participating in the scikit-image sprint that year. With their guidance, I wrote my first ever GitHub pull request and my first ever unit test. Both were tiny and cute, and I would consider them trivial now, but that seed grew into massive improvements in my code-writing practice and many more contributions to open source projects.

And this is huge: now, instead of giving up when a software package doesn’t do what I need it to do, I just look at the source code and figure out how I can add what I want. Someone else probably wants that functionality, and by putting it into a major software library instead of in my own code, I get it into the hands of many more users. It’s a bit counterintuitive but there is nothing more gratifying than having some random person you’ve never met complain when you break something! This never happens when all your code is in your one little specialised repository containing functionality for your one paper.

How SciPy

The SciPy calls for tutorials, talks, posters, and its plotting contest are all out. There’s specialised tracks and most of you reading this are probably interested in the computational biology and medicine track. It’s taken me a while to write this post, so there’s just one week left to submit something: the deadline is April 1st Update: the deadline for talks and posters has been extended to April 10th!

Even if you don’t get something in, I encourage you to participate. Everything I said above still applies if you’re not presenting. You might have a bit more trouble convincing your funders to pay for your travels, but if that’s the case I encourage you to apply for financial assistance from the conference.

I’ve written about SciPy’s diversity problem before, so I’m happy to report that this year there’s specific scholarships for women and minorities. (This may have been true last year, I forget.) And, awesomely, Matt Davis has offered to help first-time submitters with writing their proposals.

Give SciPy a try: submit here and/or register here. And feel free to email me or comment below if you have questions!

Update: A colleague pointed out that I should also mention the awesomeness of the conference venue, so here goes: Austin in July is awesome. If you love the heat like I do, well, it doesn’t get any better. If you don’t, don’t worry: the AT&T Conference Center AC is on friggin overdrive the whole time. Plus, there’s some nearby cold springs to swim in. The center itself is an excellent hotel and the conference organises massive discounts for attendees. There’s a couple of great restaurants on-site; and the Mexican and Texas BBQ in the area are incredible — follow some Enthought and Continuum folks around to experience amazing food. Finally, Austin is a great city to bike in: last time I rented a road bike for the whole week from Mellow Johnny’s, and enjoyed quite a few lunchtime and evening rides.

by Juan Nunez-Iglesias at March 24, 2015 04:54 AM

March 23, 2015

Titus Brown

Metagenomics related/relevant workshops - a list

The other day I was contacted by someone whose student wants to attend the MSU NGS course in 2015, because they are interested in learning how to data integration with (among other things) metagenome data. My response was "we don't cover that in the course", which isn't very helpful ;).

So, I went hunting, and got the following list of metagenome relevant workshops from a program manager. Note that I asked them to cast a broad net, so this goes far beyond "mere" computational analysis of environmental microbes and their functional -- but it should be pretty inclusive of what's out there. If I'm missing something relevant, please let me know!

JGI 2015 workshops - fungal genomics, genomic technologies, KBase, sample QC, and synthetic biology. (Ongoing now, but keep an eye out for next year.)

QIIME workshops, ~monthly and/or by invitation.

The International GeoBiology workshop series, application deadline in February - keep an eye out for next year. (This is the workshop I need/want to attend myself, so I can learn more biogeochemistry!)

Bigelow Third Microbial Single Cell Genomics Workshop, in June - deadline Mar 29th!!

iMicrobe workshops at ASM this year - see Workshop WS18. Note, registration deadline Mar 30th!!

MBL STAMPS (Applications closed for 2015)

MBL Microbial Diversity (applications closed for 2015) - the course is looking interesting this year, with some focus on intersections between sequencing and good ol' fashioned physiology/biogeochemistry.

EDAMAME (applications closed for 2015) - mostly focused on microbial ecology.

My only frustration with this list is that it seems like there's very little out there that really digs into the bioinformatics of shotgun metagenomics and biogeochemistry - the MicDiv course at MBL may well dip a good portion of a leg into this pond, and I'll find out more this year because I'm participating in it. But that's it. Am I wrong?


by C. Titus Brown at March 23, 2015 11:00 PM

&quot;Open Source, Open Science&quot; Meeting Report - March 2015

On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders, and academics.

Open Science has emerged into the mainstream, primarily due to concerted efforts from various individuals, institutions, and initiatives. This small, focused gathering brought together several of those community leaders. The purpose of the meeting was to define common goals, discuss common challenges, and coordinate on common efforts.

We had good discussions about several issues at the intersection of technology and social hacking including badging, improving standards for scientific APIs, and developing shared infrastructure. We also talked about coordination challenges due to the rapid growth of the open science community. At least three collaborative projects emerged from the meeting as concrete outcomes to combat the coordination challenges.

A repeated theme was how to make the value proposition of open science more explicit. Why should scientists become more open, and why should institutions and funders support open science? We agreed that incentives in science are misaligned with practices, and we identified particular pain points and opportunities to nudge incentives. We focused on providing information about the benefits of open science to researchers, funders, and administrators, and emphasized reasons aligned with each stakeholders' interests. We also discussed industry interest in "open", both in making good use of open data, and also in participating in the open ecosystem. One of the collaborative projects emerging from the meeting is a paper or papers to answer the question "Why go open?" for researchers.

Many groups are providing training for tools, statistics, or workflows that could improve openness and reproducibility. We discussed methods of coordinating training activities, such as a training "decision tree" defining potential entry points and next steps for researchers. For example, Center for Open Science offers statistics consulting, rOpenSci offers training on tools, and Software Carpentry, Data Carpentry, and Mozilla Science Lab offer training on workflows. A federation of training services could be mutually reinforcing and bolster collective effectiveness, and facilitate sustainable funding models.

The challenge of supporting training efforts was linked to the larger challenge of funding the so-called "glue" - the technical infrastructure that is only noticed when it fails to function. One such collaboration is the SHARE project, a partnership between the Association of Research Libraries, its academic association partners, and the Center for Open Science. There is little glory in training and infrastructure, but both are essential elements for providing knowledge to enable change, and tools to enact change.

Another repeated theme was the "open science bubble". Many participants felt that they were failing to reach people outside of the open science community. Training in data science and software development was recognized as one way to introduce people to open science. For example, data integration and techniques for reproducible computational analysis naturally connect to discussions of data availability and open source. Re-branding was also discussed as a solution - rather than "post preprints!", say "get more citations!" Another important realization was that researchers who engage with open practices need not, and indeed may not want to, self-identify as "open scientists" per se. The identity and behavior need not be the same.

A number of concrete actions and collaborative activities emerged at the end, including a more coordinated effort around badging, collaboration on API connections between services and producing an article on best practices for scientific APIs, and the writing of an opinion paper outlining the value proposition of open science for researchers. While several proposals were advanced for "next meetings" such as hackathons, no decision has yet been reached. But, a more important decision was clear - the open science community is emerging, strong, and ready to work in concert to help the daily scientific practice live up to core scientific values.


Tal Yarkoni, University of Texas at Austin

Kara Woo, NCEAS

Andrew Updegrove, Gesmer Updegrove and ConsortiumInfo.org

Kaitlin Thaney, Mozilla Science Lab

Jeffrey Spies, Center for Open Science

Courtney Soderberg, Center for Open Science

Elliott Shore, Association of Research Libraries

Andrew Sallans, Center for Open Science

Karthik Ram, rOpenSci and Berkeley Institute for Data Science

Min Ragan-Kelley, IPython and UC Berkeley

Brian Nosek, Center for Open Science and University of Virginia

Erin C. McKiernan, Wilfrid Laurier University

Jennifer Lin, PLOS

Amye Kenall, BioMed Central

Mark Hahnel, figshare

C. Titus Brown, UC Davis

Sara D. Bowman, Center for Open Science

by C. Titus Brown at March 23, 2015 11:00 PM

March 22, 2015

Titus Brown

What to do with lots of (sequencing) data

On a recent west coast speaking junket where I spoke at OSU, OHSU, and VanBUG (Brown PNW '15!), I put together a new talk that tried to connect our past work on scaling metagenome assembly with our future work on driving data sharing and data integration. As you can maybe guess from the first few talk slides, the motivating chain was something like

  1. We want to help biologists move more quickly to hypotheses;
  2. This can in part be done by aiding in hypothesis generation and refinement;
  3. Right now it's painful to analyze large sequencing data sets;
  4. Let's make it less painful!

At both OHSU and OSU, where I gave very similar talks, I got important and critical feedback on these points. The three most valuable points of feedback were,

  • what exactly do you mean by data integration, anyway, Titus?
  • you never talked about generating hypotheses!
  • no, seriously, you never talked about how to actually generate hypotheses!?

The culmination point of this satori-like experience was when Stephen Giovannoni leaned across the table at dinner and said, "Perhaps you can tell me what all this data is actually good for?" This led to a very robust and energetic dinner conversation which led to this blog post :). (Note, Stephen clearly had his own ideas, but wanted to hear mine!)

The underlying point is this: I, and others, are proselytizing the free and open availability of large data sets; we're pushing the development of big data analytic tools; and we're arguing that this is important to the future of science. Why? What good is it all? Why should we worry about this, rather than ignoring it and focusing instead on (for example) physiological characterization of complex environments, clinical trials, etc. etc.?

So, without further ado,

What is all this data potentially good for?

Suppose you set out to use a body of sequencing data in your scientific research. This sequencing data is maybe something you generated yourself, or perhaps it's from a public data set, or from multiple public data sets. Either way, it's a bunch of data that you would like to make use of. What could you do with it?

(This isn't specific to sequencing data, although I think the exploratory approaches are particularly important in biology and sequencing data are well suited to exploratory analysis.)

  1. Computational hypothesis falsification.

    "I thought A was happening. If A is happening, then when I looked at my data I should have seen B. I didn't see B. Therefore A is probably not happening."

    For example, if you are convinced that a specific biogeochemical process is at work, but can't find the relevant molecules in a metagenomic survey, then either you did something wrong, had insufficient sensitivity, or your hypothesis is incorrect in the first place.

    This is one place where pre-existing data can really accelerate the scientific process, and where data availability is really important.

  2. Determination of model sufficiency.

    "I have a Boolean or quantitative model that I believe captures the essential components of my system under investigation. When I fit my actual data to this model, I see several important mismatches. Therefore my model needs more work."

    For gene regulatory networks, or metabolic modeling, this kind of approach is where we need to go. See, for example, work from my graduate lab on sea urchin GRNs - this approach is used there implicitly to drive forward the investigation of underdetermined parts of the GRN.

  3. Comparison with a null or neutral model.

    "If interesting interactions were happening, I would see patterns that deviated from my model of what an uninteresting system should look like. I don't, therefore my model of what is interesting or uninteresting needs to change."

    Somewhat of an elaboration of the above "model sufficiency", here we are choosing an explicit "null model" to interpret our data and concluding that our data is either interesting or boring. For me, the difference is that these models need not be mechanistic, while the models in the previous point are often mechanistic. One example I'd point to is Ethan White's work on maximum entropy models.

  4. Hypothesis generation (or, "fishing expedition.")

    "I have no idea what processes are at work here. Let's look at the data and come up with some ideas."

    A thoroughly underappreciated yet increasingly default approach in various areas of biology, fishing expeditions can feed the masses. (Get it? Hah!)

    But, seriously, this is an important part of biology; I wrote about why at some length back in 2011. All of the previous points rely on us already knowing or believing something, while in reality most of biology is poorly understood and in many cases we have almost no idea what is going on mechanistically. Just looking at systems can be very informative in this situation.

So, this is my first take on the reasons why I think large-scale data generation, availability, analysis, and integration can and should be first class citizens in biology. But I'd be interested in pushback and other thoughts, as well as references to places where this approach has worked well (or poorly) in biology!


p.s. Thanks to Stephen Giovannoni, Thomas Sharpton, Ryan Mueller, and
David Koslicki for the dinner conversation at OSU!

by C. Titus Brown at March 22, 2015 11:00 PM

March 20, 2015

Titus Brown

A personal perspective on the Open Source, Open Science meeting (2015)

I'm returning from a small, excellent meeting on "Open Source, Open Science", held at the Center for Open Science in Charlottesville, VA. We'll post a brief meeting report soon, but I wanted to share my particular highlights --

First, I got a chance to really dig into what the Center for Open Science is doing technically with the Open Science Framework. I still have to explore it a bit more myself, but the OSF has built some excellent infrastructure around collaborative workflows in scientific practice, and I'm tremendously impressed. Meeting Brian Nosek and Jeff Spies, and having the chance to speak at length with several members of the OSF team, was worth the trip all on its own. They were also wonderful hosts, and everything ran incredibly smoothly!

Second, I finally met a bunch of tweeps in person, including Kara Woo of NCEAS (@kara_woo), Erin McKiernan (@emckiernan13), Mark Hahnel of Figshare (@markhahnel), Jason Priem of ImpactStory (@jasonpriem), and probably others that I'm forgetting (sorry ;(. I spent a lot of time talking with Erin McKiernan in particular - we chatted at length about career challenges in open science and science more generally.

Third, the willingness and enthusiasm of funders (both federal and foundation) to engage with open science was palpable. We got some great advice on "thinking like funders" - it's all fine and well to say that open science is better, but it's important to make the case to people who are open-science naive and want to know exactly which of their priorities and interests are better enabled by open science practices.

Fourth, I was completely blown away by what Figshare is doing. Mark Hahnel gave a great 5 minute talk on how Figshare is growing into the data sharing space being created by funder mandates. Tremendously impressive.


by C. Titus Brown at March 20, 2015 11:00 PM

March 19, 2015

Continuum Analytics

Microsoft Chooses Anaconda for Azure Machine Learning Service

Anaconda, the Python distribution for large-scale data processing, is now included in Microsoft’s Azure Machine Learning service. Microsoft announced on February 18 at the Strata Hadoop World Conference the general availability of Azure ML, including some enhancements to the platform.

by Continuum at March 19, 2015 12:00 AM

March 17, 2015

Titus Brown

An mRNAseq workshop at UC Davis - my first as a Davis prof

Two weeks ago, I ran a workshop at UC Davis on mRNAseq analysis for semi-model organisms, which focused on building new gene models ab initio -- with a reference genome. This was a milestone for me - the first time I taught a workshop at UC Davis as a professor there! My co-instructors were Tamer Mansour, a new postdoc in my lab, and Isabelle Laforest-Lapointe, a microbial ecologist visiting Jonathan Eisen's lab from Montreal - about whom more, later.

This is the third workshop in my planned workshop series -- the first workshop was a Software Carpentry instructor training workshop given by Greg Wilson and Tracy Teal, and the second was a Data Carpentry workshop given by Tracy Teal.

Tracy Teal will be running a one-day workshop in April, on mothur, and I will be giving a two-day workshop in early May (May 4-5), on de novo mRNAseq analysis, for transcriptome analysis in organisms without a reference genome. I will also be teaching at the MBL Microbial Diversity Course and attending the beginning of the MBL STAMPS course in Woods Hole. Finally, I will once again host the two week summer NGS course back in Michigan in August. Then, in September, I return to UC Davis and get to actually spin up my full workshop series -- these early ones are just tasters :).

Note: Software Carpentry instructors are awesome

I needed some help with an R section in the mRNAseq workshop, so I advertised on the Software Carpentry instructors mailing list; this is a mailing list for everyone who is an accredited Software Carpentry instructor, meaning they've been through the training and a bit more. Lo and behold, one of the people who responded was Isabelle, who was visiting UC Davis to do some work in the Eisen Lab - no travel needed, and willing to help out both days. We corresponded a bit to make sure she could make the given days, and then just arranged to meet up at 8:30am on the first day of the workshop.

We showed up, she showed up, and it was a great success ;).

Now, I ask you - where else in the world can you e-mail a few hundred competent, capable, friendly people, and find one who will reliably show up at a given place and time to teach?

Software Carpentry, peeps. It's awesome.


by C. Titus Brown at March 17, 2015 11:00 PM

March 16, 2015

Matthew Rocklin

Efficiently Store Pandas DataFrames

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr We benchmark several options to store Pandas DataFrames to disk. Good options exist for numeric data but text is a pain. Categorical dtypes are a good option.


For dask.frame I need to read and write Pandas DataFrames to disk. Both disk bandwidth and serialization speed limit storage performance.

  • Disk bandwidth, between 100MB/s and 800MB/s for a notebook hard drive, is limited purely by hardware. Not much we can do here except buy better drives.
  • Serialization cost though varies widely by library and context. We can be smart here. Serialization is the conversion of a Python variable (e.g. DataFrame) to a stream of bytes that can be written raw to disk.

Typically we use libraries like pickle to serialize Python objects. For dask.frame we really care about doing this quickly so we’re going to also look at a few alternatives.


  • pickle - The standard library pure Python solution
  • cPickle - The standard library C solution
  • pickle.dumps(data, protocol=2) - pickle and cPickle support multiple protocols. Protocol 2 is good for numeric data.
  • json - using the standardlib json library, we encode the values and index as lists of ints/strings
  • json-no-index - Same as above except that we don’t encode the index of the DataFrame, e.g. 0, 1, ... We’ll find that JSON does surprisingly well on pure text data.
  • msgpack - A binary JSON alternative
  • CSV - The venerable pandas.read_csv and DataFrame.to_csv
  • hdfstore - Pandas’ custom HDF5 storage format

Additionally we mention but don’t include the following:

  • dill and cloudpickle- formats commonly used for function serialization. These perform about the same as cPickle
  • hickle - A pickle interface over HDF5. This does well on NumPy data but doesn’t support Pandas DataFrames well.


Disclaimer: We’re about to issue performance numbers on a toy dataset. You should not trust that what follows generalizes to your data. You should look at your own data and run benchmarks yourself. My benchmarks lie.

We create a DataFrame with two columns, one with numeric data, and one with text. The text column has repeated values (1000 unique values, each repeated 1000 times) while the numeric column is all unique. This is fairly typical of data that I see in the wild.

df = pd.DataFrame({'text': [str(i % 1000) for i in range(1000000)],
                   'numbers': range(1000000)})

Now we time the various dumps and loads methods of the different serialization libraries and plot the results below.

Time costs to serialize numeric data

As a point of reference writing the serialized result to disk and reading it back again should take somewhere between 0.05s and 0.5s on standard hard drives. We want to keep serialization costs below this threshold.

Thank you to Michael Waskom for making those charts (see twitter conversation and his alternative charts)

Gist to recreate plots here: https://gist.github.com/mrocklin/4f6d06a2ccc03731dd5f

Further Disclaimer: These numbers average from multiple repeated calls to loads/dumps. Actual performance in the wild is likely worse.


We have good options for numeric data but not for text. This is unfortunate; serializing ASCII text should be cheap. We lose here because we store text in a Series with the NumPy dtype ‘O’ for generic Python objects. We don’t have a dedicated variable length string dtype. This is tragic.

For numeric data the successful systems systems record a small amount of metadata and then dump the raw bytes. The main takeaway from this is that you should use the protocol=2 keyword argument to pickle. This option isn’t well known but strongly impacts preformance.

Note: Aaron Meurer notes in the comments that for Python 3 users protocol=3 is already default. Python 3 users can trust the default protocol= setting to be efficient and should not specify protocol=2.

Time costs to serialize numeric data

Some thoughts on text

  1. Text should be easy to serialize. It’s already text!

  2. JSON-no-index serializes the text values of the dataframe (not the integer index) as a list of strings. This assumes that the data are strings which is why it’s able to outperform the others, even though it’s not an optimized format. This is what we would gain if we had a string dtype rather than relying on the NumPy Object dtype, 'O'.

  3. MsgPack is surpsingly fast compared to cPickle

  4. MsgPack is oddly unbalanced, it can dump text data very quickly but takes a while to load it back in. Can we improve msgpack load speeds?

  5. CSV text loads are fast. Hooray for pandas.read_csv.

Some thoughts on numeric data

  1. Both pickle(..., protocol=2) and msgpack dump raw bytes. These are well below disk I/O speeds. Hooray!

  2. There isn’t much reason to compare performance below this level.

Categoricals to the Rescue

Pandas recently added support for categorical data. We use categorical data when our values take on a fixed number of possible options with potentially many repeats (like stock ticker symbols.) We enumerate these possible options (AAPL: 1, GOOG: 2, MSFT: 3, ...) and use those numbers in place of the text. This works well when there are many more observations/rows than there are unique values. Recall that in our case we have one million rows but only one thousand unique values. This is typical for many kinds of data.

This is great! We’ve shrunk the amount of text data by a factor of a thousand, replacing it with cheap-to-serialize numeric data.

>>> df['text'] = df['text'].astype('category')
>>> df.text
0      0
1      1
2      2
3      3
999997    997
999998    998
999999    999
Name: text, Length: 1000000, dtype: category
Categories (1000, object): [0 < 1 < 10 < 100 ... 996 < 997 < 998 < 999]

Lets consider the costs of doing this conversion and of serializing it afterwards relative to the costs of just serializing it.

Serialize Original Text 1.042523
Convert to Categories 0.072093
Serialize Categorical Data 0.028223

When our data is amenable to categories then it’s cheaper to convert-then-serialize than it is to serialize the raw text. Repeated serializations are just pure-win. Categorical data is good for other reasons too; computations on object dtype in Pandas generally happen at Python speeds. If you care about performance then categoricals are definitely something to roll in to your workflow.

Final Thoughts

  1. Several excellent serialization options exist, each with different strengths.
  2. A combination of good serialization support for numeric data and Pandas categorical dtypes enable efficient serialization and storage of DataFrames.
  3. Object dtype is bad for PyData. String dtypes would be nice. I’d like to shout out to DyND a possible NumPy replacement that would resolve this.
  4. MsgPack provides surprisingly good performance over custom Python solutions, why is that?
  5. I suspect that we could improve performance by special casing Object dtypes and assuming that they contain only text.

March 16, 2015 12:00 AM

March 13, 2015

Titus Brown

Request for suggestions on a generator API for khmer

I've been putting together a streaming API for khmer that would let us use generators to do sequence analysis, and I'd be interested in thoughts on how to do it in a good Pythonic way.

Some background: a while back, Alex Jironkin asked us for high level APIs, which turned into an issue on GitHub. More recently, we posted a paper on streaming and semi-streaming approaches to spectral sequence analysis, and so for my VanBUG talk I was inspired to put together a little API using generators.

The code looks like this, currently (see versioned permalink to khmer_api.py for full implementation):

graph = khmer.new_counting_hash(20, 1e7, 4)
out_fp = open(os.path.basename(filename) + '.abundtrim', 'w')

## khmer scripts/trim-low-abund.py -V, using generators
input_iter = screed.open(filename)
input_iter = broken_paired_reader(input_iter)
input_iter = clean_reads(input_iter)
input_iter = streamtrim(input_iter, graph, 20, 3)
output_reads(input_iter, out_fp)

Briefly, what this code does is

  1. create the core data structure we use
  2. open an output file
  3. open an input file
  4. create a generator that takes in a stream of data and groups records;
  5. create a generator that takes in records, does necessary preprocessing, and outputs them;
  6. create a generator that does our semi-streaming error trimming (from the semi-streaming preprint);
  7. outputs the reads to the given output fp.

The key bit is that this uses generators, so all of this is happening with full-on streaming. The one exception to this is the 'streamtrim' which has to cache a subset of the reads for processing more than once.

Interestingly, this is an implementation of the core functionality for the 'trim-low-abund.py' script that we will be releasing with the next version of khmer (the release after khmer 1.3 - not sure if it's 1.3.1 or 1.4 yet).

You can also replace the 'streamtrim' line with:

input_iter = diginorm(input_iter, graph, 20)

if you want to do digital normalization. That turns this into an implementation of the core functionality for the 'normalize-by-median.py' script that has been in khmer for a while.

Obviously these generator implementations are not yet production-ready, although they do give identical results to the current command line scripts.

The question I have, though, is what should the actual API look like?

The two basic options I've come up with are method chaining and UNIX-style pipes.

Method chaining might look like this:

read(in_fp). \
    clean_reads(). \
    streamtrim(graph, 20, 3). \

and piping would be

read(in_fp) | \
     clean_reads() | \
     streamtrim(graph, 20, 3) | \

...so I guess that's really pretty minor. I don't know which is more Pythonic, though, or what would permit more flexibility in terms of an underlying flexibility. Thoughts?

There are some other decisions to be made -- configuration and parameters.

For configuration, ideally we would be able to specify multiple input and output files to be paired with them, and/or set default parameters for multiple steps. Runtime reporting, error handling, and benchmarking should all be put into the mix. Should there be a simple object with hooks to handle all of this, or what? For example,

s = Streamer(...)  # configure inputs and outputs

s | clean_reads() | streamtrim(graph, 20, 3) | output()

where 's' could help by holding metadata or the like. I'm not sure - I've never done this before ;).

As for parameters, I personally like named parameters, so streamtrim above would better be streamtrim(graph, coverage=20, cutoff=3). But perhaps passing in a dictionary of parameters would be more flexible - should we support both? (Yes, I'm aware that in Python you can use ** - is there a preference?)

I'd love some examples of more mature APIs that have had the rough edges rubbed off 'em; this is really a UX question and I'm just not that well equipped to do this kind of design.

Thoughts? Help? Examples?

cheers, --titus

p.s. Down the road we're going to add correct_errors and eventually assemble to the steps, and perhaps also adapter trimming, quality trimming, and mapping. Won't that be fun? Imagine:

assembly = stream_input(filename1).trim().assemble()

to do your assembly... betcha we can stream it all, too.

by C. Titus Brown at March 13, 2015 11:00 PM

March 12, 2015


Students: spend the summer improving brain research software tools in Google Summer of Code

From Malin Sandström at the INCF:

Are you a student interested in brain research and software development? Or do you know one?
This year again, INCF is participating as mentoring organization in the Google Summer of Code, a global program that offers students stipends to spend the summer writing code for open source projects. INCF has 27 project proposals offered by mentors from the international research community, many of them with a computational neuroscience slant. All projects deal with development and/or improvement of open source tools that are used in the neuroscience community.

You can see our full list of projects here: https://incf.org/gsoc/2015/proposals

To be eligible, students must fulfill the Google definition of 'student': an individual enrolled in or accepted into an accredited institution including (but not necessarily limited to) colleges, universities, masters programs, PhD programs and undergraduate programs.

Student applications open on Monday, March 16.

GSoC questions welcome to: gsoc@incf.org

by Andrew Davison (noreply@blogger.com) at March 12, 2015 12:15 PM

March 11, 2015

Matthew Rocklin

Towards Out-of-core DataFrames

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

This post primarily targets developers. It is on experimental code that is not ready for users.

tl;dr Can we build dask.frame? One approach involves indexes and a lot of shuffling.

Dask arrays work

Over the last two months we’ve watched the creation of dask, a task scheduling specification, and dask.array a project to implement the out-of-core nd-arrays using blocked algorithms. (blogposts: 1, 2, 3, 4, 5, 6). This worked pretty well. Dask.array is available on the main conda channel and on PyPI and, for the most part, is a pleasant drop-in replacement for a subset of NumPy operations. I’m really happy with it.

conda install dask
pip install dask

There is still work to do, in particular I’d like to interact with people who have real-world problems, but for the most part dask.array feels ready.

On to dask frames

Can we do for Pandas what we’ve just done for NumPy?

Question: Can we represent a large DataFrame as a sequence of in-memory DataFrames and perform most Pandas operations using task scheduling?

Answer: I don’t know. Lets try.

Naive Approach

If represent a dask.array as an N-d grid of NumPy ndarrays, then maybe we should represent a dask.frame as a 1-d grid of Pandas DataFrames; they’re kind of like arrays.

dask.array Naive dask.frame

This approach supports the following operations:

  • Elementwise operations df.a + df.b
  • Row-wise filtering df[df.a > 0]
  • Reductions df.a.mean()
  • Some split-apply-combine operations that combine with a standard reduction like df.groupby('a').b.mean(). Essentially anything you can do with df.groupby(...).agg(...)

The reductions and split-apply-combine operations require some cleverness. This is how Blaze works now and how it does the does out-of-core operations in these notebooks: Blaze and CSVs, Blaze and Binary Storage.

However this approach does not support the following operations:

  • Joins
  • Split-apply-combine with more complex transform or apply combine steps
  • Sliding window or resampling operations
  • Anything involving multiple datasets

Partition on the Index values

Instead of partitioning based on the size of blocks we instead partition on value ranges of the index.

Partition on block size Partition on index value

This opens up a few more operations

  • Joins are possible when both tables share the same index. Because we have information about index values we we know which blocks from one side need to communicate to which blocks from the other.
  • Split-apply-combine with transform/apply steps are possible when the grouper is the index. In this case we’re guaranteed that each group is in the same block. This opens up general df.gropuby(...).apply(...)
  • Rolling or resampling operations are easy on the index if we share a small amount of information between blocks as we do in dask.array for ghosting operations.

We note the following theme:

Complex operations are easy if the logic aligns with the index

And so a recipe for many complex operations becomes:

  1. Re-index your data along the proper column
  2. Perform easy computation

Re-indexing out-of-core data

To be explicit imagine we have a large time-series of transactions indexed by time and partitioned by day. The data for every day is in a separate DataFrame.

Block 1
                     credit    name
2014-01-01 00:00:00     100     Bob
2014-01-01 01:00:00     200   Edith
2014-01-01 02:00:00    -300   Alice
2014-01-01 03:00:00     400     Bob
2014-01-01 04:00:00    -500  Dennis

Block 2
                     credit    name
2014-01-02 00:00:00     300    Andy
2014-01-02 01:00:00     200   Edith

We want to reindex this data and shuffle all of the entries so that now we partiion on the name of the person. Perhaps all of the A’s are in one block while all of the B’s are in another.

Block 1
                       time  credit
Alice   2014-04-30 00:00:00     400
Alice   2014-01-01 00:00:00     100
Andy    2014-11-12 00:00:00    -200
Andy    2014-01-18 00:00:00     400
Andy    2014-02-01 00:00:00    -800

Block 2
                       time  credit
Bob     2014-02-11 00:00:00     300
Bob     2014-01-05 00:00:00     100

Re-indexing and shuffling large data is difficult and expensive. We need to find good values on which to partition our data so that we get regularly sized blocks that fit nicely into memory. We also need to shuffle entries from all of the original blocks to all of the new ones. In principle every old block has something to contribute to every new one.

We can’t just call DataFrame.sort because the entire data might not fit in memory and most of our sorting algorithms assume random access.

We do this in two steps

  1. Find good division values to partition our data. These should partition the data into blocks of roughly equal size.
  2. Shuffle our old blocks into new blocks along the new partitions found in step one.

Find divisions by external sorting

One approach to find new partition values is to pull out the new index from each block, perform an out-of-core sort, and then take regularly spaced values from that array.

  1. Pull out new index column from each block

    indexes = [block['new-column-index'] for block in blocks]
  2. Perform out-of-core sort on that column

    sorted_index = fancy_out_of_core_sort(indexes)
  3. Take values at regularly spaced intervals, e.g.

    partition_values = sorted_index[::1000000]

We implement this using parallel in-block sorts, followed by a streaming merge process using the heapq module. It works but is slow.

Possible Improvements

This could be accelerated through one of the following options:

  1. A streaming numeric solution that works directly on iterators of NumPy arrays (numtoolz anyone?)
  2. Not sorting at all. We only actually need approximate regularly spaced quantiles. A brief literature search hints that there might be some good solutions.


Now that we know the values on which we want to partition we ask each block to shard itself into appropriate pieces and shove all of those pieces into a spill-to-disk dictionary. Another process then picks up these pieces and calls pd.concat to merge them in to the new blocks.

For the out-of-core dict we’re currently using Chest. Turns out that serializing DataFrames and writing them to disk can be tricky. There are several good methods with about an order of magnitude performance difference between them.

This works but my implementation is slow

Here is an example with snippet of the NYCTaxi data (this is small)

In [1]: import dask.frame as dfr

In [2]: d = dfr.read_csv('/home/mrocklin/data/trip-small.csv', chunksize=10000)

In [3]: d.head(3)   # This is fast
                          medallion                      hack_license  \
0  89D227B655E5C82AECF13C3F540D4CF4  BA96DE419E711691B9445D6A6307C170
1  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472
2  0BD7C8F5BA12B88E0B67BED28BEA73D8  9FD8F69F0804BDB5549F40E9DA1BE472

  vendor_id  rate_code store_and_fwd_flag      pickup_datetime  \
0       CMT          1                  N  2013-01-01 15:11:48
1       CMT          1                  N  2013-01-06 00:18:35
2       CMT          1                  N  2013-01-05 18:49:41

      dropoff_datetime  passenger_count  trip_time_in_secs  trip_distance  \
0  2013-01-01 15:18:10                4                382            1.0
1  2013-01-06 00:22:54                1                259            1.5
2  2013-01-05 18:54:23                1                282            1.1

   pickup_longitude  pickup_latitude  dropoff_longitude  dropoff_latitude
0        -73.978165        40.757977         -73.989838         40.751171
1        -74.006683        40.731781         -73.994499         40.750660
2        -74.004707        40.737770         -74.009834         40.726002

In [4]: d2 = d.set_index(d.passenger_count, out_chunksize=10000)   # This takes some time

In [5]: d2.head(3)
                                        medallion  \
0                3F3AC054811F8B1F095580C50FF16090
1                4C52E48F9E05AA1A8E2F073BB932E9AA
1                FF00E5D4B15B6E896270DDB8E0697BF7

                                     hack_license vendor_id  rate_code  \
0                E00BD74D8ADB81183F9F5295DC619515       VTS          5
1                307D1A2524E526EE08499973A4F832CF       VTS          1
1                0E8CCD187F56B3696422278EBB620EFA       VTS          1

                store_and_fwd_flag      pickup_datetime     dropoff_datetime  \
0                              NaN  2013-01-13 03:25:00  2013-01-13 03:42:00
1                              NaN  2013-01-13 16:12:00  2013-01-13 16:23:00
1                              NaN  2013-01-13 15:05:00  2013-01-13 15:15:00

                 passenger_count  trip_time_in_secs  trip_distance  \
0                              0               1020           5.21
1                              1                660           2.94
1                              1                600           2.18

                 pickup_longitude  pickup_latitude  dropoff_longitude  \
0                      -73.986900        40.743736         -74.029747
1                      -73.976753        40.790123         -73.984802
1                      -73.982719        40.767147         -73.982170

0                       40.741348
1                       40.758518
1                       40.746170

In [6]: d2.blockdivs  # our new partition values
Out[6]: (2, 3, 6)

In [7]: d.blockdivs   # our original partition values
Out[7]: (10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000)

Some Problems

  • First, we have to evaluate the dask as we go. Every set_index operation (and hence many groupbys and joins) forces an evaluation. We can no longer, as in the dask.array case, endlessly compound high-level operations to form more and more complex graphs and then only evaluate at the end. We need to evaluate as we go.

  • Sorting/shuffling is slow. This is for a few reasons including the serialization of DataFrames and sorting being hard.

  • How feasible is it to frequently re-index a large amount of data? When do we reach the stage of “just use a database”?

  • Pandas doesn’t yet release the GIL, so this is all single-core. See post on PyData and the GIL.

  • My current solution lacks basic functionality. I’ve skipped the easy things to first ensure sure that the hard stuff is doable.


I know less about tables than about arrays. I’m ignorant of the literature and common solutions in this field. If anything here looks suspicious then please speak up. I could really use your help.

Additionally the Pandas API is much more complex than NumPy’s. If any experienced devs out there feel like jumping in and implementing fairly straightforward Pandas features in a blocked way I’d be obliged.

March 11, 2015 12:00 AM

March 10, 2015

Matthew Rocklin

PyData and the GIL

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr Many PyData projects release the GIL. Multi-core parallelism is alive and well.


Machines grow more cores every year. My cheap laptop has four cores and a heavy workstation rivals a decent cluster without the hardware hassle. When I bring this up in conversation people often ask about the GIL and whether or not this poses a problem to the PyData ecosystem and shared-memory parallelism.

Q: Given the growth of shared-memory parallelism, should the PyData ecosystem be concerned about the GIL?

A: No, we should be very excited about this growth. We’re well poised to exploit it.

For those who aren’t familiar, the Global Interpreter Lock (GIL) is a CPython feature/bug that stops threads from manipulating Python objects in parallel. This cripples Pure Python shared-memory parallelism.

This sounds like a big deal but it doesn’t really affect the PyData stack (NumPy/Pandas/SciKits). Most PyData projects don’t spend much time in Python code. They spend 99% of their time in C/Fortran/Cython code. This code can often release the GIL. The following projects release the GIL at various stages:

  • NumPy
  • SciPy
  • Numba (if requested) (example docs)
  • SciKit Learn
  • Anything that mostly uses the above projects
  • if you add more in the comments then I will post them here

Our software stack has roots in scientific computing which has an amazing relationship with using all-of-the-hardware. I would like to see the development community lean in to the use of shared-memory parallelism. This feels like a large low-hanging fruit.

Quick Example with dask.array

As a quick example, we compute a large random dot product with dask.array and look at top. Dask.array computes large array operations by breaking arrays up in to many small NumPy arrays and then executing those array operations in multiple threads.

In [1]: import dask.array as da

In [2]: x = da.random.random((10000, 10000), blockshape=(1000, 1000))

In [3]: float(x.dot(x.T).sum())
Out[3]: 250026827523.19141

Full resource utilization with Python

Technical note: my BLAS is set to use one thread only, the parallelism in the above example is strictly due to multiple Python worker threads, and not due to parallelism in the underlying native code.

Note the 361.0% CPU utilization in the ipython process.

Because the PyData stack is fundamentally based on native compiled code, multiple Python threads can crunch data in parallel without worrying about the GIL. The GIL does not have to affect us in a significant way.

That’s not true, the GIL hurts Python in the following cases


We don’t have a good C/Fortran/Cython solution for text. When given a pile-of-text-files we often switch from threads to processes and use the multiprocessing module. This limits inter-worker communication but this is rarely an issue for this kind of embarrassingly parallel work.

The multiprocessing workflow is fairly simple. I’ve written about this in the toolz docs and in a blogpost about dask.bag.


Pandas does not yet release the GIL in computationally intensive code. It probably could though. This requires momentum from the community and some grunt-work by some of the Pandas devs. I have a small issue here and I think that Phil Cloud is looking into it.

PyData <3 Shared Memory Parallelism

If you’re looking for more speed in compute-bound applications then consider threading and heavy workstation machines. I personally find this approach to be more convenient than spinning up a cluster.

March 10, 2015 12:00 AM

March 09, 2015

Martin Fitzpatrick

Wooey: Simple, automated web UIs for Python scripts

Wooey is a simple web interface (built on Flask) to run command line Python scripts. Think of it as an easy way to get your scripts up on the web for routine data analysis, file processing, or anything else.

Inspired by what Gooey can do, turning ArgumentParser-based command-line scripts into WxWidgets-based GUIs, I thought I’d see if I could do the same for the web. I’m still not sure if the result is beautiful or horrific.

Wooey (see what I did there?) uses the same, albeit slightly modified, back-end conversion of ArgumentParser instances to JSON definitions. These definitions are used to construct a web-based UI with type-dependent widgets. Submitted configurations are parsed, using the JSON definition, to command line arguments that are then submitted to a job queue.

Jobs in the queue are automatically run and the results made available in the job view, with smart handling of outputs such as images (CSV, etc. to be supported via pandas, possibly some kind of plugin system) into a tabbed output viewer. Support for downloading of zipped output files is to follow.

The use case for myself was as a simple platform to allow running of routine data-processing and analysis scripts within a research group, but I’m sure there are other possibilities. However, I wouldn’t recommend putting this on the public web just yet (pre-alpha warning). It’s somewhat comparable to things like Shiny for R, except multi-user out of the box. Support for multiple command-line formats is on my todo.

Enjoy and please fork liberally.

Built on Flask, using cookiecutter-flask then modified to use the Foundation framework. This is My First Flask App! so please feel free to critique & give pointers.


The front page of a wooey install presents a list of installed scripts:


Each script has it’s own UI form based on the config parameters defined in the ArgumentParser:

bar_config example script

Documentation can be specified either manually via the JSON, or my providing a Markdown-format file alongside the script or config file.

plot_some_numbers script with docs

Logged-in users get a nice listing of their previous jobs:

User job listing

The output from successful jobs is available via an inline viewer (images only presently, .csv support via Pandas to follow):

Job with success 1 Job with success 2

Errors are output to the inline console:

Job with error console

by Martin Fitzpatrick at March 09, 2015 04:37 PM

Juan Nunez-Iglesias


Prompted in part by some discussions with Ed Schofield, creator of python-future.org, I’ve been going on a bit of a porting spree to Python 3. I just finished with my gala segmentation library. (Find it on GitHub and ReadTheDocs.) Overall, the process is nowhere near as onerous as you might think it is. Getting started really is the hardest part. If you have more than yourself as a user, you should definitely just get on with it and port.

The second hardest part is the testing. In particular, you will need to be careful with dictionary iteration, pickled objects, and file persistence in general. I’ll go through these gotchas in more detail below.

Reminder: the order of dictionary items is undefined

This is one of those duh things that I forget over and over and over. In my porting, some tests that depended on a scikit-learn RandomForest object were failing. I assumed that there was some difference between the random seeding in Python 2 and Python 3, leading to slightly different models between the two versions of the random forest.

This was a massive red herring that took me forever to figure out. In actuality, the seeding was completely fine. However, gala uses networkx as its graph backend, which itself uses an adjacency dictionary to store edges. So when I asked for graph.edges() to get a set of training examples, I was getting the edges in a random order that was deterministic within Python 2.7: the edges returned were always in the same shuffled order. This went out the window when switching to Python 3.4, with the training examples now in a different order, resulting in a different random forest and thus a different learning outcome… And finally a failed test.

The solution should have been to use a classifier that is not sensitive to ordering of the training data. However, although many classifiers satisfy this property, in practice they suffer from slight numerical instability which is sufficient to throw the test results off between shufflings of the training data.

So I’ve trained a Naive Bayes classifier in Python 2.7, and which I then load up in Python 3.4 and check whether the parameters are close to a newly trained one. The actual classification results can differ slightly, and this becomes much worse in gala, where classification tasks are sequential, so a single misstep can throw off everything that comes after it.

When pickling, remember to open files in binary mode

I’ve always felt that the pickle module was deficient for not accepting filenames as input to dump. Instead, it takes an open, writeable file. This is all well and good but it turns out that you should always open files in binary mode when using pickle! I got this far without knowing that, surely an indictment of pickle’s API!

Additionally, you’ll have specify a encoding='bytes' when loading a Python 2 saved file in the Python 3 version of pickle.

Even when you do, objects may not map cleanly between Python 2 and 3 (for some libraries)

In Python 2:

>>> from sklearn.ensemble import RandomForestClassifier as RF
>>> rf = RF()
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> rf = rf.fit(iris.data, iris.target)
>>> with open('rf', 'wb') as fout:
...     pck.dump(r, fout, protocol=2)

Then, in Python 3:

>>> with open('rf', 'rb') as fin:
...     rf = pck.load(fin, encoding='bytes')
KeyError                                  Traceback (most recent call last)
<ipython-input-9-674ee92b354d> in <module>()
      1 with open('rf', 'rb') as fin:
----> 2     rf = pck.load(fin, encoding='bytes')

/Users/nuneziglesiasj/anaconda/envs/py3k/lib/python3.4/site-packages/sklearn/tree/_tree.so in sklearn.tree._tree.Tree.__setstate__ (sklearn/tree/_tree.c:18115)()

KeyError: 'node_count'

When all is said and done, your code will probably run slower on Python 3

I have to admit: this just makes me angry. After a lot of hard work ironing out all of the above kinks, gala’s tests run about 2x slower in Python 3.4 than in 2.7. I’d heard quite a few times that Python 3 is slower than 2, but that’s just ridiculous.

Nick Coghlan’s enormous Q&A has been cited as required reading before complaining about Python 3. Well, I’ve read it (which took days), and I’m still angry that the CPython core development team are generally dismissive of anyone wanting faster Python. Meanwhile, Google autocompletes “why is Python” with “so slow”. And although Nick asserts that those of us complaining about this “misunderstand the perspective of conservative users”, community surveys show a whopping 40% of Python 2 users citing “no incentive” as the reason they don’t switch.

In conclusion…

In the end, I’m glad I ported my code. I learned a few things, and I feel like a better Python “citizen” for having done it. But that’s the point: those are pretty weak reasons. Most people just want to get their work done and move on. Why would they bother porting their code if it’s not going to help them do that?

by Juan Nunez-Iglesias at March 09, 2015 06:33 AM

March 05, 2015

Continuum Analytics

Continuum Analytics - March Tech Events

This month, the Continuum team will be at a variety of meetups and conferences talking about the power of Python in analytics. Take a look at where you can find us, and reach out at info@continuum.io if you’re interested in meeting up, or if you would like us to participate in your event.

by Continuum at March 05, 2015 12:00 AM

February 28, 2015

Matthew Rocklin

Ising models and Numba

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr I play with Numba and find it effective.

Ising model with numba


Confession, I’ve never actually used Numba. Well that’s not quite true; I’ve indirectly used Numba thousands of times because Blaze auto-generates numba ufuncs. Still I’ve never used it for a particular problem. I usually define problems with large array operations and compile those down. Numba takes a different approach and translates Python for loops to efficient LLVM code. This is all lower in the hardware stack than where I usually think.

But when I was looking for applications to motivate recent work in nearest-neighbor communications in dask a friend pointed me towards the Ising model, a simple physical system that is both easy to code up and has nice macro-scale properties. I took this as an example to play with Numba. This post details my experience.

Ising Model

Disclaimer: I am not a physicist

The Ising model represents a regular grid of points where each point has two possible states, spin up and spin down. States like to have the same spin as their immediate neighbors so when a spin-down state is surrounded by more spin-up states it will switch to spin-up and vice versa. Also, due to random fluctuations, points might switch spins, even if this switch is not favorable. In pseudocode an Ising update step might look like the following

for every point in the grid:
    energy = my spin * sum of all of the spins (+1 or -1) of neighboring points
    if energy is improved by switching:
    else if we're particulalry unlucky
        switch anyway

For this kind of algorithm imperative for-loopy code is probably the most clear. You can do this with high-level array operations too (e.g. NumPy/Blaze/Theano), but it’s a mess.

Numba code

Here is my solution to the problem with Numba and a gif of the solution

Ising model with numba

import numba
import numpy as np
from math import exp, log, e, sqrt

kT = 2 / log(1 + sqrt(2), e)

def _update(x, i, j):
    n, m = x.shape
    dE = 2* x[i, j] * (
                     x[(i-1)%n, (j-1)%m]
                   + x[(i-1)%n,  j     ]
                   + x[(i-1)%n, (j+1)%m]

                   + x[ i     , (j-1)%m]
                   + x[ i     , (j+1)%m]

                   + x[(i+1)%n, (j-1)%m]
                   + x[(i+1)%n,  j     ]
                   + x[(i+1)%n, (j+1)%m]
    if dE <= 0 or exp(-dE / kT) > np.random.random():
        x[i, j] *= -1

def update(x):
    n, m = x.shape

    for i in range(n):
        for j in range(0, m, 2):  # Even columns first to avoid overlap
            _update(x, i, j)

    for i in range(n):
        for j in range(1, m, 2):  # Odd columns second to avoid overlap
            _update(x, i, j)

if __name__ == '__main__':
    x = np.random.randint(2, size=(1000, 1000)).astype('i1')
    x[x == 0] = -1
    for i in range(100):

My old beater laptop executes one update step on a 1000x1000 grid in 50ms. Without Numba this takes 12s. This wasn’t a canned demo by an expert user / numba developer; this was just my out-of-the-box experience.


I really like the following:

  • I can remove @numba.jit and use the Python debugger
  • I can assert that I’m only using LLVM with nopython=True
  • I can manage data with NumPy (or dask.array) separately from managing computation with Numba

I ran in to some issues and learned some things too:

  • random is only present in the developer preview builds of Numba (conda install -c numba numba). It will be officially released in the 0.18 version this March.
  • You don’t have to provide type signature strings. I tried providing these at first but I didn’t know the syntax and so repeatedly failed to write down the type signature correctly. Turns out the cost of not writing it down is that Numba will jit whenever it sees a new signature. For my application this is essentially free.

February 28, 2015 12:00 AM

February 26, 2015


Workshop Announcement - "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1"

Registration is now open for the workshop "HBP Hippocamp CA1: Collaborative and Integrative Modeling of Hippocampal Area CA1", to be held March 31st - April 1st, 2015 at UCL School of Pharmacy in London, supported by the Human Brain Project (www.humanbrainproject.eu).

 In short, the aims of the workshop are two-fold. First, to engage the larger community of experimentalists and modelers working on hippocampus, and highlight existing modeling efforts and strategic datasets for modeling hippocampal area CA1. Second, to define and bootstrap an inclusive community-driven model and data-integration process to achieve open pre-competitive reference models of area CA1 (and, ultimately, the rest of the hippocampus), which are well documented, validated, and released at regular intervals (supported in part by IT infrastructure funded by HBP). Involvement from the community interested in characterization and modeling of the hippocampus is highly encouraged. To keep the meeting focused on the task, participation will be limited to ~30 people, so registration is required.

Please consult the meeting website at http://neuralensemble.org/meetings/HippocampCA1/for registration and further details.

Organizing committee:Jo Falck (UCL), Szabolcs Káli (Hungarian Academy of Sciences), Sigrun Lange (UCL), Audrey Mercer (UCL), Eilif Muller (EPFL), Armando Romani (EPFL) and Alex Thomson (UCL).

by eilif (noreply@blogger.com) at February 26, 2015 03:50 PM

February 24, 2015

Jake Vanderplas

Optimizing Python in the Real World: NumPy, Numba, and the NUFFT

Donald Knuth famously quipped that "premature optimization is the root of all evil." The reasons are straightforward: optimized code tends to be much more difficult to read and debug than simpler implementations of the same algorithm, and optimizing too early leads to greater costs down the road. In the Python world, there is another cost to optimization: optimized code often is written in a compiled language like Fortran or C, and this leads to barriers to its development, use, and deployment.

Too often, tutorials about optimizing Python use trivial or toy examples which may not map well to the real world. I've certainly been guilty of this myself. Here, I'm going to take a different route: in this post I will outline the process of understanding, implementing, and optimizing a non-trivial algorithm in Python, in this case the Non-uniform Fast Fourier Transform (NUFFT). Along the way, we'll dig into the process of optimizing Python code, and see how a relatively straightforward pure Python implementation, with a little help from Numba, can be made to nearly match the performance of a highly-optimized Fortran implementation of the same algorithm.

Why a Python Implementation?

First, I want to answer the inevitable question: why spend the time to make a Python implementation of an algorithm that's already out there in Fortran? The reason is that I've found in my research and teaching that pure-Python implementations of algorithms are far more valuable than C or Fortran implementations, even if they might be a bit slower. This is for a number of reasons:

  • Pure-Python code is easier to read, understand, and contribute to. Good Python implementations are much higher-level than C or Fortran, and abstract-away loop indices, bit twiddling, workspace arrays, and other sources of code clutter. A typical student reading good Python code can immediately understand and modify the algorithm, while the same student would be lost trying to understand typical optimized Fortran code.

  • Pure-python packages are much easier to install than Python-wrapped C or Fortran code. This is especially true on non-Linux systems. Fortran in particular can require some installation prerequisites that are non-trivial for many users. In practice, I've seen people give up on better tools when there is an installation barrier for those tools.

  • Pure-python code often works for many data types. Because of the way it is written, pure Python code is often automatically applicable to single or double precision, and perhaps even to extensions to complex numbers. For compiled packages, supporting and compiling for all possible types can be a burden.

  • Pure-python is easier to use at scale. Because it does not require complicated installation, pure Python packages can be much easier to install on cloud VMs and/or shared clusters for computation at scale. If you can easily pip-install a pure-Python package on a VM, then services like AWS and TravisCI are much easier to set up.

Certainly code speed will overcome these considerations if the performance gap is great enough, but I've found that for many applications a pure Python package, cleverly designed and optimized, can be made fast enough that these larger considerations win-out. The challenge is making the Python fast. We'll explore this below.

Background: The Non-Uniform Fast Fourier Transform

The Fast Fourier Transform (FFT) is perhaps the most important and fundamental of modern numerical algorithms. It provides a fast, \(O[N\log N]\) method of computing the discrete Fourier transform: \[ Y_k^\pm = \sum_{n=0}^{N-1} y_n e^{\pm i k n / N} \] You can read more about the FFT in my previous post on the subject.

One important limitation of the FFT is that it requires that input data be evenly-spaced: that is, we can think of the values \(y_n\) as samples of a function \(y_n = y(x_n)\) where \(x_n = x_0 + n\Delta x\) is a regular grid of points. But what about when your grid is not uniform? That is, what if you want to compute this result: \[ Y_k^\pm = \sum_{j=1}^N y(x_j) e^{\pm i k x_j} \] where \(y(x)\) is evaluated at an arbitrary set of points \(x_j\)? In this case, the FFT is no longer directly applicable, and you're stuck using a much slower \(O[N^2]\) direct summation.

Stuck, that is, until the NUFFT came along.

The NUFFT is an algorithm which converts the non-uniform transform into an approximate uniform transform, not with error-prone interpolation, but instead using a clever "gridding" operation motivated by the convolution theorem. If you'd like to read about the algorithm in detail, the Courant Institute's NUFFT page has a nice set of resources.

Below we'll take a look at implementing this algorithm in Python.

Direct Non-Uniform Fourier Transform

When developing optimized code, it is important to start with something easy to make sure you're on the right track. Here we'll start with a straightforward direct version of the non-uniform Fourier transform. We'll allow non-uniform inputs \(x_j\), but compute the output on a grid of \(M\) evenly-spaced frequencies in the range \(-M/2 \le f/\delta f < M/2\). This is what the NUFFT group calls the Type-1 NUFFT.

First we'll implement nufftfreqs(), which returns the frequency grid for a given \(M\), and nudft() which computes the non-uniform discrete Fourier transform using a slow direct method. The arguments for the latter include iflag, which is a positive or negative number indicating the desired sign of the exponent:

In [1]:
from __future__ import print_function, division
import numpy as np

def nufftfreqs(M, df=1):
    """Compute the frequency range used in nufft for M frequency bins"""
    return df * np.arange(-(M // 2), M - (M // 2))

def nudft(x, y, M, df=1.0, iflag=1):
    """Non-Uniform Direct Fourier Transform"""
    sign = -1 if iflag < 0 else 1
    return (1 / len(x)) * np.dot(y, np.exp(sign * 1j * nufftfreqs(M, df) * x[:, np.newaxis]))

Again, I can't emphasize this enough: when writing fast code, start with a slow-and-simple version of the code which you know gives the correct result, and then optimize from there.

Comparing to the Fortran NUFFT

We can double-check that this is producing the desired result by comparing to the Fortran NUFFT implementation, using Python wrappers written by Dan Foreman-Mackey, available at http://github.com/dfm/python-nufft/:

In [2]:
# Install nufft from http://github.com/dfm/python-nufft/
from nufft import nufft1 as nufft_fortran

x = 100 * np.random.random(1000)
y = np.sin(x)

Y1 = nudft(x, y, 1000)
Y2 = nufft_fortran(x, y, 1000)

np.allclose(Y1, Y2)

The results match! A quick check shows that, as we might expect, the Fortran algorithm is orders of magnitude faster:

In [3]:
%timeit nudft(x, y, 1000)