May 14, 2012

Vytautas Jancauskas

First pull request

Created some really simple test cases for scatter matrix diagonal plotting and submitted a first pull request https://github.com/pydata/pandas/pull/1237. This was really useful to familiarize myself with the development flow in pydata/pandas. Also while I am quite used to Mercurial SCM I am constantly finding git to be infuriatingly different. :)

by Vytautas Jančauskas (noreply@blogger.com) at May 14, 2012 03:17 PM

May 13, 2012

Wes McKinney

Mastering high performance data algorithms I: Group By

I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

  • Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
  • O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
  • Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
  • Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

    In [6]: rng = date_range('1/1/2000', periods=20, freq='4h')
    
    In [7]: ts = Series(np.random.randn(len(rng)), index=rng)
    
    In [8]: ts
    Out[8]: 
    2000-01-01 00:00:00   -0.891761
    2000-01-01 04:00:00    0.204853
    2000-01-01 08:00:00    0.690581
    2000-01-01 12:00:00    0.454010
    2000-01-01 16:00:00   -0.123102
    2000-01-01 20:00:00    0.300111
    2000-01-02 00:00:00   -1.052215
    2000-01-02 04:00:00    0.094484
    2000-01-02 08:00:00    0.318417
    2000-01-02 12:00:00    0.779984
    2000-01-02 16:00:00   -1.514042
    2000-01-02 20:00:00    2.550011
    2000-01-03 00:00:00    0.983423
    2000-01-03 04:00:00   -0.710861
    2000-01-03 08:00:00   -1.350554
    2000-01-03 12:00:00   -0.464388
    2000-01-03 16:00:00    0.817372
    2000-01-03 20:00:00    1.057514
    2000-01-04 00:00:00    0.743033
    2000-01-04 04:00:00    0.925849
    Freq: 4H
    
    In [9]: by = lambda x: lambda y: getattr(y, x)
    
    In [10]: ts.groupby([by('year'), by('month'), by('day')]).mean()
    Out[10]: 
    2000   1      1        0.105782
                  2        0.196106
                  3        0.055418
                  4        0.834441

    This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

  • Factorize the group keys, computing integer label arrays (O(N) per key)
  • Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
  • If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
  • For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
  • Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
  • I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

    def _get_indices_dict(label_list, keys): 
        # Accepts factorized labels and unique key values
    
        shape = [len(x) for x in keys]
    
        group_index = get_group_index(label_list, shape) # Compute group index
    
        sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index),
                                          np.prod(shape))
    
        # Reorder key labels and group index
        sorted_labels = [lab.take(sorter) for lab in label_list]
        group_index = group_index.take(sorter)
    
        # Compute dict of {group tuple -> location NumPy array for group}
        index = np.arange(len(group_index)).take(sorter)
        return lib.indices_fast(index, group_index, keys, sorted_labels)

    The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

    In [11]: rng = date_range('1/1/2000', '12/31/2005', freq='H')
    
    In [12]: year, month, day = rng.year, rng.month, rng.day
    
    In [13]: ts = Series(np.random.randn(len(rng)), index=rng)
    
    In [14]: lprun -f gp._get_indices_dict for i in xrange(100): ts.groupby([year, month, day]).indices
    Timer unit: 1e-06 s
    
    File: pandas/core/groupby.py
    Function: _get_indices_dict at line 1975
    Total time: 0.628506 s
    
    Line #      Hits         Time  Per Hit   % Time  Line Contents
    ==============================================================
      1975                                           def _get_indices_dict(label_list, keys):
      1976       400          695      1.7      0.1      shape = [len(x) for x in keys]
      1977       100       114388   1143.9     18.2      group_index = get_group_index(label_list, shape)
      1978                                           
      1979       100          320      3.2      0.1      sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index),
      1980       100        62007    620.1      9.9                                        np.prod(shape))
      1981                                           
      1982       400        53530    133.8      8.5      sorted_labels = [lab.take(sorter) for lab in label_list]
      1983       100        19516    195.2      3.1      group_index = group_index.take(sorter)
      1984       100        20189    201.9      3.2      index = np.arange(len(group_index)).take(sorter)
      1985                                           
      1986       100       357861   3578.6     56.9      return lib.indices_fast(index, group_index, keys, sorted_labels)

    You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

    Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

    by Wes McKinney at May 13, 2012 06:46 PM

    Andreas Mueller

    ICML 2012 Deep Learning and Unsupervised Feature Extraction Reading List

    The ICML2012 accepted papers are officially online.

    On twitter, Andrej Kaparthy complained that the list is a bit hard to browse trough. I agree and even though this is probably not the nice visualization he had in mind, I felt like having topical reading lists would somehow mitigate this problem.
    Here is my reading list on deep learning and unsupervised feature extraction:

    A Generative Process for Contractive Auto-Encoders

    Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio
    – Accepted
    Abstract: The contractive auto-encoder learns a representation of the input data that captures the local manifold structure around each data point, through the leading singular vectors of the Jacobian of the transformation from input to representation. The corresponding singular values specify how much local variation is plausible in directions associated with the corresponding singular vectors, while remaining in a high-density region of the input space. This paper proposes a procedure for generating samples that are consistent with the local structure captured by a contractive auto-encoder. The associated stochastic process defines a distribution from which one can sample, and which experimentally appears to converge quickly and mix well between modes, compared to Restricted Boltzmann Machines and Deep Belief Networks. The intuitions behind this procedure can also be used to train the second layer of contraction that pools lower-level features and learns to be invariant to the local directions of variation discovered in the first layer. We show that this can help learn and represent invariances present in the data and improve classification error.


    Building high-level features using large scale unsupervised learning

    Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Greg Corrado, Kai Chen, Jeff Dean, Andrew Ng
    – Accepted
    Abstract: We consider the challenge of building feature detectors for high-level concepts from only unlabeled data. For example, we would like to understand if it is possible to learn a face detector using only unlabeled images downloaded from the Internet. To answer this question, we trained a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (which has 10 million images, each image has 200x200 pixels). On contrary to what appears to be a widely-held negative belief, our experimental results reveal that it is possible to achieve a face detector via only unlabeled data. Control experiments show that the feature detector is robust not only to translation but also to scaling and 3D rotation. Also via recognition and visualization, we find that the same network is sensitive to other high-level concepts such as cat faces and human bodies.

    Evaluating Bayesian and L1 Approaches for Sparse Unsupervised Learning

    Shakir Mohamed, Katherine Heller, Zoubin Ghahramani
    – Accepted
    Abstract: The use of L_1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L_1 methods, in this paper we find that L_1 regularisation often dramatically under-performs in terms of predictive performance when compared with other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L_1 minimising factor models, Bayesian variants of “L_1”, and Bayesian models with a stronger L_0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods outperform L_1 minimisation, even on a computational budget. We thus highlight the need to re-assess the wide use of L_1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance.

    On multi-view feature learning

    Roland Memisevic
    – Accepted
    Abstract: Sparse coding is a common approach to learning local features for object recognition. Recently, there has been an increasing interest in learning features from spatio-temporal, binocular, or other multi-observation data, where the goal is to encode the relationship between images rather than the content of a single image. We discuss the role of multiplicative interactions and of squaring non-linearities in learning such relations. In particular, we show that training a sparse coding model whose filter responses are multiplied or squared amounts to jointly diagonalizing a set of matrices that encode image transformations. Inference amounts to detecting rotations in the shared eigenspaces. Our analysis helps explain recent experimental results showing that Fourier features and circular Fourier features emerge when training complex cell models on translating or rotating images. It also shows how learning about transformations makes it possible to learn invariant features.

     

    Deep Mixtures of Factor Analysers

    Yichuan Tang, Ruslan Salakhutdinov, Geoffrey Hinton
    – Accepted
    Abstract: An efficient way to learn deep density models that have many layers of latent variables is to learn one layer at a time using a model that has only one layer of latent variables. After learning each layer, samples from the posterior distributions for that layer are used as training data for learning the next layer. This approach is commonly used with Restricted Boltzmann Machines, which are undirected graphical models with a single hidden layer, but it can also be used with Mixtures of Factor Analysers (MFAs) which are directed graphical models. In this paper, we present a greedy layer-wise learning algorithm for Deep Mixtures of Factor Analysers (DMFAs). Even though a DMFA can be converted to an equivalent shallow MFA by multiplying together the factor loading matrices at different levels, learning and inference are much more efficient in a DMFA and the sharing of each lower-level factor loading matrix by many different higher level MFAs prevents overfitting. We demonstrate empirically that DMFAs learn better density models than both MFAs and two types of Restricted Boltzmann Machines on a wide variety of datasets.

     

     Learning Local Transformation Invariance with Restricted Boltzmann Machines

    Kihyuk Sohn, Honglak Lee
    – Accepted
    Abstract: The difficulty of developing feature learning algorithms that are robust to the novel transformations (e.g., scale, rotation, or translation) has been a challenge in many applications (e.g., object recognition problems). In this paper, we address this important problem of transformation invariant feature learning by introducing the transformation matrices into the energy function of the restricted Boltzmann machines. Specifically, the proposed transformation-invariant restricted Boltzmann machines not only learn the diverse patterns by explicitly transforming the weight matrix, but it also achieves the invariance of the feature representation via probabilistic max pooling of hidden units over the set of transformations. Furthermore, we show that our transformation-invariant feature learning framework is not limited to this specific algorithm, but can be also extended to many unsupervised learning methods, such as an autoencoder or sparse coding. To validate, we evaluate our algorithm on several benchmark image databases such as MNIST variation, CIFAR-10, and STL-10 as well as the customized digit datasets with significant transformations, and show very competitive classification performance to the state-of-the-art. Besides the image data, we apply the method for phone classification tasks on TIMIT database to show the wide applicability of our proposed algorithms to other domains, achieving state-of-the-art performance.

    Large-Scale Feature Learning With Spike-and-Slab Sparse Coding

    Ian Goodfellow, Aaron Courville, Yoshua Bengio
    – Accepted
    Abstract: We consider the problem of object recogni- tion with a large number of classes. In or- der to scale existing feature learning algo- rithms to this setting, we introduce a new feature learning and extraction procedure based on a factor model we call spike-and- slab sparse coding (S3C). Prior work on this model has not prioritized the ability to ex- ploit parallel architectures and scale to the enormous problem sizes needed for object recognition. We present an inference proce- dure appropriate for use with GPUs which allows us to dramatically increase both the training set size and the amount of latent factors. We demonstrate that this approach improves upon the supervised learning ca- pabilities of both sparse coding and the ss- RBM on the CIFAR-10 dataset. We use the CIFAR-100 dataset to demonstrate that our method scales to large numbers of classes bet- ter than previous methods. Finally, we use our method to win the NIPS 2011 Workshop on Challenges In Learning Hierarchical Mod- els’ Transfer Learning Challenge.

    Deep Lambertian Networks

    Yichuan Tang, Ruslan Salakhutdinov, Geoffrey Hinton
    – Accepted
    Abstract: Visual perception is a challenging problem in part due to illumination variations. A possible solution is to first estimate an illumination invariant representation before using it for recognition. The object albedo and surface normals are examples of such representation. In this paper, we introduce a multilayer generative model where the latent variables include the albedo, surface normals, and the light source. Combining Deep Belief Nets with the Lambertian reflectance assumption, our model can learn good priors over the albedo from 2D images. Illumination variations can be explained by changing only the lighting latent variable in our model. By transferring learned knowledge from similar objects, albedo and surface normals estimation from a single image is possible in our model. Experiments demonstrate that our model is able to generalize as well as improve over standard baselines in one-shot face recognition.

    Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

    Clément Farabet, Camille Couprie, Laurent Najman, Yann LeCun
    – Accepted
    Abstract: Scene parsing consists in labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features. In parallel to feature extraction, a tree of segments is computed from a graph of pixel dissimilarities. The feature vectors associated with the segments covered by each node in the tree are aggregated and fed to a classifier which produces an estimate of the distribution of object categories contained in the segment. A subset of tree nodes that cover the image are then selected so as to maximize the average 'purity' of the class distributions, hence maximizing the overall likelihood that each segment will contain a single object. The system yields record accuracies on the the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170 classes) and near-record accuracy on Stanford Background Dataset (8 classes), while being an order of magnitude faster than competing approaches, producing a 320x240 image labeling in less than 1 second, including feature extraction.

    by Andreas Mueller (noreply@blogger.com) at May 13, 2012 04:05 PM

    Superpixels for Python - pretty SLIC

    Yesterday I wanted to try out a "new" superpixel algorithm that seemed quite successful: SLIC superpixels.
    This is actually a very simple algorithm, basically doing KMeans in the color+(x,y) space. I'm a bit bummed that they named that, since I already tried the same approach a couple of years ago and didn't think it was very useful. Well, apparently it is.
    The authors have a nice website with some examples. Unfortunately the linux binary didn't run on my box and building on linux seemed somewhat non-trivial.

    So I did what I always do: wrote some Python wrappers. You can find them on github. The whole thing is pretty small, easy to build and easy to use. Also damn fast (less than a second per image).

    There are two variations, one where you can specify the number of superpixels and one where you can specify the number of pixels in a superpixel. Both have an additional parameter, the "compactness", which is a trade-off between the similarity in colorspace and (x,y) space.
    Results for varying parameter settings look something like this:

    Compare to my (former) favorite, quickshift:
    Both are done in RGB colorspace and could probably benefit from going to Lab.
    The SLIC implementation converts to Lab, while I didn't do the conversion for quickshift (which I probably should have done).

    by Andreas Mueller (noreply@blogger.com) at May 13, 2012 12:32 AM

    May 12, 2012

    Enthought

    GPU-palooza: May 21, NYC

    Enthought is proud to sponsor the NYC HPC-GPU Supercomputing meetup on May 21st, 2012. This meeting will focus on the GPU and Python (what else?). A happy coincidence brings together three GPU wise men at this meetup. Andreas Klockner is the author of PyCUDA and PyOpenCL. Nicolas Pinto comes to us via MIT with an extensive [...]

    by David Kim at May 12, 2012 09:06 PM

    May 10, 2012

    Josef Perkoltd

    Regression Plots - Part 1

    I started to work on improving the documentation for the regressions plot in statsmodels. (However, I realized I have to improve them a bit.)

    For now, just a question: Can you spot the mis-specification of the model?

    I simulate a model, run a linear regression on three variables and a constant. Here is the estimation summary, which looks quite good, large R-squared, all variables significant, no obvious problems:
    >>> print res.summary()
                                                            OLS Regression Results
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.901
    Model:                            OLS   Adj. R-squared:                  0.898
    Method:                 Least Squares   F-statistic:                     290.3
    Date:                Thu, 10 May 2012   Prob (F-statistic):           5.31e-48
    Time:                        13:15:22   Log-Likelihood:                -173.85
    No. Observations:                 100   AIC:                             355.7
    Df Residuals:                      96   BIC:                             366.1
    Df Model:                           3
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    x1             0.4872      0.024     20.076      0.000         0.439     0.535
    x2             0.5408      0.045     12.067      0.000         0.452     0.630
    x3             0.5136      0.030     16.943      0.000         0.453     0.574
    const          4.6294      0.372     12.446      0.000         3.891     5.368
    ==============================================================================
    Omnibus:                        0.945   Durbin-Watson:                   1.570
    Prob(Omnibus):                  0.624   Jarque-Bera (JB):                1.031
    Skew:                          -0.159   Prob(JB):                        0.597
    Kurtosis:                       2.617   Cond. No.                         33.2
    ==============================================================================
    
    The following three graphs are refactored versions of the regression plots. Each graph looks at the data and estimation results with respect to one of the three variables. (The graphs look better in original size.)
    The short lines in the first subplot of each graph are the prediction confidence intervals for each observation.
    The code is short, if we have the (still unpublished) helper functions.
    res is an OLS results instance
    from regressionplots_new import plot_regress_exog
    
    fig9 = plot_regress_exog(res, exog_idx=0)
    add_lowess(fig9, ax_idx=1, lines_idx=0)
    add_lowess(fig9, ax_idx=2, lines_idx=0)
    add_lowess(fig9, ax_idx=3, lines_idx=0)
    
    fig10 = plot_regress_exog(res, exog_idx=1)
    add_lowess(fig10, ax_idx=1, lines_idx=0)
    add_lowess(fig10, ax_idx=2, lines_idx=0)
    add_lowess(fig10, ax_idx=3, lines_idx=0)
    
    fig11 = plot_regress_exog(res, exog_idx=2)
    add_lowess(fig11, ax_idx=1, lines_idx=0)
    add_lowess(fig11, ax_idx=2, lines_idx=0)
    add_lowess(fig11, ax_idx=3, lines_idx=0)
    

    by Josef Perktold (noreply@blogger.com) at May 10, 2012 08:54 PM

    Plots in statsmodels: qqplot

    Other news first, since I haven't managed to catch up with the blogs:
    • statsmodels has four students in GSoC, the first four projects described in my previous post. Congratulations to Alexandre, Divyanshu, George and Justin
    • statsmodels 0.4.0 has been release with new name without scikits in front, more on pypi
    statsmodels has a graphics subdirectory, where we started to collect some of the common statistical plots. To make the documentation a bit more exciting, I am adding plots directly to the docstrings for the individual functions. Currently, we don't have many of them in the online documentation yet, two examples violin_plot and bean_plot.
    A note on the documentation: Skipper improved the frontpage, which makes it easier to find the documentation for the latest released version and for the development version. Currently, the development version is better and is improving, and it is incompatible with the 0.4.0 release in only one part.

    quantile-quantile plot: qqplot

    The documentation for the function is here. The function signature is
    qqplot(data, dist=stats.norm, distargs=(), a=0, loc=0, scale=1, fit=False, line=False, ax=None)
    
    I am not copying the entire docstring, what I would like to present here are some examples and how to work with the plots.
    The first example is from the docstring. I don't like the default, so I kept adding keyword arguments until the plot is more to my taste.
    • The first plot uses no keywords and assumes normal distribution, and does not standardize the data.
    • The second plot adds line='s', which according to the docstring
      's' - standardized line, the expected order statistics are scaled
            by the standard deviation of the given sample and have the mean
            added to them
      
      corresponds to the line after fitting location and scale for the normal distribution
    • The third plot adds fit=True to get standardized sample quantiles and plots the 45 degree line. That's the plot I would prefer.
    • The fourth plot is similar to the third plot, but with the t distribution instead of the normal distribution. I was surprised that the third and fourth plot look the same, until I checked and it turned out that the fitted t distribution has a huge degrees of freedom parameter and so is essentially identical to the normal distribution.


    I will go over the code to produce this below.
    I started the second example to see whether fitting the t distribution works correctly. Instead of using real data, I generate 1000 observations with a t distribution with df=4 and standard location(0) and scale (1).
    • The first plot fits a normal distribution, keywords: line='45', fit=True
    • The second plot fits the t distribution, keywords: dist=stats.t, line='45', fit=True
    • The third plot is the same as the second plot, but I fit the t distribution myself, instead of having qqplot do it. keywords: dist=stats.t, distargs=(dof,), loc=loc, scale=scale, line='45'. I added the estimated parameters into a text insert in the plot. qqplot currently doesn't tell us what the fitted parameters are.


    The Code

    Here was my first attempt, following the docstring example
    from scipy import stats
    import statsmodels.api as sm
    
    #estimate to get the residuals
    data = sm.datasets.longley.load()
    data.exog = sm.add_constant(data.exog)
    mod_fit = sm.OLS(data.endog, data.exog).fit()
    res = mod_fit.resid
    
    fig = sm.graphics.qqplot(res, dist=stats.t, line='45', fit=True)
    fig.show()
    
    It works but the x-axis goes from -3 to 3, while there are only values from -2 to 2.
    Detour to some background
    A while ago we had a discussion on the mailing list what a plot in statsmodels should return. With the helpful comments of John Hunter, we finally agreed that plots should take an ax (matplotlib axis) argument if it's meaningful, and always return a figure instance fig. If ax is None, or the plot is a combination plot (several plots in one figure), then a figure is created and returned. If ax is given, then that is used to attach the plot elements. Ralf Gommers converted our plot functions to follow this pattern, besides that, he also wrote several of the plots that are currently in statsmodels.
    So, to change the axis limits in the above graph, all I have to add is:
    fig.axes[0].set_xlim(-2, 2)
    
    The resulting plot is then the same as the third plot in the first graph above.
    The first graph
    Here is now the script for the first graph in several stages:
    First I import some modules and calculate the residuals following the example
    from scipy import stats
    from matplotlib import pyplot as plt
    import statsmodels.api as sm
    
    #example from docstring
    data = sm.datasets.longley.load()
    data.exog = sm.add_constant(data.exog)
    mod_fit = sm.OLS(data.endog, data.exog).fit()
    res = mod_fit.resid
    
    Then I hardcode a left position for text inserts, and create a matplotlib figure instance
    left = -1.8
    fig = plt.figure()
    
    Next we can add the first subplot. The only keyword arguments for qqplot is ax to tell qqplot to attach the plot to my first subplot. Since I want to insert a text to describe the keywords, I needed to spend some time with the matplotlib documentation. As we have a reference to the axis instance, it is easy to change or add plot elements
    ax = fig.add_subplot(2, 2, 1)
    sm.graphics.qqplot(res, ax=ax)
    top = ax.get_ylim()[1] * 0.75
    txt = ax.text(left, top, 'no keywords', verticalalignment='top')
    txt.set_bbox(dict(facecolor='k', alpha=0.1))
    
    The other subplots follow the same pattern. I didn't try to generalize or avoid hardcoding
    ax = fig.add_subplot(2, 2, 2)
    sm.graphics.qqplot(res, line='s', ax=ax)
    top = ax.get_ylim()[1] * 0.75
    txt = ax.text(left, top, "line='s'", verticalalignment='top')
    txt.set_bbox(dict(facecolor='k', alpha=0.1))
    
    ax = fig.add_subplot(2, 2, 3)
    sm.graphics.qqplot(res, line='45', fit=True, ax=ax)
    ax.set_xlim(-2, 2)
    top = ax.get_ylim()[1] * 0.75
    txt = ax.text(left, top, "line='45', \nfit=True", verticalalignment='top')
    txt.set_bbox(dict(facecolor='k', alpha=0.1))
    
    ax = fig.add_subplot(2, 2, 4)
    sm.graphics.qqplot(res, dist=stats.t, line='45', fit=True, ax=ax)
    ax.set_xlim(-2, 2)
    top = ax.get_ylim()[1] * 0.75
    txt = ax.text(left, top, "dist=stats.t, \nline='45', \nfit=True",
                  verticalalignment='top')
    txt.set_bbox(dict(facecolor='k', alpha=0.1))
    
    The final step is to adjust the layout, so that axis labels don't overlap with other subplots if the graph is not very large
    fig.tight_layout()
    
    The second graph
    The second graph follows the same pattern with a few changes.
    First we generate a random sample using scipy.stats which under the hood uses the random numbers from numpy. You can notice here that I am cheating. I ran the script several times to find "nice" seeds. Especially in smaller samples, qqplot might often not be very good in distinguishing normal and t distributions.
    import numpy as np
    seed = np.random.randint(1000000)
    print 'seed', seed
    seed = 461970  #nice seed for nobs=1000
    #seed = 571478  #nice seed for nobs=100
    #seed = 247819  #for nobs=100, estimated t is essentially normal
    np.random.seed(seed)
    rvs = stats.t.rvs(4, size=1000)
    
    The first two subplot are very similar to what is in the first graph
    fig2 = plt.figure()
    ax = fig2.add_subplot(2, 2, 1)
    fig2 = sm.graphics.qqplot(rvs, dist=stats.norm, line='45', fit=True, ax=ax)
    top = ax.get_ylim()[1] * 0.75
    xlim = ax.get_xlim()
    frac = 0.1
    left = xlim[0] * (1-frac) + xlim[1] * frac
    txt = ax.text(left, top, "normal", verticalalignment='top')
    txt.set_bbox(dict(facecolor='k', alpha=0.1))
    
    ax = fig2.add_subplot(2, 2, 2)
    fig2 = sm.graphics.qqplot(rvs, dist=stats.t, line='45', fit=True, ax=ax)
    top = ax.get_ylim()[1] * 0.75
    xlim = ax.get_xlim()
    frac = 0.1
    left = xlim[0] * (1-frac) + xlim[1] * frac
    txt = ax.text(left, top, "t", verticalalignment='top')
    txt.set_bbox(dict(facecolor='k', alpha=0.1))
    
    For the third plot, I estimate the parameters of the t-distribution to see whether I get the same results as in the second plot (I do), and so I can insert the parameter estimates into the plot
    params = stats.t.fit(rvs)
    dof, loc, scale = params
    ax = fig2.add_subplot(2, 2, 4)
    fig2 = sm.graphics.qqplot(rvs, dist=stats.t, distargs=(dof,), loc=loc,
                     scale=scale, line='45', fit=False, ax=ax)
    top = ax.get_ylim()[1] * 0.75
    xlim = ax.get_xlim()
    frac = 0.1
    left = xlim[0] * (1-frac) + xlim[1] * frac
    txt = ax.text(left, top, "t \ndof=%3.2F \nloc=%3.2F, \nscale=%3.2F" % tuple(params),
                  verticalalignment='top')
    txt.set_bbox(dict(facecolor='k', alpha=0.1))
    
    That's it for the plots, now I need to add them to the statsmodels documentation.

    PS: normality tests, details left for another day

    qqplots give us a visual check whether a sample follows a specific distribution. The case that we are interested in most often, is a test for normality. Scipy.stats and statsmodels have several normality tests. The ones I have written recently are Anderson-Darling and Lillifors. Lillifors is a Kolmogorov-Smirnov test for normality when mean and variance are estimated. Calculating a statistical test provides a more reliable test than a "vague" visual inspection, but these tests do not point us to a specific alternative and provide less information about the direction in which a null hypothesis might be incorrect.
    Using the residuals in the first example, neither test rejects the Null Hypothesis that the residuals come from a normal distribution
    >>> normal_ad(res)
    (0.43982328207860633, 0.25498161947268855)
    >>> lillifors(res)
    (0.17229856392873188, 0.2354638181341876)
    
    On the other hand, in the second example with 1000 observations from the t distribution, the assumption that the data comes from a normal distribution is clearly rejected
    >>> normal_ad(rvs)
    (6.5408483355136013, 4.7694160497092537e-16)
    >>> lillifors(rvs)
    (0.05919821253474411, 8.5872265678140885e-09)
    
    PPS:
    I'm reluctant to publish the import path, because I had forgotten to add them to a proper place for 0.4.0, and the import location will not stay where it is. It took me a few minutes to find out that they are not on any recommended import path when I wrote these scripts
    >>> from statsmodels.stats.adnorm import normal_ad
    >>> from statsmodels.stats.lilliefors import lillifors
    

    by Josef Perktold (noreply@blogger.com) at May 10, 2012 06:44 PM

    May 08, 2012

    Gaël Varoquaux

    Update on scikit-learn: recent developments for machine learning in Python

    Yesterday, we released version 0.11 of the scikit-learn toolkit for machine learning in Python, and there was much rejoincing.

    Major features gained in the last releases

    In the last 6 months, there have been many things happening with the scikit-learn. While I do not whish to give an exhaustive summary of features added (it can be found here), let me list a few of the additions that I personnally find exciting.

    Non-linear prediction models

    For complex prediction problems where there is no simple model available, as in computer vision, non-linear models are handy. A good example of such models are those based on decisions trees and model averaging. For instance random forests are used in the Kinect to locate body parts. As they are intrinsically complex, they may need a large amount of training data. For this reason, they have been implemented in the scikit-learn with special attention to computational efficiency.

    Dealing with unlabeled instances

    It is often easy to gather unlabeled observations than labeled observation. While prediction of a quantity of interest is then harder or simply impossible, mining this data can be useful.

    Semi-supervised
    learning
    : using unlabeled observations together with labeled one for better prediction.

     
    Outlier/novelty detection: detect deviant observations.

     
    Manifold learning: discover a non-linear low-dimensional structure in the data.

     
    Clustering with an algorithm that can scale to really large datasets using an online approach: fitting small portions of the data on after the other (Mini-batch k-means).

     
    Dictionary learning: learning patterns in the data that represent it sparsely: each observation is a combination of a small number patterns.

    Sparse models: when very few descriptors are relevant

    In general, finding which descriptors are useful when there are many of them is like find a needle in a haystack: it is a very hard problem. However, you know that only a few of these descriptors actually carry information, you are in a so-called sparse problem, for specific approaches can work well.

    Orthogonal matching pursuit: a greedy and fast algorithm for very sparse linear models

     
    Randomized sparsity (randomized Lasso): selecting the relevant descriptors in noisy high-dimensional observations

     
    Sparse inverse covariance: learning graphs of connectivity from correlations in the data

    Getting developpers together: the Granada sprint

    Of course, such developments happen only because we have a great team of dedicated coders.

    Getting along and working together is a critical part of the project. In December 2011, we held the first international scikit-learn sprint in Granada, on the side of the NIPS conference. That was a while ago, and I haven’t found time to blog about it, maybe because I was too busy merging in the code produced :). Here is a small report from my point of view. Better late than never.

    Participants from all over the globe

    This sprint was a big deal for us, because for the first time, thanks to sponsor money, we were able to fly contributors from overseas and meet the team in person. For the first time I was able to see the faces behind many of the fantastic people that I knew only from the mailing list.

    I really think that we must thank our sponsors, Google and tinyclues, but also The PSF, that is in particular Jesse Noller but especially Steve Holden, whose help was absolutely instrumental in getting sponsor money. This money is what made it possible to unite a good fraction of the team, and it opened the door to great moments of coding, and more.

    Producing code lines and friendship

    An important aspect of the sprint for me was that I really felt the team being united. Granada is a great city and we spent fantastic moments together. Now when I review code, I can often put a face on the author of that code and remember a walk below the Alhambra or an evening in a bar. I am sure it helps reviewing code!

    Was it worth the money?


    I really appreciate that the sponsors did not ask for specific returns on investment beyond acknowledgments, but I think that it is useful for us to ask the question: was it worth the money? After all, we got around $5000, and that’s a lot of money. First of all, as a side effect of the sprint, people who had invested a huge amount of time in a machine learning toolkit without asking anything in return got help to go to a major machine learning conference.

    But was there a return over investment in terms of code? If you look at the number of lines of code modified weekly (figure on the right), there is a big spike in December 2011. That’s our sprint! Importantly, if you look at the months following the sprint, there still is a lot of activity in the months following the sprint. This is actually unusual, as the active developments happen more in the summer break than during the winter, as our developpers are busy working on papers or teaching.

    The explaination is simple: we where thrilled by the sprint. Overall, it was incredibly beneficial to the project. I am looking forward to the next ones.

    by gael at May 08, 2012 11:12 PM

    May 03, 2012

    Enthought

    BioInformatics at Scipy: It’s Alive!

    We are pleased to announce the 2012 SciPy Bioinformatics Workshop held in conjunction with SciPy 2012 this July in Austin, TX. Python in biology is not dead yet… in fact, it’s alive and well! Remember just a few short years ago when BioPerl ruled the world? Just one minor paradigm shift* later and Python now [...]

    by David Kim at May 03, 2012 09:03 PM

    May 02, 2012

    Andreas Mueller

    Python tidbits: inverting the nesting of a nested list.

    More than once I came across the problem of rearranging a nested list. I had a nested list of the form


    X = [['a', 'b', 'c'], ['d', 'e', 'f']]

    And I want


    Y = [['a', 'd'], ['b', 'e'], ['c', 'f']]

    without having to resort to an ugly list comprehension over 3 lines. A friend told me to use


    Y = zip(*X)

    So easy! Kind of obvious but I didn't find it on the web. So I thought I'd write it down. Enjoy!

    by Andreas Mueller (noreply@blogger.com) at May 02, 2012 09:48 AM

    May 01, 2012

    EuroSciPy

    Deadline extension -- May 7th

    The committee of the Euroscipy 2012 conference has extended the deadline for abstract submission to Monday May 7th, midnight (Brussels time). Up to then, new abstracts may be submitted on http://www.euroscipy.org/conference/euroscipy2012, and already-submitted abstracts can be modified.

    We are very much looking forward to your submissions to the conference.

    Euroscipy 2012 is the annual European conference for scientists using Python. It will be held August 23-27 2012 in Brussels, Belgium.

    It is also still possible to propose sprints that will take place after the conference, please write to Berkin Malkoc (malkocb at itu.edu.tr) for practical organization (rooms, ...).

    Emmanuelle

    by Emmanuelle Gouillart at May 01, 2012 12:28 AM

    April 30, 2012

    Enthought

    PyGotham Sneak Peek: GPU Computing

    This week, we are highlighting GPUs. Sean Ross Ross has put together a brief sneak peak screencast (above) of his GPU computing PyGotham talk that covers some basic terminology and a short example. The example ports a python/numpy algorithm to the GPU via CLyther. CLyther offers a higher level of abstraction that allows programmers to [...]

    by David Kim at April 30, 2012 10:19 PM

    April 29, 2012

    Vytautas Jancauskas

    Added KDE to Scatter Matrix Diagonals

    Added the ability to plot KDE on the diagonals of scatter matrix plot. An argument in scatter_matrix can be used to set if one wants to have a histogram or a KDE plot there. Examples when plotting IRIS data set are presented below. Along with the iris.csv file that was used.

    This is the histogram version which was drawn using the little script below:
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas
    iris = pandas.read_csv("/home/vytas/iris.csv")
    df = pandas.DataFrame(iris, columns=['slength', 'swidth', 'plength', 'pwidth'])
    pandas.tools.plotting.scatter_matrix(df, alpha=0.2, diagonal='hist')
    plt.show()

    This is the KDE version on the same data. It was drawn using the script below:
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas
    iris = pandas.read_csv("/home/vytas/iris.csv")
    df = pandas.DataFrame(iris, columns=['slength', 'swidth', 'plength', 'pwidth'])
    pandas.tools.plotting.scatter_matrix(df, alpha=0.2, diagonal='kde')
    plt.show()

    The data can be found here.

    by Vytautas Jančauskas (noreply@blogger.com) at April 29, 2012 05:53 AM

    April 28, 2012

    PythonXY

    Python(x, y) 2.7.2.2 Released!

    We are pleased to announce that Python(x,y) 2.7.2.2 has been released. This release was delayed for 3 months - both Pierre and me could not spare as much time as we wanted too.

    The focus of this version was the numerous reported installation issues. Also, many packages were updated and some new ones added.

    The next release will be based on Python 2.7.3.

    Python(x,y) is a free Python distribution providing a ready-to-use scientific development software for numerical computations, data analysis and data visualization based on Python programming language, Qt graphical user interfaces (and development framework) and Spyder interactive development environment. Its purpose is to help scientific programmers used to interpreted languages (such as MATLAB or IDL) or compiled languages (C/C++ or Fortran) to switch to Python.

    New Plugins

    • ITK 3.20 (without the itkvtkglue feature which is not compatible with VTK 5.8.0) 
    • pyparsing 1.5.6 - upgraded from additional plugin status. 
    • pyfits 3.0.5 - Hidden under Veusz, upgraded to core plugin status. 
    • pandas 0.7.0 


    Updated Plugins

    • Console 2.0.148.5 
    • cvxopt 1.1.4 
    • cx_Freeze 4.2.3.1 
    • Cython 0.15.1.1 
    • Distribute 0.6.24 
    • docutils 0.8.1.2 
    • Enthought Tool Suite 4.1.0 
    • ETS 4.1.0.2 
    • GDAL 1.9.0.1 
    • gettext 0.14.4.2 
    • gnuplot 1.8.0.3 
    • guidata 1.4.2.3 
    • guiqwt 2.1.6.3 
    • IPython 0.10.2.5 
    • jinja2 2.6.0.1 
    • MDP 3.3 
    • MinGW 4.5.2.2 
    • mx 3.2.3 
    • netcdf4 0.9.9 
    • NetworkX 1.6 
    • nose 1.1.2.1 
    • numexpr 2.0.1 
    • numpy 1.6.1.1 
    • pandas 0.7.1 
    • PIL 1.1.7.2 
    • Pip 1.1.0 
    • PP 1.6.1.1 
    • pylint 0.25.1.1 
    • PyQt 4.8.6.3 
    • PySerial 2.6.0.1 
    • PyTables 2.3.1 
    • Pywin32 2.17 
    • QtHelp 4.7.4 
    • scikits-learn 0.10.0.1 
    • scikits.image 0.5.0 
    • SciPy 0.10.0.1 
    • SciTE 3.0.3.2 
    • simplejson 2.3.3 
    • Sphinx 1.1.3.1 
    • Spyder 2.1.9 
    • SWIG 2.0.4.1 
    • Veusz 1.14.3 
    • vitables 2.1.0.3 
    • VPython 5.72 
    • VTK 5.8.0.1 
    • WinMerge 2.12.4.2 
    • winpdb 1.4.8.3 
    • wxPython 2.8.12.1 
    • xy 1.2.14.3 
    • xydoc 1.0.4.2 

    Improvements and Fixes

    • Issue 393 : Mayavi2 does not start 
    • Fixed explorer context menu console startup entries 
    • Fixed many issues in the SciTE API generation script 
    • All shell shortcuts have their working directory set to USERPROFILE. 
    • All file associations, menu shortcuts and environment variables are created based on current context. 
    • Issue 379 : installing vitables should automatically add PyQT4 
    • Issue 374 : PyQt4-4.8.5_py27 user install bug 
    • Issue 373 : installing python(x,y) breaks existing python install without warning 
    • Issue 359 : Scripts exe's won't launch corresponding -script.py when installing python in custom dir 
    • Issue 329 : Left click menu shortcuts for console are broken 
    • Issue 292 : Python(x,y) 2.7.2.0 installs registry to HKCU instead of HKLM even if "for all users" is selected 
    • Issue 274 : Installation Issues and Enhancement Requests 
    • Issue 107 : Installation to many user accounts 

    -Gabi Davar

    by Grizzly Nyo (noreply@blogger.com) at April 28, 2012 02:41 PM

    April 27, 2012

    Andreas Mueller

    Region connectivity graphs in Python [edit: minor bug]

    Recently I started playing with CRFs on superpixels for image segmentation.

    While doing this I noticed that Python has very little methods for morphological operations on images. For example I did not find any functions to exctract connected components inside images.

    For CRFs one obviously needs the superpixel neighbourhood graph to work on and I didn't find any ready made function to obtain it.

    After some pondering, I came up with something. Since I didn't find anything else online
    I thought I'd share it.

    def make_graph(grid):
    # get unique labels
    vertices = np.unique(grid)

    # map unique labels to [1,...,num_labels]
    reverse_dict = dict(zip(vertices,np.arange(len(vertices))))
    grid = np.array([reverse_dict[x] for x in grid.flat]).reshape(grid.shape)

    # create edges
    down = np.c_[grid[:-1, :].ravel(), grid[1:, :].ravel()]
    right = np.c_[grid[:, :-1].ravel(), grid[:, 1:].ravel()]
    all_edges = np.vstack([right, down])
    all_edges = all_edges[all_edges[:, 0] != all_edges[:, 1], :]
    all_edges = np.sort(all_edges,axis=1)
    num_vertices = len(vertices)
    edge_hash = all_edges[:,0] + num_vertices * all_edges[:, 1]
    # find unique connections
    edges = np.unique(edge_hash)
    # undo hashing
    edges = [[vertices[x%num_vertices],
    vertices[x/num_vertices]] for x in edges]

    return vertices, edges
    This code assumes a grid is given to it with uniquely labeled connected components.
    First, the unique labels are mapped to [1,..,num_labels] (which is not strictly necessary but I ran into problems since my labels were huge).

    Then I create "down" and "right" edges using a technique I saw at scikits.learn. I stack the edges and sort them, then hash them and find unique edges.
    [edit] I also remove self-connections. Missed that on the first try. [/edit]

    Hashing is necessary since "unique" only works on numbers. I could also have used a set, I guess.
    Then I undo the hash and I'm done.
    I find this code to be pretty fast and I thought it would be a lot more work to do this.
    Numpy makes stuff so easy :) - I love it.

    Now we want to visualize our graph. I start from the label-image in grid and [vertices, edges]:
    # compute region centers:
    gridx, gridy = np.mgrid[:grid.shape[0], :grid.shape[1]]
    centers = dict()
    for v in vertices:
    centers[v] = [gridy[grid == v].mean(), gridx[grid == v].mean()]

    # plot labels
    plt.imshow(grid)

    # overlay graph:
    for edge in edges:
    plt.plot([centers[edge[0]][0],centers[edge[1]][0]],
    [centers[edge[0]][1],centers[edge[1]][1]])

    And done.

    Give it a try together with quickshift.

    by Andreas Mueller (noreply@blogger.com) at April 27, 2012 08:22 PM

    Vytautas Jancauskas

    Histograms and Scatter Matrix

    I added histograms to diagonals of the plot generated by pandas.tools.plotting.scatter_matrix function.

    The code used to generate this plot is presented below:


    import numpy as np
    import matplotlib.pyplot as plt
    import pandas
    df = pandas.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd'])
    pandas.tools.plotting.scatter_matrix(df, alpha=0.2)
    plt.show()

    by Vytautas Jančauskas (noreply@blogger.com) at April 27, 2012 04:02 PM

    Official SymPy blog

    GSoC 2012 Results Announced

    Google has announced the results for Google Summer of Code. I am proud to announce that we got six slots from Google. The following projects have been accepted:

    (Project, Student, Mentor)
    - Category Theory Module, Sergiu Ivanov, Tom Bachmann
    - Density Operators for Quantum Module in sympy.physics.quantum, Guru Devanla, Brian Granger (co-mentor Sean Vig)
    - Enhancements to sympy.physics.mechanics, Angadh Nanjangud, Gilbert Gede
    - Group Theory, Aleksandar Makelov, David Joyner (co-mentor Aaron Meurer)
    - Implicit Plotting Module, Bharath M R, Aaron Meurer
    - Vector Analysis, Stefan Krastanov, Matthew Rocklin


    Join me in congratulating these students on their acceptance.

    In case you don't know, Google Summer of Code is a program where Google pays students to write code for open source projects. SymPy was accepted as a mentoring organization this year. The goal of the program is to help the students learn new skills, in particular in our case:

    * contributing to opensource * working with the community * learn git, pull requests, reviews * teach them how to review other's people patches * do useful work for SymPy * have fun, and encourage the students to stay around

    To all the students who are accepted, you should be receiving an email from your mentor soon to discuss how you will be communicating over the summer about your project. You should meet with your mentor about once a week during the summer to go over your progress. You should either meet on a public channel (like IRC), or else post minutes of your meeting in some public channel, so that the whole community can see your progress too.

    Some of you may also be assigned a backup mentor or co-mentor. These people will also help out in mentoring your project. If you have two mentors and one is not available for something, or does not know the answer, you can ask your co-mentor.

    I would like all of us to strongly encourage students this summer to submit pull requests early and often. This will go a long ways towards making sure that you don't end the summer with a ton of code written that never gets merged. Students should help review pull requests by other students, so that we don't get bogged down reviewing so much code.

    We also require that all students keep a weekly blog of their work over the summer. If you don't already have a blog, you should start one. I recommend using either Wordpress or Blogger (I personally use and prefer Wordpress because it has a cleaner interface and lets you do $latex math$, but I recommend you test out both). You can also use some other service too if you like. The only requirement is that it has an RSS feed, so we can put it on planet.sympy.org. Once you have set up your blog, send me the url so I can add it there.

    Starting on the week of May 21 (when the GSoC period officially begins), We will expect you to have at least one blog post a week, describing your progress for that week, or something interesting about your project. If you don't have a post by the beginning of the day on Saturday, your mentor or I will email you to remind you about it.

    I invite other mentors who have blogs to blog as well. And I encourage all community members to follow the student blogs, so you can see their progress.

    I would like to thank all the students who applied this year and everyone who submitted a patch. We received way more feedback this year than we ever have before. I would also like to thank all the mentors for helping review patches and proposals.

    This summer is looking to be another very productive one for SymPy, and I look forward to it!

    by Aaron Meurer (noreply@blogger.com) at April 27, 2012 03:52 PM

    April 24, 2012

    Philip Herron

    Should I?

    Well today i had a thought where i have spent about 3 to 4 years of my life working on my own projects and Open Source trying to learn outside of University. In my Degree path there is no Final Year project or anything, this always annoyed me. I only have 2 exams at the moment which i don’t fee that confident in. And since Gccpy is actually starting to get interesting where its starting to work and i completed my own interpreter to a decent amount crules and i created my own static compiler cmod a modula compiler for i386. I already had a thesis document started along side Gccpy and i am thinking of finishing it up this week to some extend detailing all 3 of these projects focusing on gccpy of course and handing this into my university i feel this must count towards something and i just want this monkey off my back i guess and i feel this might help. You can view the code for any of these projects over at:

    code.redbrain.co.uk

    Reminder the Gccpy code has been moved to the gccpy repository on that server and stable *ish code is maintined on the gccpy branch of the GCC git repository when i reach more stable code i will push the svn server also.

    by redbrain at April 24, 2012 02:51 PM

    Fabian Pedregosa

    line-by-line memory usage of a Python program

    My newest project is a Python library for monitoring memory consumption of arbitrary process, and one of its most useful features is the line-by-line analysis of memory usage for Python code.

    I wrote a basic prototype six months ago after being surprised by the lack of related tools. I wanted to plot memory consumption of a couple of Python functions but did not find a python module to do the job. I came to the conclusion that there is no standard way to get the memory usage of the Python interpreter from within Python, so I resorted to reading for from /proc/$PID/statm. From there on I realized that one the fetching of memory is done, making a line-by-line report wouldn’t be hard.

    Back to today. I’ve been using the line-by-line memory monitoring to diagnose poor memory management (hidden temporaries, unused allocation, etc.) for some time. It seems to work on two different computers, so full of confidence as I am, I’ll write a blog post about it …

    How to use it?

    The easiest way to get it is to install from the Python Package Index:

        $ easy_install -U memory_profiler # pip install -U memory_profiler

    but other options include fetching the latests from github or dropping it on your current working directory or somewhere else on your PYTHONPATH since it consist of a single file.

    Then next step is to write some python code to profile. It can be just about any function, but for the purpose of this blog post I’ll create a function my_func() with mostly memory allocations and save it to a file named example.py:

    import numpy as np

    @profile
    def my_func():
        a = np.zeros((100, 100))
        b = np.zeros((1000, 1000))
        c = np.zeros((10000, 1000))
        return a, b, c

    if __name__ == '__main__':
        my_func()

    Note that I’ve decorated the function with @profile. This tells the profiler to look into function my_func and gather the memory consumption for each line.

    Wake up the cookie monster

    To start profiling and output the result to stdout, run the script as usual and append the options “-m memory_profiler -l -v” to the python interpreter.

    $ python -m memory_profiler -l -v example.py
    Line #    Mem usage   Line Contents
    ===================================
         3                @profile
         4     13.68 MB   def my_func():
         5     13.77 MB       a = np.zeros((100, 100))
         6     21.40 MB       b = np.zeros((1000, 1000))
         7     97.70 MB       c = np.zeros((10000, 1000))
         8     97.70 MB       return a, b, c

    voilá! Each line is prefixed by the memory usage in MB of the Python interpreter after that line has been executed.

    by fabian at April 24, 2012 05:04 AM

    April 23, 2012

    Gaël Varoquaux

    3 Google summer of code for scikit-learn and more…

    The scikit-learn got 3 students accepted for the Google summer of code.

    In addition, other related projects have exciting projects, for instance statsmodels:

    and Cython:

    finally, in Pandas:

    Congratulations to all of the students. This is going to be an exciting summer.

    by gael at April 23, 2012 09:25 PM

    Enthought

    PyGotham Sneak Peek: The Three “P’s”

    As we announced last week, Enthought is sponsoring an open-source High Performance Python track at PyGotham this year. The video above is meant to give you a sneak peek at the Parallel Python class. Please remember that the class will cover a broader array of topics (as described below), so don’t worry if you aren’t [...]

    by David Kim at April 23, 2012 08:29 PM

    April 22, 2012

    EuroSciPy

    Abstract deadline approaching, and sprints

    Just a quick reminder of the approaching deadline for abstract submission at the Euroscipy 2012 conference: the deadline is April 30, in one week.

    Euroscipy 2012 will be held in Brussels, August 23-27, at the Université Libre de Bruxelles (ULB, Solbosch Campus).

    The EuroSciPy meeting is a cross-disciplinary gathering focused on the use and development of the Python language in scientific research and industry. This event strives to bring together both users and developers of scientific tools, as well as academic research and state of the art industry.

    More information about the conference, including practical information, are found on the conference website http://www.euroscipy.org/conference/euroscipy2012

    We are soliciting talks and posters that discuss topics related to scientific computing using Python. These include applications, teaching, future development directions, and research. We welcome contributions from the industry as well as the academic world.

    Submission guidelines are found on http://www.euroscipy.org/card/euroscipy2012_call_for_contributions

    Also, rooms are available at the ULB for sprints on Tuesday August 28th and Wednesday 29th. If you wish to organize a sprint at Euroscipy, please get in touch with Berkin Malkoc (malkocb AT itu dot edu dot tr).

    by Emmanuelle Gouillart at April 22, 2012 05:24 PM

    April 21, 2012

    Philip Herron

    Gccpy Update!

    I have been very quite this last long while. Things are actually finally starting to come together for me now i am about to finish my degree when i got a few job interviews out of the blue and the one i really wanted i have now got the offer for the job and i am taking it. I have 2 exams to finish and i am just studying for them at the moment but not worrying about it that much because i quite frankly just don’t care that much because it wont affect any part of my life any-more. I have dreamed of this day for years and it feels so good. I have a small dilemma in that i went for Google Summer of code again for the 3rd year for Cython my proposal you can see here: gsoc 2012 proposal . Which i don’t feel i can accept if i get accepted but i shall see the start dates for my new job are set yet as i have to wait on the paper work coming though still and i will have to complete a course on financial software. I will give more info on the job because its quite exciting when i get everything set it stone more.

     

    What i want to talk about is that my project in creating a Python Front-end to GCC is finally starting to do stuff now its quite exciting as of 1 hour ago you can compile this amount of python and have it work:

     

    1.  
    2. class date:
    3.   day = 0
    4.   month = 0
    5.   year = 0
    6.   print 11
    7.  
    8.   def __init__ (self, x, y, z):
    9.     self.day = x
    10.     self.month = y
    11.     self.year = z
    12.     print 22
    13.  
    14.   def print_date (self):
    15.     print 33
    16.     print self.day, self.month, self.year
    17.     print 34
    18.  
    19. print 0
    20. x = date (1, 2, 3)
    21. print 1
    22. x.print_date ()
    23. print 2
    24. x.day = 10
    25. print 3
    26. x.print_date ()
    27. print 4
    28.  

     

    Although this is an extremely basic example this shows a helluva lot in how data is accessed and addressed which is all the core of dynamic typing. Right now i am just flesing things out with conditionals and loops now which will lead me on to creating the yield instruction and returns (which is half implemented) which i plan to use Setjmp.h to implement. Which leads on to exceptions etc. Then we move on to imports and some standard library stuff before i would work on a basic garbage collector.

     

    Anyone interested i output gimple code for that example:

     

    1.  
    2. t_date_1.py.__main_start__ ()
    3. {
    4.   struct gpy_object_attrib_t * D.200;
    5.   struct gpy_object_attrib_t * D.201;
    6.   struct gpy_object_attrib_t * D.202;
    7.   struct gpy_object_attrib_t * D.203;
    8.   struct gpy_object_attrib_t * D.204;
    9.   struct gpy_object_attrib_t * D.205;
    10.   struct gpy_object_t * D.206;
    11.   struct gpy_object_t * D.207;
    12.   struct gpy_object_t * * __GPY_GLOBL_RR_STACK_POINTER.37;
    13.   struct gpy_object_t * * __GPY_GLOBL_RR_STACK_POINTER.38;
    14.   struct gpy_object_t * D.210;
    15.   struct gpy_object_t * * __GPY_GLOBL_RR_STACK_POINTER.39;
    16.   struct gpy_object_t * D.212;
    17.   struct gpy_object_t * D.213;
    18.   struct gpy_object_t * * __GPY_GLOBL_RR_STACK_POINTER.40;
    19.   struct gpy_object_t * D.215;
    20.   struct gpy_object_t * * __GPY_GLOBL_RR_STACK_POINTER.41;
    21.   struct gpy_object_t * D.217;
    22.   struct gpy_object_t * * __GPY_GLOBL_RR_STACK_POINTER.42;
    23.   struct gpy_object_t * D.219;
    24.   struct gpy_object_t * D.220;
    25.   void <retval>;
    26.  
    27.   gpy_rr_extend_runtime_stack (3);
    28.   D.200 = gpy_rr_fold_attribute ("month", 0B, 20);
    29.   D.201 = gpy_rr_fold_attribute ("day", 0B, 16);
    30.   D.202 = gpy_rr_fold_attribute ("__field_init__", t_date_1.py.date.__field_init__, 12);
    31.   D.203 = gpy_rr_fold_attribute ("print_date", t_date_1.py.date.print_date, 8);
    32.   D.204 = gpy_rr_fold_attribute ("__init__", t_date_1.py.date.__init__, 4);
    33.   D.205 = gpy_rr_fold_attribute ("year", 0B, 0);
    34.   C.13 = gpy_rr_fold_attrib_list (6, D.205, D.204, D.203, D.202, D.201, D.200);
    35.   A.14 = __GPY_GLOBL_RR_STACK_POINTER;
    36.   D.206 = gpy_rr_fold_class_decl (C.13, 24, "t_date_1.py.date");
    37.   *A.14 = D.206;
    38.   P.15 = gpy_rr_fold_integer (0);
    39.   gpy_rr_eval_print (1, 1, P.15);
    40.   A.16 = __GPY_GLOBL_RR_STACK_POINTER;
    41.   P.17 = gpy_rr_fold_integer (1);
    42.   P.18 = gpy_rr_fold_integer (2);
    43.   P.19 = gpy_rr_fold_integer (3);
    44.   D.207 = *A.16;
    45.   R.20 = gpy_rr_fold_call (D.207, 3, P.17, P.18, P.19);
    46.   __GPY_GLOBL_RR_STACK_POINTER.37 = __GPY_GLOBL_RR_STACK_POINTER;
    47.   A.21 = __GPY_GLOBL_RR_STACK_POINTER.37 + -4;
    48.   *A.21 = R.20;
    49.   P.22 = gpy_rr_fold_integer (1);
    50.   gpy_rr_eval_print (1, 1, P.22);
    51.   __GPY_GLOBL_RR_STACK_POINTER.38 = __GPY_GLOBL_RR_STACK_POINTER;
    52.   A.24 = __GPY_GLOBL_RR_STACK_POINTER.38 + -4;
    53.   D.210 = *A.24;
    54.   T.23 = gpy_rr_eval_attrib_reference (D.210, "print_date");
    55.   __GPY_GLOBL_RR_STACK_POINTER.39 = __GPY_GLOBL_RR_STACK_POINTER;
    56.   A.25 = __GPY_GLOBL_RR_STACK_POINTER.39 + -4;
    57.   D.212 = *A.25;
    58.   D.213 = *T.23;
    59.   R.26 = gpy_rr_fold_call (D.213, 1, D.212);
    60.   P.27 = gpy_rr_fold_integer (2);
    61.   gpy_rr_eval_print (1, 1, P.27);
    62.   P.29 = gpy_rr_fold_integer (10);
    63.   __GPY_GLOBL_RR_STACK_POINTER.40 = __GPY_GLOBL_RR_STACK_POINTER;
    64.   A.30 = __GPY_GLOBL_RR_STACK_POINTER.40 + -4;
    65.   D.215 = *A.30;
    66.   T.28 = gpy_rr_eval_attrib_reference (D.215, "day");
    67.   *T.28 = P.29;
    68.   P.31 = gpy_rr_fold_integer (3);
    69.   gpy_rr_eval_print (1, 1, P.31);
    70.   __GPY_GLOBL_RR_STACK_POINTER.41 = __GPY_GLOBL_RR_STACK_POINTER;
    71.   A.33 = __GPY_GLOBL_RR_STACK_POINTER.41 + -4;
    72.   D.217 = *A.33;
    73.   T.32 = gpy_rr_eval_attrib_reference (D.217, "print_date");
    74.   __GPY_GLOBL_RR_STACK_POINTER.42 = __GPY_GLOBL_RR_STACK_POINTER;
    75.   A.34 = __GPY_GLOBL_RR_STACK_POINTER.42 + -4;
    76.   D.219 = *A.34;
    77.   D.220 = *T.32;
    78.   R.35 = gpy_rr_fold_call (D.220, 1, D.219);
    79.   P.36 = gpy_rr_fold_integer (4);
    80.   gpy_rr_eval_print (1, 1, P.36);
    81. }
    82.  
    83.  
    84. t_date_1.py.date.__field_init__ (struct t_date_1.py.date * __object_state__)
    85. {
    86.   void <retval>;
    87.  
    88.   P.0 = gpy_rr_fold_integer (0);
    89.   __object_state__->day = P.0;
    90.   P.1 = gpy_rr_fold_integer (0);
    91.   __object_state__->month = P.1;
    92.   P.2 = gpy_rr_fold_integer (0);
    93.   __object_state__->year = P.2;
    94.   P.3 = gpy_rr_fold_integer (11);
    95.   gpy_rr_eval_print (1, 1, P.3);
    96. }
    97.  
    98.  
    99. t_date_1.py.date.print_date (struct gpy_object_t * self, struct gpy_object_t * * __arguments__)
    100. {
    101.   struct gpy_object_t * D.116;
    102.   struct gpy_object_t * D.117;
    103.   struct gpy_object_t * D.118;
    104.   void <retval>;
    105.  
    106.   P.8 = gpy_rr_fold_integer (33);
    107.   gpy_rr_eval_print (1, 1, P.8);
    108.   T.9 = gpy_rr_eval_attrib_reference (self, "day");
    109.   T.10 = gpy_rr_eval_attrib_reference (self, "month");
    110.   T.11 = gpy_rr_eval_attrib_reference (self, "year");
    111.   D.116 = *T.11;
    112.   D.117 = *T.10;
    113.   D.118 = *T.9;
    114.   gpy_rr_eval_print (1, 3, D.118, D.117, D.116);
    115.   P.12 = gpy_rr_fold_integer (34);
    116.   gpy_rr_eval_print (1, 1, P.12);
    117. }
    118.  
    119.  
    120. t_date_1.py.date.__init__ (struct gpy_object_t * self, struct gpy_object_t * * __arguments__)
    121. {
    122.   void <retval>;
    123.  
    124.   x = *__arguments__;
    125.   y = MEM[(struct gpy_object_t * *)__arguments__ + 4B];
    126.   z = MEM[(struct gpy_object_t * *)__arguments__ + 8B];
    127.   T.4 = gpy_rr_eval_attrib_reference (self, "day");
    128.   *T.4 = x;
    129.   T.5 = gpy_rr_eval_attrib_reference (self, "month");
    130.   *T.5 = y;
    131.   T.6 = gpy_rr_eval_attrib_reference (self, "year");
    132.   *T.6 = z;
    133.   P.7 = gpy_rr_fold_integer (22);
    134.   gpy_rr_eval_print (1, 1, P.7);
    135. }
    136.  

     

    Its very exciting because its taken almost 3 years of my life in my spare time and 2 Google summer of Code to get this far but finally the huge task of getting the core working its nerly there. I will need to write up alot more about this project from now because its at a state where its much easier to work with people because creating a code base from scratch which can change alot with new ideas its hard untill your sure how each part will work but when its such a big project it can be difficult unless your working together fulltime almost.

    by redbrain at April 21, 2012 06:33 PM

    April 19, 2012

    Andreas Mueller

    Learning Gabor filters with ICA and scikit-learn

    My colleague Hannes works in deep learning and started on a new feature extraction method this week.
    As with all feature extraction algorithms, it was obviously of utmost importance to be able to learn Gabor filters.

    Inspired by his work and Natural Image Statistics, a great book on the topic of feature extraction from images, I wanted to see how hard it is to learn Gabor filters with my beloved scikit-learn.

    I chose independent component analysis, since this is discusses to some depth in the book.

    Luckily mldata had some image patches that I could use for the task.

    Here goes:

    import numpy as np
    import matplotlib.pyplot as plt

    from sklearn.datasets import fetch_mldata
    from sklearn.decomposition import FastICA


    # fetch natural image patches
    image_patches = fetch_mldata("natural scenes data")
    X = image_patches.data

    # 1000 patches a 32x32
    # not so much data, reshape to 16000 patches a 8x8
    X = X.reshape(1000, 4, 8, 4, 8)
    X = np.rollaxis(X, 3, 2).reshape(-1, 8 * 8)

    # perform ICA
    ica = FastICA(n_components=49)
    ica.fit(X)
    filters = ica.unmixing_matrix_

    # plot filters
    plt.figure()
    for i, f in enumerate(filters):
    plt.subplot(7, 7, i + 1)
    plt.imshow(f.reshape(8, 8), cmap="gray")
    plt.axis("off")
    plt.show()

    And the result (takes ~40sec):
    The data is somewhat normalized already, which makes the job easier. With larger patches, it does not look as clean. I guess it is because there is not enough data. It would be possible to use PCA to get rid of "high frequencies" but at this small patch size, it did not seem to matter.

    Edit:
    As Andrej suggested, here the k-means filters on whitened data:


    For reference, the PCA whitening filters:
    which are the expected wavelets.
    You can find the updated gist here.

    by Andreas Mueller (noreply@blogger.com) at April 19, 2012 02:40 PM

    April 16, 2012

    Vytautas Jancauskas

    KDE Plots in Pandas

    I have implemented a simple KDE plot functionality in Pandas and created a pull request.

    The picture above was generated by using two streams of random numbers from Gaussian distribution with different means and deviations. The code used to produce it is presented below:

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas

    samples1 = np.random.normal(-5.0, 1.0, 500)
    samples2 = np.random.normal(5.0, 3.0, 500)
    samples = np.concatenate((samples1, samples2))
    print min(samples), max(samples)
    np.random.shuffle(samples)
    ts = pandas.Series(samples)
    ts.plot(kind="kde", label="Synthetic Data")
    plt.legend()
    plt.show()

    by Vytautas Jančauskas (noreply@blogger.com) at April 16, 2012 11:21 PM

    April 15, 2012

    Vlad Niculae

    GSoC 2012 proposal: Need for scikit-learn speed

    This summer I hope to be able to put in another full-time amount of effort into scikit-learn. After a successful Google Summer of Code project last year on dictionary learning, I now plan to do some low-level work. The title of my proposal is: “Need for scikit-learn speed” and, in a nutshell, will make the scikit go faster and will help it stay that way.

    Scikit-learn has always enforced standards of quality that kept all implementations at a non-trivial level (i.e. faster than using the generic optimizers in scipy). However, not all modules are equal: some have received more attention for speed than others (for example the SGD classes). I intend to raise the bar towards a more uniform level.

    Are you crazy, can you really do this?

    Well, of course. This might not the usual GSoC proposal, but I can show how I can do it and how it’s easily quantifiable. Actually, a very important part of the work will be to make scikit-learn’s speed easily measurable.

    As for the specific speed-ups, I have shown in the past that I can do algorithmic and memory layout optimizations in numerical code. There are parts in the scikit-learn that can benefit from such work: for example only recently Peter merged this pull request significantly improving SGDClassifier’s test time performance by switching the memory layout of the coefficients: they were laid out optimally for the training phase, not for the prediction phase.

    There are certainly more opportunities for such speed improvements in the scikit. Of course there is a lot of code that can’t reasonably be made any faster (I have a feeling that SGDClassifier is at the moment such a case, but we can’t know for sure without heavy profiling). But generally there are many speed fixes that could weigh a lot: for example, a Cython implementation of the euclidean_distances function that is able to use preallocated memory will improve the performance of raw NearestNeighbours queries as well as of the KMeans and hierarchical clustering algorithms.

    How will we be able to tell if you succeed?

    A key part of the GSoC project is setting up a CI-style benchmark platform. The point is to be able to track how the speed of certain operations evolves in time. For such purposes, Wes McKinney developed the vbench project, introduced in this blog post. The goal is for every scikit-learn module to have several such benchmarks, for differently shaped and structured data.

    Having such a benchmark suite available is the equivalent of a test suite, in terms of performance. It makes developers be extra conscious of the effect of their changes. It also makes it more fun to chase speed improvements, thanks to the positive reinforcement it gives.

    There are some static benchmarks comparing the performance of scikit-learn algorithms with other well-known libraries in the ml-benchmarks project. It would be very helpful to have such a benchmark suite that automatically keeps up-to-date.

    Side effects

    The cool thing about such a project is that it should raise the overall quality of the scikit. The refactoring will lead to an increase in test coverage, because the low-coverage modules are expected to be less optimized as well. Also, the benchmarks will lead to well-backed summaries in the documentation, such as the one recently added in the clustering section.

    Since the scikit is reaching a state where many well-known algorithms are available, the 1.0 release is slowly approaching. My Google Summer of Code project should bring the scikit significantly closer to that milestone.

    by vene at April 15, 2012 10:37 PM

    April 13, 2012

    Vlad Niculae

    Romanian people and coffee

    So I got my hands of the Google N-gram data for the Romanian language. It’s noisy as hell, has some other subtle issues too, but here’s the first thing I noticed:

    The Romanian word for coffee is cafea, and the more you crave it, the longer you pronunce the final a: I really need some cafeaaaa right now.

    Thanks to Google, here are the numbers:

    Distribution of the length of the final letter in the Romanian word for coffee.

    Post scriptum: I hope you like the theme: I installed the sane matplotlib color scheme from Huy Nguyen.

    by vene at April 13, 2012 07:20 PM

    April 11, 2012

    Josef Perkoltd

    Statsmodels and Google Summer of Code 2012

    I didn't have much time or motivation to work on my blog these last weeks, mainly because I was busy discussing Google Summer of Code and preparing a new release for statsmodels.

    So here is just an update on our Google Summer of Code candidates and their projects. This year was a successful year in attracting student proposals. We have six proposals, five of them we discussed quite extensively on our mailing list before the application.

    Of the five projects, the first two are must-haves for econometrics or statistical packages, one on System of Equations, the other on Nonlinear Least-Squares and Nonlinear Robust Models. The next two are nonparametric or semi-parametric methods, one more traditional kernel estimation, the other using Empirical Likelihood which is a relatively new approach that has become popular in recent research both in econometrics and in statistics. The fifth is on Dynamic Linear Models mainly using Kalman filter and a Bayesian approach, which would extend the depth of statsmodels in time series analysis.

    All topics would be valuable extensions to statsmodels and significantly increase our coverage of statistics and econometrics. From the discussion on the mailing list I think that all candidates are qualified to bring the projects to a successful finish.

    Estimating System of Equations

    This is a standard econometrics topic, but I only recently found that graphical models and causal models discussed in other fields have a large overlap with this. In the case of a system of simultaneous equations, we have several variables that depend on each other. The simplest case in economics is a market equilibrium, where the demanded and supplied quantities depend on the price, and the price depends on the supply and demand. The estimation methods commonly used in this area are two-stage and three-stage least-squares and limited and full information maximum likelihood estimation. The first part of the project starts with the simpler case when we have several response variables, but they don't depend on each other simultaneously, although they can depend on past values of other response variables. I'm very glad that someone is picking this one up.

    Extension of Linear to Non Linear Models

    This project has two parts, the first is extending the linear least-squares model to the non-linear case, the second part is to implement non-linear models for robust estimation. Non-linear least squares is available in scipy for example with scipy.optimize.curve_fit. However, in the statsmodels version we want to provide all the usual results statistics and statistical tests. The second part will implement two robust estimators for non-linear model, that have been shown to be most successful in recent Monte Carlo studies comparing different robust estimators for non-linear equations. Robust estimation here refers to the case when there are possibly many outliers in the data. My guess is that these will become the most used models of all proposals.

    Nonparametric Estimation

    This project extends the kernel based method in statsmodels from the univariate to the multivariate case, will provide better bandwidth selection, and then implement nonparametric function estimation. Multivariate kernel density estimation should complement scipy.stats.gaussian_kde which only works well with distributions that are approximately normal shaped or have only a single peak. Another extension is to provide kernels and estimation methods for discrete variables. These methods have been on our wishlist for a while, but only the univariate case has been included in statsmodels so far.

    Empirical Likelihood

    This is a relatively new approach in statistics and econometrics that avoids the distributional assumptions in estimation and in statistical tests. Instead of relying on a known distribution in small samples, where we often assume normal distribution, or instead of relying on the asymptotic normal distribution in large samples, this approach estimates the distribution in a nonparametric way. This is similar, to some extend, to the difference between, for example, a t-test and a rank-based Mann–Whitney U or Wilcoxon test, which are available in scipy.stats. The advantages are that in smaller samples the estimates and tests are more accurate when the distribution is not known, and in many cases, for example in finance, most tests show that the assumption of normal distribution does not hold. For this project, I still have to catch up with some readings because I'm only familiar with a small part of this literature, mainly on empirical likelihood in relation to Generalized Method of Moments (GMM).

    Dynamic Linear Models

    This covers statespace models implemented by Kalman Filter for multivariate time series models, both from a likelihood and a Bayesian perspective. The project expands the coverage of statsmodels in linear time series analysis, the first area where we get a good coverage of models. Currently, we have univariate AR and ARIMA, vector autoregressive models VAR, and structural VAR. Part of this project would be to get a good cython based implementation of Kalman filters. Wes has started a libray, statlib, for this last year, however, it is still incomplete and needs to be integrated with statsmodels. Another advantage of this project is that it increases our infrastructure and models for macro-econometrics, estimation of macroeconomic models and dynamic stochastic general equilibrium DSGE models, which is currently still Matlab dominated, as far as I can tell.

    Now we still have to see how many GSoC slots we will get, but we have the chance this year to get a large increase in the speed of development of statsmodels, and we can reduce the number of cases where someone needs to run to R, or Stata, or Matlab because there is no implementation for a statistical analysis available in Python.

    by Josef Perktold (noreply@blogger.com) at April 11, 2012 10:25 AM

    April 10, 2012

    Enthought

    Enthought at PyGotham: June 8th & 9th

    Enthought is a proud sponsor of the second annual PyGotham conference in New York City (June 8th and 9th). As part of our commitment, we are also offering a High Performance Python track that will illustrate how to build applications and utilize parallel computing techniques with various open source projects. Stayed tuned for more details [...]

    by David Kim at April 10, 2012 03:08 AM

    April 08, 2012

    Titus Brown

    Why I don't *really* practice open science

    I'm a pretty big advocate of anything open -- open source, open access, and open science, in particular. I always have been. And now that I'm a professor, I've been trying to figure out how to actually practice open science effectively

    What is open science? Well, I think of it as talking regularly about my unpublished research on the Internet, generally in my blog or on some other persistent, explicitly public forum. It should be done regularly, and it should be done with a certain amount of depth or self-reflection. (See, for example, the wunnerful Rosie Redfield and Nature's commentary on her blogging of the arsenic debacle & tests thereof.)

    Most of my cool, sexy bloggable work is in bioinformatics; I do have a wet lab, and we're starting to get some neat stuff out of that (incl. both some ascidian evo-devo and some chick transcriptomics) but that's not as mature as the computational stuff I'm doing. And, as you know if you've seen any of my recent posts on this, I'm pretty bullish about the computational work we've been doing: the de novo assembly sequence foo is, frankly, super awesome and seems to solve most of the scaling problems we face in short-read assembly. And it provides a path to solving the problems that it doesn't outright solve. (I'm talking about partitioning and digital normalization.)

    While I think we're doing awesome work, I've been uncharacteristically (for me) shy about proselytizing it prior to having papers ready. I occasionally reference it on mailing lists, or in blog posts, or on twitter, but the most I've talked about the details has been in talks -- and I've rarely posted those talks online. When I have, I don't point out the nifty awesomeness in the talks, either, which of course means it goes mostly unnoticed. This seems to be at odds with my oft-loudly stated position that open-everything is the way to go. What's going on?? That's what this blog post is about. I think it sheds some interesting light on how science is actually practiced, and why completely open science might waste a lot of people's time.

    I'd like to dedicate this blog post to Greg Wilson. He and I chat irregularly about research, and he's always seemed interested in what I'm doing but is stymied because I don't talk about it much in public venues. And he's been a bit curious about why. Which made me curious about why. Which led to this blog post, explaining why I think why. (I've had it written for a few months, but was waiting until I posted diginorm.)


    For the past two years or so, I've been unusually focused on the problem of putting together vast amounts of data -- the problem of de novo assembly of short-read sequences. This is because I work on unusual critters -- soil microbes & non-model animals -- that nobody has sequenced before, and so we can't make use of prior work. We're working in two fields primarily, metagenomics (sampling populations of wild microbes) and mRNAseq (quantitative sequencing of transcriptomes, mostly from non-model organisms).

    The problems in this area are manifold, but basically boil down to two linked issues: vast underlying diversity, and dealing with the even vaster amounts of sequence necessary to thoroughly sample this diversity. There's lots of biology motivating this, but the computational issues are, to first order, dominant: we can generate more sequence than we can assemble. This is the problem that we've basically solved.

    A rough timeline of our work is:

    • mid/late 2009: Likit, a graduate student in my lab, points out that we're getting way better gene models from assembly of chick mRNAseq than from reference-based approaches. Motivates interest in assembly.
    • mid/late 2009: our lamprey collaborators deliver vast amounts of lamprey mRNAseq to us. Reference genome sucks. Motivates interest in assembly.
    • mid/late 2009: the JGI starts delivering ridiculous amount of soil sequencing data to us (specifically, Adina). We do everything possible to avoid assembly.
    • early 2010: we realize that the least insane approach to analyzing soil sequencing data relies on assembly.
    • early 2010: Qingpeng, a graduate student, convinces me that existing software for counting k-mers (tallymer, specifically) doesn't scale to samples with 20 or 30 billion unique k-mers. (He does this by repeatedly crashing our lab servers.)
    • mid-2010: a computational cabal within the lab (Jason, Adina, Rose) figures out how to count k-mers really efficiently, using a CountMin Sketch data structure (which we reinvent, BTW, but eventually figure out isn't novel. o well). We implement this in khmer. (see k-mer filtering)
    • mid-2010: We use khmer to figure out just how much Illumina sequence sucks. (see Illumina read phenomenology)
    • mid-2010: Arend joins our computational cabal, bringing detailed and random knowledge of graph theory with him. We invent an actually novel use of Bloom filters for storing de Bruijn graphs. (blog post) The idea of partitioning large metagenomic data sets into (disconnected) components is born. (Not novel, as it turns out -- see MetaVelvet and MetaIDBA.)
    • late 2010: Adina and Rose figure out that Illumina suckage prevents us from actually getting this to work.
    • first half of 2011: Spent figuring out capacity of de Bruijn graph representation (Jason/Arend) and the right parameters to actually de-suckify large Illumina data sets (Adina). We slowly progress towards actually being able to partition large metagenomic data sets reliably. A friend browbeats me into applying the same technique to his ugly genomic data set, which magically seems to solve his assembly problems.
    • fall 2011: the idea of digital normalization is born: throwing away redundant data FTW. Early results are very promising (we throw away 95% of data, get identical assembly) but it doesn't scale assembly as well as I'd hoped.
    • October 2011: JGI talk at the metagenome informatics workshop - SLYT, where we present our ideas of partitioning and digital normalization, together, for the first time. We point out that this potentially solves all the scaling problems.
    • November 2011: We figure out the right parameters for digital normalization, turning up the awesomeness level dramatically.
    • through present: focus on actually writing this stuff up. See: de Bruijn graph preprint; digital normalization preprint.

    If you read this timeline (yeah, I know it's long, just skim) and look at the dates of "public disclosure", there's a 12-14 month gap between talking about k-mer counting (July 2010) and partitioning/etc (Oct 2011, metagenome informatics talk). And then there's another several-month gap before I really talk about digital normalization as a good solution (basically, mid/late January 2012).

    Why??

    1. I was really freakin' busy actually getting the stuff to work, not to mention teaching, traveling, and every now and then actually being at home.
    2. I was definitely worried about "theft" of ideas. Looking back, this seems a mite ridiculous, but: I'm junior faculty in a fast-moving field. Eeek! I also have a duty to my grads and postdocs to get them published, which wouldn't be helped by being "scooped".
    3. We kept on coming up with new solutions and approaches! Digital normalization didn't exist until August 2011, for example; appropriate de-suckifying of Illumina data took until April or May of 2011; and proving that it all worked was, frankly, quite tough and took until October. (More on this below.)
    4. The code wasn't ready to use, and we hadn't worked out all the right parameters, and I wasn't ready to do the support necessary to address lots of people using the software.

    All of these things meant I didn't talk about things openly on my blog. Is this me falling short of "open science" ideals??

    In my defense, on the "open science" side:

    • I gave plenty of invited talks in this period, including a few (one at JGI and one at UMD CBCB) attended by experts who certainly understood everything I was saying, probably better than me.
    • I posted some of these talks on slideshare.
    • all of our software development has been done on github, under github.com/ctb/khmer/. It's all open source, available, etc.

    ...but these are sad excuses for open science. None of these activities really disseminated my research openly. Why?

    Well, invited talks by junior faculty like me are largely attended out of curiosity and habit, rather than out of a burning desire to understand what they're doing; odds are, the faculty in question hasn't done anything particularly neat, because if they had, they'd be well known/senior, right? And who the heck goes through other people's random presentations on slideshare? So that's not really dissemination, especially when the talks are given to an in group already.

    What about the source code? The "but all my source code is available" dodge is particularly pernicious. Nobody, but nobody, looks at other people's source code in science, unless it's (a) been released, (b) been documented, and (c) claims to solve YOUR EXACT ACTUAL PROBLEM RIGHT NOW RIGHT NOW. The idea that someone is going to come along and swoop your awesome solution out of your repository seems to me to be ridiculous; you'd be lucky to be that relevant, frankly.

    So I don't think any of that is a good way to disseminate what you've done. It's necessary for science, but it's not at all sufficient.

    --

    What do I think is sufficient for dissemination? In my case, how do you build solutions and write software that actually has an impact, either on the way people think or (even better) on actual practice? And is it compatible with open science?

    1. Write effective solutions to common problems. The code doesn't have to be pretty or even work all that well, but it needs to work well enough to run and solve a common problem.
    2. Make your software available. Duh. It doesn't have to be open source, as far as I can tell; I think it should be, but plenty of people have restrictive licenses on their code and software, and it gets used.
    3. Write about it in an open setting. Blogs and mailing lists are ok; SeqAnswers is probably a good place for my field; but honestly, you've got to write it all down in a nice, coherent, well-thought out body of text. And if you're doing that? You might as well publish it. Here is where Open Access really helps, because The Google will make it possible for people to find it, read it, and then go out and find your software.

    The interesting thing about this list is that in addition to all the less-than-salutary reasons (given above) for not blogging more regularly about our stuff, I had one very good reason for not doing so.

    It's a combination of #1 and #3.

    You see, until near to the metagenome informatics meeting, I didn't know if partitioning or digital normalization really worked. We had really good indications that partitioning worked, but it was never solid enough for me to push it strongly as an actual solution to big data problems. And digital normalization made so much sense that it almost had to work, but, um, proving it was a different problem. Only in October did we do a bunch of cross-validation that basically convinced me that partitioning worked really well, and only in November did we figure out how awesome digital normalization was.

    So we thought we had solutions, but we weren't sure they were effective, and we sure didn't have it neatly wrapped in a bow for other people to use. So #1 wasn't satisfied.

    And, once we did have it working, we started to put a lot of energy into demonstrating that it worked and writing it up for publication -- #3 -- which took a few months.

    In fact, I would actually argue that before October 2011, we could have wasted people's time by pushing our solutions out for general use when we basically didn't know if they worked well. Again, we thought they did, but we didn't really know.

    This is a conundrum for open science: how do you know that someone else's work is worth your attention? Research is really hard, and it may take months or years to nail down all the details; do you really want to invest significant time or effort in someone else's research before that's done? And when they are done -- well, that's when they submit it for publication, so you might as well just read that first!

    --

    This is basically the format for open science I'm evolving. I'll blog as I see fit, I'll post code and interact with people that I know who need solutions, but I will wait until we have written a paper to really open up about what we're doing. A big part of that is trying to only push solid science, such that I don't waste other people's time, energy, and attention.

    So: I'm planning to continue to post all my senior-author papers to arXiv just before their first submission. The papers will come with open source and the full set of data necessary to recapitulate our results. And I'll blog about the papers, and the code, and the work, and try to convince people that it's nifty and awesome and solves some useful problems, or addresses cool science. But I don't see any much point in broadly discussing my stuff before a preprint is available.

    Is this open science? I don't really think so. I'd really like to talk more openly about our actual research, but for all the reasons above, it doesn't seem like a good idea. So I'll stick to trying to give presentations on our stuff at conferences, and maybe posting the presentations to slideshare when I think of it, and interacting with people privately where I can understand what problems they're running into.

    What I'm doing is more about open access than open science: people won't find out details of our work until I think it's ready for publication, but they also won't have to wait for the review process to finish. While I'm not a huge fan of the way peer review is done, I accept it as a necessary evil for getting my papers into a journal. By the time I submit a paper, I'll be prepared to argue, confidently and with actual evidence, that the approach is sound. If the reviewers disagree with me and find an actual mistake, I'll fix the paper and apologize profusely & publicly; if reviewers just want more experiments done to round out the story, I'll do 'em, but it's easy to argue that additional experiments generally don't detract from the paper unless they discover flaws (see above, re "apologize"). The main thing reviewers seem to care about is softening grandiose claims, anyway; this can be dealt with by (a) not making them and (b) sending to impact-oblivious journals like PLoS One. I see no problem with posting the paper, in any of these circumstances.

    Maybe I'm wrong; experience will tell if this is a good idea. It'll be interesting to see where I am once we get these papers out... which may take a year or two, given all the stuff we are writing up.

    I've also come to realize that most people don't have the time or (mental) energy to spare to really come to grips with other people's research. We were doing some pretty weird stuff (sketch graph representations? streaming sketch algorithms for throwing away data?), and I don't have a prior body of work in this area; most people probably wouldn't be able to guess at whether I was a quack without really reading through my code and presentations, and understanding it in depth. That takes a lot of effort. And most people don't really understand the underlying issues anyway; those who do probably care about them sufficiently to have their own research ideas and are pursuing them instead, and don't have time to understand mine. The rest just want a solution that runs and isn't obviously wrong.

    In the medium term, the best I can hope for is that preprints and blog posts will spur people to either use our software and approaches, or that -- even better -- they will come up with nifty new approaches that solve the problems in some new way that I'd never have thought of. And then I can read their work and build on their ideas. This is what we should strive for in science: the shortest round trip between solid scientific inspiration in different labs. This does not necessarily mean open notebooks.

    Overall, it's been an interesting personal journey from "blind optimism" about openness to a more, ahem, "nuanced" set of thoughts (i.e., I was wrong before :). I'd be interested to hear what other people have to say... drop me a note or make a comment.

    --titus

    p.s. I recognize that it's too early to really defend the claim that our stuff provides a broad set of solutions. That's not up to me to say, for one thing. For another, it'll take years to prove out. So I'm really talking about the hypothetical solution where it is widely useful in practice, and how that intersects with open science goals & practice.

    April 08, 2012 12:07 AM

    April 06, 2012

    Titus Brown

    Big Data Biology - why efficiency matters

    I'm going to pick on Mick Watson today. (It's OK. He's just a foil for this discussion, and I hope he doesn't take it too personally.)

    Mick made the following comment on my earlier Big Data Biology blog post:

    I do wonder whether there is just a bit too much hand wringing about "big data".

    For e.g., the rumen metagenomic data you mentioned above, I can assemble using MetaVelvet on our server in less than a day (admittedly it has 512Gb of RAM, but doesn't everyone?). I can count the 17mers in it using Jellyfish in a few hours.

    So I just set the processes running, two days later, I have my analysis. What's the problem? Does it matter that you can do it quicker?

    Big data doesn't really worry me.

    ...

    I know I am being flippant, but really to me the challenge isn't the data, it's the biology. I don't care if it takes 2 hours, 2 days or 2 weeks to process the data.

    Improve your computing efficiency by 100x, I don't care; improve your ability to extract biological information by 100x, then I'm interested :)

    He makes one very, very, very good point -- who cares if you can run an analysis (whatever it is) and it doesn't provide any value? The end goal of my sequencing analysis is to provide insight into biological processes; I might as well just delete the data (an O(1) "analysis" operation, if one with a big constant in front of it..) if the analysis isn't going to yield useful information.

    But he also seems to think that speed and efficiency of analyses doesn't matter for science. And I don't just think he's dead wrong, I know he's dead wrong.

    This is both an academic point and a practical point. And, in fact, an algorithmic point.

    The academic reason why efficient computation is good for science

    The academic point is simple: our ability to do thorough exploratory analysis of a large sequencing data set is limited by at least four things. These four things are:

    1. Our ability to do initial processing on the data - error trimming and correction, and data summary (mapping and assembly, basically).

    2. The information available for cross-reference. Most (99.9%) of our bioinformatic analyses rely on homology (for inference of function) and annotation.

      (This is why Open Access of data is so freakin' important to us bioinformaticians. If you hide your database from us, it might as well not exist for all we care.)

    3. Statistics. We do a lot of sensitive signal analysis and multiple testing, and we are really quite bad at computing FDRs and other statistical properties. Each statistical advance is greeted with joy.

    4. The ability to complete computations on (1), (2), and (3).

    Every 100gb data set takes a day to process. Mapping and assembly can take hours to days to weeks. Each database search costs time and effort (in part because the annotations are all in different formats). Each MCMC simulation or background calculation takes significant time, even if it's automated.

    Inefficient computation thus translates to an economic penalty on science (in time, effort, and attention span). This, in turn, leads directly to science that is not as good as it could be (as do poor computational science skills, badly written software, inflexible workflows, opaque pipelines, and too quick a rush to hypotheses -- hey, look, a central theme to my blog posts!)

    Anecdote: someone recently e-mailed us to tell us about how they could assemble a comparable soil data set to ours in a mere week and 3 TB of memory. Our internal estimates suggest that for full sensitivity, we need to do 5-10 assemblies of that data set (each with different parameters) followed by a similarly expensive post-assembly merging -- so, minimally, 6 weeks of compute requiring 3 TB of memory, full-time, on as many cores as possible. You've gotta imagine that there's going to be a lot of internal pressure to get results in less time (surely we can get away with only 1 assembly?) with less parameter searching (what, you think we can tell you which parameters are going to work?) and this pressure is going to translate to doing less in the way of data set exploration. (Never mind the actual economics -- since this data set would take about 1 week of sequencer time, and $10,000 or so, to generate today, I think they don't make sense either.)

    I can point you to at least three big metagenome Illumina assembly papers where I know these computational limitations truncated their exploration of the data set. (Wait, you say there are only three? Well, I'm not going to tell you which three they are.)

    The practical reason why efficient computation is good for science

    This one's a bit more obvious, but, interestingly, Mick also treads all over it. He says "...I can assemble using MetaVelvet on our server in less than a day (admittedly it has 512 Gb of RAM, but doesn't everyone?"

    Well, no, they don't.

    We didn't have access to such a big server until recently. We had plenty of offers for occasional access, but when we explained that we needed them for a few weeks of dedicated compute (for parameter exploration -- see above) and also that no, we weren't willing to sign copyright or license for our software over to a national lab for that access, somewhat oddly a lot of the offers came to naught.

    It turns out most people don't have access to such bigmem computers, or even big compute clusters; and when they do, those computers and clusters aren't configured for biologists to use.

    Democratization of sequencing should mean democratization of analysis, too. Every year our next-gen sequence analysis course gets tons of applicants from small colleges and universities where the compute infrastructure is small and what does exist is overwhelmed by Monte Carlo calculations. Our course explicitly teaches them to use Amazon to do their compute -- with that, they can take that knowledge home, and spend small amounts of money to buy IaaS, or apply for an AWS education grant to do their analysis. We feel for them because we were in their situation until recently.

    Expensive compute translates to a penalty on the very ability of many scientists and teachers to access computational science. (Insert snide comment on similar limitations in practical access to US education, health care, and justice).

    The algorithmic reason why efficient computation is good for science

    Assemblers kinda suck. Everyone knows it, and recent contests & papers have done a pretty good job of highlighting the limitations (see GAGE and Assemblathon). This is not because the field is full of stupid people, but rather because assembly is a really, really hard problem (see Nagarajan & Pop) -- so hard that really smart people have worked for decades on it. (In many ways, the fact that it works at all is a tribute to their brilliance.)

    Advances in assembly algorithms have led to our current crop of assemblers, but assemblers are still relatively slow and relatively memory consumptive. Our diginorm paper benchmarks Trinity as requiring 38 hours in 42gb of RAM for 100m mouse mRNAseq reads; genome and metagenome assemblers require similar size resources, although the variance depends on the sample, of course. SGA and Cortex seem unreasonably memory efficient to me :), but I understand that they perform less well on things other than single genomes (like, say, metagenomic data) -- in part because the underlying data structures are targeted at specific features of their data.

    What's the plan for the future, in which we will be applying next-gen sequencing to non-model organisms, evolutionary experiments, and entire populations of novel critters? These sequencing data sets will have different features from the ones we are used to tackling with current tech -- including higher heterozygosity and strong GC-rich biases.

    I personally think the next big advances in assembly will come through the systematic application of sample- or sub-sample specific, compute-expensive algorithms like EMIRGE to our data sets. While perfect assembly may be a pipe dream, significant and useful incremental advances seem very achievable, especially if the practical cost of current assembly algorithms drops.

    Not so parenthetically, this is one of the reasons I'm so excited about digital normalization (the general concept, not only our implementation) --

    I bet more algorithmically expensive solutions would be investigated, implemented, and applied if memory and time requirements dropped, don't you?

    Or if the data could be made less error-prone and simpler?

    Or if the volume of data could be reduced without losing much information?

    I will take one side of that bet...

    ---

    Of course, I'm more than a wee bit biased on this whole topic. A big focus of my group has been in spending the last three years fighting the trend of "just use a bigger computer and it will all be OK". Diginorm and partitioning are two of the results, and a few more will be emerging soon. I happen to think it's incredibly important; I would have done something else with my time, energy, and money if not. Hopefully you can agree that it's important, even if you're interested in other things.

    So: yes, computational efficiency is not the only thing. And it's a surprisingly convenient moving target; frequently, you yourself can just wait a few months or buy a bigger computer, and achieve similar results. But sometimes that attitude masks the fact that efficient computation can bring better, cheaper, and broader science. We need to pay attention to that, too.

    And, Mick? I don't think I can improve your ability to extract biological information by 100x. On metagenomes, would 2-10x be a good enough start?

    --titus

    April 06, 2012 01:36 PM

    What is digital normalization, anyway?

    I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization. When I posted the paper the reaction was reasonably positive, but I haven't had much luck explaining why it's so awesome.

    At the workshop, people were still confused. So I tried something new.

    I first made a simulated metagenome by taking three genomes worth of data from the Chitsaz et al. (2011) paper (see http://bix.ucsd.edu/projects/singlecell/) and shuffling them together. I combined the sequences in a ratio of 10:25:50 for the E. coli sequences, the Staph sequences, and the SAR sequences, respectively; the latter two were single-cell MDA genomic DNA. I took the first 10m reads of this mix and then estimated the coverage.

    You can see the coverage of these genomic data sets estimated by using the known reference sequences in the first figure. E. coli looks nice and Gaussian; Staph is smeared from here to heck; and much of the SAR sequence is low coverage. This reflects the realities of single cell sequencing: you get really weird copy number biases out of multiple displacement amplification.

    Then I applied three-pass digital normalization (see the paper) and plotted the new abundances. As a reminder, this operates without knowing the reference in advance; we're just using the known reference here to check the effects.

    http://ivory.idyll.org/permanent/raw-coverage.png

    Coverage of genome read mix, calculated by mapping the mixed reads onto the known reference genomes.

    http://ivory.idyll.org/permanent/norm-coverage.png

    Coverage post-digital-normalization, again calculated by mapping the mixed reads onto the known reference genomes.

    As you can see, digital normalization literally "normalizes" the data to the best of its ability. That is, it cannot create higher coverage where high coverage doesn't exist (for the SAR), but it can convert the existing high coverage into nice, Gaussian distributions centered around a much lower number. You also discard quite a bit of data (look at the X axes -- about 85% of the reads were discarded in downsampling the coverage like this).

    When you assemble this, you get as good or better results than assembling the unnormalized data, despite having discarded so much data. This is because no low-coverage data is discarded, so you still retain as much overall covered bases -- just in fewer reads. To boot, it works pretty generically for single genomes, MDA genomes, transcriptomes, and metagenomes.

    And, as a reminder? Digital normalization does this in fixed, low memory; in a single pass; and without any reference sequence needed.

    Pretty neat.

    --titus

    April 06, 2012 11:17 AM

    April 02, 2012

    Titus Brown

    Our approach to replication in computational science

    I'm pretty proud of our most recently posted paper, which is on a sequence analysis concept we call digital normalization. I think the paper is pretty kick-ass, but so is the way in which we're approaching replication. This blog post is about the latter.

    (Quick note re "replication" vs "reproduction": The distinction between replication and reproducibility is, from what I understand, that "replicable" means "other people get exactly the same results when doing exactly the same thing", while "reproducible" means "something similar happens in other people's hands". The latter is far stronger, in general, because it indicates that your results are not merely some quirk of your setup and may actually be right.)

    So what did we do to make this paper extra super replicable?

    If you go to the paper Web site, you'll find:

    • a link to the paper itself, in preprint form, stored at the arXiv site;
    • a tutorial for running the software on a Linux machine hosted in the Amazon cloud;
    • a git repository for the software itself (hosted on github);
    • a git repository for the LaTeX paper and analysis scripts (also hosted on github), including an ipython notebook for generating the figures (more about that in my next blog post);
    • instructions on how to start up an EC2 cloud instance, install the software and paper pipeline, and build most of the analyses and all of the figures from scratch;
    • the data necessary to run the pipeline;
    • some of the output data discussed in the paper.

    (Whew, it makes me a little tired just to type all that...)

    What this means is that you can regenerate substantial amounts (but not all) of the data and analyses underlying the paper from scratch, all on your own, on a machine that you can rent for something like 50 cents an hour. (It'll cost you about $4 -- 8 hours of CPU -- to re-run everything, plus some incidental costs for things like downloads.)

    Not only can you do this, but if you try it, it will actually work. I've done my best to make sure the darn thing works, and this is the actual pipeline we ourselves ran to produce the figures in the paper. All the data is there, and all of the code used to process the data, analyze the results, and produce the figures is also there. In version control.

    When you combine that with the ability to run this on a specific EC2 instance -- a combination of a frozen virtual machine installation and a specific set of hardware -- I feel pretty confident that at least this component of our paper is something that can be replicated.

    A few thoughts on replicability, and effort

    Why did I go to all this trouble??

    Wasn't it a lot of work?

    Well, interestingly enough, it wasn't that much work. I already use version control for everything, including paper text; posting it all to github was a matter of about three commands.

    Writing the code, analysis scripts, and paper was an immense amount of work. But I had to do that anyway.

    The most extra effort I put in was making sure that the big data files were available. I didn't want to add the the 2gb E. coli resequencing data set to git, for example. So I ended up tarballing those files sticking them on S3.

    The Makefile and analysis scripts are ugly, but suffice to remake everything from scratch; they were already needed to make the paper, so in order to post them all I had to do was put in a teensy bit of effort to remove some unintentional dependencies.

    The ipython notebook used to generate the figures (again -- next blog post) was probably the most effort, because I had to learn how to use it, which took about 20 minutes. But it was one of the smoothest transitions into using a new tool I've ever experienced in my ~25 years of coding.

    Overall, it wasn't that much extra effort on my part.

    Why bother in the first place??

    The first and shortest answer is, because I could, and because I believe in replication and reproducibility, and wanted to see how tough it was to actually do something like this. (It's a good deal above and beyond what most bioinformaticians do.)

    Perhaps the strongest reason is that our group has been bitten a lot in recent months by irreplicable results. I won't name names, but several Science and PNAS and PLoS One papers of interest to us turned out to be basically impossible for us to replicate. And, since we are engaged in developing new computational methods that must be compared to previous work, an inability to regenerate exactly the results in those other papers meant we had to work harder than we should have, simply to reproduce what they'd done.

    A number of these problems came from people discarding large data sets after publishing, under the mistaken belief that their submission to the Short Read Archive could be used to regenerate their results. (Often SRA submissions are unfiltered, and no one keeps the filtering parameters around...right?) In some cases, I got the right data sets from the authors and could replicate (kudos to Brian Haas of Trinity for this!), but in most cases, ixnay on the eplicationre.

    Then there were the cases where authors clearly were simply being bad computational scientists. My favorite example is a very high profile paper (coauthored by someone I admire greatly), in which the script they sent to us -- a script necessary for the initial analyses -- had a syntax error in it. In that case, we were fairly sure that the authors weren't sending us the script they'd actually used... (It was Perl, so admittedly it's hard to tell a syntax error from legitimate code, but even the Perl interpreter was choking on this.)

    (A few replication problems came from people using closed or unpublished software, or being hand-wavy about the parameters they used, or using version X of some Web-hosted pipeline for which only version Y was now available. Clearly these are long-term issues that need to be discussed with respect to replication in comp. bio., but that's another topic.)

    Thus, my group has wasted a lot of time replicating other people's work. I wanted to avoid making other people go through that.

    A third reason is that I really, really, really want to make it easy for people to pick up this tool and use it. Digital normalization is super ultra awesome and I want as little as possible to stand in the way of others using it. So there's a strong element of self-interest in doing things this way, and I hope it makes diginorm more useful. (I know about a dozen people that have already tried it out in the week or so since I made the paper available, which is pretty cool. But citations will tell.)

    What use is replication?

    Way back when, Jim Graham politely schooled me in the true meaning of reproducibility, as opposed to replication. He was about 2/3 right, but then he went a bit too far and said

    But let's drop the idea that I'm going to take your data and your code and "reproduce" your result. I'm not. First, I've got my own work to do. More importantly, the odds are that nobody will be any wiser when I'm done."

    Well, let's take a look at that concern, shall we?

    With the benefit of about two years of further practice, I can tell you this is a dangerously wrong way to think, at least in the field of bioinformatics. My objections hinge on a few points:

    First, based on our experiences so far, I'd be surprised if the authors themselves could replicate their own computational results -- too many files and parameters are missing. We call that "bad science".

    Second, odds are, the senior professor has little or no detailed understanding of what bioinformatic steps were taken in processing the data, and moreover is uninterested in the details; that's why they're not in the Methods. Why is that a problem? Because the odds are quite good that many biological analyses hinge critically on such points. So the peer reviewers and the community at large need to be able to evaluate them (see this RNA editing kerfuffle for an excellent example of reviewer fail). Yet most bioinformatic pipelines are so terribly described that even with some WAG I can't figure out what, roughly speaking, is going on. I certainly couldn't replicate it, and generating specific critiques is quite difficult in that kind of circumstance.

    Parenthetically, Graham does refer to the climate sciences struggles with reproducibility and replication. If only they put the same effort into replication and data archiving they did into arguing with climate change deniers...

    Third, Graham may be guilty of physics chauvinism (just like I'm almost certainly guilty of bioinformatics chauvinism...) Physics and biology are quite different: in physics, you often have a theoretical framework to go by, and results should at least roughly adhere to that or else they are considered guilty until proven innocent. In biology, we usually have no good idea of what we're expecting to see, and often we're looking at a system for the very first time. In that environment, I think it's important to make the underlying computation WAY more solid than you would demand in physics (see RNA editing above).

    As Narayan Desai pointed out to me (following which I then put it in my PyCon talk (slide 5)), physics and biology are quite different in the way data is generated and analyzed. There's fewer sources of data generation in physics, there's more of a computational culture, and there's more theory. Having worked with physicists for much of my scientific life (and having published a number of papers with physicists) I can tell you that replication is certainly a big problem over there, but the consequences don't seem as big -- eventually the differences between theory and computation will be worked out, because they're far more noticeable when you have theory, like in physics. Not so in biology.

    Fourth, a renewed emphasis on computational methods (and therefore on replicability of computational results) is a natural part of the transition to Big Data biology. The quality of analysis methods matters A LOT when you are dealing with massive data sets with weak signals and many systematic biases. (I'll write about this more later.)

    Fifth, and probably most significant from a practical perspective, Graham misses the point of reuse. In bioinformatics, it behooves us to reuse proven (aka published) tools -- at least we know they worked for someone, at least once, which is not usually the case for newly written software. I don't pretend that it's the responsibility of people to write awesome reusable tools for every paper, but sure as heck I should expect to be able to run them on some combination of hardware and software. Often that's not the case, which means I get to reinvent the wheel (yay...) even when I'm doing the same stupid thing the last five pubs did.

    For our paper, khmer and screed should be quite reusable. The analysis pipeline for the paper? It's not that great. But at least you can run it, and potentially steal code from it, too.

    When I was talking to a colleague about the diginorm paper, he said something jokingly: "wow, you're making it way too easy for people!" -- presumably he meant it would be way to easy for people to criticize or otherwise complain about the specific way we're doing things. Then, a day or two later he said, "hmm, but now that I think of it, no one ever uses the software we publish, and you seem to have had better luck with that..." -- recognizing that if you are barely able to run your own software, perhaps others might find it even more difficult.

    Heck, the diginorm paper itself would have been far harder to write without the data sets from the Trinity paper and the Velvet-SC paper. Having those nice, fresh, well-analyzed data sets already at hand was fantastic. Being able to run Trinity and reproduce their results was wonderful.

    There's a saying in software engineering: "one of the main people you should be programming for is yourself, in 6 months." That's also true in science -- I'm sure I won't remember the finer details of the diginorm paper analysis in 2 years -- but I can always go look into version control. More importantly, new graduate students can go look and really see what's going on. (And I can use it for teaching, too.) And so can other people working with me. So there's a lot of utility in simply nailing everything down and making it runnable.

    Replication is by no means sufficient for good science. But I'll be more impressed by the argument that "replication isn't all that important" when I see lack of replication as the exception rather than the rule. Replication is essential, and good, and useful. I long for the day when it's not interesting, because it's so standard. In the meantime I would argue that it certainly doesn't do any harm to emphasize it.

    (Note that I really appreciate Jim Graham's commentary, as I think he is at worst usefully wrong on these points, and substantially correct in many ways. I'm just picking on him because he wrote it all down in one place for me to link to, and chose to use the word 'sic' when reproducing my spelling mistake. Low blow ;)

    The future

    I don't pretend to have all, or even many, of the answers; I just like to think about what form they might take.

    I don't want to argue that this approach is a panacea or a high-quality template for others to use, inside or out of bioinformatics. For one thing, I haven't automated some of the analyses in the paper; it's just too much work for too little benefit at this point. (Trust me, they're easy to reproduce... :). For another, our paper used a fairly small amount of data overall; only a few dozen gigabytes all told. This makes it easy to post the data for others to use later on. Several of our next few papers will involve over a half terabyte of raw data, plus several hundred gb of ancillary and intermediate results; no idea what we'll do for them.

    Diginorm is also a somewhat strange bioinformatics paper. We just analyzed other people's data sets (an approach which for some reason isn't in favor in high impact bioinformatics, probably because high impact journal subs are primarily reviewed by biologists who want to see cool new data that we don't understand, not boring old data that we don't understand). There's no way we can or should argue that biological replicates done in a different lab should replicate the results; that's where reproducibility becomes important.

    But I would like it if people considered this approach (or some other approach) to making their analyses replicable. I don't mind people rejecting good approaches because they don't fit; to each their own. But this kind of limited enabling of replication isn't that difficult, frankly, and even if it were, it has plenty of upsides. It's definitely not irrelevant to the practice of science -- I would challenge anyone to try to make that claim in good faith.

    --titus

    p.s. I think I have to refer to this cancer results not reproducible paper somewhere. Done.

    April 02, 2012 02:29 PM

    March 31, 2012

    PythonXY

    Release Candidate Ready - 2.7.2.2-rc1

    Hi All,

    The first (and hopefully only) release candidate of 2.7.7.2 is ready for testing. Please use the project issue list for reporting issues - make sure you fill out the form!

    Python(x,y) 2.7.2.2rc1 can be downloaded at Mirror1 - NTUA

    Changes:

    Fixed:

    • Issue 393 : Mayavi2 does not start
    • Backwards compatibility restored.
    • Fixed explorer context menu console start-up entries.
    • Fixed many issues in the SciTE API generation script.

    Updated:

    • Pip 1.1.0
    • Spyder 2.1.8.1
    • scikits.image 0.5.0
    • SciPy 0.10.1
    • simplejson 2.3.3
    • pandas 0.7.1
    • wxPython 2.8.12.1
    • ETS 4.1.0.2
    • Sphinx 1.1.3
    • PyQt 4.8.6.3
    • gnuplot 1.8.0.3
    • guidata 1.4.2.3
    • guiqwt 2.1.6.3
    • IPython 0.10.2.5
    • Veusz 1.14.3
    • vitables 2.1.0.3
    • winpdb 1.4.8.3
    • xy 1.2.14.2
    • xydoc 1.0.4.2
    • Console 2.0.148.5
    • SciTE 3.0.3.2
    • WinMerge 2.12.4.2


    -Gabi Davar 

    by Grizzly Nyo (noreply@blogger.com) at March 31, 2012 02:05 PM

    March 27, 2012

    Matthieu Brucher

    Cover tree for nearest-neighbors

    I’ve looked on github for a good C++ implementation of Cover Trees for nearest-neighbors search, but I didn’t find one. I may have overlooked some repositories, but in the end, implementing it myself wasn’t that difficult.

    Implementation

    Contrary to kdtrees, the cover tree nodes always represent one point. So you have as many nodes as you have data points. Each of them can have several children, sorted by “level”, or distance hierarchy. Each level halves the distance between the node and its children. When constructing the tree instance, the distance you use is given as parameters (and its type is a template parameter, as well as the point type and the data type used for comparisons).

    Construction of the tree is done by searching in the tree for the closest node and the lowest hierarchy possible. In all time, the min and max levels are tracked in the tree. On insertion, the max level is tried, and if it fails, the max level is increased by one. The failure occurs if the new point is furthest of the root point that the maximum distance on the max level. At each level, the set of the nodes that are closer to the new point than the current maximum distance are kept (function populate_set_from_node). If the least distance is greater than the next level maximum level (function find_min_dist, the recursion stops and the new node is inserted in the first node in the set.

    When doing a search, a list of distances and points (of type NearestNodesStructure) is created and its size is kept at the number of neighbors k. On entrance of a new level (level_traversal), all nodes that have children that could be close enough (current distance less than maximum level distance + distance to the current k-th neighbors found with find_k_dist) are added to the list. In order to speed search, a map is not used, but a partial sort is done at each iteration (which makes find_k_dist work without sorting the container again). Without this, the performance is worse than a linear knn search.

    I didn’t implement other functions like removal or traversal. I don’t need them, but feel free to add them if you need. Sidenote: I will try to find time to refactor the code, because it is not very clean at the moment (especially the knn method).

    Results

    So some figures. I’ve populated a tree with 100k elements, and then made a 10-neighbors search with the cover tree, and with a standard linear search. Without the construction time, the search is about ten times faster than a linear one. Of course, it all depends on the cover tree balance. You could design a cover tree that has too many levels (but also the data would be unusual) and that would result in an almost linear time.

    I’ve done a profile with callgrind, but the majority of the time is in the tree construction. No point in checking the search and displaying it here as the following timings show:

    Build time 00:09:38.530397
    Out time (linear) 00:00:00.238989
    Out time (cover_tree) 00:00:00.012518
    

    What must be optimized is clearly the build time.

    The test code is in the main.cpp file on github.

    Cover Tree code on Github is here and the publication with detailed algorithms here.


    by Matt at March 27, 2012 08:30 AM

    March 23, 2012

    EuroSciPy

    EuroSciPy 2012 - Tutorials program

    The EuroSciPy 2012 team has put the tutorials program online.

    The tutorials last for two days and are a unique training opportunity for beginner or trained scientists. The tutorials range from Array manipulations with NumPy to Parallel Computing or Statistics with Pandas.

    Consider coming for the tutorials and staying for the scientific track or the opposite!

    All information is found on the conference webpage.

    Reminder The call for abstracts is open.

    by Pierre de Buyl at March 23, 2012 04:06 PM

    Enthought

    Pycon Startup Row and Cloak!

    It’s been brought to our attention that Cloak, one of the start-ups featured on Startup Row at Pycon, was recently featured on Lifehacker. I imagine the press brought them some welcome attention. Well, as it happens, we interviewed the founders of Cloak at Pycon (along with some other folks). Unfortunately, the focus is a bit [...]

    by David Kim at March 23, 2012 03:31 AM

    March 22, 2012

    Titus Brown

    Digital normalization of short-read shotgun data

    We just posted a pre-submission paper to arXiv.org:

    A single pass approach to reducing sampling variation, removing errors, and scaling de novo assembly of shotgun sequences

    Authors: C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, and Timothy H. Brom

    arXiv link

    Paper Web site, with source code and tutorials

    Abstract:

    Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes enable the sensitive investigation of a wide range of biological phenomena. However, it is increasingly difficult to deal with the volume of data emerging from deep short-read sequencers, in part because of random and systematic sampling variation as well as a high sequencing error rate. These challenges have led to the development of entire new classes of short-read mapping tools, as well as new de novo assemblers. Even newer assembly strategies for dealing with transcriptomes, single-cell genomes, and metagenomes have also emerged. Despite these advances, algorithms and compute capacity continue to be challenged by the continued improvements in sequencing technology throughput. We here describe an approach we term digital normalization, a single-pass computational algorithm that discards redundant data and both sampling variation and the number of errors present in deep sequencing data sets. Digital normalization substantially reduces the size of data sets and accordingly decreases the memory and time requirements for de novo sequence assembly, all without significantly impacting content of the generated contigs. In doing so, it converts high random coverage to low systematic coverage. Digital normalization is an effective and efficient approach to normalizing coverage, removing errors, and reducing data set size for shotgun sequencing data sets. It is particularly useful for reducing the compute requirements for de novo sequence assembly. We demonstrate this for the assembly of microbial genomes, amplified single-cell genomic data, and transcriptomic data. The software is freely available for use and modification.

    ---

    I'll blog more about this stuff over the next few days, but, briefly, this paper presents a single pass, fixed-memory approach to downsampling sequencing data (yeah, the stuff I've been talking about for a while now). This approach is called "digital normalization", or "diginorm" for sure. It eliminates lots of data, evens out coverage, and removes errors from shotgun sequencing data sets. The net effect is an often massive amount of data reduction combined with significant scaling of de novo assembly.

    Or, to put it another way, it's a streaming lossy compression algorithm that primarily "loses" errors in sequencing data.

    We've implemented it as a preprocessing filter that should work on any data set you want to assemble, with potentially any assembler. It's written in C++, with a Python wrapper, as part of khmer. And, of course, it's freely available for use, re-use, modification, and redistribution under the BSD license, 'cause why the heck not?

    If you want to try it out, we've linked to some tutorials for running microbial genome assemblies with Velvet, as well as eukaryotic transcriptome assemblies with Oases and Trinity, on the paper Web site

    We'll submit this paper to PNAS on Friday; I'm still waiting for some final proofreading.

    Together with our other paper, which is currently under revision for resubmission to PNAS, these two papers form the theoretical basis for our attack on soil metagenome assembly. (Our plan is sheer elegance in its simplicity!)

    --titus

    March 22, 2012 01:38 AM

    March 20, 2012

    Enthought

    Yet Another Pycon Recap

    It’s been a while since the Enthought blog was updated with something other than the odd EPD release announcement. Pycon would have been a good time to recommit ourselves to blogging, but of course technical difficulties with the blog threw a wrench into that plan. Pycon recaps are now all over the internet and it [...]

    by David Kim at March 20, 2012 11:05 PM

    March 16, 2012

    Official SymPy blog

    Google Summer of Code 2012

    SymPy was accepted by Google once again to participate in Google Summer of Code for 2012. Please go to http://www.google-melange.com/gsoc/org/google/gsoc2012/sympy for more information about how to apply and get started.

    by Aaron Meurer (noreply@blogger.com) at March 16, 2012 02:55 PM

    March 10, 2012

    Andreas Mueller

    Matplotlib Errorbars Weirdness

    Some time ago, I was having trouble with using error bars with a legend in matplotlib. I did some hack to fix it that time.
    Today, I came across the same issue again at the scikit-learn sprint.
    Obviously this time we wanted to do it right^TM.

    The problem is that when doing a plot with error bars, the colors in the legend don't correspond to the colors of the lines.

    The problem seems to be solved in never versions, though.

    After fiddling a bit with it I found the problem:
    The errorbars command actually returns a list of artists, corresponding to the lines AND the individual error bars. That seems to really confuse the legend.

    You can solve the problem by doing

    errorbar  = pl.errorbars(x, y, err)
    line = errorbar[0]
    legend(line, 'the line')
    taking the zeroth element of the tuple extracts just the line from the collection of
    artists and everything works out :)

    by Andreas Mueller (noreply@blogger.com) at March 10, 2012 07:08 PM

    Generating Data for benchmarking clustering algorithms

    As I have been working on some clustering algorithms recently, I invested some time last weekend to refactor some code inside sklearn to generate some toy data sets to visualize the results of clustering algorithms.
    That looks something like this:

    While the first two are nice to show off that your algorithm can handle non-convex clusters, these data sets obviously look nothing like the data you'll see in practice.

    So I wanted to have some a bit more general data set generator.
    What I ended up doing is a nonparametric mixture of Gaussians.
    While Gaussians are a bit boring, combining them with a non-parametric prior makes them somewhat more general.

    As I didn't found some very easy to use package to do that (though David pointed out pymc) I went ahead and wrote the generative model down myself.

    It's a mixture of Gaussians with a Chinese restaurant process as prior for the mixture components and Wishard-Gaussian priors for mean and variance.
    You can find the code here.
    With this class, you can generate a dataset by:

    dpgmm = DPGMMSampler(alpha=10., deg=10, sigma=3, n_features=2)
    X = dpgmm.sample(n_samples=100)
    Where alpha is the parameter of the Chinese restaurant process, deg is the degrees of freedom of the (assumed diagonal) Wishart prior and sigma is the (diagonal) standard-deviation of the Gaussian prior over means. Here are some examples of what X might look like given the above parameters.
    Here, the means of the Gaussians are marked with red diamonds, while the entries of X are blue dots.
    The code was pretty straight-forward (although it's not as fast as it could be I guess), except for drawing from the Wishart distribution. That was a bit annoying.
    I got the idea how to do it from this. It would be great if scipy could integrate something similar in the future.

    Btw, I have not really proved correctness of my code, as the main point was to generate some nice samples. If you need to know that the model is correct, you might want to check the above reference and the code ;)

    by Andreas Mueller (noreply@blogger.com) at March 10, 2012 02:49 PM

    William Stein

    Sage and Python 3?

    Have you ever wondered how difficult it might be to switch Sage from Python 2.7 to Python 3.x?  Some students in my Sage course made a webpage that summarizes the Python 3 support status of the Python packages on which Sage depends.

    Interesting examples:
    • Matplotlib doesn't support Python 3.x at all yet.
    • Mercurial doesn't support Python 3.x, and they don't have any plans to do so. (Which is another point in favor switching Sage to GIT.)

    by William Stein (noreply@blogger.com) at March 10, 2012 07:57 AM

    March 08, 2012

    Gaël Varoquaux

    Want features? Just code

    Somebody just sent an email on a user’s mailing list for an open-source scientific package entitled Feature foo: why is package bar not up to the task?” (names hidden to avoid pointing directly to the responsible of my wrath). To quote him:

    Is there ANY plan for having such a module in package bar?? I think (personally) that this is a MUST DO. This is typically the type of routines that I hear people use in e.g., idl etc. If this could be an optimised, fast (and easy to use) routine, all the better.

    As some one who spends a fair amount of time working on open source software I hear such remarks quite often. I am finding it harder and harder not to react negatively to these emails. Now I cannot consider myself as a contributor to package bar, and thus I can claim that I am not taking your comment personally.

    Why aren’t package not up to the task? Will, the answer is quite simple: because they are developed by volunteers that do it on their spare time, late at night too often, or companies that put some of their benefits in open source rather in locking down a market. 90% of the time the reason the feature isn’t as good as you would want it is because of lack of time.

    I personally find that suggesting that somebody else should put more of the time and money they are already giving away in improving a feature that you need is almost insulting.

    I am aware that people do not realize how small the group of people that develop and maintain their toys is. Borrowing the figure below from Fernando Perez’s talk at Euroscipy, the number of people that do 90% of the grunt work to get the core scientific Python ecosystem going is around two handfuls:

    Commits per contributor in various scientific Python packages, from Fernando Perez

    I’d like to think that this recruitment problem is a lack of skill set: users that have the ability to contribute are just too rare. This is not entirely true, there are scores of skilled people on the mailing lists. The poster himself mentioned his email that he was developing a package. I personally started contribution not knowing anything about software development. I struggled, I did the grunt work like maintaining wikis, answer questions on mailing list, and writing documentation. These easier tasks were useful to the community, I think, but must importantly, they taught me a lot because I was investing energy in them.

    If people want things to improve, they will have more successes sending in pull requests than messages on mailing list that sound condescending to my ears.
    I hope that I haven’t overreacted too badly :), that email turned me on. That said, I am not sure that people realize how much they owe to the open source developers breaking their backs on the packages they use.
    All credit for images goes to Fernando Perez

    by gael at March 08, 2012 09:46 PM

    William Stein

    How to get access to Magma

    Despite Sage doing a lot, people still write to me asking about how to get access to Magma. Here is my advice:
    1. You can buy a copy of Magma here for only $1180.
    2. If your calculation will take less than 20 seconds (?), you can do it over the web for free. (Historical note: I wrote the first version of that website.)
    3. In Sage, use the command magma_free('some string'), though I just checked and the Magma calculator has changed in a way that breaks this command again. I'll be posting a patch to fix this in Sage.
    4. Most people I know who have Magma do not buy it, but get it in exchange for contributing to Magma (I used to get it free for that reason). Study the list of
      past and current contributors to Magma and ask the one you know the best. This is a list of people who definitely have a copy of Magma and know how to use it.

    by William Stein (noreply@blogger.com) at March 08, 2012 02:57 PM

    March 07, 2012

    Titus Brown

    Top 12 reasons you know you are a Big Data biologist

    A few people have recently asked me what this "Big Data" thing is in biology. It's not an easy question to answer, though, because biology's a bit peculiar, and a lot of Big Data researchers are not working in bio. While I was thinking about this I kept on coming up with anecdotes -- and, well, that turned into this: the Top 12 Reasons You Know You Are a Big Data Biologist.

    ---

    1. You no longer get enthused by the prospect of more data.

    I was at a conference a few months back, and an Brit colleague of mine rushed up to me and said, "Hey! We just got an Illumina HiSeq! Do you have anything you want sequenced?" My immediate, visceral response was "Hell no! We can't even deal with a 10th of the sequence we've already got! PLEASE don't send me any more! Can I PAY you to not send me any more?"

    1. Analysis is the bottleneck.

    I'm dangerously close to genomics bingo here, but:

    I was chatting with a colleague in the hallway here at MSU, pitching some ideas about microbiome work, and he said, "What a good idea! I have some samples here that I'd like to sequence that could help with that." I responded, "How about we stop producing data and think about the analysis?" He seemed only mildly offended -- I have a rep, you see -- but biology, as a field, is wholly unprepared to go from data to an analyses.

    My lab has finally turned the corner from "hey, let's generate more data!" to "cool, data that we can analyze and build hypotheses from!" -- yeah, we're probably late bloomers, but after about 3 years of tech development, there's very little we can't do at a basic sequence level. Analysis at that level is no longer the bottleneck. Next step? Trying to do some actual biology! But we are working in a few very specific areas, and I think the whole field needs to undergo this kind of shift.

    1. You've finally learned that 'for' loops aren't all they're cracked up to be.

    This was one of my early lessons. Someone had dumped a few 10s of millions of reads on me -- a mere gigabyte or so of data -- and I was looking for patterns in the reads. I sat down to write my usual opening lines of Python to look at the data: "for sequence in dataset:" and ... just stopped. The data was just too big to do iteration in a scripting language with it! Cleverness of some sort needed to be applied.

    Corollary: not just Excel, but BLAST, start to look like Really Bad Ideas That Don't Scale. Sean Eddy's HMMER 3 FTW...

    1. Addressing small questions just isn't that interesting.

    Let's face it, if you've just spent a few $k generating dozens of gigabytes amounts of hypothesis-neutral data on gene expression from (say) chick, the end goal of generating a list of genes that are differentially regulated is just not that exciting. (Frankly, you probably could have done it with qPCR, if you didn't want that cachet of "mRNAseq!") What more can you do with the data?

    (This point comes courtesy of Jim Tiedje, who says (I paraphrase): "The problem for assistant professors these days is that many of may not be thinking big enough." Also see: Don't be afraid of failure: really go for it)

    The biggest training challenge (in my opinion) is going to be training people in how to push past the obvious analysis of the data and go for deeper integrative insight. This will require training not just in biology but in data analysis, computational thinking, significant amounts of informed skepticism, etc. (See our course.) I think about it like this: generating hypotheses from large amounts of data isn't that interesting -- I can do that with publicly available data sets without spending any money! Constraining the space of hypotheses with big data sets is far more interesting, because it gives you the space of hypotheses that aren't ruled out; and its putting your data to good use. I'll, uh, let you know if I ever figure out how to do this myself...

    I think there are plenty of people that can learn to do this, but as Greg Wilson correctly points out, there has to be a tradeoff: what do you take out of existing training curricula, to be replaced with training in data analysis? I wish I knew.

    1. Transferring data around is getting more than a bit tedious.

    For some recent papers, I had to copy some big files from EC2 over to my lab computer, and from there to our HPC system. It was Slow, in the "time for lunch" sense. And these were small test data sets, compressed. Transferring our big data sets around is getting tedious. Luckily we have a lot of them, so I can usually work on analysis of one while another is transferring.

    5. Your sysadmin/HPC administrator yells at you on a regular basis about your disk usage.

    We regularly get nastygrams from our local sysadmins accusing of using too many terabytes of disk space. This is in contrast to the good ol' days of physics (which is where I got my sysadmin chops), where your sysadmin would yell at you for using too much CPU...

    1. You regularly crash big memory machines.

    My favorite quote of the year so far is from the GAGE paper), in which Salzberg et al (2012) say "For larger genomes, the choice of assemblers is often limited to those that will run without crashing." Basically, they took a reasonably big computer, threw some big data at various assembly packages, and watched the computer melt down. Repeatedly.

    Someone recently sent me an e-mail saying "hey, we did it! we took 3 Gb of sequence data from soil and assembled it in only 1 week in 3 TB of RAM!" Pretty cool stuff -- but consider that at least four to six runs would need to be done (parameter sweep!), and it takes only about 1 week and $10k to generate twice that data. In the long run, this does not seem cost effective. (It currently takes us 1-2 months to analyze this data in 300 GB of RAM. I'm not saying we have the answer. ;)

    1. Throwing away data looks better and better.

    I made a kind of offhanded comment to a NY Times reporter once (hint: don't do this) about how at some point we're going to need to start throwing away data. He put it as the ultimate quote in his article. People laughed at me for it. (BUT I WAS RIGHT! HISTORY WILL SHOW!)

    But seriously, if someone came up to you and said "we can get rid of 90% of your data for you and give you an answer that's just as good", many biologists would have an instant negative response. But I think there's a ground truth in there: a lot of Big Data is noise. If you can figure out how to get rid of it... why wouldn't you? This is an interesting shift in thinking from the "every data point is precious and special" that you adopt when it takes you a !#%!#$ week to generate each data point.

    I attended a talk that David Haussler gave at Caltech recently. He was talking about how eventually we would need to resequence millions of individual cancer cells to look for linked sets of mutations. At 50-300 Gb of sequence per cell, that's a lot of data. But most of that data is going to be uninteresting -- wouldn't it be great if we could pick out the interesting people and then throw the rest away? It would certainly help with data archiving and analysis...

    8. Big computer companies call you because they're curious about why you're buying such big computers.

    True story: a Big Computer Company called up our local HPC to ask why we were buying so many bigmem machines. They said "It's the damned biologists -- they keep on wanting more memory. Why? We don't know - we suspect the software's badly written, but can't tell. Why don't you talk to Titus? He pretends to understand this stuff." I don't think it's weird to get calls trying to sell me stuff -- but it is a bit weird to have our local HPC's buying habits be so out of character, due to work that I and others are doing, that Big Computer Companies notice.

    (Note: the software's not mine, and it's not badly written, either.)

    9. Your choice must increasingly be "improve algorithms" rather than "buy bigger computers"

    I've been banging this drum for a while. Sequencing capacity is outpacing Moore's Law, and so we need to rethink algorithms. An algorithm that was nlogn used to be good enough; now, if analysis requires a supra-linear algorithm, we need to figure out how to make it linear. (Sublinear would be better.)

    Anecdote: we developed a nifty data structure for attacking metagenome assembly (see: http://arxiv.org/abs/1112.4193). It scaled (scales) assembly by a factor of about 20x, which got us pretty excited -- that meant we could in theory assemble things like MetaHIT and rumen on commodity hardware without doing abundance filtering. Literally the day that we told our collaborators we had it working, they dumped 10x more data on us and told that they could send us more any time we wanted. (...and that, boys and girls, was our introduction to the HiSeq!) 20x wasn't enough. Sigh.

    The MG-RAST folk have told me a similar story. They did some awesomely cool engineering and got their pipeline running about 100x faster. That'll hold them for a year or so against the tidalwave of data.

    Corollary: don't waste your time with 2% improvements in sensitivity and specificity unless you also deliver 10x in compute performance.

    10. You spend a lot of time worrying about biased noise, cross-validation, and the incorrect statistical models used.

    We were delayed in some of our research by about a year, because of some systematic biases being placed in our sequencing data by Illumina. Figuring out that these non-biological features were there took about two months; figuring out how to remove them robustly took another 6 months; and then making sure that removing didn't screw up the actual biological signal took another four months.

    This is a fairly typical story from people who do a lot of data analysis. We developed a variety of cross-validation techniques and ways of intuiting whether or not something was "real" or "noise", and we spent a certain amount of time discussing what statistical approaches to use to assess significance. In the end we more or less gave up and pointed out that on simulated data what we were doing didn't screw things up.

    1. Silicon Valley wants to hire your students to go work on non-biology problems.

    Hey, it's all Big Data, right?

    ---

    So: what is Big Data in biology?

    First, I've talked mostly about DNA sequence analysis, because that's what I work on. But I know that proteomics and image analysis people are facing similar dilemmas. So it's not just sequence data.

    Second, compute technology is constantly improving. So I think we need moving definitions.

    Here are some more serious points that I think bear on what, exactly, problems for Big Data in biology. (They're not all specific to biology, but I can defend them on my home ground more easily, you see.)

    1. You have archival issues on a large scale.

    You have lots of homogeneously formatted data that probably contains answers you don't know you're looking for yet, so you need to save it, metadata it, and catalog it. For a long time.

    1. The rate at which data is arriving is itself increasing.

    You aren't just getting one data set. You're getting dozens (or hundreds) this year. And you'll get more than that next year.

    One implication of this is that you'd better have a largely automated analysis pipeline, or else you will need an increasing number of people just to work with the data, much less do anything interesting. Another implication is that software reuse becomes increasingly important: if you're building custom software for each data set, you will fall behind. A third implication is that you need a long-term plan for scaling your compute capacity.

    1. Data structure and algorithm research is increasingly needed.

    You cannot rely on many heavyweight iterations over your data, or simple data structures for lookup: the data is just too big and existing algorithms are tailored to smaller data. For example, BLAST works fine for a few gigabytes of data; past that, it becomes prohibitively slow.

    1. Infrastructure designers are needed.

    Issues of straightforward data transfer, network partitioning, and bus bandwidth start to come to the forefront. Bottleneck analysis needs to be done regularly. In the past, you could get away with "good enough", but as throughput continues to increase, bottlenecks will need to be tackled on a regular basis. For this, you need a person who is immersed in your problems on a regular basis; they are hard to find and hard to keep.

    One interesting implication here is for cloud computing, where smart people set up a menu of infrastructure options and you can tailor your software to those options. So far I like the idea, but I'm told by aficionados that (for example) Amazon still falls short.

    1. You have specialized hardware needs.

    Sort of a corollary of the above: what kind of analyses do you need to do? And what's the hardware bottleneck? That's where you'll get the most benefit from focused hardware investment.

    1. Hardware, infrastructure design, and algorithms all need to work together.

    Again, a corollary of the above, but: if your bottleneck is memory, focus on memory improvements. If your bottleneck is disk I/O, focus on hardware speed and caching. If your bottleneck is data transfer, try to bring your compute to your data.

    1. Software needs to change to be reusable and portable.

    Robust, reusable software platforms are needed, with good execution guarantees; that way you have a foundation to build on. This software needs to be flexible (practically speaking, scriptable in a language like Python or Ruby or Perl), well developed and tested, and should fade into the background so that you can focus on more interesting things like your actual analysis. It should also be portable so that you can "scale out" -- bring the compute to your data, rather than vice versa. This is where Hadoop and Pig and other such approaches fit now, and where we seriously need to build software infrastructure in biology.

    1. Analysis thinking needs to change.

    Comprehensively analyzing your data sets is tough when your data sets are really big and noisy. Extracting significant signals from them is potentially much easier, and some approaches and algorithms for doing this in biology exist or are being developed (see especially Lior Pachter's eXpress). But this is a real shift in algorithmic thinking, and it's also a real shift in scientific thinking, because you're no longer trying do understand the entire data set -- you're trying to focus on the bits that might be interesting.

    1. Analyses are increasingly integrative.

    It's hard to make sense of lots of data on its own: you need to link it in to other data sets. Data standards and software interoperability and "standard" software pipelines are incredibly important for doing this.

    1. The interesting problems are still discipline-specific.

    There are many people working on Big Data, and there is big business in generic solutions. There's lots of Open Source stuff going on, too. Don't reinvent those wheels; figure out how to connect them to your biology, and then focus on the bits that are interesting to you and important for your science.

    11. New machine learning, data mining, and statistical models need to be developed for data-intensive biological science.

    As data volume increases, and integrative hypothesis development proceeds, we need to figure out how to assess the quality and significance of hypotheses. Right now, a lot of people throw their data at several programs, pick their favorite answer, and then recite the result as if it's correct. Since often they will have tried out many programs, this presents an obvious multiple testing problem. And, since users are generally putting in data that violates one or more of the precepts of the program developers, the results may not be applicable.

    1. A lack of fear of computational approaches is a competitive advantage.

    The ability to approach computational analyses as just another black box upon which controls must be placed is essential. Even if you can open up the black box and understand what's inside, it needs to be evaluated not on what you think it's doing but on what it's actually doing at the black box level. If there's one thing I try to teach to students, it's to engage with the unknown without fear, so that they can think critically about new approaches.

    ---

    Well, that's it for now. I'd be interested in hearing about what other people think I've missed. And, while I'm at it, a hat tip to Erich Schwarz, Greg Wilson, and Adina Howe for reading and commenting on a draft.

    --titus

    March 07, 2012 02:42 PM

    March 06, 2012

    Matthieu Brucher

    Registry and automatic plugin system

    A few years ago, I mentioned that the registry pattern was my favorite pattern in Python. Well, it may also be my favorite C++ pattern.

    I’ve implemented this pattern in most of my production code, when different algorithms are needed for some computation. Some weeks ago, a colleague asked to use one of my production codes, because the parallel I/O were hidden inside a job scheduler. The only thing he had to do was deriving a base class, implement the algorithm, and he could use the massively distributed framework. Of course, the derived class builder was registered in a map, registry pattern inside.

    The idea of the plugin emerged when he didn’t want to compile the whole distributed system. As his class was automatically registered on startup, the plugin system was quite easy to implement: just dlopen() the library, and the build would be registered. Usually, you have to extract some function from the library before being able to use it. With a registry pattern, this is no longer the case. So I did just that, and now all the complicated algorithms can now be exported as plugins, which eases the building process (through SCons) and of course the algorithm development (big time).

    Now, I just need to optimize the distributed framework for I/O…


    by Matt at March 06, 2012 08:27 AM

    March 05, 2012

    PythonXY

    New Beta Ready - 2.7.2.2 b4

    Hi All,

    A beta of 2.7.2.2 is ready for testing. Please download and test - your feedback is vital!

    Why beta?

    1. Major changes were made to the way packages are installed. Specifically to eliminate "all users" vs. "current user" troubles.
    2. Testing was limited to a token few systems (WinXP SP3 & Win7 32bit).
    3. The sheer amount of package updates.
    Upgrading was not tested at all and may yield unpredictable results. Please uninstall any previous Python(x,y) installs before installing the beta. 

    Changes:

    Added

    • ITK 3.20 (without the itkvtkglue feature which is not compatible with VTK 5.8.0)
    • pyparsing 1.5.6 - upgraded from additional plugin status.
    • pyfits 3.0.5 - Hidden under Veusz, upgraded to core plugin status.
    • Veusz converted from an "other" plugin to a python plugin.
    • pandas 0.7.0

    Fixed

    • All shell shortcuts have their working directory set to USERPROFILE.
    • All file assosiactions, menu shorcuts and environment variables are created based on current context.
    • The Python CHM is no longer unpacked.
    •  Issue 379 : installing vitables should automatically add pyQT
    •  Issue 374 : PyQt4-4.8.5_py27 user install bug
    •  Issue 373 : installing python(x,y) breaks existing python install without warning
    •  Issue 359 : Scripts exe's won't launch corresponding -script.py when installing python in custom dir
    •  Issue 329 : Left click menu shortcuts for console are broken
    •  Issue 292 : Python(x,y) 2.7.2.0 installs registry to HKCU instead of HKLM even if "for all users" is selected
    •  Issue 274 : Installation Issues and Enhancement Requests
    •  Issue 107 : Installation to many user accounts

    Updated

    • Spyder 2.1.7.1
    • guidata 1.4.2.2
    • guiqwt 2.1.6.2
    • Distribute 0.6.24
    • nose 1.1.2.1
    • Sphinx 1.1.2
    • Console 2.0.148.2
    • PyQt? 4.8.6.2
    • QtHelp? 4.7.4
    • SciTE 3.0.3.1
    • PySerial? 2.6.0.1
    • MDP 3.3
    • netcdf4 0.9.9
    • PyTables? 2.3.1
    • numexpr 2.0.1
    • SciPy? 0.10.0.1
    • scikits.image 0.4.2
    • numpy 1.6.1.1
    • xy 1.2.14.1
    • IPython 0.10.2.3
    • simplejson 2.3.2
    • pylint 0.25.1.1
    • NetworkX 1.6
    • Enthought Tool Suite 4.1.0
    • cvxopt 1.1.4
    • mx 3.2.3
    • VPython 5.72
    • scikits-learn 0.10.0.1
    • jinja2 2.6.0.1
    • Cython 0.15.1.1
    • Pip 1.0.2.1
    • cx_Freeze 4.2.3.1
    • docutils 0.8.1.2
    • PIL 1.1.7.2
    • PP 1.6.1.1
    • winpdb 1.4.8.2
    • GDAL 1.9.0.1
    • Veusz 1.14.2
    • wxPython 2.9.3.1
    • Pywin32 2.17
    • ETS 4.1.0.1
    • gnuplot 1.8.0.2
    • VTK 5.8.0.1
    • vitables 2.1.0.2
    • WinMerge? 2.12.4.1
    • MinGW 4.5.2.2
    • SWIG 2.0.4.1
    • gettext 0.14.4.2

    by Grizzly Nyo (noreply@blogger.com) at March 05, 2012 11:11 AM

    March 04, 2012

    Josef Perkoltd

    Data "Analysis" in Python

    I'm catching up with some Twitter feeds and other information on the internet about the PyData Workshop

    There is a big effort in the Python/Numpy/SciPy community to get into the "Big Data" and data processing market.

    Even the creator of Python was at the workshop and took not of it.

    Guido van Rossum  -  Yesterday 9:05 PM  -  Public
    Pandas: a data analysis library for Python, poised to give R a run for its money
    

    I think Python is well suited for this, Python in combination with numpy and scipy has been for 4 years my favorite language for coding for statistics and econometrics. I have been working for several years now on improving "Statistics in Python", both in scipy.stats and statsmodels.

    Since the PyData Workshop didn't include anything about statistics or econometrics, it looks like my view is a bit out of mainstream. The blogoshpere is awash with articles about what's hype and what's reality behind BIG DATA. (I don't find the links to the articles I liked, but SAS might have a realistic view Is big data overhyped )

    However, what came to my mind reading the buzz surrounding the PyData Workshop is more personal and specific to software developement in Python.

    My first thoughts can be roughly summarized with

    You know that you are out of date, if

    • you like mailing lists. [1]
    • you signed up for Twitter and never posted anything.
    • you signed up for Google plus and never posted anything.
    • you read the Twitter feed of others once a month.
    • you don't even know how to link to a Twitter message.

    You know you don't do the popular things, if

    • you spend two days checking the numerical accuracy of your algorithm for a case with bad data instead of trying to calculate it in the cloud.
    • you spend a week writing test cases verifying your code against other packages, instead of waiting for the bug reports from users.
    • you spend your time figuring out skew, kurtosis and fat tails, and everyone thinks the world is normal, (normally distributed, that is).
    • you think you can to "fancy" econometrics in Python, when users can just use STATA.
    • you think you can to "fancy" statistics in Python, when users can just use R.
    • you think "Data Analysis" requires statistics and econometrics.

    You know you are missing the boat (or the point), if

    • "all the best and brightest in the scipy/numpy community are doing a startup" [2], and you are not among them.
    • you are looking for your business plan, and you realize you never came up with one.
    • the "community" of your open source project consists mostly of two developers.
    [1]example
    [2]from this feed

    by Josef Perktold (noreply@blogger.com) at March 04, 2012 05:48 PM

    10 lines of code and it took you 3 weeks?!

    A question on the statsmodels mailing list reminded me about some features of software development that seem to occur to me quite frequently.

    The question was about estimating the parameters of a statistical distribution based on the characteristic function with the general method of moments (GMM) with a continuum of moment conditions. I'm not going into details, just say the theory behind it is "a bit" technical. The suggested paper is a survey paper that summarizes the basic ideas and several applications. Of course, there are not enough details to base an implementation on a paper like this.

    But this is not about GMM, in this case I would be reasonably familiar with it. The case I have in mind is p-value correction for multiple testing.

    Close to two years ago there was the question on the statsmodels mailing list about some multiple testing corrections in post-hoc tests, specifically Tukey's_range_test

    Never heard, what's that? A quick look at Wikipedia but that's all I know.

    But the search also shows that this topic seems to be pretty important. However, Tukey and the other post-hoc tests are a bit old, and looking for "multiple testing" and "multiple comparisons" shows a large number of papers published in recent years, and there is FDR, false discovery rate. Now, what's that?

    Fast forward:

    Reading and skimming articles, especially survey articles and articles with Monte Carlo Studies, and some documentation of SAS, to figure out what is worth the trouble, and what is not, and then papers that look like they have enough details to base an implementation on it.

    Starting to code and trying to translate the verbal descriptions of various step-up and step-down procedures into code, with lots of false starts and convoluted loops or code. (summer break in the coffeshop next to the swimming pool)

    Finally, I figure out the pattern, and some R packages to compare my results with. (Christmas break)

    The final result is something like

    elif method.lower() in ['hs', 'holm-sidak']:
        pvals_corrected_raw = 1 - np.power((1. - pvals), np.arange(ntests, 0, -1))
        pvals_corrected = np.maximum.accumulate(pvals_corrected_raw)
    
    elif method.lower() in ['sh', 'simes-hochberg']:
        pvals_corrected_raw = np.arange(ntests, 0, -1) * pvals
        pvals_corrected = np.minimum.accumulate(pvals_corrected_raw[::-1])[::-1]
    

    available in statsmodels as function multipletests. I didn't know we can do things like np.minimum.accumulate with numpy before.

    The code is actually still located in the sandbox although imported through the main parts of statsmodels, since I ran out of steam after getting it to work and verifying it's correctness against R. Actually, I ran out of steam (and Christmas break was over and I had to take care of some tickets for scipy.stats) after getting the cdf for the multivariate t distribution and multiple comparison based on the full covariance matrix, kind of, to work. But here I found a helpful book just published for this for R.

    So, since then statsmodels has a function where part of the docstring is

    method : string
        Method used for testing and adjustment of pvalues. Can be either the
        full name or initial letters. Available methods are ::
    
        `bonferroni` : one-step correction
        `sidak` : on-step correction
        `holm-sidak` :
        `holm` :
        `simes-hochberg` :
        `hommel` :
        `fdr_bh` : Benjamini/Hochberg
        `fdr_by` : Benjamini/Yekutieli
    

    and some other functions.

    PS: The title is exagerated, it's a few hundred lines of code, the sandbox module has close to 2000 lines that is partly not cleaned up and verified, but might be useful when I work on this or similar again. The story is correct in the sense that I spend three weeks or more with maybe 4 or 5 hours a day to get the few lines of code that is at the core of these functions, and to understand what I was doing.

    PS: This was about code developement, multiple testing deserves its own story, and more emphasis and more advertising which I only did on the statsmodels mailing list. I recently read Joseph P. Romano, Michael Wolf : Stepwise Multiple Testing as Formalized Data Snooping which is interesting also in terms of application stories in economics and finance, but I haven't implemented their approach yet.

    PS: This would be a lot easier if I were writing GPL instead of BSD licensed code, since then I could just check the R source.

    PS: I don't have to work so long just to get started with the right implementation, if the problem is more traditional econometrics. There, it's usually just checking which variation people use and some details.

    PS: There are too many PS.

    by Josef Perktold (noreply@blogger.com) at March 04, 2012 05:48 PM

    Starting a list of diagnostic and specification tests

    I started already several times an inventory of statistical tests in python, scipy.stats and statsmodels, compared to R.

    Here is another try.

    It is mainly the comparison with the R package lmtest. I just spend most of two days, writing tests against R, and before that some days of writing tests against Gretl, and before that outlier measures against SAS, as described in the previous posts.

    I don't know yet how easy it will be to maintain a list like this, the current version is mainly based on parsing the lmtest index page. lmtest is not very complete, and there are tests covered in other packages and additional tests covered in Gretl.

    For now I just keep it in a boring python module, with a string that's easy to manipulate and convert.

    #cols: category, name, statsmodels, r_lmtest, gretl
    s4 = '''\
    acorr | Breusch-Godfrey Test | acorr_breush_godfrey | bgtest
    acorr | Durbin-Watson Test | location_no_pvalues | dwtest
    het | Breusch-Pagan Test | het_breush_pagan | bptest
    het | Goldfeld-Quandt Test | het_goldfeld_quandt | gqtest
    het | Harrison-McCabe test | missing | hmctest
    het | White test | het_white | - |
    causality | Test for Granger Causality | grangercausalitytest | grangertest
    linear | Harvey-Collier Test | missing | harvtest
    linear | PE Test for Linear vs. Log-Linear Specifications | missing | petest
    linear | Rainbow Test | missing | raintest
    func form | RESET Test | with outliers | resettest
    compare nonnested | Cox Test | compare_cox | coxtest
    compare nonnested | J Test | compare_j | jtest
    compare nonnested | Encompassing Test | missing | encomptest
    compare nested | Likelihood Ratio Test nested | compare_lr | lrtest
    compare nested | Wald Test nested | compare_ftest | waldtest
    coef | Testing Estimated Coefficients | t_test | coeftest
    coef | Testing Estimated Coefficients | missing | coeftest.breakpointsfull
    '''

    add another separator
    print '\n'.join(line + '|' for line in s4.split('\n'))
    convert to list of list 
    def str2list(ss, sep='|', keep_empty=4):
    Unfortunately copying into blogger doesn't preserve intend, so skip this. And convert separator to tabs, so that google spreadsheet separates the cells:
    print '\n'.join(line + '|' for line in s4.split('\n'))

    Tomorrow, I will start looking for the diagnostic tests that are not yet on the list. R stats has some, for example (fm is my test case linear model result)

    > names(ls.diag(fm))
     [1] "std.dev"      "hat"          "std.res"      "stud.res"     "cooks"        "dfits"      
     [7] "correlation"  "std.err"      "cov.scaled"   "cov.unscaled"
    I figured out that json works pretty well transferring data from some R animals to python.

    In other news:
    The basic R doesn't save automatically the sessionlog or history. I was playing with Rcommander last night to see what default diagnostics they are proposing, and it crashed R after a while. Unfortunately, I hadn't saved my sessionlog and script file, so a day or two of work died with it. I didn't think about safeguarding against crashes anymore, Windows never crashes, not even the kids manage to turn it off anymore, spyder and firefox always recover after a crash with an existing history or session log. R never crashed before.


    by Josef Perktold (noreply@blogger.com) at March 04, 2012 05:48 PM

    Using rst2blogger and Breush-Godfrey test

    This is mainly a test to see if rst2blogger works. The description sounded like just what I needed to make going from rst file to blogger less work. Some setup inconveniences: Why do python programs need to install distribute and destroy my setuptools? Lxml didn't have windows binaries, and easy_install fails because the source doesn't build. In contrast to sphinx, the plain docutils do not highlight python code, at least not right away. If there is an exception in rendering the html, then the message is completely uninformative: except Exception

    (Reminds me of the sound I'm hearing when the kids are just guessing in a numbers game: "Try again", "Try again", ...)

    try a rst list: some categories of tests that I have been working on recently.

    • diagnostic tests
      • autocorrelation
      • heteroscedasticity
      • multicollinearity
      • functional form
      • normality
      • outliers
      • influence
    • tests for coefficients/parameters
      • t test for linear restrictions
      • F test for linear restrictions
      • Wald test for linear restrictions missing
      • Likelihood Ratio test to compare nested models
      • Wald/F test test to compare nested models
      • comparing non-nested models: J, Cox

    Writing Breush-Godfrey autocorrelation test in 5 minutes

    The background story: I have written several diagnostics tests a while ago, maybe one and a half years ago. Many of the diagnostic test have a similar structure and it's easy to just copy the pattern. For one test, acorr_lm, I took the code and basic idea of the augmented dickey fuller test for unit roots, and wanted to make a Lagrange Multiplier test for autocorrelation out of it. Some time later, I added some comments that this could be used as Engle's ARCH test but that there is a difference to Breush-Godfrey's serial correlation test.

    While I was writing tests comparing my diagnostic tests with R, I saw that Breush-Godfrey was still missing. Since this time I had unit tests, it was very quick work, copy and paste, and checking Wikipedia .

    The test passed after a few changes, and I barely had to think.

    Build the matrix of lagged residuals, column_stack them with the original regressors, get the matrix for the linear restrictions that the joint effect of the lagged residuals is zero. Most of the work is done by existing functions.

    I haven't changed the docstring yet, and there is still cleanup work to do.

    def acorr_breush_godfrey(results, nlags=None, store=False):
        '''Lagrange Multiplier tests for autocorrelation
    
        not checked yet, copied from unitrood_adf with adjustments
        check array shapes because of the addition of the constant.
        written/copied without reference
        This is not Breush-Godfrey. BG adds lags of residual to exog in the
        design matrix for the auxiliary regression with residuals as endog,
        see Greene 12.7.1.
    
        Notes
        -----
        If x is calculated as y^2 for a time series y, then this test corresponds
        to the Engel test for autoregressive conditional heteroscedasticity (ARCH).
        TODO: get details and verify
    
        '''
    
        x = np.concatenate((np.zeros(nlags), results.resid))
        exog_old = results.model.exog
        x = np.asarray(x)
        nobs = x.shape[0]
        if nlags is None:
            #for adf from Greene referencing Schwert 1989
            nlags = 12. * np.power(nobs/100., 1/4.)#nobs//4  #TODO: check default, or do AIC/BIC
    
        xdall = lagmat(x[:,None], nlags, trim='both')
        nobs = xdall.shape[0]
        xdall = np.c_[np.ones((nobs,1)), xdall]
        xshort = x[-nobs:]
        exog = np.column_stack((exog_old, xdall))
        k_vars = exog.shape[1]
    
        if store: resstore = ResultsStore()
    
        resols = sm.OLS(xshort, exog).fit()
        ft = resols.f_test(np.eye(nlags, k_vars, k_vars - nlags))
        fval = ft.fvalue
        fpval = ft.pvalue
        fval = np.squeeze(fval)[()]   #TODO: fix this in ContrastResults
        fpval = np.squeeze(fpval)[()]
        lm = nobs * resols.rsquared
        lmpval = stats.chi2.sf(lm, nlags)
        # Note: degrees of freedom for LM test is nvars minus constant = usedlags
        #return fval, fpval, lm, lmpval
    
        if store:
            resstore.resols = resols
            resstore.usedlag = nlags
            return fval, fpval, lm, lmpval, resstore
        else:
            return fval, fpval, lm, lmpval
    

    Just a brief explanation to the main tools:

    lagmat just creates a matrix of lagged values without having to worry about how to shift and cut the arrays.

    >>> from scikits.statsmodels.tsa.tsatools import lagmat
    >>> lagmat(np.arange(8), 3, trim='both')
    array([[ 2.,  1.,  0.],
           [ 3.,  2.,  1.],
           [ 4.,  3.,  2.],
           [ 5.,  4.,  3.],
           [ 6.,  5.,  4.]])
    

    numpy.eye is very flexible and allows shifting of the diagonal

    >>> np.eye(3, 5, 5-3)
    array([[ 0.,  0.,  1.,  0.,  0.],
           [ 0.,  0.,  0.,  1.,  0.],
           [ 0.,  0.,  0.,  0.,  1.]])
    

    The rest are mainly calls to OLS and f_test. One part I needed to change was that R takes the initial errors to be zero instead of truncating the regression, so I also had to concatenate some zeros in front of the regression residuals array.

    by Josef Perktold (noreply@blogger.com) at March 04, 2012 05:46 PM

    Numerical Accuracy in Linear Least Squares and Rescaling

    The Problem

    (Warning upfront: there are problems to replicate this, more below)

    A week ago, I stumbled on this Numerical_Comparison_of_Statistical_Software which presents some test results for numerical accuracy of statistical packages.

    For linear regression, there is one test, Longley, that we have in the datasets in statsmodels. But I wanted to look at Filip which sounds like a difficult case, neither SAS nor SPSS produced a solution. Let's see how badly statsmodels and numpy are doing, or maybe not.

    The model is a polynomial of degree 10. Description, data, certified values and a plot are on the NIST website here

    1 Predictor Variable
    82 Observations
    Higher Level of Difficulty
    Model: Polynomial, 11 Parameters
    

    I parsed the data into an array dta, first column is the endogeous, y, variable second column is the exogenous, x, variable. I saved y in endog. I also parsed the main NIST result in params_nist, first column parameters, second column their standard deviation.

    Fitting a Polynomial

    Since it is a polynomial problem, let us treat it as such and use the polynomials from numpy.

    First try, use the old polyfit function

    >>> p_params = np.polyfit(dta[:,1], endog, 10)[::-1]
    >>> p_params
    array([-1467.48963361, -2772.17962811, -2316.37111156, -1127.97395547,
            -354.47823824,   -75.1242027 ,   -10.87531817,    -1.062215  ,
              -0.06701912,    -0.00246781,    -0.0000403 ])
    >>> log_relative_error(p_params, params_nist[:,0])
    array([ 7.87929761,  7.88443445,  7.88840683,  7.89138269,  7.89325784,
            7.89395336,  7.89341841,  7.89162977,  7.88859034,  7.88432427,
            7.87887292])
    

    Not bad, following the description on the Wikipedia page, I wrote a function log_relative_error that tells us how many significant digits agreement is between the two arrays. polyfit agrees at 7 to 8 significant digits, that's about the same as S-Plus on the Wikipedia page.

    Let's work properly with polynomials and use the new polynomial package in numpy. Charles Harris wrote it and is still expanding and improving it.

    >>> poly = np.polynomial.Polynomial.fit(dta[:,1],endog, 10)
    >>> poly
    Polynomial([ 0.88784146,  0.10879327, -0.53636698,  0.28747072,  2.20567367,
           -1.31072158, -4.21841581,  1.76229897,  3.8096025 , -0.77251557,
           -1.30327368], [-8.78146449, -3.13200249])
    

    Oops, these numbers don't look like the NIST numbers. The last numbers, [-8.78146449, -3.13200249], show the domain of the polynomial, our values have been transformed. A bit of introspection, and we figure out how to change the domain. To get the "standard" representation, we can transform the domain back to the standard domain (-1, 1).

    >>> poly.convert(domain=(-1,1))
    Polynomial([-1467.48961423, -2772.17959193, -2316.37108161, -1127.97394098,
            -354.4782337 ,   -75.12420174,   -10.87531804,    -1.06221499,
              -0.06701912,    -0.00246781,    -0.0000403 ], [-1.,  1.])
    

    Now, this looks more like NIST, it even agrees at 13 to 14 significant digits

    >>> log_relative_error(poly.convert(domain=(-1,1)).coef, params_nist[:,0])
    array([ 13.72347502,  13.84056851,  13.81494335,  13.70878715,
            13.78207216,  13.79374075,  13.6729684 ,  13.71128925,
            13.75445952,  13.68695573,  13.67736436])
    

    Nice job Charles. No problem fitting this polynomial with numpy.

    Linear Regression

    In the previous part we knew we were fitting a polynomial, but lets treat it just as a standard linear regression problem and use scikits.statsmodels.

    First try: just create the design matrix in the simplest way and estimate

    >>> exog0 = dta[:,1:]**np.arange(11)
    >>> res0 = sm.OLS(endog, exog0).fit()
    >>> res0.params
    array([ 8.443046917097718,  1.364996059973237, -5.350750046084954,
           -3.34190287892045 , -0.406458629495091,  0.257727311296307,
            0.119771653524165,  0.023140891929892,  0.002403995206457,
            0.000131618839544,  0.000002990001222])
    >>> log_relative_error(res0.params, params_nist[:,0])
    array([-0.002491507096328, -0.000213790029725,  0.00100436814061 ,
            0.001288615104161,  0.000498264786078, -0.00148737673275 ,
           -0.004756810105056, -0.009359738327099, -0.015305377783833,
           -0.022566206229652, -0.031085341541384])
    

    Bummer, 0 significant digits, way off.

    We can print the full summary of the results, maybe we see something

    >>> print res0.summary()
                                OLS Regression Results
    ==============================================================================
    Dep. Variable:                      y   R-squared:                       0.996
    Model:                            OLS   Adj. R-squared:                  0.995
    Method:                 Least Squares   F-statistic:                     2390.
    Date:                Sat, 03 Mar 2012   Prob (F-statistic):           1.85e-84
    Time:                        23:47:45   Log-Likelihood:                 344.73
    No. Observations:                  82   AIC:                            -673.5
    Df Residuals:                      74   BIC:                            -654.2
    Df Model:                           7
    ==============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
    ------------------------------------------------------------------------------
    const          8.4430     12.864      0.656      0.514       -17.189    34.075
    x1             1.3650      6.496      0.210      0.834       -11.578    14.308
    x2            -5.3508      9.347     -0.572      0.569       -23.974    13.273
    x3            -3.3419     11.702     -0.286      0.776       -26.659    19.975
    x4            -0.4065      5.923     -0.069      0.945       -12.209    11.396
    x5             0.2577      1.734      0.149      0.882        -3.197     3.712
    x6             0.1198      0.321      0.373      0.710        -0.520     0.759
    x7             0.0231      0.038      0.604      0.548        -0.053     0.099
    x8             0.0024      0.003      0.838      0.405        -0.003     0.008
    x9             0.0001      0.000      1.072      0.287        -0.000     0.000
    x10          2.99e-06   2.29e-06      1.303      0.197     -1.58e-06  7.56e-06
    ==============================================================================
    Omnibus:                        1.604   Durbin-Watson:                   1.627
    Prob(Omnibus):                  0.449   Jarque-Bera (JB):                1.379
    Skew:                          -0.317   Prob(JB):                        0.502
    Kurtosis:                       2.961   Cond. No.                        -1.#J
    ==============================================================================
    
    The smallest eigenvalue is  -0.38. This might indicate that there are
    strong multicollinearity problems or that the design matrix is singular.
    

    R square is 0.996, so we are fitting the curve pretty well, but our design matrix with the polynomial is not positive definite. There is even a negative eigenvalue. A negative eigenvalue sounds strange, a quadratic form shouldn't have them. Just to make sure that this is not a bug, check with numpy

    >>> np.linalg.eigvalsh(np.dot(exog0.T, exog0)).min()
    -0.38006444279775781
    >>> np.linalg.eigvals(np.dot(exog0.T, exog0)).min()
    -0.00011161167956978682
    

    I'm still suspicious, but I delay the detour into numpy's and scipy's linalg modules.

    One more check of our regression results, the residual standard error is not very far away from the Nist numbers:

    >>> np.sqrt(res0.mse_resid), 0.334801051324544E-02,
    (0.0038044343586352601, 0.0033480105132454399)
    

    Conclusion: If you try to fit a linear regression with a non-positive definite design matrix, then the parameters are not identified, but we can still get a good fit.

    (Technical aside: statsmodels uses by default the generalized inverse, pinv, for linear regression. So it just drops the eigenvalues below a threshold close to zero. The parameter estimates will be closer to a penalized Ridge regression. But don't quote me on the last part since I don't remember where I read that pinv is the limit of a Ridge problem.)

    The question for statsmodels is what to do about it.

    One solution that works in this case, as we have seen with numpy polynomials, is to rescale the explanatory variables or design matrix. I'm showing one example below. My working title for this post was: Don't do this, or do we have to do it for you? Is it the responsibility of the user not to use a design matrix that numerically doesn't make much sense and we can only warn, or should we automatically transform the design matrix to make it numerically more stable. The latter will be costly and might not be required in 99% of the cases?

    Another issue is that there are many different ways to do the linear algebra, and we have not investigated much what might work better or worse in different cases. See Addendum below for the effect that linear algebra can have in numerically unstable problems.

    Rescaling the Design Matrix

    Our design matrix looks pretty bad, variables vary in a large range and the correlation is very high

    >>> np.set_printoptions(precision=3)
    >>> print np.max(np.abs(exog0),0)
    [  1.000e+00   8.781e+00   7.711e+01   6.772e+02   5.947e+03   5.222e+04
       4.586e+05   4.027e+06   3.536e+07   3.105e+08   2.727e+09]
    
    >>> print np.corrcoef(exog0[:,1:], rowvar=0)
    [[ 1.    -0.991  0.969 -0.938  0.904 -0.87   0.838 -0.808  0.782 -0.758]
     [-0.991  1.    -0.993  0.975 -0.951  0.925 -0.899  0.874 -0.851  0.83 ]
     [ 0.969 -0.993  1.    -0.994  0.981 -0.963  0.943 -0.924  0.904 -0.886]
     [-0.938  0.975 -0.994  1.    -0.996  0.986 -0.973  0.958 -0.943  0.928]
     [ 0.904 -0.951  0.981 -0.996  1.    -0.997  0.99  -0.98   0.968 -0.957]
     [-0.87   0.925 -0.963  0.986 -0.997  1.    -0.998  0.992 -0.985  0.976]
     [ 0.838 -0.899  0.943 -0.973  0.99  -0.998  1.    -0.998  0.994 -0.988]
     [-0.808  0.874 -0.924  0.958 -0.98   0.992 -0.998  1.    -0.999  0.995]
     [ 0.782 -0.851  0.904 -0.943  0.968 -0.985  0.994 -0.999  1.    -0.999]
     [-0.758  0.83  -0.886  0.928 -0.957  0.976 -0.988  0.995 -0.999  1.   ]]
    

    Now we can use just the simplest transform, limit the maximum absolute value to be one:

    exog1 = exog0 / np.max(np.abs(exog0),0)
    

    After running the regression on the rescaled design matrix, we see an agreement with the NIST benchmark results at around 7 to 8 significant digits both for the parameters and for the standard deviation of the parameter estimates, bse in statsmodels:

    >>> res1 = sm.OLS(endog, exog1).fit()
    >>> params_rescaled = res1.params / np.max(np.abs(exog0), 0)
    >>> log_relative_error(params_rescaled, params_nist[:,0])
    array([ 7.419,  7.414,  7.409,  7.402,  7.394,  7.384,  7.373,  7.36 ,
            7.346,  7.331,  7.314])
    >>> bse_rescaled = res1.bse / np.max(np.abs(exog0),0)
    >>> log_relative_error(bse_rescaled, params_nist[:,1])
    array([ 8.512,  8.435,  8.368,  8.308,  8.255,  8.207,  8.164,  8.124,
            8.089,  8.057,  8.028])
    

    Also R squared and the standard deviation of the residuals (using appropriate degrees of freedom) agrees with the NIST numbers at around 10 and 7 digits, resp.

    >>> log_relative_error(res1.rsquared, 0.996727416185620)
    10.040156510920081
    
    >>> log_relative_error(np.sqrt(res1.mse_resid), 0.334801051324544E-02)
    7.8575015681097371
    

    So we are doing pretty well just with a simple rescaling of the variables. Although, the comment at the end of print res1.summary() still reports a smallest eigenvalue of -1.51e-15, so essentially zero. But I worry about this later. I looked initially at another way of rescaling the design matrix but didn't check yet how the choice of the rescaling will affect the results

    Addendum 1: Smallest eigenvalue of ill-conditioned array

    Going back to the original design matrix without rescaling, define the moment matrix X'X:

    >>> xpx0 = np.dot(exog0.T, exog0)
    

    the eigenvalues, assuming a symmetric matrix, are

    >>> np.sort(np.linalg.eigvalsh(xpx0))
    array([ -3.79709e-01,   1.14869e-05,   4.40507e-03,   3.20670e+00,
             7.91804e+02,   1.05833e+03,   3.98410e+05,   2.31485e+08,
             4.28415e+11,   1.93733e+15,   5.17955e+19])
    

    This looks very badly conditioned. the largest eigenvalue is 5e19, the smallest is "around zero".

    We can compare different algorithms to calculate the smallest eigenvalues (splinalg is scipy.linalg)

    >>> np.sort(np.linalg.eigvals(xpx0))[:4]
    array([  3.41128e-04,   5.58946e-04,   1.23213e-02,   3.33365e+00])
    >>> np.sort(splinalg.eigvalsh(xpx0))[:4]
    array([ -2.14363e+03,  -2.00323e-01,   1.26094e-05,   4.40956e-03])
    >>> np.sort(splinalg.eigvals(xpx0))[:4]
    array([ -3.66973e-05+0.j,   1.61750e-04+0.j,   7.90465e-03+0.j,
             2.01592e+00+0.j])
    
    >>> np.sort(np.linalg.svd(xpx0)[1])[:4]
    array([  2.84057e-05,   4.91555e-04,   7.28252e-03,   3.41739e+00])
    >>> np.sort(splinalg.svd(xpx0)[1])[:4]
    array([  2.19202e-05,   7.11920e-04,   7.00790e-03,   3.28229e+00])
    
    >>> np.sort(np.linalg.svd(exog0)[1]**2)[:4]
    array([  1.65709e-11,   3.08225e-08,   2.48138e-05,   1.08036e-02])
    >>> np.sort(splinalg.svd(exog0)[1]**2)[:4]
    array([  1.65708e-11,   3.08225e-08,   2.48138e-05,   1.08036e-02])
    

    So, we see that they are pretty much all over the place, from -0.38 to 2.8e-05. The last version with singular value decomposition is the closest to what statsmodels uses with pinv. It also looks like I picked the worst algorithm for the regression summary in this case.

    Warning: Calculations at machine precision are not necessarily deterministic, in the sense that if you run it repeatedly you might not always get the same results. There are several cases on the scipy and numpy mailing lists that report that the results might "randomly" switch between several patterns. And the results won't agree on different operating systems, compilers and versions of the linear algebra libraries. So, I don't expect that these results can be replicated in exactly the same way.

    Addendum 2: scipy.linalg versus numpy.linalg

    To avoid getting these changing results whenever I re-ran the script while preparing this post, I changed the statsmodels source to use scipy.linalg.pinv instead of numpy.linalg.pinv. I expected more replicable results, however what I found is:

    >>> exog0 = dta[:,1:]**np.arange(11)
    >>> res0 = sm.OLS(endog, exog0).fit()
    >>> log_relative_error(res0.params, params_nist[:,0])
    array([ 5.31146488,  5.7400516 ,  6.53794562,  6.81318335,  6.81855769,
            7.22333339,  8.13319742,  7.38788711,  7.24457806,  7.18580677,
            7.12494176])
    >>> log_relative_error(res0.bse, params_nist[:,1])
    array([ 2.25861611,  2.25837872,  2.25825903,  2.25822427,  2.2582245 ,
            2.25823174,  2.25823693,  2.25823946,  2.25824058,  2.25824108,
            2.25824165])
    

    Just by changing the algorithm that calculates the generalized inverse, I get agreement with the NIST data at 5 to 7 significant digits for the parameters and 2 digits for the standard error of the parameter estimates even with the very ill-conditioned original design matrix. That doesn't look so bad, much better than when using the numpy.linalg version.

    (But I need to write proper tests and look at this when I can trust the results. I now have two python sessions open, one that imported the original source, and one that imported the source after changing the statsmodels source. Also, if I run this regression repeatedly the numbers changed once, but remained within the same neighborhood. Besides different algorithm there is also rcond which defines the cutoff in pinv. I didn't check whether that differs in the numpy and scipy versions.)

    Conclusions

    I think this test case on the NIST site is very well "cooked" to test the numerical accuracy of a linear regression program. The main lesson is that we shouldn't throw a numerically awful problem at a statistical package, unless we know that the package takes care for us of the basic tricks for making the problem numerically more stable. It's safer to make sure our design matrix is numerically sound.

    Also, if we just want to estimate a polynomial function, then use the information and use a specialized algorithm, or, even better, use an orthogonal polynomial basis instead of power polynomials.

    What does it mean for statistical analysis?

    That, I'm not so sure. Multicollinearity is a serious issue, and there a various approaches for dealing with it. But my attitude so far has been:

    If you work with real data and run into numerical problems, it's not a problem with numerical accuracy but with your data, or with your model.

    We should still use basic precautions like scaling our variables appropriately, but if we have high multicollinearity, then it mainly means that the model that we specified is asking for information that's not in the data. In certain directions the data is not informative enough to reliably identify some parameters. Given measurement errors, noise in the data and misspecification, there are many other parts to worry about before machine precision becomes important. For a related discussion see this thread on the statsmodels mailinglist.

    I tried before to come up with a case where standardizing (zscoring) the design matrix helps in improving the precision of the estimates but I didn't manage. Whether I zscored or not, the results where essentially the same. Now, I have a test case to add to statsmodels. I am sceptical about automatic rescaling, but I started a while ago to look into how to make it easier for users to use predefined transforms in statsmodels, instead of having to code them from scratch.

    I'm not an expert in numerical analysis and don't intend to become one, my "numerical incompetence" has improved only a bit since this although I know now a bit more linear algebra.

    I put a script with the NIST case in this gist. I haven't yet copied over the parts from the interpreter sessions.

    A final comment:

    I don't like long interpreter sessions, I usually convert everything as fast as possible to a script. For this, I copied everything directly from the session. After cleaning up the original script a bit, I'm getting different numbers for the log relative error (LRE). I'm now using scipy.linalg.pinv inside statsmodels, and LRE is in this case a measure for the behavior at machine precision, and bounces anywhere between 5 and 8. This is a good result in that we can still get estimates with a reasonable precision, but it makes LRE unreliable for replicating the results. I will make a proper script and unittest later, so that I can be more certain about how much the numbers change and whether there isn't a bug somewhere in my "live" session.

    by Josef Perktold (noreply@blogger.com) at March 04, 2012 05:46 PM

    February 28, 2012

    Andreas Mueller

    Simplistic Minimum Spanning Tree in Numpy [update]

    I started working with spanning trees for euclidean distance graphs today. The first think I obviously needed to do was compute the spanning tree.

    There are MST algorithms in Python, for example in pygraph and networkx. These use their native graph formats, though, which would have meant I'd have to construct a graph from my point set.

    I didn't see a way on how to do this and set the edge weights without iterating over all edges.

    That would probably take longer than the computation of the MST, so I decided to do my own small implementation using numpy.
    This is an instantiation of Prim's algorithm based on numpy matrices. The input is a dense matrix of distances, the output a list of edges. It is not as pretty as I would have hoped but still reasonably short. If any one has suggestions how to make this prettier, I'd love that.

    [edit] Using line_profiler I had a quick look a the code and made some minor improvements. It's now significantly faster than networkx and also a bit prettier. Most time is now spent on the argmin, which seems reasonable. It would probably be even faster if I didn't use a full matrix for representing the weights but only the upper triangle. But that seems to be some extra effort.[/edit]

    Here it goes:
    import numpy as np
    from scipy.spatial.distance import pdist, squareform
    import matplotlib.pyplot as plt


    def minimum_spanning_tree(X, copy_X=True):
    """X are edge weights of fully connected graph"""
    if copy_X:
    X = X.copy()

    if X.shape[0] != X.shape[1]:
    raise ValueError("X needs to be square matrix of edge weights")
    n_vertices = X.shape[0]
    spanning_edges = []

    # initialize with node 0:
    visited_vertices = [0]
    num_visited = 1
    # exclude self connections:
    diag_indices = np.arange(n_vertices)
    X[diag_indices, diag_indices] = np.inf

    while num_visited != n_vertices:
    new_edge = np.argmin(X[visited_vertices], axis=None)
    # 2d encoding of new_edge from flat, get correct indices
    new_edge = divmod(new_edge, n_vertices)
    new_edge = [visited_vertices[new_edge[0]], new_edge[1]]
    # add edge to tree
    spanning_edges.append(new_edge)
    visited_vertices.append(new_edge[1])
    # remove all edges inside current tree
    X[visited_vertices, new_edge[1]] = np.inf
    X[new_edge[1], visited_vertices] = np.inf
    num_visited += 1
    return np.vstack(spanning_edges)


    def test_mst():
    P = np.random.uniform(size=(50, 2))

    X = squareform(pdist(P))
    edge_list = minimum_spanning_tree(X)
    plt.scatter(P[:, 0], P[:, 1])

    for edge in edge_list:
    i, j = edge
    plt.plot([P[i, 0], P[j, 0]], [P[i, 1], P[j, 1]], c='r')
    plt.show()

    if __name__ == "__main__":
    test_mst()
    The algorithms basically works on a dense distance matrix and finds the best possible edge that is reachable from a set of active nodes.
    All edges that connect nodes already inside the tree are set to infinity so as not to create cycles.

    I haven't really benchmarked this and it probably loses against any cython implementation but I would hope it is reasonably fast and straight-forward.

    The output looks something like this:

    by Andreas Mueller (noreply@blogger.com) at February 28, 2012 07:51 PM

    Stefan van der Walt

    scikits-image 0.5

    We’re happy to announce the 0.5 release of scikits-image, our image processing toolbox for SciPy.

    For more information, please visit our website

    http://scikits-image.org

    New Features

    • Consistent intensity rescaling and improved range conversion.
    • Random walker segmentation.
    • Harris corner detection.
    • Otsu thresholding.
    • Block views, window views and montage.
    • Plugin for Christoph Gohlke’s “tifffile”.
    • Peak detection.
    • Improved FreeImage wrappers and meta-data reading.
    • 8-neighbor and background labelling.

    … along with updates to the documentation and website, and a number of bug fixes.

    Contributors to this release

    • Andreas Mueller
    • Brian Holt
    • Christoph Gohlke
    • Emmanuelle Gouillart
    • Michael Aye
    • Nelle Varoquaux
    • Nicolas Pinto
    • Nicolas Poilvert
    • Pieter Holtzhausen
    • Stefan van der Walt
    • Tony S Yu
    • Warren Weckesser
    • Zachary Pincus

    Permalink | Leave a comment  »

    February 28, 2012 02:42 AM

    February 24, 2012

    Trichech

    Wenn die Anleihemärkte zittern

    Auf den Anleihemärkten steigt seit der Schuldenkrise die Nervosität zunehmend. Neben der immer wieder aufflammenden Krise Griechenlands herrscht weiterhin die Rezessionsangst. Dies macht sich auch für Spanien bemerkbar. Bei den Dreijahreskrediten werden höhere Zinsen fällig. Für Staatsanleihen musste der spanische Staat eine Rendite von 3,332 % bezahlen. Vor zwei Wochen verlangten die Investoren nur 2,861%.

    Aufgrund dieser Anleihen nahm die spanische Hauptstadt erneut 4,1 Mrd. Euro auf. Durch die Zinssteigerung nimmt natürlich auch das Zittern an den Bondmärkten erneut zu, da neben Griechenland nun auch schlechte Nachrichten aus Spanien zu verzeichnen sind: Hier sank die spanische Wirtschaftsleistung um 0,3 %.

    Trotz aller Befürchtungen schätzen die Analysten die Ergebnisse nicht negativ ein. Aufgrund der Renditensteigerung wurde die Nachfrage nach spanischen Staatsanleihen erneut belebt, berichtet die Nachrichtenagentur Reuters. Trotzdem seien die Renditen noch nicht so hoch, wie im letzten Quartal des Jahres 2011. In diesem Zeitraum musste der spanische Staat Renditen von über 5 % bezahlen.

    Im Gegensatz zu Spanien kann Frankreich einen erheblichen Rückgang der Renditen verzeichnen. Anleger forderten für zweijährige Staatsanleihen eine Rendite von 0,89 %. Zu Beginn des Jahres 2012 waren es noch 1,05 %. Dies scheint eine positive Entwicklung zu sein, obwohl die Ratingagentur Moody’s Frankreich ziemlich kritisierte. Sie drohte sogar damit, dem französischen Staat die Bonitätsnote Aaa abzuerkennen. Dieses erstklassige Rating wurde von Standard & Poor’s bereits entzogen. Beide Ratingagenturen stehen in einem offenen Konkurrenzkampf.

    Die Nachrichtenagentur Bloomberg berichtet über eine Long-Term Refinancing Operation, die es erlaubt, drei-Jahres-Darlehen zu vergeben. Daraufhin wurden 490 Milliarden Euro von etwa 500 Banken genutzt. Diese Liquiditätsspritze der Europäischen Zentralbank soll zu dem befriedigenden Ergebnis Frankreichs geführt haben.

    by praktikant at February 24, 2012 09:22 AM

    Gute Chancen für die deutsche Wirtschaft

    Die deutsche Wirtschaft steigt wieder stetig. Trotz „außenwirtschaftlicher Bremsfaktoren“, die sie im ersten Quartal des Jahres 2012 noch beeinflussen, so der Notenbank-Monatsbericht für Februar, befindet sich die Wirtschaft Deutschlands wieder auf einem grünen Zweig. Seit der Rezession im Jahr 2009 schrumpfte die Wirtschaft erheblich. Nach dem Schlussquartal des vergangenen Jahres kann jedoch wieder neuer Mut geschöpft werden.

    Zwar sank das BIP von Oktober bis Dezember um 0,2%, diese Verringerung fiel jedoch geringer aus, als es zu erwarten war.

    Wirtschaftswachstum aufgrund der hohen Baunachfrage

    Die Bundesbank berichtet über gute Chancen für die deutsche Wirtschaft. Aufgrund einer hohen Baunachfrage wird auch die deutsche Konjunktur wieder nachhaltig gestärkt.

    Einen Grund für diese Nachfrage sehen Experten in den zahlreichen Förderprogrammen von Bund und Ländern für Sanierung und Hausbau. Wer nachhaltig und energieeffizient saniert oder baut, kann gute Zinssätze und Fördergelder erwarten.

    Besonders die Lage auf dem Immobilienmarkt entspannte sich. Aufgrund der stetigen Nachfrage stieg der Preis für Immobilien während des Jahres um 5,5 %. Die Bundesbank erläutert des Weiteren, dass erstmals seit der Wiedervereinigung wieder ein Aufschwung der Konjunktur mit Preiserhöhungen auf dem Immobilienmarkt verbunden sei.

    Privater Konsum stärkt die Konjunktur

    Vor allem die Kauflaune und der allgemeine Konsum von Privatleuten spielt eine große Rolle im Bereich der Konjunktursteigerung. Bereits im vergangenen Jahr konnte die deutsche Wirtschaft von dem privaten Konsum profitieren. Die positive Stimmung des Endverbrauchers ist Indiz für eine stabile Arbeitsmarktlage, die den Arbeitnehmern Sicherheit verspricht. Im Vordergrund stehen dabei eher Investitionen. Das Rekordtief des Leitzinses im Euroraum von nur einem Prozent veranlasst den Verbraucher von daher eher auszugeben, als zu sparen.

    by praktikant at February 24, 2012 09:21 AM

    Energieeffizienz ermöglicht Förderung durch Bund und Länder bei der Baufinanzierung

    Für die Sanierung oder den Hausbau bieten Bund und Länder ausgezeichnete Konditionen mit einem festen Zinssatz über dreißig Jahre. Dabei kann der Bauherr zwischen verschiedenen Angeboten wählen. Die Bedingung für eine Förderung der Baufinanzierung ist jedoch die energieeffiziente Versorgung.

    Der Staat ermöglicht bei dem Bau eines energieeffizienten Hauses finanzielle Unterstützungen. Darüber hinaus offerieren Bund und Länder verschiedene Förderprogramme, welche durch die Förderbank des Bundes gewährleistet werden. So können Darlehen zu einem günstigen Zins oder auch Direktzuschüsse erhalten werden. Die Höhe richtet sich jedoch nach dem Grad der vorgesehenen Energieeffizienz.

    Das Programm „Energieeffizient Bauen“

    Mit Hilfe dieses Programmes können Bauherren mit einer finanziellen Unterstützung von etwa 50.000 € je Wohneinheit rechnen. Dabei müssen jedoch spezielle Vorgaben im Bereich Energiebedarf, Wärmeverlust und Energiesparen eingehalten werden. Die Förderung umfasst dabei Bau- sowie Kaufkosten von Häusern oder Wohnungen.

    Möglicher Tilgungszuschuss

    Dieser Zuschuss ist erneut von der Energieeffizienz des Hauses oder der Wohnung abhängig. Wird bei einem Haus extrem energieeffizient gebaut und eingerichtet, ermöglicht die Förderbank des Bundes einen Tilgungszuschuss. Werden im Bereich des Energiebedarfs dabei Werte erreicht, die 45% unter dem Mindeststandard liegen, erhält der Bauherr einen Zuschuss von 5%. Liegt der Wert 60% unter dem Standard, kann sogar ein Zuschuss von 10% erwartet werden.

    Zu beachten sei jedoch, dass alle Förderanträge vor dem Hausbau bzw. vor dem Kauf des Eigenheims eingereicht und bearbeitet werden müssen. Der Antrag muss bei der eigenen Hausbank gestellt werden.

    Förderprogramme der Länder

    Die Bundesländer ermöglichen Bauherren mittels Investitionsbanken und Kreditanstalten verschiedene Zuschüsse, die sich meist auch nach der Energieeffizienz richten.

    Wer also ein Haus sowie eine Wohnung bauen oder sanieren möchte, sollte sich bereits vorzeitig um verschiedene Fördermöglichkeiten bemühen. Besonders in Zeiten von Rohölknappheit und erneuerbaren Energien ist es möglich, langfristig gesehen Geld durch innovative Bauweise und Energiemodelle zu sparen.

    by praktikant at February 24, 2012 09:19 AM

    February 21, 2012

    EuroSciPy

    Euroscipy 2012: Brussels, August 23-27!

    It is our pleasure to announce this year's Euroscipy conference, that will be held in Brussels, August 23-27, at the Université Libre de Bruxelles (ULB, Solbosch Campus).

    This Euroscipy meeting will be the 5th edition of this cross-disciplinary gathering, focused on the use and development of the Python language in scientific research and industry. Previous conferences took place in Leipzig and Paris, and gathered a very nice crowd of researchers, engineers, programmers, students, hackers, etc. Many thanks to the two new conferences chairs, Pierre de Buyl and Didrik Pinte, and to the local organizing committee, who took over the organization after two years in Paris!

    As for the last editions, Euroscipy 2012 will consist in two days of tutorials and two days of conference. Thanks to the last Euroscipys, we have been able to gather a lot of tutorial materials that are available on http://scipy-lectures.github.com/ (CC-by license). These lecture notes can be used either in html or in pdf format; they cover introductory and more advanced topics related to Scientific Python, and each section can be used as a basis for a two- or three-hour tutorial. We hope to take advantage of this new conference to improve and extend http://scipy-lectures.github.com!

    As for the conference, we are very excited to welcome David Beazley (http://www.dabeaz.com) as our keynote speaker. David Beazley created SWIG, a software development tool that connects programs written in C and C++ with a variety of high-level programming languages such as Python. He has also authored the acclaimed Python Essential Reference. The call for abstracts is now open; abstracts will be selected by our program committee for oral presentations and posters. We look forward to hearing about your recent breakthroughs using Python! More details on the submission of abstracts can be found here. The deadline for abstract submission is Monday April 30. More details about the organization are found on the webpage of the conference: http://www.euroscipy.org/conference/euroscipy2012

    An innovation of last year was the organization of satellite meetings following Euroscipy, one on Python for neuro-sciences and the other on Python for Physics. If you are interested in organizing a satellite meeting in Brussels (and preferably have some local contacts who can help for the practical organization), you should contact the organizing team at org-team@lists.euroscipy.org.

    We look forward to meeting you in Brussels next summer!

    Emmanuelle

    Euroscipy 2011 (ENS Paris)

    by Emmanuelle Gouillart at February 21, 2012 12:14 AM

    February 17, 2012

    NeuralEnsemble

    G-Node Workshop on Neuronal GPU Computing

    Announcement by Michael Schmuker, Christian Kellner and Thomas Wachtler of a very interesting workshop:

    Graphics processing units (GPUs) offer a low-cost approach to parallel high-performance computing. Neuronal simulations can be parallelized efficiently and are particularly well suited for implementation on GPUs. There is also great potential for GPU-based high-throughput analysis of neuronal data. The field is progressing at rapid pace, and has reached a point where it may strongly benefit from some kind of convergence between the different approaches.

    To facilitate communication and foster collaboration in the field, the German INCF Node (G-Node) organizes a one-day symposium on neuronal GPU computing with an adjoint two-day developer workshop.

    Symposium

    Invited Speakers (preliminary):

    • Romain Brette (École Normale Supérieure, Paris)
    • Andreas Fidjeland (Imperial College, London)
    • Dan Goodman (École Normale Supérieure, Paris)
    • Thomas Nowotny (University of Sussex, Brighton-Falmer)
    • Pierre Yger (Imperial College, London)

    Applications are encouraged for talks at the symposium. Topics may cover one or more of the following:

    • GPU-based neuronal simulation: development, applications, user reports
    • GPU-based data analysis: software and use-cases
    • Reports on GPU-powered neuronal research
    • Comparison of GPU-based neuronal applications with other high-throughput technologies (e.g. clusters, neuromorphic hardware)

    Participation in the symposium is free, but registration is required.

    Developer workshop

    We encourage applications for participation in the developer workshop. The workshop's aim is to bring together developers of GPU-based applications for neuroscience and to enable exchange of ideas, knowledge, and code. Enthusiastic users of GPU-based tools with programming skills are also warmly invited. The number of participants in the workshop is limited to 20.

    Invited symposium speakers will also be present at the developer workshop.

    Application and registration:

    To apply for a presentation slot at the symposium, send us an abstract (approx. 500 words) of your presentation. A note with your name and affiliation is sufficient if you only want to register for the symposium. To apply for the developer workshop, please send a us a short letter of motivation stating your background, why you want to participate, and what you could contribute to the workshop.

    Contact:

    Direct your applications, registrations and any questions to gpu-computing@g-node.org .

    Deadline for application: 28 Feb 2012

    Workshop website: https://portal.g-node.org/gpu-workshop-2012/Current information about speakers will be posted there.

    Dates:

    April 11, 2012 (Symposium)
    April 12-13, 2012 (Developer Workshop)

    Venue:

    LMU Biocenter
    Großhaderner Str. 2
    82152 Planegg-Martinsried

    Hope to see you in Munich in April!

    The organizers

    Michael Schmuker, Freie Universität Berlin & BCCN Berlin
    Christian Kellner and Thomas Wachtler, G-Node, LMU München

    by Andrew Davison (noreply@blogger.com) at February 17, 2012 11:04 AM