August 03, 2015

Matthew Rocklin

Caching

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data science well. We propose a new caching policy.

Humans Repeat Stuff

Consider the dataset that you’ve worked on most recently. How many times have you loaded it from disk into memory? How many times have you repeated almost the same computations on that data?

Exploratory data science workloads involve repetition of very similar computations. These computations share structure. By caching frequently used results (or parts of results) we may be able to considerably speed up exploratory data analysis.

Caching in other contexts

The web community loves caching. Database backed web applications almost always guard their database lookups with a system like memcached which devotes some amount of memory to caching frequent and recent queries. Because humans visit mostly the same pages over and over again this can reduce database load by an order of magnitude. Even if humans visit different pages with different inputs these pages often share many elements.

Limited caching resources

Given infinite memory we would cache every result that we’ve ever computed. This would give us for instant recall on anything that wasn’t novel. Sadly our memory resources are finite and so we evict cached results that don’t seem to be worth keeping around.

Traditionally we use a policy like Least Recently Used (LRU). This policy evicts results that have not been requested for a long time. This is cheap and works well for web and systems applications.

LRU doesn’t fit analytic workloads

Unfortunately LRU doesn’t fit analytic workloads well. Analytic workloads have a large spread computation times and of storage costs. While most web application database queries take roughly the same amount of time (100ms-1000ms) and take up roughly the same amount of space to store (1-10kb), the computation and storage costs of analytic computations can easily vary by many orders of magnitude (spreads in the millions or billions are common.)

Consider the following two common computations of a large NumPy array:

  1. x.std() # costly to recompute, cheap to store
  2. x.T # cheap to recompute, costly to store

In the first case, x.std(), this might take a second on a large array (somewhat expensive) but takes only a few bytes to store. This result is so cheap to store that we’re happy to keep it in our cache for a long time, even if its infrequently requested.

In the second case, x.T this is cheap to compute (just a metadata change in the array) and executes in microseconds. However the result might take gigabytes of memory to store. We don’t want to keep this in our cache, even if it’s very frequently requested; it takes up all of the space for other potentially more useful (and smaller) results and we can recompute it trivially anyway.

So we want to keep cached results that have the following properties:

  1. Costly to recompute (in seconds)
  2. Cheap to store (in bytes)
  3. Frequently used
  4. Recently used

Proposed Caching Policy

Here is an alternative to LRU that respects the objectives stated above.

Every time someone accesses an entry in our cache, we increment the score associated to the entry with the following value

Where compute time is the time it took to compute the result in the first place, nbytes is the number of bytes that it takes to store the result, epsilon is a small number that determines the halflife of what “recently” means, and t is an auto-incrementing tick time increased at every access.

This has units of inverse bandwidth (s/byte), gives more importance to new results with a slowly growing exponential growth, and amplifies the score of frequently requested results in a roughly linear fashion.

We maintain these scores in a heap, keep track of the total number of bytes, and cull the cache as necessary to keep storage costs beneath a fixed budget. Updates cost O(log(k)) for k the number of elements in the cache.

Cachey

I wrote this up into a tiny library called cachey. This is experimental code and subject to wild API changes (including renaming.)

The central object is a Cache that includes asks for the following:

  1. Number of available bytes to devote to the cache
  2. Halflife on importance (the number of access that occur to reduce the importance of a cached result by half) (default 1000)
  3. A lower limit on costs to consider entry to the cache (default 0)

Example

So here is the tiny example

>>> from cachey import Cache
>>> c = Cache(available_bytes=1e9)  # 1 GB

>>> c.put('Hello', 'world!', cost=3)

>>> c.get('Hello')
'world!'

More interesting example

The cache includes a memoize decorator. Lets memoize pd.read_csv.

In [1]: import pandas as pd
In [2]: from cachey import Cache

In [3]: c = Cache(1e9)
In [4]: read_csv = c.memoize(pd.read_csv)

In [5]: %time df = read_csv('accounts.csv')
CPU times: user 262 ms, sys: 27.7 ms, total: 290 ms
Wall time: 303 ms

In [6]: %time df = read_csv('accounts.csv')  # second read is free
CPU times: user 77 µs, sys: 16 µs, total: 93 µs
Wall time: 93 µs

In [7]: c.total_bytes / c.available_bytes
Out[7]: 0.096

So we create a new function read_csv that operates exactly like pandas.read_csv except that it holds on to recent results in c, a cache. This particular CSV file created a dataframe that filled a tenth of our cache space. The more often we request this CSV file the more its score will grow and the more likely it is to remain in the cache into the future. If other memoized functions using this same cache produce more valuable results (costly to compute, cheap to store) and we run out of space then this result will be evicted and we’ll have to recompute our result if we ask for read_csv('accounts.csv') again.

Memoize everything

Just memoizing read_csv isn’t very interesting. The pd.read_csv function operates at a constant data bandwidth of around 100 MB/s. The caching policies around cachey really shine when they get to see all of our computations. For example it could be that we don’t want to hold on to the results of read_csv because these take up a lot of space. If we find ourselves doing the same groupby computations then we might prefer to use our gigabyte of caching space to store these both because

  1. groupby computations take a long time to compute
  2. groupby results are often very compact in memory

Cachey and Dask

I’m slowly working on integrating cachey into dask’s shared memory scheduler.

Dask is in a good position to apply cachey to many computations. It can look at every task in a task graph and consider the result for inclusion into the cache. We don’t need to explicitly memoize every function we want to use, dask can do this for us on the fly.

Additionally dask has a nice view of our computation as a collection of sub-tasks. Similar computations (like mean and variance) often share sub-tasks.

Future Work

Cachey is new and untested but potentially useful now, particularly through the memoize method shown above. It’s a small and simple codebase. I would love to hear if people find this kind of caching policy valuable.

I plan to think about the following in the near future:

  1. How do we build a hierarchy of caches to share between memory and disk?
  2. How do we cleanly integrate cachey into dask (see PR #502) and how do we make dask amenable to caching (see PR #510)?

August 03, 2015 12:00 AM

July 28, 2015

Matthieu Brucher

Audio Toolkit: Different low pass filters

There are several different low pass filters, and as many high pass, band pass, band stop… filters. In Audio toolkit, there are different usual implementation available:

  • Bessel
  • Butterworth
  • Chebyshev type 1
  • Chebyshev type 2
  • Second order
  • Linkwitz-Riley
  • RBJ

and it is possible to implement other, different orders as well…

Here is a display of the behaviors of these filters (for order 2 for those who can have different orders), amplitude and phase:

Low-pass filter comparisonLow-pass filter comparison

What is interesting is that for low pass filters, the Butterworth filter is the same as the second order filter implementation. Linkwitz-Riley filters have a lower amplitude than the others, and Chebyshev type 2 has definitely a different behavior than all the others (which is due to the fact that its prototype is different than for the others, it’s not based on band pass, but band stop).

With all these options, I think Audio Toolkit can cover any basis.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at July 28, 2015 07:21 AM

July 27, 2015

Martin Fitzpatrick

Create simple GUI Applications with Python and Qt

So, for some unknown reason I recently thought it might be good to start putting together some video tutorials for Python. I thought I’d start with something I was sure I knew a fair bit about: writing GUI applications with PyQt (+PySide).

create-simple-gui-applications.jpeg

It turned out to be a lot more work than I thought, but rewarding to get the finished product. Like most programming tutorials it is predominantly screencast based, but I’ve included short lecture-like segments in each part to get across the key points. For those interested in such things, the setup I used was —

  • Screen recording - ScreenFlow (Mac)
  • Microphone - Samson GO Mic USB
  • Audio editing - Audacity

For hosting the course I thought I’d try out Udemy. The list price is $35 but if you’re reading this you can get it for free. There is also a subset of the lectures on YouTube which will be extended with more free content as the main course grows.

The course is built around communicating the basics needed to get up and running, and then moves on to create a demo application — a tabbed web browser — in PyQt. Since it’s my first attempt at doing this I’m more interested in getting feedback on:

  • Is it any good?
  • How can I improve it?

I would also be interested to hear your thoughts on Udemy as a platform - both on course quality and value for money. It’s obvious that the pricing strategy there is price high and discount hard which seems a bit disingenuous. But there is no arguing with the audience that it brings.

So sign up and let me know what you think, in either a review, a message or the comments.

by Martin Fitzpatrick at July 27, 2015 07:00 PM

Wei Xue

GSoC Week 8, 9 and Progress Report 2

Week 8 and 9

In the week 8 and 9, I implemented DirichletProcessGaussianMixture. But its behavior looks similar to BayesianGaussianMixture. Both of them can infer the best number of components. DirichletProcessGaussianMixture took a slightly more iteration than BayesianGaussianMixture to converge on Old-faith data set, around 60 iterations.

If we solve Dirichlet Process Mixture by Gibbs sampling, we don't need to specify the truncated level T. Only the concentration parameter $\alpha$ is enough. In the other hand, with variational inference, we still need to specify the maximal possible number of components, i.e., the truncated level.

At the first, the lower bound of DirichletProcessGaussianMixture seems a little strange. It is not always going up. When some clusters disappear, it goes down a little bit, then go up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples less than the number of features. I did the math derivation of Dirichlet process mixture models again, and found it was a bug on the coding of a very long equation.

I also finished the code of BayesianGaussianMixture for 'tied', 'diag' and 'spherical' precision.

My mentor pointed out the style problem in my code and docstrings. I knew PEP8 convention, but got no idea where was also a convention for docstring, PEP257. It took me a lot of time to fix the style problem.

Progress report 2

During the last 5 weeks (since the progress report 1), I finished the

  1. GaussianMixutre with four kinds of covariance
  2. Most test cases of GaussianMixutre
  3. BayesianGaussianMixture with four kinds of covariance
  4. DirichletProcessGaussianMixture

Although I spent some time on some unsuccessful attempts, such as decoupling out observation models and hidden models as mixin classes, double checking DP equations, I did finished the most essential part of my project and did some visualization. In the following 4 weeks, I will finish all the test cases for BayesianGaussianMixture and DirichletProcessGaussianMixture, and did some optional tasks, such as different covariance estimators and incremental GMM.

July 27, 2015 02:59 PM

July 26, 2015

Titus Brown

jclub: Bloom Filter Trie - a data structure for pan-genome storage

Note: this is a blog post from the DIB Lab journal club.

Jump to Questions and Comments:.


The paper:

http://www.techfak.uni-bielefeld.de/~stoye/dropbox/wabi2015final.pdf

"Bloom Filter Trie: a data structure for pan-genome storage."

by Guillaume Holley, Roland Wittler, and Jens Stoye.

Background

  • Pan Genome: Represents genes in a clade that comprises hundreds/thousands of strains that share large sequence parts but differ by individual mutations from one another.
  • Colored De Bruijn Graph (C-DBG) - A directed graph where each vertex represents a kmer and is associated with a color that represents the genome where the kmer occurs. An edge between vertex x and vertex y exists if x[2..k] = y[1..k-1]

Existing pan-genome Data structures

  • Burrows-Wheeler Transform (BWT) that rearranges the input data to enable better compression by aggregating characters with similar context.
  • FM-Index count and locate the occurrence of substring in BWT.
  • Bloom Filter BF - A bit array of n elements initialized with zeros, and it uses a set of hash functions for look-up and insertion in a constant time.
  • Sequence Bloom Filter (SBT) is a binary tree with BFs as vertices. A query sequence is broken into a set of kmers then checked if they exist in the vertex bloom filter. If yes, search is repeated on children if not existing the proceeds on other vertices.

Paper Scope: New pan-genome Data Structure

Proposes a new data structure for indexing and compressing a pan-genome as a C-DBG, the Bloom Filter Trie (BFT). The trie stores kmers and their colors. The new representation allows for compressing and indexing shared substrings. It also allows quick checking for the existence of a given substring. It is incremental, alignment and reference free, and allows for any format.

How BFT Works?

Colors Representation Before Compression

Colors are represented by a bit array initialized with 0s. Each color has an index i_color: color_x [i_color] =1 means kmer x has color i_color.

Colors Representation After Compression

If we can put each color set in a specific position in an array such that this position encodes into a number less than the color set size, then we can store the color set in the BFT using less space. Color set size is the number of kmers sharing a color multiplied by its size. So basically, they sort colors in a list based on the color set size in a decreasing order then they add each color set to an external array incrementally (if : integer encoding the position of the color in the array < size of the color). Finally, each color is replaced in the BFT by its position in the external array

Uncompressed Container

A set of tuples of capacity c <s, color_ps> where: x = ps and x is the path from root to v.

When number of suffixes exceeds c the container is burst.

In bursting process, each suffix s is split into a suffix s_suff and prefix s_pref

Prefixes are stored in a new compressed container.

Suffixes and their colors are stored in a new uncompressed containers.

Compressed Container and Bursting Procedure

An uncompressed container is replaced by a compressed one that stores q prefixes of suffix with links to children containing the suffix.

  • quer is a BF represented as m bit array and has f hash functions to record and filter the presence of q prefixes of suffix.
  • The q prefixes of suffix are stored in a bloom filter BF. q suffixes are stored in an array suf in lexicographic order of the prefixes of those suffixes.
  • pref[𝛂] =1 if a prefix a exists with 𝛂 as its binary representation.
  • clust is an array of q bits one per suffix of array suf. A cluster is a list of consecutive suffixes in array suf that has the same prefix.

BFT Operations

Look up

To check if a Spref ab with 𝛂𝜷 representation exist in a compressed container cc, the BF quer is checked and the prefix a existence is verified in the array pref. Remember that suffix prefixes are stored in quer during bursting process.

if a exist, then its hamming weight is computed which is the index of the cluster in which suffix b is likely located where i is the start position of the cluster. Remember that a cluster is a list of consecutive suffixes that has the same prefix, so b is compared to suffixes in that cluster.

To check of a kmer x exists in a BFT t, the look-up process traverses t and for each vertex v it checks its containers one after one. Remember that suffix and their colors are stored in uncompressed container during burst process, hence a vertex now either represent a suffix from an uncompressed container or a suffix rooted from its compressed container.

  • For the first case, and as the uncompressed container has no childs, a match indicates the presence of the kmer.
  • For the second case, the quer is checked for the x_suff[1..l]. If it is found, traverse is continued recursively to the corresponding children. The absence of of x_suff[1..l] means the absence of the kmer as it can’t be located in another container of vertex v. Remember that k is a multiple of l so kmer =k/l equal substrings.

Insertion

If the kmer already exist, its set of colors is only modified. Otherwise, a lookup process is continued till:

  • The prefix of the searched suffix does not exist
  • The kmer suffix does not exist

Then the kmer is inserted. Insertion is simple if the container is uncompressed. If the container is compressed, the insertion of of s_pref =ab is pretty complicated:

Remember in the look up process, the ‘a’ prefix existence is verified by checking pref array. If it does not exist: it is a FP, and we can insert now by setting pref[𝛂] to 1. So, in next look up, the verification will lead to a TP index and start position of cluster pos are recomputed and updated. How? if it does exist: Then the suffix b is to be inserted into suf[pos]

Evaluation

Experiments presented in the paper show that BFT is faster than SBT while utilizing less memory.

  • KmerGenie was used to get optimal k size and mininal kmer count
  • Data insertion (loading) and kmer query was compared between SBT and BFT. Traversal time is also evaluated on BFT.
  • BFT was multiple times faster than the SBT on the building time while using about 1.5 times less memory. The BFT was about 30 times faster than the SBT for querying a k-mer.

Questions and Comments:

  • Essentially, a nice fast data structure for querying for k-mers and retrieving their colors. I guess this is for pangenomes, among other things.

  • They essentially use compressed nodes in the tree to efficiently store prefixes for large sections of the tree.

  • We worry about the peak mem usage diagram. It seems like a fair amount of memory is used in the making. How does this compare to the SBT? Do they compare peak memory usage or merely compressed memory usage?

  • It seems like one advantage that the SBT has is that with the BFT you cannot store/query for presence in individual data sets. So, for example, if you wanted to build indices for data sets spread across many different machines, you would have to do it by gathering all of the data sets in one place.

  • Both SBT and BFT get the compression mainly from bloom filter. The author did not discuss about why there is difference in compression ratio. Bloom filter size? The FP rate of bloom filters in used in SBT was mentioned as 7.2%, but FP rate of bloom filters in BFT were not mentioned in paper.

  • Another catch in the evaluation is that 1) loading cpu time difference in Table 1 of SBT and BFT may be from kmer counting (Jellyfish vs. CMK2); 2) When comparing the unique kmer query time, unique kmer were divided into subsets due to memory limit. Not sure whether this was a fair comparison.

  • How does false positive rate of all bloom filters (on all nodes) affect overall error rate, e.g. If BFT is converted back to k-mers, how many sequence error are there? (None, we think)

  • PanCake (alignment based) and RCSI (Reference based) were mentioned but not included in evaluation, which gave us the impression that they are not as efficient. Do they have any advantage?

  • BFT or SBT vs. khmer? (mentioned in intro but not discussed)

  • Pan genome (and transcriptome, proteome!) storage is super cool. (might not be relevant question here, but I am wondering:) How are genomes defined as "highly similar", as the authors restricted their test data sets to. At what point do species diverge to become too distant to analyze in this manner? e.g. how close is close, and what is too far?

    (CTB answer: it has something to do with how many k-mers they share, but I don't know that this has been really quantified. Kostas Konstantinidis et al's latest work on species defn might be good reading (http://nar.oxfordjournals.org/content/early/2015/07/06/nar.gkv657.full) as well as his Average Nucleotide Identity metric.)

  • Wondering how might BFT scale? Authors only tested prokaryotic sequences, 473 clinical isolates of Pseudomonas aeruginosa from 34 patients = 844.37 GB. Simulated data were 6 million reads of 100 b length for 31 GB. In comparison, MMETSP data are transcriptomic data from 678 cultured samples of 306 marine eukaryotic species representing more than 40 phyla (see Figure 2, Keeling et al. 2014) Not sure how large the entire MMETSP data set is, but probably on order of TB? http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889 http://www.ncbi.nlm.nih.gov/nuccore?term=231566%5BBioProject%5D

  • Although they discussed SBT as existing data structure, and graphalign in khmer, it wasn't clear until end that one of the main goals of the paper, besides describing BFT, was to compare their BFT to SBT (Soloman and Kingsford 2015 http://biorxiv.org/content/biorxiv/early/2015/03/26/017087.full.pdf) I feel this should have been noted in the Abstract.

  • speed can partly come from being able to abort searches for k-mers partway through.

  • BFT is really specialized for the pangenome situation, where many k-mers are in common. The cluster approach will break down if the genomes aren't mostly the same?

  • we would have liked a more visual representation of the data structure to help build intuition.

Points for clarification or discussion:

  • c is defined as capacity, but this is not well-described. What is capacity?
  • BFT to khmer graphalign comparison?

by Sherine Awad, Lisa Cohen, Jiarong Guo, and C. Titus Brown at July 26, 2015 10:00 PM

July 24, 2015

Abraham Escalante

A glimpse into the future (my future, of course)

Hello again,

Before I get started I just want to let you know that in this post I will talk about the future of my career and moving beyond the GSoC so this will only be indirectly related to the summer of code.

As you may or may not know, I will start my MSc in Applied Computing at the University of Toronto in September (2015, in case you're reading this in the future). Well, I have decided steer towards topics like Machine Learning, Computer Vision and Natural Language Processing.

While I still don't know what I will end up having has my main area of focus nor where this new journey will take me, I am pretty sure it will have to do with Data Science and Python. I am also sure that I will keep contributing to SciPy and most likely start contributing to other related communities like NumPy, pandas and scikit-learn so you could say that the GSoC has had a positive impact by helping me find areas that make my motivation soar and introducing me to people who have been working in this realm for a very long time and know a ton of stuff that make me want to pick up a book and learn.

In my latest meeting with Ralf (my mentor), we had a discussion regarding the growing scope of the GSoC project and my concern about dealing with all the unforeseen and ambiguous details that arise along the way. He seemed oddly pleased as I proposed to keep in touch with the project even after the "pencils down" date for the GSoC. He then explained that this is the purpose of the summer of code (to bring together students and organisations) and their hope when they choose a student to participate is that he/she will become a longterm active member of the community which is precisely what I would like to do.

I have many thanks to give and there is still a lot of work to be done with the project so I will save the thank you speech for later. For now I just want to say that this has been a great experience and I have already gotten more out of it than I had hoped (which was a lot).

Until my next post,
Abraham.

by noreply@blogger.com (Abraham Escalante) at July 24, 2015 07:42 PM

Progress Report

Hello all,

A lot of stuff has happened in the last couple of weeks. The project is coming along nicely and I am now getting into some of the bulky parts of it.

There is an issue with the way NaN (not a number) checks are handled that spans beyond SciPy. Basically, there is no consensus on how to deal with NaN values when they show up. In statistics they are often assumed to be missing values (e.g. there was a problem when gathering statistic data and the value was lost), but there is also the IEEE NaN which is defined as 'undefined' and can be used to indicate out-of-domain values that may point to a bug in one's code or a similar problem.

Long story short, the outcome of this will largely depend on the way projects like pandas and Numpy decide to deal with it in the future, but right now for SciPy we decided that we should not get in the business of assuming that NaN values signify 'missing' because that is not always the case and it may end up silently hiding bugs, leading to incorrect results without the user's knowledge. Therefore, I am now implementing a backwards compatible API addition that will allow the user to define whether to ignore NaN values (asume they are missing), treat them as undefined, or raise an exception. This is a longterm effort that may span through the entire stats module and beyond so the work I am doing now is set to spearhead future development.

Another big issue is the consistency of the `scipy.stats` module with its masked arrays counterpart `scipy.mstats`. The implementation will probably not be complicated but it encompasses somewhere around 60 to 80 functions so I assume it to be a large and time consuming effort. I expect to work on this for the next month or so.

During the course of the last month or two there have been some major developments in my life that are indirectly related to the project so I feel like they should be addressed but I intend do so in a separate post. For now I bid you farewell and thank you for reading.

Cheers,
Abraham.

by noreply@blogger.com (Abraham Escalante) at July 24, 2015 06:21 PM

July 23, 2015

Jake Vanderplas

Learning Seattle's Work Habits from Bicycle Counts (Updated!)

Update, July 25, 2015: I included some new plots suggested by my colleague Ariel Rokem. Scroll to the end!

Last year I wrote a blog post examining trends in Seattle bicycling and how they relate to weather, daylight, day of the week, and other factors.

Here I want to revisit the same data from a different perspective: rather than making assumptions in order to build models that might describe the data, I'll instead wipe the slate clean and ask what information we can extract from the data themselves, without reliance on any model assumptions. In other words, where the previous post examined the data using a supervised machine learning approach for data modeling, this post will examine the data using an unsupervised learning approach for data exploration.

Along the way, we'll see some examples of importing, transforming, visualizing, and analyzing data in the Python language, using mostly Pandas, Matplotlib, and Scikit-learn. We will also see some real-world examples of the use of unsupervised machine learning algorithms, such as Principal Component Analysis and Gaussian Mixture Models, in exploring and extracting meaning from data.

To spoil the punchline (and perhaps whet your appetite) what we will find is that from analysis of bicycle counts alone, we can make some definite statements about the aggregate work habits of Seattleites who commute by bicycle.

The Data

The data we will use here are the hourly bicycle counts on Seattle's Fremont Bridge. These data come from an automated bicycle counter, installed in late 2012, which has inductive sensors under the sidewalks on either side of the bridge. The daily or hourly bicycle counts can be downloaded from http://data.seattle.gov/; here is the direct link to the hourly dataset. To download the data directly, you can uncomment the following curl command:

In [1]:
# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

Once this is downloaded, we can use the pandas library to load the data:

In [2]:
import pandas as pd
data = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
data.head()
Out[2]:
Fremont Bridge West Sidewalk Fremont Bridge East Sidewalk
Date
2012-10-03 00:00:00 4 9
2012-10-03 01:00:00 4 6
2012-10-03 02:00:00 1 1
2012-10-03 03:00:00 2 3
2012-10-03 04:00:00 6 1

We'll do some quick data cleaning: we'll rename the columns to the shorter "West" and "East", set any missing values to zero, and add a "Total" column:

In [3]:
data.columns = ['West', 'East']
data.fillna(0, inplace=True)
data['Total'] = data.eval('East + West')

We can get a better idea of the dataset as a whole through a simple visualization; for example, we can resample the data to see the weekly trend in trips over the nearly three-year period:

In [4]:
# first some standard imports
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()  # plot styling
import numpy as np

data.resample('W', how='sum').plot()
plt.ylabel('weekly trips');

The counts show both a strong seasonal variation, as well as a local structure that can be partially accounted for by temperature, time of year, precipitation, and other factors.

Extracting Knowledge from the Data

From here, we could do a variety of other visualizations based on our intuition about what might affect bicycle counts. For example, we could look at the effect of the days of the week, the effect of the weather, and other factors that I explored previously. But we could also proceed by letting the dataset speak for itself, and use unsupervised machine learning techniques (that is, machine learning without reference to data labels) to learn what the data have to tell us.

We will consider each day in the dataset as its own separate entity (or sample, in usual machine learning parlance). For each day, we have 48 observations: two observations (east and west sidewalk sensors) for each of the 24 hour-long periods. By examining the days in light of these observations and doing some careful analysis, we should be able to extract meaningful quantitative statements from the data themselves, without the need to lean on any other assumptions.

Transforming the Data

The first step in this approach is to transform our data; essentially we will want a two-dimensional matrix, where each row of the matrix corresponds to a day, and each column of the matrix corresponds to one of the 48 observations. We can arrange the data this way using the pivot_table() function in Pandas. We want the "East" and "West" column values, indexed by date, and separated by hour of the day. Any missing values we will fill with zero:

In [5]:
pivoted = data.pivot_table(['East', 'West'],
                           index=data.index.date,
                           columns=data.index.hour,
                           fill_value=0)
pivoted.head()
Out[5]:
East ... West
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
2012-10-03 9 6 1 3 1 10 50 95 146 104 ... 77 72 133 192 122 59 29 25 24 5
2012-10-04 11 0 6 3 1 11 51 89 134 94 ... 63 73 114 154 137 57 27 31 25 11
2012-10-05 7 4 3 2 2 7 37 101 119 81 ... 63 80 120 144 107 42 27 11 10 16
2012-10-06 7 5 2 2 1 2 15 16 47 55 ... 89 115 107 107 41 40 25 18 14 15
2012-10-07 5 5 1 2 2 3 8 12 26 36 ... 126 122 132 118 68 26 19 12 9 5

5 rows × 48 columns

Next we extract the raw values and put them in a matrix:

In [6]:
X = pivoted.values
X.shape
Out[6]:
(1001, 48)

Our data consists of just over 1000 days, each with the aforementioned 48 measurements.

Visualizing the Data

We can think of this data now as representing 1001 distinct objects which live in a 48-dimensional space: the value of each dimension is the number of bicycle trips measured on a particular side of the bridge at a particular hour. Visualizing 48-dimensional data is quite difficult, so instead we will use a standard dimensionality reduction technique to project this to a more manageable size.

The technique we'll use is Principal Component Analysis (PCA), a fast linear projection which rotates the data such that the projection preserves the maximum variance. We can ask for components preserving 90% of the variance as follows:

In [7]:
from sklearn.decomposition import PCA
Xpca = PCA(0.9).fit_transform(X)
Xpca.shape
Out[7]:
(1001, 2)

The output has two dimensions, which means that these two projected components describe at least 90% of the total variance in the dataset. While 48-dimensional data is difficult to plot, we certainly know how to plot two-dimensional data: we'll do a simple scatter plot, and for reference we'll color each point according to the total number of trips taken that day:

In [8]:
total_trips = X.sum(1)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=total_trips,
            cmap='cubehelix')
plt.colorbar(label='total trips');

We see that the days lie in two quite distinct groups, and that the total number of trips increases along the length of each projected cluster. Further, the two groups begin to be less distinguishable when the number of trips during the day is very small.

I find this extremely interesting: from the raw data, we can determine that there are basically two primary types of days for Seattle bicyclists. Let's model these clusters and try to figure out what these types-of-day are.

Unsupervised Clustering

When you have groups of data you'd like to automatically separate, but no previously-determined labels for the groups, the type of algorithm you are looking at is a clustering algorithm. There are a number of clustering algorithms out there, but for nicely-defined oval-shaped blobs like we see above, Gaussian Mixture Models are a very good choice. We can compute the Gaussian Mixture Model of the data using, again, scikit-learn, and quickly plot the predicted labels for the points:

In [9]:
from sklearn.mixture import GMM
gmm = GMM(2, covariance_type='full', random_state=0)
gmm.fit(Xpca)
cluster_label = gmm.predict(Xpca)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=cluster_label);

This clustering seems to have done the job, and separated the two groups we are interested in. Let's join these inferred cluster labels to the initial dataset:

In [10]:
pivoted['Cluster'] = cluster_label
data = data.join(pivoted['Cluster'], on=data.index.date)
data.head()
Out[10]:
West East Total Cluster
Date
2012-10-03 00:00:00 4 9 13 0
2012-10-03 01:00:00 4 6 10 0
2012-10-03 02:00:00 1 1 2 0
2012-10-03 03:00:00 2 3 5 0
2012-10-03 04:00:00 6 1 7 0

Now we can find the average trend by cluster and time using a GroupBy within this updated dataset

In [11]:
by_hour = data.groupby(['Cluster', data.index.time]).mean()
by_hour.head()
Out[11]:
West East Total
Cluster
0 00:00:00 5.312139 6.213873 11.526012
01:00:00 2.713873 2.969653 5.683526
02:00:00 2.294798 1.732659 4.027457
03:00:00 1.570809 1.426301 2.997110
04:00:00 4.179191 2.650289 6.829480

Finally, we can plot the average hourly trend among the days within each cluster:

In [12]:
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
hourly_ticks = 4 * 60 * 60 * np.arange(6)

for i in range(2):
    by_hour.ix[i].plot(ax=ax[i], xticks=hourly_ticks)
    ax[i].set_title('Cluster {0}'.format(i))
    ax[i].set_ylabel('average hourly trips')

These plots give us some insight into the interpretation of the two clusters: the first cluster shows a sharp bimodal traffic pattern, while the second shows a wide unimodal pattern.

In the bimodal cluster, we see a peak around 8:00am which is dominated by cyclists on the west sidewalk, and another peak around 5:00pm which is dominated by cyclists on the east sidewalk. This is very clearly a commute pattern, with the majority of cyclists riding toward downtown Seattle in the morning, and away from downtown Seattle in the evening.

In the unimodal cluster, we see fairly steady traffic in each direction beginning early in the morning and going until late at night, with a peak around 2:00 in the afternoon. This is very clearly a recreational pattern of use, with people out riding through the entire day.

I find this is fascinating: from simple unsupervised dimensionality reduction and clustering, we've discovered two distinct classes of days in the data, and found that these classes have very intuitive explanations.

Seattle's Work Habits

Let's go one step deeper and figure out what we can learn about people (well, bicycle commuters) in Seattle from just this hourly commute data. As a rough approximation, you might guess that these two classes of data might be largely reflective of workdays in the first cluster, and non-work days in the second. We can check this intuition by re-plotting our projected data, except labeling them by day of the week:

In [13]:
dayofweek = pd.to_datetime(pivoted.index).dayofweek
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=dayofweek,
            cmap=plt.cm.get_cmap('jet', 7))
cb = plt.colorbar(ticks=range(7))
cb.set_ticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.clim(-0.5, 6.5);

We see that the weekday/weekend intuition holds, but only to a point: in particular, it is clear that there are a handful of weekdays which follow the typical weekend pattern! Further, it's interesting to note that Fridays tend to be pulled closer to weekend days in this plot, though as a whole they still fall solidly in the work-day cluster.

Let's take a closer look at the "special" weekdays that fall in the "wrong" cluster. We start by constructing a dataset listing the cluster id and the day of the week for each of the dates in our dataset:

In [14]:
results = pd.DataFrame({'cluster': cluster_label,
                        'is_weekend': (dayofweek > 4),
                        'weekday': pivoted.index.map(lambda x: x.strftime('%a'))},
                       index=pivoted.index)
results.head()
Out[14]:
cluster is_weekend weekday
2012-10-03 0 False Wed
2012-10-04 0 False Thu
2012-10-05 0 False Fri
2012-10-06 1 True Sat
2012-10-07 1 True Sun

First, let's see how many weekend days fall in the first, commute-oriented cluster

In [15]:
weekend_workdays = results.query('cluster == 0 and is_weekend')
len(weekend_workdays)
Out[15]:
0

zero! Apparently, there is not a single weekend during the year where Seattle cyclists as a whole decide to go to work.

Similarly, we can see how many weekdays fall in the second, recreation-oriented cluster:

In [16]:
midweek_holidays = results.query('cluster == 1 and not is_weekend')
len(midweek_holidays)
Out[16]:
23

There were 23 weekdays over the past several years in which Seattle cyclists as a whole did not go to work. To label these, let's load the US Federal holiday calendar available in Pandas:

In [17]:
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2016', return_name=True)
holidays.head()
Out[17]:
2012-01-02                 New Years Day
2012-01-16    Dr. Martin Luther King Jr.
2012-02-20                Presidents Day
2012-05-28                   MemorialDay
2012-07-04                      July 4th
dtype: object

Just for completeness, we will add to the list the day before and day after each of these holidays:

In [18]:
holidays_all = pd.concat([holidays,
                         "Day Before " + holidays.shift(-1, 'D'),
                         "Day After " + holidays.shift(1, 'D')])
holidays_all = holidays_all.sort_index()
holidays_all.head()
Out[18]:
2012-01-01                 Day Before New Years Day
2012-01-02                            New Years Day
2012-01-03                  Day After New Years Day
2012-01-15    Day Before Dr. Martin Luther King Jr.
2012-01-16               Dr. Martin Luther King Jr.
dtype: object

Note that these are observed holidays, which is why New Years Day 2012 falls on January 2nd. With this ready to go, we can compute the complete list of non-weekend days on which Seattle bicycle commuters as a whole chose to stay home from work:

In [19]:
holidays_all.name = 'name'  # required for join
joined = midweek_holidays.join(holidays_all)
set(joined['name'])
Out[19]:
{&aposChristmas&apos,
 &aposDay After Christmas&apos,
 &aposDay After Thanksgiving&apos,
 &aposDay Before Christmas&apos,
 &aposJuly 4th&apos,
 &aposLabor Day&apos,
 &aposMemorialDay&apos,
 &aposNew Years Day&apos,
 &aposThanksgiving&apos}

On the other side of things, here are the Federally recognized holidays where Seattle bicycle commuters chose to go to work anyway:

In [20]:
set(holidays) - set(joined.name)
Out[20]:
{&aposColumbus Day&apos,
 &aposDr. Martin Luther King Jr.&apos,
 &aposPresidents Day&apos,
 &aposVeterans Day&apos}

Update: What's up with Fridays?

A colleague of mine, Ariel Rokem, saw the first version of this post and noticed something interesting. For the most part, Fridays tend to lie on the upper side of the weekday cluster, closer in this parameter space to the typical weekend pattern. This pattern holds nearly universally for Fridays, all except for three strange outliers which lie far on the other side of the cluster.

We can see these more clearly if we highlight the Friday points in the plot:

In [21]:
fridays = (dayofweek == 4)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c='gray', alpha=0.2)
plt.scatter(Xpca[fridays, 0], Xpca[fridays, 1], c='yellow');

The yellow points in the bottom-left of the plot are unique – they're far different than other Fridays, and they even stand-out in comparison to the other work days! Let's see what they represent:

In [22]:
weird_fridays = pivoted.index[fridays & (Xpca[:, 0] < -600)]
weird_fridays
Out[22]:
Index([2013-05-17, 2014-05-16, 2015-05-15], dtype=&aposobject&apos)

All three of these outlying Fridays fall in the middle of May. Curious!

Let's quickly visualize the daily stats for these, along with the mean trend over all days. We can arrange the data this way with a pivot table operation:

In [23]:
all_days = data.pivot_table('Total', index=data.index.time, columns=data.index.date)
all_days.loc[:, weird_fridays].plot();
all_days.mean(1).plot(color='gray', lw=5, alpha=0.3,
                      xticks=hourly_ticks);

Apparently these three strange Fridays are days with extreme amounts of bicycle commuting. But what makes them so special?

After some poking-around on the internet, the answer becomes clear: we've discovered Seattle's annual bike to work day. Mystery solved!

Summary

We have seen here that by taking a close look at raw bicycle counts and using some basic visualization and unsupervised machine learning, we can make some very definite statements about the overall work habits of people in Seattle who bicycle to work across the Fremont bridge. In summary, this is what we have learned:

  • Seattle cyclists, as a whole, tend to take off work New Year's Day, Memorial Day, Independence Day, Labor Day, and the days surrounding Thanksgiving and Christmas.
  • Seattle cyclists, as a whole, tend to head to the office on the more minor US holidays: Columbus Day, Martin Luther King Jr. Day, Presidents Day, and Veterans Day.
  • Seattle cyclists, as a whole, would never, ever, be caught at work on a weekend.

Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at July 23, 2015 03:00 PM

Continuum Analytics

Find Continuum Analytics at PyData Seattle!

Starting Friday, July 24th, the Continuum Analytics team will be in Washington state for PyData Seattle. Join us at the Microsoft Conference Center for three days of Python tutorials, talks, and social events.

by Continuum at July 23, 2015 02:00 PM

Empowering Data Scientists for the Journey

Three years ago, Travis and I founded a company with a mission to promote and foster the growth of better tools for computational science. Both of us are scientists by training, and both of us have been math and science nerds from childhood. Neither of us can resist geeking out on math and physics when we chat with customers and users about their analytical problems.

by Peter Wang at July 23, 2015 10:00 AM

Continuum Analytics news

Continuum Analytics Secures $24 Million Series A Round to Empower Next Phase of Data Science

General Catalyst Partners and BuildGroup Co-Invest

Enterprise-Ready Anaconda Platform Supports New Generation of Data-Driven Companies


AUSTIN, TX. — July 23, 2015 — Continuum Analytics, which develops Anaconda, the leading modern open source analytics platform powered by Python, has secured a $24 million Series A funding round led by General Catalyst Partners and BuildGroup. The funds will be leveraged to accelerate product development, expand sales and marketing, and continue to invest in strengthening the Python and Anaconda communities. The latest round of $24 million in funding comes after $10 million in initial funding from a combination of angel investors and a DARPA award, bringing total funding to-date to $34 million.

by Continuum at July 23, 2015 07:00 AM

Matthew Rocklin

Custom Parallel Workflows

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We motivate the expansion of parallel programming beyond big collections. We discuss the usability custom of dask graphs.

Recent Parallel Work Focuses on Big Collections

Parallel databases, Spark, and Dask collections all provide large distributed collections that handle parallel computation for you. You put data into the collection, program with a small set of operations like map or groupby, and the collections handle the parallel processing. This idea has become so popular that there are now a dozen projects promising big and friendly Pandas clones.

This is good. These collections provide usable, high-level interfaces for a large class of common problems.

Custom Workloads

However, many workloads are too complex for these collections. Workloads might be complex either because they come from sophisticated algorithms (as we saw in a recent post on SVD) or because they come from the real world, where problems tend to be messy.

In these cases I tend to see people do two things

  1. Fall back to multiprocessing, MPI or some other explicit form of parallelism
  2. Perform mental gymnastics to fit their problem into Spark using a clever choice of keys. These cases often fail to acheive much speedup.

Direct Dask Graphs

Historically I’ve recommended the manual construction of dask graphs in these cases. Manual construction of dask graphs lets you specify fairly arbitrary workloads that then use the dask schedulers to execute in parallel. The dask docs hold the following example of a simple data processing pipeline:

def load(filename):
    ...
def clean(data):
    ...
def analyze(sequence_of_data):
    ...
def store(result):
    ...

dsk = {'load-1': (load, 'myfile.a.data'),
       'load-2': (load, 'myfile.b.data'),
       'load-3': (load, 'myfile.c.data'),
       'clean-1': (clean, 'load-1'),
       'clean-2': (clean, 'load-2'),
       'clean-3': (clean, 'load-3'),
       'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
       'store': (store, 'analyze')}

from dask.multiprocessing import get
get(dsk, 'store')  # executes in parallel

Feedback from users is that this is interesting and powerful but that programming directly in dictionaries is not inutitive, doesn’t integrate well with IDEs, and is prone to error.

Introducing dask.do

To create the same custom parallel workloads using normal-ish Python code we use the dask.do function. This do function turns any normal Python function into a delayed version that adds to a dask graph. The do function lets us rewrite the computation above as follows:

from dask import do

loads = [do(load)('myfile.a.data'),
         do(load)('myfile.b.data'),
         do(load)('myfile.c.data')]

cleaned = [do(clean)(x) for x in loads]

analysis = do(analyze)(cleaned)
result = do(store)(analysis)

The explicit function calls here don’t perform work directly; instead they build up a dask graph which we can then execute in parallel with our choice of scheduler.

from dask.multiprocessing import get
result.compute(get=get)

This interface was suggested by Gael Varoquaux based on his experience with joblib. It was implemented by Jim Crist in PR (#408).

Example: Nested Cross Validation

I sat down with a Machine learning student, Gabriel Krummenacher and worked to parallelize a small code to do nested cross validation. Below is a comparison of a sequential implementation that has been parallelized using dask.do:

You can safely skip reading this code in depth. The take-away is that it’s somewhat involved but that the addition of parallelism is light.

parallized cross validation code

The parallel version runs about four times faster on my notebook. Disclaimer: The sequential version presented here is just a downgraded version of the parallel code, hence why they look so similar. This is available on github.

So the result of our normal imperative-style for-loop code is a fully parallelizable dask graph. We visualize that graph below.

test_score.visualize()

Cross validation dask graph

Help!

Is this a useful interface? It would be great if people could try this out and generate feedback on dask.do.

For more information on dask.do see the dask imperative documentation.

July 23, 2015 12:00 AM

July 21, 2015

Titus Brown

DIB jclub: Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

Note: at the Lab for Data Intensive Biology, we're trying out a new journal club format where we summarize our thoughts on the paper in a blog post. For this blog post, Luiz wrote the majority of the text and the rest of us added questions and comments.


The paper:

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap.

Relevant background:

Summary

This paper presents a new read mapper, GraphMap.

The method is described as a five-stage read funneling approach, which successively reduces the number and complexity of the decisions that need to be made

Stage 1: Region selection

Gapped spaced seeds are not the same as our last journal club, but it's an interesting strategy for selecting seeds to extend alignments.

Based on Levenshtein distance, it uses gaps inside k-mers to consider mismatches and indels. Three shapes are used, with one or two positions being DC ("don't care") bases:

  • 1110111: DC base can be either a match or mismatch
  • 111111: one deletion at the specified position (?)
  • 11100111: DC base and following base are skipped. At most one insertion and one match/mismatch.

(Can we use longer shapes? These one are for fairly small _k_, if we can extend the idea to arbitrary _k_ it might be useful for seeding on graphalign).

Hough transforms are used in a clever way to bin seeds into viable regions, but it depends on reference coordinates (so it is not so useful for us in the absence of a decent reference).

Stage 2: Graph-based vertex-centric construction of anchors

The 'graph' part of GraphMap comes from the next step, where the seeds are processed to construct alignment anchors. Given a target and a query, a "kmer mapping graph" (a DAG) is built for the target. In a "kmer mapping graph" distinct nodes can represent the same k-mer. For example, the first step in constructing a 2-mer mapping graph for CTAATATC would be CT -> TA -> AA -> AT -> TA -> AT -> TC (note nodes TA, AT appearing twice). Then, for each vertex an edge is added for every successor until the last node is reached.

These graphs are small, since target and query are a read sequence and a single region of the reference (for memory consumption purposes, read sequence are usually smaller and so used as target). A new index is constructed for the target on the fly, with a much smaller seed size (defaults to k=6). For k < 10 perfect hashing is used, for k >= 10 a suffix array.

After the graph is ready, the mapping works by walking along target and query simultaneously. The query is processed as a sliding window, and an edge is followed to extend the walk each step in the target, while keeping track of all walks corresponding to potential mapping sites.

There are similarities to how partial order alignment works, but how is this stage any different than just doing DP?

Stage 3: Extending anchors into alignments using LCS

(nothing here)

Stage 4: Refining alignments using $L_1$ linear regression / Stage 5: Construction of final alignment

Just summary of stages 4 and 5: After we have extended anchors in stage 3, we will have a set of points representing the alignments from LCSK, mixed with a set of noise; (indels, sequence errors, etc). To refine these alignments, we need to draw a line that best fits these points. This is done by using linear regression, which is used to fit the alignment's "predictive model" from among these observed points "list of anchors".

The points that lie on given dl1 from either sides of the line, represents our best alignments - those points who deviates should be discarded.

Then the std deviation of anchors from this line, no. of exact kmers covered by anchors around the line, length of query that matched the target, no. of bases covered by anchors (normalized) and the read length are used to compute an f score. The region with highest f score are picked for final alignment.

(I think the reference coordinates c can be estimated from the position on the read and the position of the hit, so we still can use Hough Transform?)

Stages 4 and 5 seem heavyweight.


Questions and comments:

  • we are unclear on how index would be constructed in the absence of a reference in the case of a de novo assembly. Last paragraph of Discussion states: "GraphMap’s sensitivity and specificity as a mapper could thus serve as the basis for fast computation of overlap alignments and de novo assemblies in the future." But algorithm from Figure 1b and Methods Stage I: region selection seems to be based on seed finding between query and reference. What is used as reference in the absence of a reference?
  • GraphMap showed improved runtime (Suppl Table 2) compared to "gold standard" BLAST alignment. In Figure 2 with synthetic data, data platform type made the biggest difference in GraphMap performance compared to "gold-standard" BLAST aligner. Oxford Nanopore 2D data (double-stranded) had consistently among the highest precision relative to other platforms, although PacBio was close behind in both C. elegans and H. sapiens sets. Interesting that genome size (from 2.2 Mbp = N. meningitidis to 198 Mbp = H. sapiens ch3) didn't make much difference in precision (mapped location on read; seed hit, k-point) or recall (alignment) (https://en.wikipedia.org/wiki/Precision_and_recall). Recall was much lower in all ONT data regardless of species.
  • Impressed with figure 3a, difference between GraphMap and LAST mapping tool with difference in consensus sequences (colored = low confidence, grey = higher confidence). Would liked to have seen their BWA-MEM and BLASR results, although Figure 3b suggests LAST was closer to GraphMap with higher coverage.
  • Interesting applications outline in Results. Benefit of GraphMap to reduce pipeline resources required for ONP data? Mick Watson suggests BLAST (https://biomickwatson.wordpress.com/2015/06/01/analysing-minion-data-with-no-bioinformatics-expertise-nor-infrastructure/).
  • Reasons for why we care include new ONP technology in field applications, e.g. identifying pathogens in remote location with local install. Species predictions in Table 3, F1 (mean of precision and recall) higher for GraphMap. Need for more testing with real ONP data (just 3 species were tested in this paper) and with higher complexity, e.g. pathogenic microbial eukaryotes?
  • We were a bit surprised that longest-common-subsequence works so well with ONT, but that's why they did it with only the subsequences extracted after the graph approach.
  • "Our comparisons with BLAST suggest that reads that cannot be mapped by GraphMap may essentially be unmappable." Did they characterize these reads at all?
  • What's the memory usage? Largely or entirely unmentioned.
  • We were confused by the gapped qspaced seeds/gapped q-gram filter stuff. (p10)
  • We do not think they tested for genome scaling appropriately. They need to show an example for may be a whole human genome. As Lisa noticed there is no change in precision with their bigger genomes.
  • The clinical application is very interesting. They did not compare precision of other mappers using strain specific sequence.

by Luiz Irber, Sherine Awad, Camille Scott, Lisa Cohen, Tamer Mansour, C. Titus Brown at July 21, 2015 10:00 PM

Matthieu Brucher

Announcement: ATKSideChainCompressor

I’m happy to announce the release of a side-chain stereo compressor based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. This stereo compressor can work on two channels, left/right or middle/side, possibly in linked mode (only one set of parameters), and can be set up to mix the input signal with the compressed signal (serial/parallel compression). The side chain channels can be used to steer the gain stage (the same setup will be used, right/left or middle/side).

ATKSideChainCompressorATKSideChainCompressor

The supported formats are:

  • VST2 (32bits/64bits on Windows, 64bits on OS X)
  • VST3 (32bits/64bits on Windows, 64bits on OS X)
  • Audio Unit (64bits, OS X)

Direct link for ATKSideChainCompressor .

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at July 21, 2015 07:16 AM

Announcement: Audio TK 0.7.0

Focus on this release was on performance. As such the core functions were optimized, as well as some tools and EQ.

A new filter dedicated to fast convolution (using a fixed-size partition with a mix of FFT convolution and explicit FIR filter) with 0 latency was added.

Download link: ATK 0.7.0

Changelog:

0.7.0
* Fixed the FFT implementation
* Added a fast zero latency convolution filter with tests and comparison with a basic FIR implementation
* Enhanced global framework performance (Core, EQ, Tools)
* Enhanced dynamic gain filters behavior by switching between LUT or direct computation dependening on the LUT state

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at July 21, 2015 07:08 AM

July 19, 2015

Titus Brown

A year-old editorial regarding computational infrastructure sustainability

Note: A year ago, I wrote this in response to an editorial request. Ultimately they weren't interested in publishing it, and I got distracted and this languished on my hard disk. So when I remembered it recently, I decided to just push it out to my blog, where I should have put it in the first place. --titus

All things come to an end eventually, including funding for computational services. What is a field to do?

As biology steadily becomes more and more data intensive, the need for community-wide analysis software, databases, data curation, and Internet services grows. While writing software and gathering data are both expensive, the cost of maintaining software and curating data sets over time can dwarf the upfront costs; in the software industry, for example, software maintenance and support costs can be 10-or 100-fold the initial investment to develop the software. Despite this, there is often little sustained academic funding for software maintenance, data curation, and Internet service support; in part, this may be because the maintenance costs for existing data and software could easily consume all of the available funding! Yet this lack of infrastructure is increasingly problematic, as data set sizes and curation needs grow, and tools to interpret and integrate data sets become ever more critical to forward progress in biology. How can we develop and maintain robust infrastructure services while enabling creative development of new resources and software?

The volume, velocity, and variety of data in biology is stunning, and would challenge even the handling capacity of more established data-intensive fields with larger computational investments. For example, by analogy with astronomy, Golden et al. (2012) propose bringing processing pipelines closer to the data generating instrument (in this case, the sequencing machine). While this approach can certainly help address the volume and velocity of sequencing data, it fails to address the variety -- there are dozens of types of sequencing output, with perhaps hundreds of different processing pipelines, the choice of which depends critically on the biological system being analyzed and the questions being asked of the data. Some subfields of biology may well be able to standardize -- for example, variation analysis for the human genome is increasingly using only a few processing pipelines -- but for environmental sequencing, the types of systems and the metadata being gathered are extremely diverse and nowhere near standardization. We must recognize that our knowledge of the natural biological world is so shallow, and the data gathering needs so great, that the field is very immature compared to other data-intensive sciences like particle physics and astronomy.

How can we build sustained computational analysis and data storage services, in the face of increasingly large and diverse biological data sets, with fast-moving analysis needs? This question has been brought into sharp relief in recent years, with the lapses in funding of TAIR, Tranche, and CAMERA.

While substantial investments have been made in a variety of genomic and transcriptomic analysis services, only a few projects have achieved sustained funding independent of large host institutes. Chief among these are the biomedically relevant projects, which include Wormbase, Flybase, and SGD, all of which have been funded for well over a decade by the NIH. Many others, including iPlant Collaborative and KBase, are in a ramp-up phase and are still exploring options for long-term support. With rare exceptions, it is safe to say that no large cyberinfrastructure effort has successfully weaned itself from continued large-scale support from a granting agency - and some have failed to find this continued funding, and have no clear future.

The challenges for sustainability of cyberinfrastructure are significant. The necessary mix of data storage, research software development, database curation, and service hosting requires substantial and diverse computational expertise, large compute resources, and extensive community involvement to ensure relevance. Even individually, these can be hard to find, and yet projects often try to combine all four of these: to a large extent they buy their own hardware, manage it with their own infrastructure software, develop their own research analysis software, store their data, and curate their databases. Hybrid models exist -- for example, iPlant Collaborative works with a number of external computational biologists to develop and integrate tools -- but these efforts are often centrally managed and continue to require substantial funding for this integration.

Another challenge is that of maintaining innovation in algorithm and software development while continuing to provide robust services. Many innovative computational tools have emerged from small labs and become more broadly useful, but it can be hard for small labs to engage with large, centralized infrastructure projects. Moreover, even in these models, the larger efforts can only engage deeply with a few collaborators; these choices privilege some tools over others, and may not be based on technical merit or community need. This may also arise from the tension between engineering and research needs: large projects prize engineering stability, while research innovation is inherently unstable.

There is the hope of a more sustainable path, rooted in two approaches -- one old, and one new. The old and proven approach is that of open source. The open source community has existed for almost half a century now, and has proven to be remarkably capable: open source languages such as R and Python are widely used in data analysis and modeling, and the Linux operating system dominates scientific computing. Moreover, the open source workflow closely tracks the ideal of a scientific community, with a strong community ethic, widespread collaboration, and high levels of reproducibility and good computational practice (Perez and Millman, 2014). The new approach is cloud computing, where the advent of ubiquitous virtualization technology has made it possible for entire companies to dynamically allocate disk and compute infrastructure as needed with no upfront hardware cost. Open source approaches provide an avenue for long-term research software sustainability, while cloud computing allows cyberinfrastructure projects to avoid upfront investment in hardware and lets them grow with their needs.

Interestingly, two notable exceptions to the cyberinfrastructure sustainability dilemma exploit both open source practices and cloud computing. The Galaxy Project develops and maintains an open source Web-based workflow interface that can be deployed on any virtual machine, and in recent years has expanded to include cloud-enabled services that lets users manage larger clusters of computers for more compute-intensive tasks. Importantly, users pay for their own compute usage in the cloud: tasks that consume more compute resources will cost more. Since Galaxy is also locally deployable, heavy users can eschew the cost of the cloud by installing it on existing local compute resources. And, finally, large providers such as iPlant Collaborative can host Galaxy instances for their user communities. Likewise, the Synapse project is an open source project developed by Sage Bionetworks that hosts data and provenance information for cooperative biomedical analysis projects. While less broadly used than Galaxy, Synapse is -- from an infrastructure perspective -- infinitely expandable: the Sage Bionetworks developers rely entirely on the Amazon cloud to host their infrastructure, and scale up their computing hardware as their computing needs increase.

A general approach using open source and cloud computing approaches could separate data set storage from provision of services, active database curation, and software development. One example could look like this: first, long-term community-wide cyberinfrastructure efforts would focus on static data storage and management, with an emphasis on building and extending metadata standards and metadata catalogs. These efforts would place data in centralized cloud storage, accessible to everyone. There, separately funded data services would actively index and serve the data to address the questions and software stacks of specific fields. In tandem, separately funded new data curation and research software development efforts would work to refine and extend capabilities.

If we follow this path, substantial upfront investment in tool development and data curation will still be needed -- there's no such thing as a free lunch. However, when the project sunsets or funding lapses, with the open source/open data route there will still be usable products at the end. And, if it all rests on cloud computing infrastructure, communities can scale their infrastructure up or down with their needs and only pay for what they use.

Funders can help push their projects in this direction by requiring open data and open source licenses, encouraging or even requiring multiple deployments of the core infrastructure on different cloud platforms, and ultimately by only funding projects that build in sustainability from the beginning. Ultimately, funders must request, and reviewers require, an end-of-life plan for all infrastructure development efforts, and this end-of-life plan should be Òbaked inÓ to the project from the very beginning.

In the end, providing robust services while developing research software and storing and curating data is both challenging and expensive, and this is not likely to change with a top-down funding or management paradigm. We need new processes and approaches that enable bottom-up participation by small and large research groups; open approaches and cloud computing infrastructure can be a solution.

by C. Titus Brown at July 19, 2015 10:00 PM

July 17, 2015

Titus Brown

A review of &quot;Tools and techniques for computational reproducibility&quot;

Note: Last week, I submitted my review of Stephen R. Piccolo, Adam B. Lee, and Michael B. Frampton's paper, Tools and techniques for computational reproducibility. Soon after, Dan Katz wrote a blog post about notebooks, and in a comment I mentioned Piccolo's paper; and, after dropping a note to Dr. Piccolo, he put it up on bioRxiv! This, in turn, meant that I felt comfortable posting my review - since I sign my reviews, the authors already know who I am, so where's the harm? See below.


In this paper, Piccolo et al. do a nice (and I think comprehensive?) job of outlining six strategies for computational reproducibility. The point is well made that science is increasingly dependent on computational reproducibility (and that in theory we should be able to do computational reproducibility easily and well) and hence we should explore effective approaches that are actually being used.

I know of no other paper that covers this array of material, and this is a quite nice exposition that I would recommend to many. I can't evaluate how broadly it will appeal to a diverse audience but it seems very readable to me.

The following comments are offered as helpful suggestions, not criticisms -- make of them what you will.

The paper almost completely ignores HPC. I'm good with that, but it's a bit surprising (since many computational scientists seem to think that reproducible orchestration of many processors is an unachievable task). Noted in passing.

I was somewhat surprised by the lack of emphasis of version control systems. These are really critical in programming for ensuring reproducibility. I also found a missing citation! You should look at journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745 (yes, sorry, I'm on the paper).

Speaking of which, I appreciate the completeness of references (and even the citation of my blog post ;) but it would be interesting to see if Millman and Perez have anything to offer: http://www.jarrodmillman.com/oss-chapter.html. Certainly a good citation (I think you hit the book, but this is a particularly good chapter.)

I would suggest (in the section that mentions version control systems, ~line 170 of p9) recommending that authors "tag" specific versions for the publication, even if they later recommend using updated versions. (Too many people say "use this repo!" without specifying a revision.)

The section on literate programming could usefully mention that these literate programming environments do not offer good mechanisms for long running programs, so they may not be appropriate for things that take more than a few minutes to run.

Also, and perhaps most important, these literate programming environments provide REPL and can thus track exploratory data analysis and "harden" it when it works and the author moves onto another data analysis - so even if the authors don't want to clean up their notebook before publication, you can track exactly how they got their final results. I think this is important for practical reproducibility. I don't know quite what to suggest in the context of the paper but it seems like an important point to me.

Both the virtual machine and container sections should mention the challenges of raw data bundling, which is one of the major drawbacks here - not only is the VM large, but (unless you are partnering with e.g. Amazon to "scale out") you must distribute potentially large data sets. I think this is one of the biggest practical issues facing data intensive sciences. (There was a nice commentary recently by folk in human genomics begging the NIH to make human genomic data available via the cloud; I can track it down if the authors haven't seen it.)

I think it's important to emphasize how transparent most Dockerfiles are (and how this is a different culture than the VM deployment scene, where configuration systems are often not particularly emphasized except in the devops community). I view this as one of the most important cultural differences driving container adoption, and for once it's good for science!

The docker ecosystem also seems quite robust, which is important, I think.

[ ... specific typos etc omitted ... ]

Signed,

C. Titus Brown, ctbrown@ucdavis.edu

by C. Titus Brown at July 17, 2015 10:00 PM

July 16, 2015

Matthieu Brucher

Convert HPCToolkit files to callgrind format

After my post on HPCToolkit, I felt that I prefered QCacheGrind as a GUI to explore profiling results. So here is a gist with a Python script to convert XML HPCToolkit experiments to callgrind format: https://gist.github.com/mbrucher/6cad31e38beca770523b

For instance, this is a display of an Audio Toolkit test of L2 cache misses:
ATK L2 cache misses profileATK L2 cache misses profile

Enjoy!

by Matt at July 16, 2015 07:05 PM

July 14, 2015

Matthieu Brucher

Profiling with HPC Toolkit

I’ve started working with the HPC Toolkit, especially the part where it can use PAPI to read hardware counters and display the result in a nice GUI.

Installation

Installation is a little bit of a pain, but it is simplified somewhat, as all external dependencies are available in a unique repository, except for PAPI that you have to install by hand.

What is also handy is that all the GUIs are available on all platforms and used right away. The not so handy part is that HPC toolkit can be used on different platform, but not PAPI. And as I like PAPI…

Usage

Usage is really simple. Multithreaded applications will agglomerate all threads and even MPI processes separately, and in the viewer, the results would be shown with a specific [thread, process] index. This is really neat.

But before you can do that, there are a couple of commands to execute. The first one is hpcstruct, that just creates a XML file with all the known functions inside, named binary.hpcstruct:

hpcstruct binary

After this, the profiling can be done:

hpcrun -t -e PAPI_TOT_CYC@15000000 -e PAPI_L1_TCM@400000 -e PAPI_L2_TCM@400000 -e WALLCLOCK@5000 -o outputfolder binary arguments

Here there are 3 PAPI counters and one general counter (WALLCLOCK is relevant as current processors have a variable frequency, meaning that the total number of cycling is not enough to actually infer the spent time). After the @, there is the sampling frequency. I haven’t said what kind of profiling PAPI provides, it’s now clear it’s sampling. Sampling means that from time to time, the application will stop and the profiler will look at its state. If the sampling frequency (or in this case the sampling period) is too short, the profile will be biased and will show profiling errors, or it can even break if the sampling period is so small that it gets hit while doing the report…

A list of available counters can be gotten by hpcrun -L or papi_avail -a for PAPI counters.

Once this is done, the profile has to be converted in a “database”:

hpcprof outputfolder -S binary.hpcstruct -I include_folder

The include folder is relevant if you want to browse through the source folders while analyzing your profile. There are two applications that can be used to review profiles, available on all platforms: hpcviewer or hpctraceviewer.

Here is an example of hpcviewer:

HPCViewer main windowHPCViewer main window

There are three ways of displaying results:

  • Top-down, starting with top functions, diving into more specific functions, following the tree calls
  • Bottom-up, starting with all the leaves in the tree calls, a really informative view that captures the biggest offenders in an eye blink
  • A flat view, file by file, then function by function and finally line by line

Discussion

There is one main warning in these trees: if you compiled in optimized mode with GCC, some layers will be missing where gcc inlined functions. So it takes some agility to actually understand what happens. Another solution is to use the Intel Compiler which has an option to properly generate debug info. The case where this bugs me the most is for lambda functions, as this gets displayed as operator() function! Without the source code, there is no way of knowing where this function actually comes from.

Another thing to remember when checking values is what the third view gives. Each time a sample is taken, it’s at some point in the execution of the program, and the sampling period is given to the line where the program stopped/is sampled. It’s like ergodicity for a random variable, we assume that over enough samples, making a mean on the samples will be the same as making a mean on the timings. So the profiling time has to be important compared to the number of functions profiled and the sampling period.

Also when comparing with other tools like Parallel Studio or Valgrind, I miss at least a bar allowing me to compare different functions cost (that being for implicit or explicit counters). I think a callgrind-like profile can be created out of the HPC Toolkit profile. For the moment, it’s a xml file, but they plan on moving to another binary format…

I thought I could do something with the cvs export, but it only export timings for unfolded lines, without giving lines and file names, without any hierarchy, but it is doable from he original database, as they are stored as XML and as the callgrind file format is open.

One key advantage of HPC Toolkit over Callgrind is the way the profile is computed (sampling vs emulation), so they are faster to be created, and it shows a good picture of what happens within the processor, which an emulation profile can’t exactly show (it depends of the model of the processor, which is never exactly the right model). But it is not reproducible, it depends on the CPU load, so more care is required to analyze this profiles. Also it is only usable on Linux…

It is a great complement to Valgrind/Callgrind, and the fact that it is open source means that there are no excuse for not profiling your application for bottlenecks.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at July 14, 2015 07:19 AM

July 13, 2015

Wei Xue

GSoC Week 6/7

In the week 6 and 7, I coded BayesianGaussianMixture for the full covariance type. Now it can run smoothly on synthetic data and old-faithful data. Take a peek on the demo.

from sklearn.mixture.bayesianmixture import BayesianGaussianMixture as BGM
bgm = BGM(n_init=1, n_iter=100, n_components=7, verbose=2, init_params='random',
         precision_type='full')
bgm.fit(X)

BayesianGaussianMixture on old-faithful dataset. n_components=6, alpha=1e-3

The demo is to repeat the experiment of PRML, page 480, Figure 10.6. VB on BGMM has shown its capability of inferring the number of components automatically. It has converged in 47 iterations.

The lower bound of the log-likelihood, a.k.a ELBO

The ELBO looks a little weired. It is not always going up. When some clusters disappear, ELBO goes down a little bit, then go up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples less than the number of features.

The BayesianGaussianMixture has much more parameters than GaussianMixture, there are six parameters per each components. I feel it is not easy to control the so many functions and parameters. The initial design of BaseMixture is also not so good. I took a look at bnpy which is a more complicated implementation of VB on various mixture models. Though I don't need to go such complicated implementation, but the decoupling of observation model, i.e. $X$, $\mu$, $\Lambda$, and mixture mode, i.e. $Z$, $\pi$ is quite nice. So I tried to use Mixin class to represent these two models. I split MixtureBase into three abstract classes ObsMixin, HiddenMixin and MixtureBase(ObsMixn, HiddenMixin). I also implemented subclasses for Gaussian Mixture ObsGaussianMixin(ObsMixin), MixtureMixin(HiddenMixin), GaussianMixture(MixtureBase, ObsGaussianMixin, MixtureMixin), but Python does allow me to do this due to there is correct MRO. :-|. I changed them back, but this unsuccessful experiment gives me a nice base class, MixtureBase.

I also tried to use cached_property to store the intermediate variables such as, $\ln \pi$, $\ln \Lambda$, and cholsky decomposed $ W-1 $, but didn't get much benefits. It is almost the same to save these variables as private attributes into instances.

The numerical issue comes from responsibility is extremely small. When estimating resp * log resp, it gives NAN. I simply avoid computing when resp < 10*EPS. Still, ELBO seems suspicious.

The current implementation of VBGMM in scikit-learn cannot learn the correct parameters on old-faithful data.

VBGMM(alpha=0.0001, covariance_type='full', init_params='wmc',
   min_covar=None, n_components=6, n_iter=100, params='wmc',
   random_state=None, thresh=None, tol=0.001, verbose=0)

It gives only one components. The weights_ is

 array([  7.31951611e-07,   7.31951611e-07,   7.31951611e-07,
         7.31951611e-07,   7.31951611e-07,   9.99996340e-01])

I also implemented DirichletProcessGaussianMixture. But currently it looks the same as BayesianGaussianMixture. Both of them can infer the best number of components. DirichletProcessGaussianMixture took a slightly more iteration than BayesianGaussianMixture. If we infer Dirichlet Process Mixture by Gibbs sampling, we don't need to specify the truncated level, only alpha the concentration parameter is enough. But with variational inference, we still need the give the model the maximal possible number of components, i.e., the truncated level $T$.

July 13, 2015 09:17 PM

July 10, 2015

Titus Brown

A response to &quot;The myths of bioinformatics software&quot;

This is a response to (parts of) Dr. Lior Pachter's post, "The myths of bioinformatics software". (You can also see my post on bioinformatics software licensing for at least some of the background arguments.)

I agree with a lot of what Lior says: most bioinformatics software is not very good quality (#1), most bioinformatics software is not built by a team (#2), licensing is at best a minor component of what makes software widely used (#3), software should have an expiration date (#5), most URLs are unstable (#6), software should not be "idiot proof" (#7), and it shouldn't matter whether you use a specific programming language (#8).

I strongly disagree with Lior's point #4, in almost every way. I try make my software free for everyone, including companies, for both philosophical reasons and for simplicity; I explained my reasoning in my blog post. (Anyone who doesn't think linking against GPL software is reasonably complicated and nuanced should through the tweets and comments on that post!) From my few involvements with working on non-free software, I would also add that selling software is a tough business, and not one that automatically leads to any profits; there's a long tail, just as with everything else, and I long ago decided that my time is worth more to me than the expected income from selling software would be. (I would be thrilled if a student wanted to try to make money off of our work, but my academic work would remain open source.)

Regardless, Lior's opinion isn't obviously wrong, and I appreciate the discussion.


What surprises me most about Lior's post, though, is that he's describing the present situation rather accurately, but he's not angry about it. I'm angry, frustrated, and upset by it, and I really want to see a better future -- I'm in science, and biology, partly because I think it can have a real impact on society and health. Software is a key part of that.

Biology and genomics are changing. Large scale data analysis is becoming more and more important to the biomedical sciences, and software packages like kallisto and khmer are almost certainly going to be used in the clinic at some point. (I believe some of Broad's variant calling software is already used in diagnosis and treatment for cancer, for example, although I don't know the details.) Our software is certainly being used by people doing basic biomedical research, although it may not be directly clinical yet - and I think the quality of computation in basic research matters too.

And this means bioinformatics should grow up a bit. If bioinformatics is a core component of the future of biology (which I think is obvious), then the quality of bioinformatics software matters.

To quote Lior, "Who wants to read junk software, let alone try to edit it or build on it?" Certainly not me - but then why are we producing it? Are we settling for this kind of software in biomedical research? Are we just giving up on producing decent quality software altogether, because, uh, it's hard? How is this different from doing bad math, or publishing bad biology - topics that Lior and others get really mad about?

Lior also quotes a Computational Biology interview with James Taylor, who says,

A lot of traditional software engineering is about how to build software effectively with large teams, whereas the way most scientific software is developed is (and should be) different. Scientific software is often developed by one or a handful of people.

That was true in a decade ago, and it may have been a reasonable reason to avoid using decent software engineering techniques then, but the landscape has changed significantly in the last decade, with a wide variety of rapid prototyping, test-driven development, and lean/agile methodologies being put into practice in startups and large companies. So I think James is mistaken here.

I wager that the reason a lot of scientists do bad software engineering is because they can get away with it, not because there are no techniques they could profitably use. Heck, if they wanted to learn something about it, Software Carpentry will come teach workshops for you on this very topic, and I'd be happy to offer both Lior and James a workshop to bring them up to speed. (Note: I don't think either of them needs my advice, which is actually kind of my point.)

(As for languages, Lior's point #8, there is a persistent expansion of the Python and R toolchains around bioinformatics and a convergence on them as the daily workhorses of bioinformatics data analysis. So even that's changing.)

Fundamentally the blithe acceptance of badly engineered software in science baffles me. I can understand (and even endorse) not requiring good software engineering for algorithmic proofs of concept, but clearly we want to have good, robust libraries for serious work. To claim otherwise would seem to lead to the conclusion that much of bioinformatics and genomics should seek to be incorrect and irrelevant.


I want there to be a robust community of computational scientists and software developers in biology. I want people to be able to build a new variant caller without having to reimplement a FASTQ or SAM parser. I think we need people to file bug reports, catch weird off-by-one problems, and otherwise spot check all the software they are using. And I don't think it's impossible or even terribly difficult to achieve this.

The open source community has been developing software with distributed teams, with no single employer, and with uncertain funding for decades. It's not easy, but it's not impossible. And in the end I do think that the open source community has a lot of the solutions the computational science community needs, and in fact is simply a much better exemplar for how to work reproducibly and with high technical quality. Why we continue to ignore this mystifies me, although I would guess it has to do with how credit is assigned in academic software development.

If we went back to the 80s and 90s we'd see that many of the same arguments that Lior is making were applied to open source software in contrast to commercial software. We know how that ended - open source software now runs most of the Internet infrastructure. And open source has had other benefits, too; to quote Bill Janeway, "open source and the cloud have dramatically decreased the friction of innovating", and the scientific community has certainly benefited from the tremendous innovation in software and high tech. I would love to see that same innovation enabled in genomics and bioinformatics. And that's why we try to practice good software development in my lab; that's why we release our software under the BSD license; and that's why we encourage people to do whatever they want with our software.

Ultimately I think we should develop our software (and choose our licenses), for the future we want to see, not the present that we're stuck with.

--titus

by C. Titus Brown at July 10, 2015 10:00 PM

July 09, 2015

Fabian Pedregosa

Job Offer: data scientist in Paris

chaire Havas-Dauphine

Alexandre D'Aspremont and I are looking looking for a data scientist to work on large-scale non-smooth optimization methods and other topics. You can find more information in this link.

The job description is intentionally vague, and depending on the candidate this can be a postdoc-like job with most of their time devoted to research or an engineering-like job with more emphasis on coding tasks and contributing to open source projects.

Feel free to contact me (f@bianp.net) or Alexandre D'Aspremont for more information.

by Fabian Pedregosa at July 09, 2015 10:00 PM

July 07, 2015

Matthieu Brucher

Audio Toolkit: Zero Latency Efficient Convolution

Convolution is an algorithm that is often used for reverberations. If the equation is easy to understand and to implement, the implementation is costly. The other way of doing it is to use Fast Fourier Transform (FFT), but the direct/crude implementation requires latency. If it is possible to optimize the basic convolution code, it is sometimes more interesting to use a different algorithm, as it is the case here.

Basic theory

Let’s start with the convolution equation, with x[n] the input signal, y[n] the output signal and i[n] the impulse we want to convolve the input signal with:
Convolution equationConvolution equation
The cost of the operation is the size of the impulse, O(K), which can be quite high. For N output elements, the cost is thus O(KN).

The other way of doing this is in the frequency domain, where the convolution will be a simple multiplication. The issue is that your require from the start N elements. The transform in the frequency domain is in O(M), with M the number of elements in the FFT. As we are talking about convolution, M=N+K-1. So as we have more output elements after the convolution, we will need to add them to the following convolution.

The issue is the selection of the size N. With a big N, we get efficiency, but we also require N elements before being able to output any signal at all. This is worrying, as there is no way of going around this limitation.

Rewriting the equation

Ok, now let’s rewrite the equation as a double sum. Let’s split the impulse in pieces with K elements (let’s say 64 elements, and the original length is a multiple of 64) with S pieces:
Split convolution equationSplit convolution equation

What is interesting is we can also say that the x[k] in the equation can also be regrouped by K elements. This means that we have S convolutions of size K elements. There are several advantages to think about these input samples as pieces. The original algorithm had to have a memory of S*K elements. Either we use a circular buffer, or we have to shift the entire signal. Both solutions have their drawbacks. By regrouping them in pieces, there is no buffer management and no shifting the signal (or more precisely far less).

Unfortunately, we still have latency. We need K elements if we use FFT. Another approach is to process the first convolution in the usual way and the rest with FFTs. Each time we processed K samples, we create a new chunk, transform it in the frequency domain and push it in the list of FFT chunks. Then, we can directly multiply the partial impulses (transformed in the frequency domain) by their respective input chunks, sum them together and then transform the FFT back in the time domain. This means that after the transform we get 2*K elements, the first K elements will be added to the next K elements after the discreet convolution. The last K elements will have to be put aside and added to the first K elements of the next convolution.

Implementation in Audio Toolkit

This is a pretty basic approach to partitioned convolution. It’s just a rewrite of the convolution equation, and no fancy science behind.

The reference implementation will be from the FIR filter (just reverse the impulse). One of the nice feature of ATK is that you can make sure that you always get K input elements available in the buffer. Even if you are asked to process 1 element, the framework has a way of always keeping the last K values. This way, there is no additional need for bookkeeping. Once we processed K elements, we can directly create a new chunk. This is done through the following code in the main process method:

    int processed_size = 0;
    do
    {
      // We can only process split_size elements at a time, but if we already have some elements in the buffer,
      // we need to take them into account.
      int64_t size_to_process = std::min(split_size - split_position, size - processed_size);

      // Process first part of the impulse
      process_impulse_beginning(processed_size, size_to_process);

      split_position += size_to_process;
      processed_size += size_to_process;
      if(split_position == split_size)
      {
        process_new_chunk(processed_size);
        split_position = 0;
      }
    }while(processed_size != size);

process_new_chunk() triggers the copy of the last K input samples and computes a FFT on it. Once this is finished, the FFT result (2*K elements) is stored, and the last K elements of the FFT result are added to the buffer. This buffer part will be added to the convolution in process_impulse_beginning.

Some profiling

I always like to profile my code, so I used a small 10000 samples impulse (a ramp) and convoluted it with a triangle signal and compared the performance to the reference algorithm (the FIR filter). With pieces of 64 elements, the new implementation is 10 times faster, and it improves more if the pieces are bigger or the impulse longer:

First, a simple block convolution already improves timings by a factor of 3. Then, using FFTs to compute all the convolutions expect for the first, we can go further. Let’s see a profile of my original implementation:

Original profile with FFT convolutionOriginal profile with FFT convolution

Now, all the complex and vector calls inside the two convolution main methods can be removed to streamline the code:

Final profile after optimizationsFinal profile after optimizations

Besides the optimization on the FFT code, the classic IIR code was also optimized somewhat, leading to a 15% performance increase. The conversion code was also enhanced leading to a gain for all plugins of more than 50% for the input/output handling.

Conclusion

There is still room for improvement, as there are several copies that are still required, but a basic implementation is actually easy to achieve. There is a compromise to be achieved between the computation of the convolution through FFT, dominated by the complex multiplication and accumulation before the inverse FFT, and the convolution through a FIR filter of the first block. If the impulse is long (several seconds), it seems that a size of 256 to 512 elements is the best compromise, whereas for shorter impulse (1 second), a block size of 64 elements is the best.

Of course, by improving the complex multiplication and addition (SIMD? FMA?), the compromise would shift towards smaller block sizes.

I didn’t put any big equations on how the block convolution is achieved. This would almost require a paper or a book chapter…

The full implementation is available on Github.

Buy Me a Coffee!



Other Amount:



Your Email Address :


by Matt at July 07, 2015 04:09 PM

July 06, 2015

Titus Brown

On licensing bioinformatics software: use the BSD, Luke.

If a piece of bioinformatics software is not fully open source, my lab and I will generally seek out alternatives to it for research, teaching and training. This holds whether or not the software is free for academic use.

If a piece of bioinformatics software is only available under the GNU Public License or another "copyleft" license, then my lab and I will absolutely avoid integrating any part of it into our own source code, which is mostly BSD.

Why avoid non-open-source software?

We avoid non-open-source software because it saves us future headaches.

Contrary to some assumptions, this is not because I'm anti-company or against making money from software, although I have certainly chosen to forego that in my own life. It's almost entirely because such software is an evolutionary dead end, and hence time spent working with it is ultimately wasted.

More specifically, here are some situations that I want to avoid:

  • we invest a lot of time in building training materials for a piece of software, only to find out that some of our trainees can't make use of the software in their day job.
  • we need to talk to lawyers about whether or not we can use a piece of software or include some of its functionality in a workflow we're building.
  • we find a small but critical bug in a piece of bioinformatics software, and can't reach any of the original authors to OK a new release.

This is the reason why I won't be investing much time or effort in using kallisto in my courses and my research: it's definitely not open source.

Why avoid copyleft?

The typical decision tree for an open source license is between a "permissive" BSD-style license vs a copyleft license like the GPL; see Jake Vanderplas's excellent post on licensing scientific code for specifics.

There is an asymmetry in these licenses.

Our software, khmer, is available under the BSD license. Any open source project (indeed, any project) is welcome to take any part of our source code and include it in theirs.

However, we cannot use GPL software in our code base at all. We can't call GPL library functions, we can't include GPL code in our codebase, and I'm not 100% sure we can even look closely at GPL code. This is because if we do so, we must license our own software under the GPL.

This is the reason that I will be avoiding bloomtree code, and in fact we will probably be following through on our reimplementation -- bloomtree relies on both Jellyfish and sdsl-lite, which are GPL.

Why did we choose BSD and not GPL for our own code?

Two reasons: first, I'm an academic, funded by government grants; second, I want to maximize the utility of my work, which means choosing a license that encourages the most participation in the project, and encourages the most reuse of my code in other projects.

Jake covers the second line of reasoning really nicely in his blog post, so I will simply extract his summary of John Hunter's reasoning:

To summarize Hunter's reasoning: the most important two predictors of success for a software project are the number of users and the number of contributors. Because of the restrictions and subtle legal issues involved with GPL licenses, many for-profit companies will not touch GPL-licensed code, even if they are happy to contribute their changes back to the community. A BSD license, on the other hand, removes these restrictions: Hunter mentions several specific examples of vital industry partnership in the case of matplotlib. He argues that in general, a good BSD-licensed project will, by virtue of opening itself to the contribution of private companies, greatly grow its two greatest assets: its user-base and its developer-base.

I also think maximizing remixability is a basic scientific goal, and this is something that the GPL fails.

The first line of reasoning is a little more philosophical, but basically it comes down to a wholesale rejection of the logic in the Bayh-Dole act, which tries to encourage innovation and commercialization of federally funded research by assigning intellectual property to the university. I think this approach is bollocks. While I am not an economic expert, I think it's clear that most innovation in the university is probably not worth that much and should be made maximally open. From talking to Dr. Bill Janeway, I he agrees that pre-Bayh-Dole was a time of more openness, although I'm not sure of the evidence for more innovation during this period. Regardless, to me it's intuitively obvious that the prospect of commercialization causes more researchers to keep their research closed, and I think this is obviously bad for science. (The Idea Factory talks a lot about how Bell Labs spurred immense amounts of innovation because so much of their research was open for use. Talent Wants to be Free is a pop-sci book that outlines research supporting openness leading to more innovation.)

So, basically, I think my job as an academic is to come up with cool stuff and make it as open as possible, because that encourages innovation and progress. And the BSD fits that bill. If a company wants to make use of my code, that's great! Please don't talk to us - just grab it and go!

I should say that I'm very aware of the many good reasons why GPL promotes a better long-term future, and until I became a grad student I was 100% on board. Once I got more involved in scientific programming, though, I switched to a more selfish rationale, which is that my reputation is going to be enhanced by more people using my code, and the way to do that is with the BSD. If you have good arguments about why I'm wrong and everyone should use the GPL, please do post them (or links to good pieces) in the comments; I'm happy to promote that line of reasoning, but for now I've settled on BSD for my own work.

One important note: universities like releasing things under the GPL, because they know that it virtually guarantees no company will use it in a commercial product without paying the university to relicense it specifically for the company. While this may be in the best short-term interests of the university, I think it says all you need to know about the practical outcome of the GPL on scientific innovation.

Why am I OK with the output of commercial equipment?

Lior Pachter drew a contrast between my refusal to teach non-free software and my presumed teaching on sequencing output from commercial Illumina machines. I think there's at least four arguments to be made in favor of continuing to use Illumina while avoiding the use of Kallisto.

  • pragmatically, Illumina is the only game in town for most of my students, while there are plenty of RNA-seq analysis programs. So unless I settled on kallisto being the super-end-all-be-all of RNAseq analysis, I can indulge my leanings towards freedom by ignoring kallisto and teaching something else that's free-er.
  • Illumina has a clear pricing model and their sequencing is essentially a commodity that needs little to no engagement from me. This is not generally how bioinformatics software works :)
  • There's no danger of Illumina claiming dibs on any of my results or extensions - we're all clear that I pays my money, and I gets my sequence. I'm honestly not sure what would happen if I modified kallisto or built on it to do something cool, and then wanted to let a company use it. (I bet it would involve talking to a lot of lawyers, which I'm not interested in doing.)
  • James Taylor made the excellent points that limited training and development time is best spent on tools that are maximally available, and that don't involve licenses that they can't enforce.

So that's my reasoning. I don't want to pour fuel on any licensing fire, but I wanted to explain my reasoning to people. I also think that people should fight hard to make their bioinformatics software available under a permissive license, because it will benefit everyone :).

I should say that Manoj Samanta has been following this line of thought for much longer than me, and has written several blog posts on this topic (see also this, for example).

--titus

by C. Titus Brown at July 06, 2015 10:00 PM

Jake Vanderplas

The Model Complexity Myth

(or, Yes You Can Fit Models With More Parameters Than Data Points)

An oft-repeated rule of thumb in any sort of statistical model fitting is "you can't fit a model with more parameters than data points". This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints, and this is much more than mere trivia with which you can impress the nerdiest of your friends: as I will show here, this fact can prove to be very useful in real-world scientific applications.

A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. In this post I will consider this misconception, which I call the model complexity myth. I'll start by showing where this model complexity myth holds true, first from from an intuitive point of view, and then from a more mathematically-heavy point of view. I'll build from this mathematical treatment and discuss how underdetermined models may be addressed from a frequentist standpoint, and then from a Bayesian standpoint. (If you're unclear about the general differences between frequentist and Bayesian approaches, I might suggest reading my posts on the subject). Finally, I'll discuss some practical examples of where such an underdetermined model can be useful, and demonstrate one of these examples: quantitatively accounting for measurement biases in scientific data.

The Root of the Model Complexity Myth

While the model complexity myth is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive. In this section I first want to motivate the reason for the underdetermination issue in simple linear models, first from an intuitive view, and then from a more mathematical view.

I'll start by defining some functions to create plots for the examples below; you can skip reading this code block for now:

In [1]:
# Code to create figures
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('ggplot')

def plot_simple_line():
    rng = np.random.RandomState(42)
    x = 10 * rng.rand(20)
    y = 2 * x + 5 + rng.randn(20)
    p = np.polyfit(x, y, 1)
    xfit = np.linspace(0, 10)
    yfit = np.polyval(p, xfit)
    plt.plot(x, y, 'ok')
    plt.plot(xfit, yfit, color='gray')
    plt.text(9.8, 1,
             "y = {0:.2f}x + {1:.2f}".format(*p),
             ha='right', size=14);
    
def plot_underdetermined_fits(p, brange=(-0.5, 1.5), xlim=(-3, 3),
                              plot_conditioned=False):
    rng = np.random.RandomState(42)
    x, y = rng.rand(2, p).round(2)
    xfit = np.linspace(xlim[0], xlim[1])
    for r in rng.rand(20):
        # add a datapoint to make model specified
        b = brange[0] + r * (brange[1] - brange[0])
        xx = np.concatenate([x, [0]])
        yy = np.concatenate([y, [b]])
        theta = np.polyfit(xx, yy, p)
        yfit = np.polyval(theta, xfit)
        plt.plot(xfit, yfit, color='#BBBBBB')
    plt.plot(x, y, 'ok')
    
    if plot_conditioned:
        X = x[:, None] ** np.arange(p + 1)
        theta = np.linalg.solve(np.dot(X.T, X)
                                + 1E-3 * np.eye(X.shape[1]),
                                np.dot(X.T, y))
        Xfit = xfit[:, None] ** np.arange(p + 1)
        yfit = np.dot(Xfit, theta)
        plt.plot(xfit, yfit, color='black', lw=2)

def plot_underdetermined_line():
    plot_underdetermined_fits(1)
        
def plot_underdetermined_cubic():
    plot_underdetermined_fits(3, brange=(-1, 2),
                            xlim=(0, 1.2))
    
def plot_conditioned_line():
    plot_underdetermined_fits(1, plot_conditioned=True)

Fitting a Line to Data

The archetypical model-fitting problem is that of fitting a line to data: A straight-line fit is one of the simplest of linear models, and is usually specified by two parameters: the slope m and intercept b. For any observed value \(x\), the model prediction for \(y\) under the model \(M\) is given by

\[ y_M = m x + b \]

for some particular choice of \(m\) and \(b\). Given \(N\) obverved data points \(\{x_i, y_i\}_{y=1}^N\), it is straightforward (see below) to compute optimal values for \(m\) and \(b\) which fit this data:

In [2]:
plot_simple_line()

The simple line-plus-intercept is a two-parameter model, and it becomes underdetermined when fitting it to fewer than two datapoints. This is easy to understand intuitively: after all, you can draw any number of perfectly-fit lines through a single data point:

In [3]:
plot_underdetermined_line()

The single point simply isn't enough to pin-down both a slope and an intercept, and the model has no unique solution.

Fitting a More General Linear Model

While it's harder to see intuitively, this same notion extends to models with more terms. For example, let's think consider fitting a general cubic curve to data. In this case our model is

\[ y_M = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3 \]

Note that this is still a linear model: the "linear" refers to the linearity of model parameters \(\theta\) rather than linearity of the dependence on the data \(x\). Our cubic model is a four-parameter linear model, and just as a two-parameter model is underdetermined for fewer than two points, a four-parameter model is underdetermined for fewer than four points. For example, here are some of the possible solutions of the cubic model fit to three randomly-chosen points:

In [4]:
plot_underdetermined_cubic()

For any such simple linear model, an underdetermined system will lead to a similar result: an infinite set of best-fit solutions.

The Mathematics of Underdetermined Models

To make more progress here, let's quickly dive into the mathematics behind these linear models. Going back to the simple straight-line fit, we have our model

\[ y_M(x~|~\theta) = \theta_0 + \theta_1 x \]

where we've replaced our slope \(m\) and intercept \(b\) by a more generalizable parameter vector \(\theta = [\theta_0, \theta_1]\). Given some set of data \(\{x_n, y_n\}_{n=1}^N\) we'd like to find \(\theta\) which gives the best fit. For reasons I'll not discuss here, this is usually done by minimizing the sum of squared residuals from each data point, often called the \(\chi^2\) of the model in reference to its expected theoretical distribution:

\[ \chi^2 = \sum_{n=1}^N [y_n - y_M(x_n~|~\theta)]^2 \]

We can make some progress by re-expressing this model in terms of matrices and vectors; we'll define the vector of \(y\) values:

\[ y = [y_1, y_2, y_3, \cdots y_N] \]

We'll also define the design matrix which we'll call \(X\); this contains all the information about the form of the model:

\[ X = \left[ \begin{array}{ll} 1 & x_1 \\ 1 & x_2 \\ \vdots &\vdots \\ 1 & x_N \\ \end{array} \right] \]

With this formalism, the vector of model values can be expressed as a matrix-vector product:

\[ y_M = X\theta \]

and the \(\chi^2\) can be expressed as a simple linear product as well:

\[ \chi^2 = (y - X\theta)^T(y - X\theta) \]

We'd like to minimize the \(\chi^2\) with respect to the parameter vector \(\theta\), which we can do by the normal means of differentiating with respect to the vector \(\theta\) and setting the result to zero (yes, you can take the derivative with respect to a vector!):

\[ \frac{d\chi^2}{d\theta} = -2X^T(y - X\theta) = 0 \]

Solving this for \(\theta\) gives the Maximum Likelihood Estimate (MLE) for the parameters,

\[ \hat{\theta}_{MLE} = [X^T X]^{-1} X^T y \]

Though this matrix formalism may seem a bit over-complicated, the nice part is that it straightforwardly generalizes to a host of more sophisticated linear models. For example, the cubic model considered above requires only a larger design matrix \(X\):

\[ X = \left[ \begin{array}{llll} 1 & x_1 & x_1^2 & x_1^3\\ 1 & x_2 & x_2^2 & x_2^3\\ \vdots & \vdots & \vdots & \vdots\\ 1 & x_N & x_N^2 & x_N^3\\ \end{array} \right] \]

The added model complexity is completely encapsulated in the design matrix, and the expression to compute \(\hat{\theta}_{MLE}\) from \(X\) is unchanged!

Why Underdetermined Models Break

Taking a look at this Maximum Likelihood solution for \(\theta\), we see that there is only one place that it might go wrong: the inversion of the matrix \(X^T X\). If this matrix is not invertible (i.e. if it is a singular matrix) then the maximum likelihood solution will not be well-defined.

The number of rows in \(X\) corresponds to the number of data points, and the number of columns in \(X\) corresponds to the number of parameters in the model. It turns out that a matrix \(C = X^TX\) will always be singular if \(X\) has fewer rows than columns, and this is the source of the problem. For underdetermined models, \(X^TX\) is a singular matrix, and so the maximum likelihood fit is not well-defined.

Let's take a look at this in the case of fitting a line to the single point shown above, \((x=0.37, y=0.95)\). For this value of \(x\), here is the design matrix:

In [5]:
X = np.array([[1, 0.37]])

We can use this to compute the normal matrix, which is the standard name for \(X^TX\):

In [6]:
C = np.dot(X.T, X)

If we try to invert this, we will get an error telling us the matrix is singular:

In [7]:
np.linalg.inv(C)
---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-7-a6d66d9f99af> in <module>()
----> 1 np.linalg.inv(C)

/Users/jakevdp/anaconda/envs/py34/lib/python3.4/site-packages/numpy/linalg/linalg.py in inv(a)
    518     signature = &aposD->D&apos if isComplexType(t) else &aposd->d&apos
    519     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 520     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    521     return wrap(ainv.astype(result_t))
    522 

/Users/jakevdp/anaconda/envs/py34/lib/python3.4/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
     88 
     89 def _raise_linalgerror_singular(err, flag):
---> 90     raise LinAlgError("Singular matrix")
     91 
     92 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

Evidently, if we want to fix the underdetermined model, we'll need to figure out how to modify the normal matrix so it is no longer singular.

Fixing an Underdetermined Model: Conditioning

One easy way to make a singular matrix invertible is to condition it: that is, you add to it some multiple of the identity matrix before performing the inversion (in many ways this is equivalent to "fixing" a divide-by-zero error by adding a small value to the denominator). Mathematically, that looks like this:

\[ C = X^TX + \sigma I \]

For example, by adding \(\sigma = 10^{-3}\) to the diagonal of the normal matrix, we condition the matrix so that it can be inverted:

In [8]:
cond = 1E-3 * np.eye(2)
np.linalg.inv(C + cond)
Out[8]:
array([[ 121.18815362, -325.16038316],
       [-325.16038316,  879.69065823]])

Carrying this conditioned inverse through the computation, we get the following intercept and slope for our underdetermined problem:

In [9]:
b, m = np.linalg.solve(C + cond,
                       np.dot(X.T, [0.95]))
print("Conditioned best-fit model:")
print("y = {0:.3f} x + {1:.3f}".format(m, b))
Conditioned best-fit model:
y = 0.309 x + 0.835

In [10]:
plot_conditioned_line()

This conditioning caused the model to settle on a particular one of the infinite possibilities for a perfect fit to the data. Numerically we have fixed our issue, but this arbitrary conditioning is more than a bit suspect: why is this particular result chosen, and what does it actually mean in terms of our model fit? In the next two sections, we will briefly discuss the meaning of this conditioning term from both a frequentist and Bayesian perspective.

Frequentist Conditioning: Regularization

In a frequentist approach, this type of conditioning is known as regularization. Regularization is motivated by a desire to penalize large values of model parameters. For example, in the underdetermined fit above (with \((x, y) = (0.37, 0.95)\)), you could fit the data perfectly with a slope of one billion and an intercept near negative 370 million, but in most real-world applications this would be a silly fit. To prevent this sort of canceling parameter divergence, in a frequentist setting you can "regularize" the model by adding a penalty term to the \(\chi^2\):

\[ \chi^2_{reg} = \chi^2 + \lambda~\theta^T\theta \]

Here \(\lambda\) is the degree of regularization, which must be chosen by the person implementing the model.

Using the expression for the regularized \(\chi^2\), we can minimize with respect to \(\theta\) by again taking the derivative and setting it equal to zero:

\[ \frac{d\chi^2}{d\theta} = -2[X^T(y - X\theta) - \lambda\theta] = 0 \]

This leads to the following regularized maximum likelihood estimate for \(\theta\):

\[ \hat{\theta}_{MLE} = [X^TX + \lambda I]^{-1} X^T y \]

Comparing this to our conditioning above, we see that the regularization degree \(\lambda\) is identical to the conditioning term \(\sigma\) that we considered above. That is, regulariation of this form is nothing more than a simple conditioning of \(X^T X\), with \(\lambda = \sigma\). The result of this conditioning is to push the absolute values of the parameters toward zero, and in the process make an ill-defined problem solvable.

I'll add that the above form of regularization is known variably as L2-regularization or Ridge Regularization, and is only one of the possible regularization approaches. Another useful form of regularization is L1-regularization, also known as Lasso Regularization, which has the interesting property that it favors sparsity in the model.

Bayesian Conditioning: Priors

Regularization illuminates the meaning of matrix conditioning, but it still sometimes seems like a bit of black magic. What does this penalization of model parameters within the \(\chi^2\) actually mean? Here, we can make progress in understanding the problem by examining regularization from a Bayesian perspective.

As I pointed out in my series of posts on Frequentism and Bayesianism, for many simple problems, the frequentist likelihood (proportional to the negative exponent of the \(\chi^2\)) is equivalent to the Bayesian posterior – albeit with a subtlely but fundamentally different interpretation.

The Bayesian posterior probability on the model parameters \(\theta\) is given by

\[ P(\theta~|~D, M) = \frac{P(D~|~\theta, M) P(\theta~|~M)}{P(D~|~M)} \]

where the most important terms are the likelihood \(P(D~|~\theta, M)\) and the prior \(P(\theta~|~M)\). From the expected correspondence with the frequentist result, we can write:

\[ P(D~|~\theta, M) P(\theta~|~M) \propto \exp[- \chi^2] \]

Because the term on the right-hand-side has no \(\theta\) dependence, we can immediately see that

\[ P(D~|~\theta, M) \propto \exp[-\chi^2]\\ P(\theta~|~M) \propto 1 \]

That is, the simple frequentist likelihood is equivalent to the Bayesian posterior for the model with an implicit flat prior on \(\theta\).

To understand the meaning of regularization, let's repeat this exercise with the regularized \(\chi^2\):

\[ P(D~|~\theta, M) P(\theta~|~M) \propto \exp[- \chi^2 - \lambda~|\theta|^2] \]

The regularization term in the \(\chi^2\) becomes a second term in the product which depends only on \(\theta\), thus we can immediately write

\[ P(D~|~\theta, M) \propto \exp[-\chi^2]\\ P(\theta~|~M) \propto \exp[- \lambda~|\theta|^2] \]

So we see that ridge regularization is equivalent to applying a Gaussian prior to your model parameters, centered at \(\theta=0\) and with a width \(\sigma_P = (2\lambda)^{-2}\). This insight lifts the cover off the black box of regularization, and reveals that it is simply a roundabout way of adding a Bayesian prior within a frequentist paradigm. The stronger the regularization, the narrower the implicit Gaussian prior is.

Returning to our single-point example above, we can quickly see how this intuition explains the particular model chosen by the regularized fit; it is equivalent to fitting the line while taking into account prior knowledge that both the intercept and slope should be near zero:

In [11]:
plot_conditioned_line()

The benefit of the Bayesian view is that it helps us understand exactly what this conditioning means for our model, and given this understanding we can easily extend use more general priors. For example, what if you have reason to believe your slope is near 1, but have no prior information on your intercept? In the Bayesian approach, it is easy to add such information to your model in a rigorous way.

But regardless of which approach you use, this central fact remains clear: you can fit models with more parameters than data points, if you restrict your parameter space through the use of frequentist regularization or Bayesian priors.

Underdetermined Models in Action

There are a few places where these ideas about underdetermined models come up in real life. I'll briefly discuss a couple of them here, and then walk through the solution of a simple (but rather interesting) problem that demonstrates these ideas.

Compressed Sensing: Image Reconstruction

One area where underdetermined models are often used is in the field of Compressed Sensing. Compressed sensing comprises a set of models in which underdetermined linear systems are solved using a sparsity prior, the classic example of which is the reconstruction of a full image from just a handful of its pixels. As a simple linear model this would fail, because there are far more unknown pixels than known pixels. But by carefully training a model on the structure of typical images and applying priors based on sparsity, this seemingly impossible problem becomes tractable. This 2010 Wired article has a good popular-level discussion of the technique and its applications, and includes this image showing how a partially-hidden input image can be iteratively reconstructed from a handful of pixels using a sparsity prior:

In [12]:
from IPython.display import Image
Image('http://www.wired.com/magazine/wp-content/images/18-03/ff_algorithm2_f.jpg')
Out[12]:

Kernel-based Methods: Gaussian Processes

Another area where a classically underdetermined model is solved is in the case of Gaussian Process Regression. Gaussian Processes are an interesting beast, and one way to view them is that rather than fitting, say, a two-parameter line or four-parameter cubic curve, they actually fit an infinite-dimensional model to data! They accomplish this by judicious use of certain priors on the model, along with a so-called "kernel trick" which solves the infinite dimensional regression implicitly using a finite-dimensional representation constructed based on these priors.

In my opinion, the best resource to learn more about Gaussian Process methods is the Gaussian Processes for Machine Learning book, which is available for free online (though it is a bit on the technical side). You might also take a look at the scikit-learn Gaussian Process documentation. If you'd like to experiment with a rather fast Gaussian Process implementation in Python, check out the george library.

Imperfect Detectors: Extrasolar Planets

Another place I've seen effective use of the above ideas is in situations where the data collection process has unavoidable imperfections. There are two basic ways forward when working with such noisy and biased data:

  • you can pre-process the data to try to remove and/or correct data imperfections, and then fit a conventional model to the corrected data.
  • you can account for the data imperfections along with interesting model parameters as part of a unified model: this type of unified approach is often preferable in scientific settings, where even the most careful pre-processing can lead to biased results.

If you'd like to see a great example of this style of forward-modeling analysis, check out the efforts of David Hogg's group in finding extrasolar planets in the Kepler survey's K2 data; there's a nice astrobytes post which summarizes some of these results. While the group hasn't yet published any results based on truly underdetermined models, it is easy to imagine how this style of comprehensive forward-modeling analysis could be pushed to such an extreme.

Example: Fitting a Model to Biased Data

While it might be fun to dive into the details of models for noisy exoplanet searches, I'll defer that to the experts. Instead, as a more approachable example of an underdetermined model, I'll revisit a toy example in which a classically underdetermined model is used to account for imperfections in the input data.

Imagine you have some observed data you would like to model, but you know that your detector is flawed such that some observations (you don't know which) will have a bias that is not reflected in the estimated error: in short, there are outliers in your data. How can you fit a model to this data while accounting for the possibility of such outliers?

To make this more concrete, consider the following data, which is drawn from a line with noise, and includes several outliers:

In [13]:
rng = np.random.RandomState(42)
theta = [-10, 2]
x = 10 * rng.rand(10)
dy = 2 + 2 * rng.rand(10)
y = theta[0] + theta[1] * x + dy * rng.randn(10)
y[4] += 15
y[7] -= 10
plt.errorbar(x, y, dy, fmt='ok', ecolor='gray');

If we try to fit a line using the standard \(\chi^2\) minimization approach, we will find an obviously biased result:

In [14]:
from scipy import optimize

def chi2(theta, x=x, y=y, dy=dy):
    y_model = theta[0] + theta[1] * x
    return np.sum(0.5 * ((y - y_model) / dy) ** 2)

theta1 = optimize.fmin(chi2, [0, 0], disp=False)

xfit = np.linspace(0, 10)
plt.errorbar(x, y, dy, fmt='ok', ecolor='gray')
plt.plot(xfit, theta1[0] + theta1[1] * xfit, '-k')
plt.title('Standard $\chi^2$ Minimization');

This reflects a well-known deficiency of \(\chi^2\) minimization: it is not robust to the presence of outliers.

What we would like to do is propose a model which somehow accounts for the possibility that each of these points may be the result of a biased measurement. One possible route is to add \(N\) new model parameters: one associated with each point which indicates whether it is an outlier or not. If it is an outlier, we use the standard model likelihood; if not, we use a likelihood with a much larger error. The result for our straight-line fit will be a model with \(N + 2\) parameters, where \(N\) is the number of data points. An overzealous application of lessons from simple linear models might lead you to believe this model can't be solved. But, if carefully constructed, it can!

Let's show how it can be done.

Our linear model is:

\[ y_M(x~|~\theta) = \theta_0 + \theta_1 x \]

For a non-outlier (let's call it an "inlier") point at \(x\), \(y\), with error on \(y\) given by \(dy\), the likelihood is

\[ L_{in, i}(D~|~\theta) = \frac{1}{\sqrt{2\pi dy_i^2}} \exp\frac{-[y_i - y_M(x_i~|~\theta)]^2}{2 dy_i^2} \]

For an "outlier" point, the likelihood is

\[ L_{out, i}(D~|~\theta) = \frac{1}{\sqrt{2\pi \sigma_y^2}} \exp\frac{-[y_i - y_M(x_i~|~\theta)]^2}{2 \sigma_y^2} \]

where \(\sigma_y\) is the standard deviation of the \(y\) data: note that the only difference between the "inlier" and "outlier" likelihood is the width of the Gaussian distribution.

Now we'll specify \(N\) additional binary model parameters \(\{g_i\}_{i=1}^N\) which indicate whether point \(i\) is an outlier \((g_i = 1)\) or an inlier \((g_i = 0)\). With this, the overall Likelihood becomes:

\[ L(D~|~\theta, g) = \prod_i \left[(1 - g_i)~L_{in, i} + g_i~L_{out, i}\right] \]

We will put a prior on these indicator variables \(g\) which encourages sparsity of outliers; this can be accomplished with a simple L1 prior, which penalizes the sum of the \(g\) terms:

\[ P(g) = \exp\left[-\sum_i g_i\right] \]

where, recall, \(g_i \in \{0, 1\}\).

Though you could likely solve for a point estimate of this model, I find the Bayesian approach to be much more straightforward and interpretable for a model this complex. To fit this, I'll make use of the excellent emcee package. Because emcee doesn't have categorical variables, we'll instead allow \(g_i\) to range continuously between 0 and 1, so that any single point will be some mixture of "outlier" and "inlier".

We start by defining a function which computes the log-posterior given the data and model parameters, using some computational tricks for the sake of floating-point accuracy:

In [15]:
# theta will be an array of length 2 + N, where N is the number of points
# theta[0] is the intercept, theta[1] is the slope,
# and theta[2 + i] is the weight g_i

def log_prior(theta):
    g = theta[2:]
    #g_i needs to be between 0 and 1
    if (np.any(g < 0) or np.any(g > 1)):
        return -np.inf # recall log(0) = -inf
    else:
        return -g.sum()

def log_likelihood(theta, x, y, dy):
    sigma_y = np.std(y)
    y_model = theta[0] + theta[1] * x
    g = np.clip(theta[2:], 0, 1)  # g<0 or g>1 leads to NaNs in logarithm
    # log-likelihood for in-lier
    logL_in = -0.5 * (np.log(2 * np.pi * dy ** 2) + ((y - y_model) / dy)** 2)
    # log-likelihood for outlier
    logL_out = -0.5 * (np.log(2 * np.pi * sigma_y ** 2) + ((y - y_model) / sigma_y) ** 2)
    return np.sum(np.logaddexp(np.log(1 - g) + logL_in,
                               np.log(g) + logL_out))

def log_posterior(theta, x, y, dy):
    return log_prior(theta) + log_likelihood(theta, x, y, dy)

Now we use the emcee package to run this model. Note that because of the high dimensionality of the model, the run_mcmc command below will take a couple minutes to complete:

In [16]:
import emcee

ndim = 2 + len(x)  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 10000  # "burn-in" period to let chains stabilize
nsteps = 15000  # number of MCMC steps to take

# set walkers near the maximum likelihood
# adding some random scatter
rng = np.random.RandomState(0)
starting_guesses = np.zeros((nwalkers, ndim))
starting_guesses[:, :2] = rng.normal(theta1, 1, (nwalkers, 2))
starting_guesses[:, 2:] = rng.normal(0.5, 0.1, (nwalkers, ndim - 2))

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, dy])
sampler.run_mcmc(starting_guesses, nsteps)

sample = sampler.chain  # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, ndim)
-c:21: RuntimeWarning: divide by zero encountered in log
-c:22: RuntimeWarning: divide by zero encountered in log

The Runtime warnings here are normal – they just indicate that we've hit log(0) = -inf for some pieces of the calculation.

With the sample chain determined, we can plot the marginalized distribution of samples to get an idea of the value and uncertainty of slope and intercept with this model:

In [17]:
plt.plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1)
plt.xlabel('intercept')
plt.ylabel('slope');

Finally, we can make use of the marginalized values of all \(N + 2\) parameters and plot both the best-fit model, along with a model-derived indication of whether each point is an outlier:

In [18]:
theta2 = np.mean(sample[:, :2], 0)
g = np.mean(sample[:, 2:], 0)
outliers = (g > 0.5)

plt.errorbar(x, y, dy, fmt='ok', ecolor='gray')
plt.plot(xfit, theta1[0] + theta1[1] * xfit, color='gray')
plt.plot(xfit, theta2[0] + theta2[1] * xfit, color='black')
plt.plot(x[outliers], y[outliers], 'ro', ms=20, mfc='none', mec='red')
plt.title('Bayesian Fit');

The red circles mark the points that were determined to be outliers by our model, and the black line shows the marginalized best-fit slope and intercept. For comparison, the grey line is the standard maximum likelihood fit.

Notice that we have successfully fit an \((N + 2)\)-parameter model to \(N\) data points, and the best-fit parameters are actually meaningful in a deep way – the \(N\) extra parameters give us individual estimates of whether each of the \(N\) data points has misreported errors. I think this is a striking example of how a model that would be considered impossible under the model complexity myth can be solved in practice to produce very useful results!

Conclusion

I hope you will see after reading this post that the model complexity myth, while rooted in a solid understanding of simple linear models, should not be assumed to apply to all possible models. In fact, it is possible to fit models with more parameters than datapoints: and for the types of noisy, biased, and heterogeneous data we often encounter in scientific research, you can make a lot of progress by taking advantage of such models. Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at July 06, 2015 03:00 PM

July 03, 2015

NeuralEnsemble

Sumatra 0.7 released

We would like to announce the release of version 0.7.0 of Sumatra, a tool for automated tracking of simulations and computational analyses so as to be able to easily replicate them at a later date.

This version of Sumatra brings some major improvements for users, including an improved web browser interface, improved support for the R language, Python 3 compatibility, a plug-in interface making Sumatra easier to extend and customize, and support for storing data using WebDAV.
In addition, there have been many changes under the hood, including a move to Github and improvements to the test framework, largely supported by the use of Docker.
Last but not least, we have changed licence from the CeCILL licence (GPL-compatible) to a BSD 2-Clause Licence, which should make it easier for other developers to use Sumatra in their own projects.

Updated and extended web interface

Thanks to Felix Hoffman’s Google Summer of Code project, the web browser interface now provides the option of viewing the history of your project either in a “process-centric” view, as in previous versions, where each row in the table represents a computation, or in a “data-centric” view, where each row is a data file. Where the output from one computation is the input to another, additional links make it possible to follow these connections.
The web interface has also had a cosmetic update and several other improvements, including a more powerful comparison view (see screenshot). Importantly, the interface layout no longer breaks in narrower browser windows.
../_images/compare_records.png

BSD licence

The Sumatra project aims to provide not only tools for scientists as end users (such as smt and smtweb), but also library components for developers to add Sumatra’s functionality to their own tools. To support this second use, we have switched licence from CeCILL (GPL-compatible) to the BSD 2-Clause Licence.

Python 3 support

In version 0.6.0, Sumatra already supported provenance capture for projects using Python 3, but required Python 2.6 or 2.7 to run. Thanks to Tim Tröndle, Sumatra now also runs in Python 3.4.

Plug-in interface

To support the wide diversity of workflows in scientific computing, Sumatra has always had an extensible architecture. It is intended to be easy to add support for new database formats, new programming languages, new version control systems, or new ways of launching computations.
Until now, adding such extensions has required that the code be included in Sumatra’s code base. Version 0.7.0 adds a plug-in interface, so you can define your own local extensions, or use other people’s.
For more information, see Extending Sumatra with plug-ins.

WebDAV support

The option to archive output data files has been extended to allow archiving to a remote server using the WebDAV protocol.

Support for the R language

Sumatra will now attempt to determine the versions of all external packages loaded by an R script.

Other changes

For developers, there has been a significant change - the project has moved from Mercurial to Git, and is now hosted on Github. Testing has also been significantly improved, with more system/integration testing, and the use of Docker for testing PostgreSQL and WebDAV support.
Parsing of command-line parameters has been improved. The ParameterSet classes now have a diff() method, making it easier to see the difference between two parameter sets, especially for large and hierarchical sets.
Following the recommendation of the Mercurial developers, and to enable the change of licence to BSD, we no longer use the Mercurial internal API. Instead we use the Mercurial command line interface via the hgapi package.

Bug fixes

fair number of bugs have been fixed.

Download, support and documentation


The easiest way to get the latest version of Sumatra is

  $ pip install sumatra

Alternatively, Sumatra 0.7.0 may be downloaded from PyPI or from the INCF Software Center. Support is available from the sumatra-users Google Group. Full documentation is available on Read the Docs.

by Andrew Davison (noreply@blogger.com) at July 03, 2015 03:55 PM

Abraham Escalante

Mid-term summary

Hello all,

We're reaching the halfway mark for the GSoC and it's been a great journey so far.

I have had some off court issues. I was hesitant to write about them because I don't want my blog to turn into me ranting and complaining but I have decided to briefly mention them in this occasion because they are relevant and at this point they are all but overcome.

Long story short, I was denied the scholarship that I needed to be able to go to Sheffield so I had to start looking for financing options from scratch. Almost at the same time I was offered a place at the University of Toronto (which was originally my first choice). The reason why this is relevant to the GSoC is because it coincided with the beginning of the program so I was forced to cope with not just the summer of code but also with searching/applying for funding and paperwork for the U of T which combined to make for a lot of work and a tough first month.

I will be honest and say that I got a little worried at around week 3 and week 4 because things didn't seem to be going the way I had foreseen in my proposal to the GSoC. In my previous post I wrote about how I had to make a change to my approach and I knew I had to commit to it so it would eventually pay off.

At this point I am feeling pretty good with the way the project is shaping up. As I mentioned, I had to make some changes, but out of about 40 open issues, now only 23 remain, I have lined up PRs for another 8 and I have started discussion (either with the community or with my mentor) on almost all that remain, including some of the longer ones like NaN handling which will span over the entire scipy.stats module and is likely to become a long term community effort depending on what road Numpy and Pandas take on this matter in the future.

I am happy to look at the things that are still left and find that I at least have a decent idea of what I must do. This was definitely not the case three or four weeks ago and I'm glad with the decision that I made when choosing a community and a project. My mentor is always willing to help me understand unknown concepts and point me in the right direction so that I can learn for myself and the community is engaging and active which helps me keep things going.

My girlfriend, Hélène has also played a major role in helping me keep my motivation when it seems like things amount to more than I can handle.

I realise that this blog (since the first post) has been a lot more about my personal journey than technical details about the project. I do apologise if this is not what you expect but I reckon that this makes it easier to appreciate for a reader who is not familiarised with 'scipy.stats', and if you are familiarised you probably follow the issues or the developer's mailing list (where I post a weekly update) so technical details would be redundant to you.  I also think that the setup of the project, which revolves around solving many issues makes it too difficult to write about specific details without branching into too many tangents for a reader to enjoy.

If you would like to know more about the technical aspect of the project you can look at the PRs, contact me directly (via a comment here or the SciPy community) or even better, download SciPy and play around with it. If you find something wrong with the statistics module, chances are it's my fault, feel free to let me know. If you like it, you can thank guys like Ralf Gommers (my mentor), Evgeni Burovski and Josef Perktold (to name just a few of the most active members in 'scipy.stats') for their hard work and support to the community.

I encourage anyone who is interested enough to go here to see my proposal or go here to see currently open tasks to find out more about the project. I will be happy to fill you in on the details if you reach me personally.

Sincerely,
Abraham.

by noreply@blogger.com (Abraham Escalante) at July 03, 2015 01:07 AM

July 02, 2015

Continuum Analytics

Find Continuum at SciPy Austin 2015

Continuum is a sponsor of this year’s SciPy Conference in Austin, TX. We invite you to join us at the fourteenth annual Scientific Computing with Python conference, and to check out some really fantastic talks and tutorials from our talented engineers and developers.

by Continuum at July 02, 2015 12:00 AM

Continuum Analytics - July Tech Events

The Continuum Team is hitting some big conferences this month, and we hope you’ll join us for exciting talks on Bokeh, Numba, Blaze, Anaconda, and more. From our hometown of Austin, Texas all the way to Spain, our team is ready to talk Python.

by Continuum at July 02, 2015 12:00 AM

July 01, 2015

PythonXY

Python(x, y) 2.7.10.0 Released!

Hi All,

I'm happy to announce that Python(x, y) 2.7.10.0 is available for immediate download.from any of the mirrors. The full change log can be viewed here. Please post your comments and suggestions to the mailing list.

Work on the Python 3.x 64 bit version will resume once Visual Studio 2015 RTM is released.

What's new in general:

  • All packages have been updated to their latest versions. Numpy, Scipy etc.
  • ipython has been held back at 2.4.1 to avoid potential compatibility issues.
  • OpenCV has been help back at 2.4.11 to avoid potential compatibility issues.

New noteworthy packages:

  • yappi - Yet Another Python Profiler.
  • libnacl - A ctypes wrapper around libsodium.
Have fun!

-Gabi Davar

by Gabi Davar (noreply@blogger.com) at July 01, 2015 11:16 PM

June 30, 2015

Matthieu Brucher

Audio Toolkit: Anatomy of a transient shaper

When I first about transient shaper, I was like “what’s the difference with a compressor? Is there one?”. And I tried to see how to get these transient without relying on the transient energy, with a relative power (ratio between the instant power and the mean power) filter, or its derivative, but nothing worked. Until someone explained that the gain was driven by the difference between a fast attack filtered power and a slower one. So here it goes.

First, let’s have a look on the filter graph:
Transient shaper pipelineTransient shaper pipeline

I’ve surrounded the specific transient shaper part in with a dotted line. This is the difference with a compressor/limiter/expander: the way the signal steered the gain computation is generated.

Let’s start from a kick. The fast envelope follower can then be generated (red curve) as well as the slow envelope follower (green curve). The difference is always positive (if the two follower have the same release time value), so we can use it to compute a gain through our usual GainCompressorFilter.
Depending on whether you want to increase the transient or attenuate it, the ratio will be below or higher than 1 (respectively), which is what is shown in the last two plots here:
Transient shaper plotTransient shaper plot

In the end, it’s all about the right algorithms. If you have a proper framework, you may already have everything you need for some filters. In the case of a transient shaper, I already had all the filters in my toolbox. Now I just need to make a plugin out of this simple pipeline!

The code for the last plots can be found