## August 03, 2015

### Matthew Rocklin

#### Caching

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: Caching improves performance under repetitive workloads. Traditional LRU policies don’t fit data science well. We propose a new caching policy.

## Humans Repeat Stuff

Consider the dataset that you’ve worked on most recently. How many times have you loaded it from disk into memory? How many times have you repeated almost the same computations on that data?

Exploratory data science workloads involve repetition of very similar computations. These computations share structure. By caching frequently used results (or parts of results) we may be able to considerably speed up exploratory data analysis.

## Caching in other contexts

The web community loves caching. Database backed web applications almost always guard their database lookups with a system like memcached which devotes some amount of memory to caching frequent and recent queries. Because humans visit mostly the same pages over and over again this can reduce database load by an order of magnitude. Even if humans visit different pages with different inputs these pages often share many elements.

## Limited caching resources

Given infinite memory we would cache every result that we’ve ever computed. This would give us for instant recall on anything that wasn’t novel. Sadly our memory resources are finite and so we evict cached results that don’t seem to be worth keeping around.

Traditionally we use a policy like Least Recently Used (LRU). This policy evicts results that have not been requested for a long time. This is cheap and works well for web and systems applications.

## LRU doesn’t fit analytic workloads

Unfortunately LRU doesn’t fit analytic workloads well. Analytic workloads have a large spread computation times and of storage costs. While most web application database queries take roughly the same amount of time (100ms-1000ms) and take up roughly the same amount of space to store (1-10kb), the computation and storage costs of analytic computations can easily vary by many orders of magnitude (spreads in the millions or billions are common.)

Consider the following two common computations of a large NumPy array:

1. x.std() # costly to recompute, cheap to store
2. x.T # cheap to recompute, costly to store

In the first case, x.std(), this might take a second on a large array (somewhat expensive) but takes only a few bytes to store. This result is so cheap to store that we’re happy to keep it in our cache for a long time, even if its infrequently requested.

In the second case, x.T this is cheap to compute (just a metadata change in the array) and executes in microseconds. However the result might take gigabytes of memory to store. We don’t want to keep this in our cache, even if it’s very frequently requested; it takes up all of the space for other potentially more useful (and smaller) results and we can recompute it trivially anyway.

So we want to keep cached results that have the following properties:

1. Costly to recompute (in seconds)
2. Cheap to store (in bytes)
3. Frequently used
4. Recently used

## Proposed Caching Policy

Here is an alternative to LRU that respects the objectives stated above.

Every time someone accesses an entry in our cache, we increment the score associated to the entry with the following value

Where compute time is the time it took to compute the result in the first place, nbytes is the number of bytes that it takes to store the result, epsilon is a small number that determines the halflife of what “recently” means, and t is an auto-incrementing tick time increased at every access.

This has units of inverse bandwidth (s/byte), gives more importance to new results with a slowly growing exponential growth, and amplifies the score of frequently requested results in a roughly linear fashion.

We maintain these scores in a heap, keep track of the total number of bytes, and cull the cache as necessary to keep storage costs beneath a fixed budget. Updates cost O(log(k)) for k the number of elements in the cache.

## Cachey

I wrote this up into a tiny library called cachey. This is experimental code and subject to wild API changes (including renaming.)

The central object is a Cache that includes asks for the following:

1. Number of available bytes to devote to the cache
2. Halflife on importance (the number of access that occur to reduce the importance of a cached result by half) (default 1000)
3. A lower limit on costs to consider entry to the cache (default 0)

## Example

So here is the tiny example

>>> from cachey import Cache
>>> c = Cache(available_bytes=1e9)  # 1 GB

>>> c.put('Hello', 'world!', cost=3)

>>> c.get('Hello')
'world!'

## More interesting example

The cache includes a memoize decorator. Lets memoize pd.read_csv.

In [1]: import pandas as pd
In [2]: from cachey import Cache

In [3]: c = Cache(1e9)

In [5]: %time df = read_csv('accounts.csv')
CPU times: user 262 ms, sys: 27.7 ms, total: 290 ms
Wall time: 303 ms

CPU times: user 77 µs, sys: 16 µs, total: 93 µs
Wall time: 93 µs

In [7]: c.total_bytes / c.available_bytes
Out[7]: 0.096

So we create a new function read_csv that operates exactly like pandas.read_csv except that it holds on to recent results in c, a cache. This particular CSV file created a dataframe that filled a tenth of our cache space. The more often we request this CSV file the more its score will grow and the more likely it is to remain in the cache into the future. If other memoized functions using this same cache produce more valuable results (costly to compute, cheap to store) and we run out of space then this result will be evicted and we’ll have to recompute our result if we ask for read_csv('accounts.csv') again.

## Memoize everything

Just memoizing read_csv isn’t very interesting. The pd.read_csv function operates at a constant data bandwidth of around 100 MB/s. The caching policies around cachey really shine when they get to see all of our computations. For example it could be that we don’t want to hold on to the results of read_csv because these take up a lot of space. If we find ourselves doing the same groupby computations then we might prefer to use our gigabyte of caching space to store these both because

1. groupby computations take a long time to compute
2. groupby results are often very compact in memory

I’m slowly working on integrating cachey into dask’s shared memory scheduler.

Dask is in a good position to apply cachey to many computations. It can look at every task in a task graph and consider the result for inclusion into the cache. We don’t need to explicitly memoize every function we want to use, dask can do this for us on the fly.

Additionally dask has a nice view of our computation as a collection of sub-tasks. Similar computations (like mean and variance) often share sub-tasks.

## Future Work

Cachey is new and untested but potentially useful now, particularly through the memoize method shown above. It’s a small and simple codebase. I would love to hear if people find this kind of caching policy valuable.

I plan to think about the following in the near future:

1. How do we build a hierarchy of caches to share between memory and disk?
2. How do we cleanly integrate cachey into dask (see PR #502) and how do we make dask amenable to caching (see PR #510)?

## July 28, 2015

### Matthieu Brucher

#### Audio Toolkit: Different low pass filters

There are several different low pass filters, and as many high pass, band pass, band stop… filters. In Audio toolkit, there are different usual implementation available:

• Bessel
• Butterworth
• Chebyshev type 1
• Chebyshev type 2
• Second order
• RBJ

and it is possible to implement other, different orders as well…

Here is a display of the behaviors of these filters (for order 2 for those who can have different orders), amplitude and phase:

Low-pass filter comparison

What is interesting is that for low pass filters, the Butterworth filter is the same as the second order filter implementation. Linkwitz-Riley filters have a lower amplitude than the others, and Chebyshev type 2 has definitely a different behavior than all the others (which is due to the fact that its prototype is different than for the others, it’s not based on band pass, but band stop).

With all these options, I think Audio Toolkit can cover any basis.

## July 27, 2015

### Martin Fitzpatrick

#### Create simple GUI Applications with Python and Qt

So, for some unknown reason I recently thought it might be good to start putting together some video tutorials for Python. I thought I’d start with something I was sure I knew a fair bit about: writing GUI applications with PyQt (+PySide).

It turned out to be a lot more work than I thought, but rewarding to get the finished product. Like most programming tutorials it is predominantly screencast based, but I’ve included short lecture-like segments in each part to get across the key points. For those interested in such things, the setup I used was —

• Screen recording - ScreenFlow (Mac)
• Microphone - Samson GO Mic USB
• Audio editing - Audacity

#### Enterprise-Ready Anaconda Platform Supports New Generation of Data-Driven Companies

AUSTIN, TX. — July 23, 2015 — Continuum Analytics, which develops Anaconda, the leading modern open source analytics platform powered by Python, has secured a $24 million Series A funding round led by General Catalyst Partners and BuildGroup. The funds will be leveraged to accelerate product development, expand sales and marketing, and continue to invest in strengthening the Python and Anaconda communities. The latest round of$24 million in funding comes after $10 million in initial funding from a combination of angel investors and a DARPA award, bringing total funding to-date to$34 million.

### Matthew Rocklin

#### Custom Parallel Workflows

This work is supported by Continuum Analytics and the XDATA Program as part of the Blaze Project

tl;dr: We motivate the expansion of parallel programming beyond big collections. We discuss the usability custom of dask graphs.

## Recent Parallel Work Focuses on Big Collections

Parallel databases, Spark, and Dask collections all provide large distributed collections that handle parallel computation for you. You put data into the collection, program with a small set of operations like map or groupby, and the collections handle the parallel processing. This idea has become so popular that there are now a dozen projects promising big and friendly Pandas clones.

This is good. These collections provide usable, high-level interfaces for a large class of common problems.

However, many workloads are too complex for these collections. Workloads might be complex either because they come from sophisticated algorithms (as we saw in a recent post on SVD) or because they come from the real world, where problems tend to be messy.

In these cases I tend to see people do two things

1. Fall back to multiprocessing, MPI or some other explicit form of parallelism
2. Perform mental gymnastics to fit their problem into Spark using a clever choice of keys. These cases often fail to acheive much speedup.

Historically I’ve recommended the manual construction of dask graphs in these cases. Manual construction of dask graphs lets you specify fairly arbitrary workloads that then use the dask schedulers to execute in parallel. The dask docs hold the following example of a simple data processing pipeline:

def load(filename):
...
def clean(data):
...
def analyze(sequence_of_data):
...
def store(result):
...

'analyze': (analyze, ['clean-%d' % i for i in [1, 2, 3]]),
'store': (store, 'analyze')}

get(dsk, 'store')  # executes in parallel

Feedback from users is that this is interesting and powerful but that programming directly in dictionaries is not inutitive, doesn’t integrate well with IDEs, and is prone to error.

To create the same custom parallel workloads using normal-ish Python code we use the dask.do function. This do function turns any normal Python function into a delayed version that adds to a dask graph. The do function lets us rewrite the computation above as follows:

from dask import do

cleaned = [do(clean)(x) for x in loads]

analysis = do(analyze)(cleaned)
result = do(store)(analysis)

The explicit function calls here don’t perform work directly; instead they build up a dask graph which we can then execute in parallel with our choice of scheduler.

from dask.multiprocessing import get
result.compute(get=get)

This interface was suggested by Gael Varoquaux based on his experience with joblib. It was implemented by Jim Crist in PR (#408).

## Example: Nested Cross Validation

I sat down with a Machine learning student, Gabriel Krummenacher and worked to parallelize a small code to do nested cross validation. Below is a comparison of a sequential implementation that has been parallelized using dask.do:

You can safely skip reading this code in depth. The take-away is that it’s somewhat involved but that the addition of parallelism is light.

The parallel version runs about four times faster on my notebook. Disclaimer: The sequential version presented here is just a downgraded version of the parallel code, hence why they look so similar. This is available on github.

So the result of our normal imperative-style for-loop code is a fully parallelizable dask graph. We visualize that graph below.

test_score.visualize()


## Help!

Is this a useful interface? It would be great if people could try this out and generate feedback on dask.do.

For more information on dask.do see the dask imperative documentation.

## July 21, 2015

### Titus Brown

#### DIB jclub: Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap

Note: at the Lab for Data Intensive Biology, we're trying out a new journal club format where we summarize our thoughts on the paper in a blog post. For this blog post, Luiz wrote the majority of the text and the rest of us added questions and comments.

The paper:

Fast and sensitive mapping of error-prone nanopore sequencing reads with GraphMap.

Relevant background:

## Summary

This paper presents a new read mapper, GraphMap.

The method is described as a five-stage read funneling approach, which successively reduces the number and complexity of the decisions that need to be made

### Stage 1: Region selection

Gapped spaced seeds are not the same as our last journal club, but it's an interesting strategy for selecting seeds to extend alignments.

Based on Levenshtein distance, it uses gaps inside k-mers to consider mismatches and indels. Three shapes are used, with one or two positions being DC ("don't care") bases:

• 1110111: DC base can be either a match or mismatch
• 111111: one deletion at the specified position (?)
• 11100111: DC base and following base are skipped. At most one insertion and one match/mismatch.

(Can we use longer shapes? These one are for fairly small _k_, if we can extend the idea to arbitrary _k_ it might be useful for seeding on graphalign).

Hough transforms are used in a clever way to bin seeds into viable regions, but it depends on reference coordinates (so it is not so useful for us in the absence of a decent reference).

### Stage 2: Graph-based vertex-centric construction of anchors

The 'graph' part of GraphMap comes from the next step, where the seeds are processed to construct alignment anchors. Given a target and a query, a "kmer mapping graph" (a DAG) is built for the target. In a "kmer mapping graph" distinct nodes can represent the same k-mer. For example, the first step in constructing a 2-mer mapping graph for CTAATATC would be CT -> TA -> AA -> AT -> TA -> AT -> TC (note nodes TA, AT appearing twice). Then, for each vertex an edge is added for every successor until the last node is reached.

These graphs are small, since target and query are a read sequence and a single region of the reference (for memory consumption purposes, read sequence are usually smaller and so used as target). A new index is constructed for the target on the fly, with a much smaller seed size (defaults to k=6). For k < 10 perfect hashing is used, for k >= 10 a suffix array.

After the graph is ready, the mapping works by walking along target and query simultaneously. The query is processed as a sliding window, and an edge is followed to extend the walk each step in the target, while keeping track of all walks corresponding to potential mapping sites.

There are similarities to how partial order alignment works, but how is this stage any different than just doing DP?

(nothing here)

### Stage 4: Refining alignments using $L_1$ linear regression / Stage 5: Construction of final alignment

Just summary of stages 4 and 5: After we have extended anchors in stage 3, we will have a set of points representing the alignments from LCSK, mixed with a set of noise; (indels, sequence errors, etc). To refine these alignments, we need to draw a line that best fits these points. This is done by using linear regression, which is used to fit the alignment's "predictive model" from among these observed points "list of anchors".

The points that lie on given dl1 from either sides of the line, represents our best alignments - those points who deviates should be discarded.

Then the std deviation of anchors from this line, no. of exact kmers covered by anchors around the line, length of query that matched the target, no. of bases covered by anchors (normalized) and the read length are used to compute an f score. The region with highest f score are picked for final alignment.

(I think the reference coordinates c can be estimated from the position on the read and the position of the hit, so we still can use Hough Transform?)

Stages 4 and 5 seem heavyweight.

• we are unclear on how index would be constructed in the absence of a reference in the case of a de novo assembly. Last paragraph of Discussion states: "GraphMap’s sensitivity and specificity as a mapper could thus serve as the basis for fast computation of overlap alignments and de novo assemblies in the future." But algorithm from Figure 1b and Methods Stage I: region selection seems to be based on seed finding between query and reference. What is used as reference in the absence of a reference?
• GraphMap showed improved runtime (Suppl Table 2) compared to "gold standard" BLAST alignment. In Figure 2 with synthetic data, data platform type made the biggest difference in GraphMap performance compared to "gold-standard" BLAST aligner. Oxford Nanopore 2D data (double-stranded) had consistently among the highest precision relative to other platforms, although PacBio was close behind in both C. elegans and H. sapiens sets. Interesting that genome size (from 2.2 Mbp = N. meningitidis to 198 Mbp = H. sapiens ch3) didn't make much difference in precision (mapped location on read; seed hit, k-point) or recall (alignment) (https://en.wikipedia.org/wiki/Precision_and_recall). Recall was much lower in all ONT data regardless of species.
• Impressed with figure 3a, difference between GraphMap and LAST mapping tool with difference in consensus sequences (colored = low confidence, grey = higher confidence). Would liked to have seen their BWA-MEM and BLASR results, although Figure 3b suggests LAST was closer to GraphMap with higher coverage.
• Interesting applications outline in Results. Benefit of GraphMap to reduce pipeline resources required for ONP data? Mick Watson suggests BLAST (https://biomickwatson.wordpress.com/2015/06/01/analysing-minion-data-with-no-bioinformatics-expertise-nor-infrastructure/).
• Reasons for why we care include new ONP technology in field applications, e.g. identifying pathogens in remote location with local install. Species predictions in Table 3, F1 (mean of precision and recall) higher for GraphMap. Need for more testing with real ONP data (just 3 species were tested in this paper) and with higher complexity, e.g. pathogenic microbial eukaryotes?
• We were a bit surprised that longest-common-subsequence works so well with ONT, but that's why they did it with only the subsequences extracted after the graph approach.
• "Our comparisons with BLAST suggest that reads that cannot be mapped by GraphMap may essentially be unmappable." Did they characterize these reads at all?
• What's the memory usage? Largely or entirely unmentioned.
• We were confused by the gapped qspaced seeds/gapped q-gram filter stuff. (p10)
• We do not think they tested for genome scaling appropriately. They need to show an example for may be a whole human genome. As Lisa noticed there is no change in precision with their bigger genomes.
• The clinical application is very interesting. They did not compare precision of other mappers using strain specific sequence.

### Matthieu Brucher

#### Announcement: ATKSideChainCompressor

I’m happy to announce the release of a side-chain stereo compressor based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. This stereo compressor can work on two channels, left/right or middle/side, possibly in linked mode (only one set of parameters), and can be set up to mix the input signal with the compressed signal (serial/parallel compression). The side chain channels can be used to steer the gain stage (the same setup will be used, right/left or middle/side).

ATKSideChainCompressor

The supported formats are:

• VST2 (32bits/64bits on Windows, 64bits on OS X)
• VST3 (32bits/64bits on Windows, 64bits on OS X)
• Audio Unit (64bits, OS X)

The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code.

#### Announcement: Audio TK 0.7.0

Focus on this release was on performance. As such the core functions were optimized, as well as some tools and EQ.

A new filter dedicated to fast convolution (using a fixed-size partition with a mix of FFT convolution and explicit FIR filter) with 0 latency was added.

Changelog:

0.7.0
* Fixed the FFT implementation
* Added a fast zero latency convolution filter with tests and comparison with a basic FIR implementation
* Enhanced global framework performance (Core, EQ, Tools)
* Enhanced dynamic gain filters behavior by switching between LUT or direct computation dependening on the LUT state

## July 19, 2015

### Titus Brown

#### A year-old editorial regarding computational infrastructure sustainability

Note: A year ago, I wrote this in response to an editorial request. Ultimately they weren't interested in publishing it, and I got distracted and this languished on my hard disk. So when I remembered it recently, I decided to just push it out to my blog, where I should have put it in the first place. --titus

All things come to an end eventually, including funding for computational services. What is a field to do?

As biology steadily becomes more and more data intensive, the need for community-wide analysis software, databases, data curation, and Internet services grows. While writing software and gathering data are both expensive, the cost of maintaining software and curating data sets over time can dwarf the upfront costs; in the software industry, for example, software maintenance and support costs can be 10-or 100-fold the initial investment to develop the software. Despite this, there is often little sustained academic funding for software maintenance, data curation, and Internet service support; in part, this may be because the maintenance costs for existing data and software could easily consume all of the available funding! Yet this lack of infrastructure is increasingly problematic, as data set sizes and curation needs grow, and tools to interpret and integrate data sets become ever more critical to forward progress in biology. How can we develop and maintain robust infrastructure services while enabling creative development of new resources and software?

The volume, velocity, and variety of data in biology is stunning, and would challenge even the handling capacity of more established data-intensive fields with larger computational investments. For example, by analogy with astronomy, Golden et al. (2012) propose bringing processing pipelines closer to the data generating instrument (in this case, the sequencing machine). While this approach can certainly help address the volume and velocity of sequencing data, it fails to address the variety -- there are dozens of types of sequencing output, with perhaps hundreds of different processing pipelines, the choice of which depends critically on the biological system being analyzed and the questions being asked of the data. Some subfields of biology may well be able to standardize -- for example, variation analysis for the human genome is increasingly using only a few processing pipelines -- but for environmental sequencing, the types of systems and the metadata being gathered are extremely diverse and nowhere near standardization. We must recognize that our knowledge of the natural biological world is so shallow, and the data gathering needs so great, that the field is very immature compared to other data-intensive sciences like particle physics and astronomy.

How can we build sustained computational analysis and data storage services, in the face of increasingly large and diverse biological data sets, with fast-moving analysis needs? This question has been brought into sharp relief in recent years, with the lapses in funding of TAIR, Tranche, and CAMERA.

While substantial investments have been made in a variety of genomic and transcriptomic analysis services, only a few projects have achieved sustained funding independent of large host institutes. Chief among these are the biomedically relevant projects, which include Wormbase, Flybase, and SGD, all of which have been funded for well over a decade by the NIH. Many others, including iPlant Collaborative and KBase, are in a ramp-up phase and are still exploring options for long-term support. With rare exceptions, it is safe to say that no large cyberinfrastructure effort has successfully weaned itself from continued large-scale support from a granting agency - and some have failed to find this continued funding, and have no clear future.

The challenges for sustainability of cyberinfrastructure are significant. The necessary mix of data storage, research software development, database curation, and service hosting requires substantial and diverse computational expertise, large compute resources, and extensive community involvement to ensure relevance. Even individually, these can be hard to find, and yet projects often try to combine all four of these: to a large extent they buy their own hardware, manage it with their own infrastructure software, develop their own research analysis software, store their data, and curate their databases. Hybrid models exist -- for example, iPlant Collaborative works with a number of external computational biologists to develop and integrate tools -- but these efforts are often centrally managed and continue to require substantial funding for this integration.

Another challenge is that of maintaining innovation in algorithm and software development while continuing to provide robust services. Many innovative computational tools have emerged from small labs and become more broadly useful, but it can be hard for small labs to engage with large, centralized infrastructure projects. Moreover, even in these models, the larger efforts can only engage deeply with a few collaborators; these choices privilege some tools over others, and may not be based on technical merit or community need. This may also arise from the tension between engineering and research needs: large projects prize engineering stability, while research innovation is inherently unstable.

There is the hope of a more sustainable path, rooted in two approaches -- one old, and one new. The old and proven approach is that of open source. The open source community has existed for almost half a century now, and has proven to be remarkably capable: open source languages such as R and Python are widely used in data analysis and modeling, and the Linux operating system dominates scientific computing. Moreover, the open source workflow closely tracks the ideal of a scientific community, with a strong community ethic, widespread collaboration, and high levels of reproducibility and good computational practice (Perez and Millman, 2014). The new approach is cloud computing, where the advent of ubiquitous virtualization technology has made it possible for entire companies to dynamically allocate disk and compute infrastructure as needed with no upfront hardware cost. Open source approaches provide an avenue for long-term research software sustainability, while cloud computing allows cyberinfrastructure projects to avoid upfront investment in hardware and lets them grow with their needs.

Interestingly, two notable exceptions to the cyberinfrastructure sustainability dilemma exploit both open source practices and cloud computing. The Galaxy Project develops and maintains an open source Web-based workflow interface that can be deployed on any virtual machine, and in recent years has expanded to include cloud-enabled services that lets users manage larger clusters of computers for more compute-intensive tasks. Importantly, users pay for their own compute usage in the cloud: tasks that consume more compute resources will cost more. Since Galaxy is also locally deployable, heavy users can eschew the cost of the cloud by installing it on existing local compute resources. And, finally, large providers such as iPlant Collaborative can host Galaxy instances for their user communities. Likewise, the Synapse project is an open source project developed by Sage Bionetworks that hosts data and provenance information for cooperative biomedical analysis projects. While less broadly used than Galaxy, Synapse is -- from an infrastructure perspective -- infinitely expandable: the Sage Bionetworks developers rely entirely on the Amazon cloud to host their infrastructure, and scale up their computing hardware as their computing needs increase.

A general approach using open source and cloud computing approaches could separate data set storage from provision of services, active database curation, and software development. One example could look like this: first, long-term community-wide cyberinfrastructure efforts would focus on static data storage and management, with an emphasis on building and extending metadata standards and metadata catalogs. These efforts would place data in centralized cloud storage, accessible to everyone. There, separately funded data services would actively index and serve the data to address the questions and software stacks of specific fields. In tandem, separately funded new data curation and research software development efforts would work to refine and extend capabilities.

If we follow this path, substantial upfront investment in tool development and data curation will still be needed -- there's no such thing as a free lunch. However, when the project sunsets or funding lapses, with the open source/open data route there will still be usable products at the end. And, if it all rests on cloud computing infrastructure, communities can scale their infrastructure up or down with their needs and only pay for what they use.

Funders can help push their projects in this direction by requiring open data and open source licenses, encouraging or even requiring multiple deployments of the core infrastructure on different cloud platforms, and ultimately by only funding projects that build in sustainability from the beginning. Ultimately, funders must request, and reviewers require, an end-of-life plan for all infrastructure development efforts, and this end-of-life plan should be Òbaked inÓ to the project from the very beginning.

In the end, providing robust services while developing research software and storing and curating data is both challenging and expensive, and this is not likely to change with a top-down funding or management paradigm. We need new processes and approaches that enable bottom-up participation by small and large research groups; open approaches and cloud computing infrastructure can be a solution.

## July 17, 2015

### Titus Brown

#### A review of &quot;Tools and techniques for computational reproducibility&quot;

Note: Last week, I submitted my review of Stephen R. Piccolo, Adam B. Lee, and Michael B. Frampton's paper, Tools and techniques for computational reproducibility. Soon after, Dan Katz wrote a blog post about notebooks, and in a comment I mentioned Piccolo's paper; and, after dropping a note to Dr. Piccolo, he put it up on bioRxiv! This, in turn, meant that I felt comfortable posting my review - since I sign my reviews, the authors already know who I am, so where's the harm? See below.

In this paper, Piccolo et al. do a nice (and I think comprehensive?) job of outlining six strategies for computational reproducibility. The point is well made that science is increasingly dependent on computational reproducibility (and that in theory we should be able to do computational reproducibility easily and well) and hence we should explore effective approaches that are actually being used.

I know of no other paper that covers this array of material, and this is a quite nice exposition that I would recommend to many. I can't evaluate how broadly it will appeal to a diverse audience but it seems very readable to me.

The following comments are offered as helpful suggestions, not criticisms -- make of them what you will.

The paper almost completely ignores HPC. I'm good with that, but it's a bit surprising (since many computational scientists seem to think that reproducible orchestration of many processors is an unachievable task). Noted in passing.

I was somewhat surprised by the lack of emphasis of version control systems. These are really critical in programming for ensuring reproducibility. I also found a missing citation! You should look at journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745 (yes, sorry, I'm on the paper).

Speaking of which, I appreciate the completeness of references (and even the citation of my blog post ;) but it would be interesting to see if Millman and Perez have anything to offer: http://www.jarrodmillman.com/oss-chapter.html. Certainly a good citation (I think you hit the book, but this is a particularly good chapter.)

I would suggest (in the section that mentions version control systems, ~line 170 of p9) recommending that authors "tag" specific versions for the publication, even if they later recommend using updated versions. (Too many people say "use this repo!" without specifying a revision.)

The section on literate programming could usefully mention that these literate programming environments do not offer good mechanisms for long running programs, so they may not be appropriate for things that take more than a few minutes to run.

Also, and perhaps most important, these literate programming environments provide REPL and can thus track exploratory data analysis and "harden" it when it works and the author moves onto another data analysis - so even if the authors don't want to clean up their notebook before publication, you can track exactly how they got their final results. I think this is important for practical reproducibility. I don't know quite what to suggest in the context of the paper but it seems like an important point to me.

Both the virtual machine and container sections should mention the challenges of raw data bundling, which is one of the major drawbacks here - not only is the VM large, but (unless you are partnering with e.g. Amazon to "scale out") you must distribute potentially large data sets. I think this is one of the biggest practical issues facing data intensive sciences. (There was a nice commentary recently by folk in human genomics begging the NIH to make human genomic data available via the cloud; I can track it down if the authors haven't seen it.)

I think it's important to emphasize how transparent most Dockerfiles are (and how this is a different culture than the VM deployment scene, where configuration systems are often not particularly emphasized except in the devops community). I view this as one of the most important cultural differences driving container adoption, and for once it's good for science!

The docker ecosystem also seems quite robust, which is important, I think.

[ ... specific typos etc omitted ... ]

Signed,

C. Titus Brown, ctbrown@ucdavis.edu

## July 16, 2015

### Matthieu Brucher

#### Convert HPCToolkit files to callgrind format

After my post on HPCToolkit, I felt that I prefered QCacheGrind as a GUI to explore profiling results. So here is a gist with a Python script to convert XML HPCToolkit experiments to callgrind format: https://gist.github.com/mbrucher/6cad31e38beca770523b

For instance, this is a display of an Audio Toolkit test of L2 cache misses:
ATK L2 cache misses profile

Enjoy!

## July 14, 2015

### Matthieu Brucher

#### Profiling with HPC Toolkit

I’ve started working with the HPC Toolkit, especially the part where it can use PAPI to read hardware counters and display the result in a nice GUI.

# Installation

Installation is a little bit of a pain, but it is simplified somewhat, as all external dependencies are available in a unique repository, except for PAPI that you have to install by hand.

What is also handy is that all the GUIs are available on all platforms and used right away. The not so handy part is that HPC toolkit can be used on different platform, but not PAPI. And as I like PAPI…

# Usage

Usage is really simple. Multithreaded applications will agglomerate all threads and even MPI processes separately, and in the viewer, the results would be shown with a specific [thread, process] index. This is really neat.

But before you can do that, there are a couple of commands to execute. The first one is hpcstruct, that just creates a XML file with all the known functions inside, named binary.hpcstruct:

hpcstruct binary


After this, the profiling can be done:

hpcrun -t -e PAPI_TOT_CYC@15000000 -e PAPI_L1_TCM@400000 -e PAPI_L2_TCM@400000 -e WALLCLOCK@5000 -o outputfolder binary arguments


Here there are 3 PAPI counters and one general counter (WALLCLOCK is relevant as current processors have a variable frequency, meaning that the total number of cycling is not enough to actually infer the spent time). After the @, there is the sampling frequency. I haven’t said what kind of profiling PAPI provides, it’s now clear it’s sampling. Sampling means that from time to time, the application will stop and the profiler will look at its state. If the sampling frequency (or in this case the sampling period) is too short, the profile will be biased and will show profiling errors, or it can even break if the sampling period is so small that it gets hit while doing the report…

A list of available counters can be gotten by hpcrun -L or papi_avail -a for PAPI counters.

Once this is done, the profile has to be converted in a “database”:

hpcprof outputfolder -S binary.hpcstruct -I include_folder


The include folder is relevant if you want to browse through the source folders while analyzing your profile. There are two applications that can be used to review profiles, available on all platforms: hpcviewer or hpctraceviewer.

Here is an example of hpcviewer:

HPCViewer main window

There are three ways of displaying results:

• Top-down, starting with top functions, diving into more specific functions, following the tree calls
• Bottom-up, starting with all the leaves in the tree calls, a really informative view that captures the biggest offenders in an eye blink
• A flat view, file by file, then function by function and finally line by line

# Discussion

There is one main warning in these trees: if you compiled in optimized mode with GCC, some layers will be missing where gcc inlined functions. So it takes some agility to actually understand what happens. Another solution is to use the Intel Compiler which has an option to properly generate debug info. The case where this bugs me the most is for lambda functions, as this gets displayed as operator() function! Without the source code, there is no way of knowing where this function actually comes from.

Another thing to remember when checking values is what the third view gives. Each time a sample is taken, it’s at some point in the execution of the program, and the sampling period is given to the line where the program stopped/is sampled. It’s like ergodicity for a random variable, we assume that over enough samples, making a mean on the samples will be the same as making a mean on the timings. So the profiling time has to be important compared to the number of functions profiled and the sampling period.

Also when comparing with other tools like Parallel Studio or Valgrind, I miss at least a bar allowing me to compare different functions cost (that being for implicit or explicit counters). I think a callgrind-like profile can be created out of the HPC Toolkit profile. For the moment, it’s a xml file, but they plan on moving to another binary format…

I thought I could do something with the cvs export, but it only export timings for unfolded lines, without giving lines and file names, without any hierarchy, but it is doable from he original database, as they are stored as XML and as the callgrind file format is open.

One key advantage of HPC Toolkit over Callgrind is the way the profile is computed (sampling vs emulation), so they are faster to be created, and it shows a good picture of what happens within the processor, which an emulation profile can’t exactly show (it depends of the model of the processor, which is never exactly the right model). But it is not reproducible, it depends on the CPU load, so more care is required to analyze this profiles. Also it is only usable on Linux…

It is a great complement to Valgrind/Callgrind, and the fact that it is open source means that there are no excuse for not profiling your application for bottlenecks.

## July 13, 2015

### Wei Xue

#### GSoC Week 6/7

In the week 6 and 7, I coded BayesianGaussianMixture for the full covariance type. Now it can run smoothly on synthetic data and old-faithful data. Take a peek on the demo.

from sklearn.mixture.bayesianmixture import BayesianGaussianMixture as BGM
bgm = BGM(n_init=1, n_iter=100, n_components=7, verbose=2, init_params='random',
precision_type='full')
bgm.fit(X)


The demo is to repeat the experiment of PRML, page 480, Figure 10.6. VB on BGMM has shown its capability of inferring the number of components automatically. It has converged in 47 iterations.

The ELBO looks a little weired. It is not always going up. When some clusters disappear, ELBO goes down a little bit, then go up straight. I think it is because the estimation of the parameters is ill-posed when these clusters have data samples less than the number of features.

The BayesianGaussianMixture has much more parameters than GaussianMixture, there are six parameters per each components. I feel it is not easy to control the so many functions and parameters. The initial design of BaseMixture is also not so good. I took a look at bnpy which is a more complicated implementation of VB on various mixture models. Though I don't need to go such complicated implementation, but the decoupling of observation model, i.e. $X$, $\mu$, $\Lambda$, and mixture mode, i.e. $Z$, $\pi$ is quite nice. So I tried to use Mixin class to represent these two models. I split MixtureBase into three abstract classes ObsMixin, HiddenMixin and MixtureBase(ObsMixn, HiddenMixin). I also implemented subclasses for Gaussian Mixture ObsGaussianMixin(ObsMixin), MixtureMixin(HiddenMixin), GaussianMixture(MixtureBase, ObsGaussianMixin, MixtureMixin), but Python does allow me to do this due to there is correct MRO. :-|. I changed them back, but this unsuccessful experiment gives me a nice base class, MixtureBase.

I also tried to use cached_property to store the intermediate variables such as, $\ln \pi$, $\ln \Lambda$, and cholsky decomposed $W-1$, but didn't get much benefits. It is almost the same to save these variables as private attributes into instances.

The numerical issue comes from responsibility is extremely small. When estimating resp * log resp, it gives NAN. I simply avoid computing when resp < 10*EPS. Still, ELBO seems suspicious.

The current implementation of VBGMM in scikit-learn cannot learn the correct parameters on old-faithful data.

VBGMM(alpha=0.0001, covariance_type='full', init_params='wmc',
min_covar=None, n_components=6, n_iter=100, params='wmc',
random_state=None, thresh=None, tol=0.001, verbose=0)


It gives only one components. The weights_ is

 array([  7.31951611e-07,   7.31951611e-07,   7.31951611e-07,
7.31951611e-07,   7.31951611e-07,   9.99996340e-01])


I also implemented DirichletProcessGaussianMixture. But currently it looks the same as BayesianGaussianMixture. Both of them can infer the best number of components. DirichletProcessGaussianMixture took a slightly more iteration than BayesianGaussianMixture. If we infer Dirichlet Process Mixture by Gibbs sampling, we don't need to specify the truncated level, only alpha the concentration parameter is enough. But with variational inference, we still need the give the model the maximal possible number of components, i.e., the truncated level $T$.

## July 10, 2015

### Titus Brown

#### A response to &quot;The myths of bioinformatics software&quot;

This is a response to (parts of) Dr. Lior Pachter's post, "The myths of bioinformatics software". (You can also see my post on bioinformatics software licensing for at least some of the background arguments.)

I agree with a lot of what Lior says: most bioinformatics software is not very good quality (#1), most bioinformatics software is not built by a team (#2), licensing is at best a minor component of what makes software widely used (#3), software should have an expiration date (#5), most URLs are unstable (#6), software should not be "idiot proof" (#7), and it shouldn't matter whether you use a specific programming language (#8).

I strongly disagree with Lior's point #4, in almost every way. I try make my software free for everyone, including companies, for both philosophical reasons and for simplicity; I explained my reasoning in my blog post. (Anyone who doesn't think linking against GPL software is reasonably complicated and nuanced should through the tweets and comments on that post!) From my few involvements with working on non-free software, I would also add that selling software is a tough business, and not one that automatically leads to any profits; there's a long tail, just as with everything else, and I long ago decided that my time is worth more to me than the expected income from selling software would be. (I would be thrilled if a student wanted to try to make money off of our work, but my academic work would remain open source.)

Regardless, Lior's opinion isn't obviously wrong, and I appreciate the discussion.

What surprises me most about Lior's post, though, is that he's describing the present situation rather accurately, but he's not angry about it. I'm angry, frustrated, and upset by it, and I really want to see a better future -- I'm in science, and biology, partly because I think it can have a real impact on society and health. Software is a key part of that.

Biology and genomics are changing. Large scale data analysis is becoming more and more important to the biomedical sciences, and software packages like kallisto and khmer are almost certainly going to be used in the clinic at some point. (I believe some of Broad's variant calling software is already used in diagnosis and treatment for cancer, for example, although I don't know the details.) Our software is certainly being used by people doing basic biomedical research, although it may not be directly clinical yet - and I think the quality of computation in basic research matters too.

And this means bioinformatics should grow up a bit. If bioinformatics is a core component of the future of biology (which I think is obvious), then the quality of bioinformatics software matters.

To quote Lior, "Who wants to read junk software, let alone try to edit it or build on it?" Certainly not me - but then why are we producing it? Are we settling for this kind of software in biomedical research? Are we just giving up on producing decent quality software altogether, because, uh, it's hard? How is this different from doing bad math, or publishing bad biology - topics that Lior and others get really mad about?

Lior also quotes a Computational Biology interview with James Taylor, who says,

A lot of traditional software engineering is about how to build software effectively with large teams, whereas the way most scientific software is developed is (and should be) different. Scientific software is often developed by one or a handful of people.

That was true in a decade ago, and it may have been a reasonable reason to avoid using decent software engineering techniques then, but the landscape has changed significantly in the last decade, with a wide variety of rapid prototyping, test-driven development, and lean/agile methodologies being put into practice in startups and large companies. So I think James is mistaken here.

I wager that the reason a lot of scientists do bad software engineering is because they can get away with it, not because there are no techniques they could profitably use. Heck, if they wanted to learn something about it, Software Carpentry will come teach workshops for you on this very topic, and I'd be happy to offer both Lior and James a workshop to bring them up to speed. (Note: I don't think either of them needs my advice, which is actually kind of my point.)

(As for languages, Lior's point #8, there is a persistent expansion of the Python and R toolchains around bioinformatics and a convergence on them as the daily workhorses of bioinformatics data analysis. So even that's changing.)

Fundamentally the blithe acceptance of badly engineered software in science baffles me. I can understand (and even endorse) not requiring good software engineering for algorithmic proofs of concept, but clearly we want to have good, robust libraries for serious work. To claim otherwise would seem to lead to the conclusion that much of bioinformatics and genomics should seek to be incorrect and irrelevant.

I want there to be a robust community of computational scientists and software developers in biology. I want people to be able to build a new variant caller without having to reimplement a FASTQ or SAM parser. I think we need people to file bug reports, catch weird off-by-one problems, and otherwise spot check all the software they are using. And I don't think it's impossible or even terribly difficult to achieve this.

The open source community has been developing software with distributed teams, with no single employer, and with uncertain funding for decades. It's not easy, but it's not impossible. And in the end I do think that the open source community has a lot of the solutions the computational science community needs, and in fact is simply a much better exemplar for how to work reproducibly and with high technical quality. Why we continue to ignore this mystifies me, although I would guess it has to do with how credit is assigned in academic software development.

If we went back to the 80s and 90s we'd see that many of the same arguments that Lior is making were applied to open source software in contrast to commercial software. We know how that ended - open source software now runs most of the Internet infrastructure. And open source has had other benefits, too; to quote Bill Janeway, "open source and the cloud have dramatically decreased the friction of innovating", and the scientific community has certainly benefited from the tremendous innovation in software and high tech. I would love to see that same innovation enabled in genomics and bioinformatics. And that's why we try to practice good software development in my lab; that's why we release our software under the BSD license; and that's why we encourage people to do whatever they want with our software.

Ultimately I think we should develop our software (and choose our licenses), for the future we want to see, not the present that we're stuck with.

--titus

## July 09, 2015

### Fabian Pedregosa

#### Job Offer: data scientist in Paris

Alexandre D'Aspremont and I are looking looking for a data scientist to work on large-scale non-smooth optimization methods and other topics. You can find more information in this link.

The job description is intentionally vague, and depending on the candidate this can be a postdoc-like job with most of their time devoted to research or an engineering-like job with more emphasis on coding tasks and contributing to open source projects.

Feel free to contact me (f@bianp.net) or Alexandre D'Aspremont for more information.

## July 07, 2015

### Matthieu Brucher

#### Audio Toolkit: Zero Latency Efficient Convolution

Convolution is an algorithm that is often used for reverberations. If the equation is easy to understand and to implement, the implementation is costly. The other way of doing it is to use Fast Fourier Transform (FFT), but the direct/crude implementation requires latency. If it is possible to optimize the basic convolution code, it is sometimes more interesting to use a different algorithm, as it is the case here.

# Basic theory

Let’s start with the convolution equation, with x[n] the input signal, y[n] the output signal and i[n] the impulse we want to convolve the input signal with:
Convolution equation
The cost of the operation is the size of the impulse, O(K), which can be quite high. For N output elements, the cost is thus O(KN).

The other way of doing this is in the frequency domain, where the convolution will be a simple multiplication. The issue is that your require from the start N elements. The transform in the frequency domain is in O(M), with M the number of elements in the FFT. As we are talking about convolution, M=N+K-1. So as we have more output elements after the convolution, we will need to add them to the following convolution.

The issue is the selection of the size N. With a big N, we get efficiency, but we also require N elements before being able to output any signal at all. This is worrying, as there is no way of going around this limitation.

# Rewriting the equation

Ok, now let’s rewrite the equation as a double sum. Let’s split the impulse in pieces with K elements (let’s say 64 elements, and the original length is a multiple of 64) with S pieces:
Split convolution equation

What is interesting is we can also say that the x[k] in the equation can also be regrouped by K elements. This means that we have S convolutions of size K elements. There are several advantages to think about these input samples as pieces. The original algorithm had to have a memory of S*K elements. Either we use a circular buffer, or we have to shift the entire signal. Both solutions have their drawbacks. By regrouping them in pieces, there is no buffer management and no shifting the signal (or more precisely far less).

Unfortunately, we still have latency. We need K elements if we use FFT. Another approach is to process the first convolution in the usual way and the rest with FFTs. Each time we processed K samples, we create a new chunk, transform it in the frequency domain and push it in the list of FFT chunks. Then, we can directly multiply the partial impulses (transformed in the frequency domain) by their respective input chunks, sum them together and then transform the FFT back in the time domain. This means that after the transform we get 2*K elements, the first K elements will be added to the next K elements after the discreet convolution. The last K elements will have to be put aside and added to the first K elements of the next convolution.

# Implementation in Audio Toolkit

This is a pretty basic approach to partitioned convolution. It’s just a rewrite of the convolution equation, and no fancy science behind.

The reference implementation will be from the FIR filter (just reverse the impulse). One of the nice feature of ATK is that you can make sure that you always get K input elements available in the buffer. Even if you are asked to process 1 element, the framework has a way of always keeping the last K values. This way, there is no additional need for bookkeeping. Once we processed K elements, we can directly create a new chunk. This is done through the following code in the main process method:

    int processed_size = 0;
do
{
// We can only process split_size elements at a time, but if we already have some elements in the buffer,
// we need to take them into account.
int64_t size_to_process = std::min(split_size - split_position, size - processed_size);

// Process first part of the impulse
process_impulse_beginning(processed_size, size_to_process);

split_position += size_to_process;
processed_size += size_to_process;
if(split_position == split_size)
{
process_new_chunk(processed_size);
split_position = 0;
}
}while(processed_size != size);


process_new_chunk() triggers the copy of the last K input samples and computes a FFT on it. Once this is finished, the FFT result (2*K elements) is stored, and the last K elements of the FFT result are added to the buffer. This buffer part will be added to the convolution in process_impulse_beginning.

# Some profiling

I always like to profile my code, so I used a small 10000 samples impulse (a ramp) and convoluted it with a triangle signal and compared the performance to the reference algorithm (the FIR filter). With pieces of 64 elements, the new implementation is 10 times faster, and it improves more if the pieces are bigger or the impulse longer:

First, a simple block convolution already improves timings by a factor of 3. Then, using FFTs to compute all the convolutions expect for the first, we can go further. Let’s see a profile of my original implementation:

Original profile with FFT convolution

Now, all the complex and vector calls inside the two convolution main methods can be removed to streamline the code:

Final profile after optimizations

Besides the optimization on the FFT code, the classic IIR code was also optimized somewhat, leading to a 15% performance increase. The conversion code was also enhanced leading to a gain for all plugins of more than 50% for the input/output handling.

# Conclusion

There is still room for improvement, as there are several copies that are still required, but a basic implementation is actually easy to achieve. There is a compromise to be achieved between the computation of the convolution through FFT, dominated by the complex multiplication and accumulation before the inverse FFT, and the convolution through a FIR filter of the first block. If the impulse is long (several seconds), it seems that a size of 256 to 512 elements is the best compromise, whereas for shorter impulse (1 second), a block size of 64 elements is the best.

Of course, by improving the complex multiplication and addition (SIMD? FMA?), the compromise would shift towards smaller block sizes.

I didn’t put any big equations on how the block convolution is achieved. This would almost require a paper or a book chapter…

The full implementation is available on Github.

## July 06, 2015

### Titus Brown

#### On licensing bioinformatics software: use the BSD, Luke.

If a piece of bioinformatics software is not fully open source, my lab and I will generally seek out alternatives to it for research, teaching and training. This holds whether or not the software is free for academic use.

If a piece of bioinformatics software is only available under the GNU Public License or another "copyleft" license, then my lab and I will absolutely avoid integrating any part of it into our own source code, which is mostly BSD.

## Why avoid non-open-source software?

We avoid non-open-source software because it saves us future headaches.

Contrary to some assumptions, this is not because I'm anti-company or against making money from software, although I have certainly chosen to forego that in my own life. It's almost entirely because such software is an evolutionary dead end, and hence time spent working with it is ultimately wasted.

More specifically, here are some situations that I want to avoid:

• we invest a lot of time in building training materials for a piece of software, only to find out that some of our trainees can't make use of the software in their day job.
• we need to talk to lawyers about whether or not we can use a piece of software or include some of its functionality in a workflow we're building.
• we find a small but critical bug in a piece of bioinformatics software, and can't reach any of the original authors to OK a new release.

This is the reason why I won't be investing much time or effort in using kallisto in my courses and my research: it's definitely not open source.

## Why avoid copyleft?

The typical decision tree for an open source license is between a "permissive" BSD-style license vs a copyleft license like the GPL; see Jake Vanderplas's excellent post on licensing scientific code for specifics.

There is an asymmetry in these licenses.

Our software, khmer, is available under the BSD license. Any open source project (indeed, any project) is welcome to take any part of our source code and include it in theirs.

However, we cannot use GPL software in our code base at all. We can't call GPL library functions, we can't include GPL code in our codebase, and I'm not 100% sure we can even look closely at GPL code. This is because if we do so, we must license our own software under the GPL.

This is the reason that I will be avoiding bloomtree code, and in fact we will probably be following through on our reimplementation -- bloomtree relies on both Jellyfish and sdsl-lite, which are GPL.

## Why did we choose BSD and not GPL for our own code?

Two reasons: first, I'm an academic, funded by government grants; second, I want to maximize the utility of my work, which means choosing a license that encourages the most participation in the project, and encourages the most reuse of my code in other projects.

Jake covers the second line of reasoning really nicely in his blog post, so I will simply extract his summary of John Hunter's reasoning:

To summarize Hunter's reasoning: the most important two predictors of success for a software project are the number of users and the number of contributors. Because of the restrictions and subtle legal issues involved with GPL licenses, many for-profit companies will not touch GPL-licensed code, even if they are happy to contribute their changes back to the community. A BSD license, on the other hand, removes these restrictions: Hunter mentions several specific examples of vital industry partnership in the case of matplotlib. He argues that in general, a good BSD-licensed project will, by virtue of opening itself to the contribution of private companies, greatly grow its two greatest assets: its user-base and its developer-base.

I also think maximizing remixability is a basic scientific goal, and this is something that the GPL fails.

The first line of reasoning is a little more philosophical, but basically it comes down to a wholesale rejection of the logic in the Bayh-Dole act, which tries to encourage innovation and commercialization of federally funded research by assigning intellectual property to the university. I think this approach is bollocks. While I am not an economic expert, I think it's clear that most innovation in the university is probably not worth that much and should be made maximally open. From talking to Dr. Bill Janeway, I he agrees that pre-Bayh-Dole was a time of more openness, although I'm not sure of the evidence for more innovation during this period. Regardless, to me it's intuitively obvious that the prospect of commercialization causes more researchers to keep their research closed, and I think this is obviously bad for science. (The Idea Factory talks a lot about how Bell Labs spurred immense amounts of innovation because so much of their research was open for use. Talent Wants to be Free is a pop-sci book that outlines research supporting openness leading to more innovation.)

So, basically, I think my job as an academic is to come up with cool stuff and make it as open as possible, because that encourages innovation and progress. And the BSD fits that bill. If a company wants to make use of my code, that's great! Please don't talk to us - just grab it and go!

I should say that I'm very aware of the many good reasons why GPL promotes a better long-term future, and until I became a grad student I was 100% on board. Once I got more involved in scientific programming, though, I switched to a more selfish rationale, which is that my reputation is going to be enhanced by more people using my code, and the way to do that is with the BSD. If you have good arguments about why I'm wrong and everyone should use the GPL, please do post them (or links to good pieces) in the comments; I'm happy to promote that line of reasoning, but for now I've settled on BSD for my own work.

One important note: universities like releasing things under the GPL, because they know that it virtually guarantees no company will use it in a commercial product without paying the university to relicense it specifically for the company. While this may be in the best short-term interests of the university, I think it says all you need to know about the practical outcome of the GPL on scientific innovation.

## Why am I OK with the output of commercial equipment?

Lior Pachter drew a contrast between my refusal to teach non-free software and my presumed teaching on sequencing output from commercial Illumina machines. I think there's at least four arguments to be made in favor of continuing to use Illumina while avoiding the use of Kallisto.

• pragmatically, Illumina is the only game in town for most of my students, while there are plenty of RNA-seq analysis programs. So unless I settled on kallisto being the super-end-all-be-all of RNAseq analysis, I can indulge my leanings towards freedom by ignoring kallisto and teaching something else that's free-er.
• Illumina has a clear pricing model and their sequencing is essentially a commodity that needs little to no engagement from me. This is not generally how bioinformatics software works :)
• There's no danger of Illumina claiming dibs on any of my results or extensions - we're all clear that I pays my money, and I gets my sequence. I'm honestly not sure what would happen if I modified kallisto or built on it to do something cool, and then wanted to let a company use it. (I bet it would involve talking to a lot of lawyers, which I'm not interested in doing.)
• James Taylor made the excellent points that limited training and development time is best spent on tools that are maximally available, and that don't involve licenses that they can't enforce.

So that's my reasoning. I don't want to pour fuel on any licensing fire, but I wanted to explain my reasoning to people. I also think that people should fight hard to make their bioinformatics software available under a permissive license, because it will benefit everyone :).

I should say that Manoj Samanta has been following this line of thought for much longer than me, and has written several blog posts on this topic (see also this, for example).

--titus

## (or, Yes You Can Fit Models With More Parameters Than Data Points)

An oft-repeated rule of thumb in any sort of statistical model fitting is "you can't fit a model with more parameters than data points". This idea appears to be as wide-spread as it is incorrect. On the contrary, if you construct your models carefully, you can fit models with more parameters than datapoints, and this is much more than mere trivia with which you can impress the nerdiest of your friends: as I will show here, this fact can prove to be very useful in real-world scientific applications.

A model with more parameters than datapoints is known as an under-determined system, and it's a common misperception that such a model cannot be solved in any circumstance. In this post I will consider this misconception, which I call the model complexity myth. I'll start by showing where this model complexity myth holds true, first from from an intuitive point of view, and then from a more mathematically-heavy point of view. I'll build from this mathematical treatment and discuss how underdetermined models may be addressed from a frequentist standpoint, and then from a Bayesian standpoint. (If you're unclear about the general differences between frequentist and Bayesian approaches, I might suggest reading my posts on the subject). Finally, I'll discuss some practical examples of where such an underdetermined model can be useful, and demonstrate one of these examples: quantitatively accounting for measurement biases in scientific data.

## The Root of the Model Complexity Myth

While the model complexity myth is not true in general, it is true in the specific case of simple linear models, which perhaps explains why the myth is so pervasive. In this section I first want to motivate the reason for the underdetermination issue in simple linear models, first from an intuitive view, and then from a more mathematical view.

I'll start by defining some functions to create plots for the examples below; you can skip reading this code block for now:

In [1]:
# Code to create figures
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('ggplot')

def plot_simple_line():
rng = np.random.RandomState(42)
x = 10 * rng.rand(20)
y = 2 * x + 5 + rng.randn(20)
p = np.polyfit(x, y, 1)
xfit = np.linspace(0, 10)
yfit = np.polyval(p, xfit)
plt.plot(x, y, 'ok')
plt.plot(xfit, yfit, color='gray')
plt.text(9.8, 1,
"y = {0:.2f}x + {1:.2f}".format(*p),
ha='right', size=14);

def plot_underdetermined_fits(p, brange=(-0.5, 1.5), xlim=(-3, 3),
plot_conditioned=False):
rng = np.random.RandomState(42)
x, y = rng.rand(2, p).round(2)
xfit = np.linspace(xlim[0], xlim[1])
for r in rng.rand(20):
# add a datapoint to make model specified
b = brange[0] + r * (brange[1] - brange[0])
xx = np.concatenate([x, [0]])
yy = np.concatenate([y, [b]])
theta = np.polyfit(xx, yy, p)
yfit = np.polyval(theta, xfit)
plt.plot(xfit, yfit, color='#BBBBBB')
plt.plot(x, y, 'ok')

if plot_conditioned:
X = x[:, None] ** np.arange(p + 1)
theta = np.linalg.solve(np.dot(X.T, X)
+ 1E-3 * np.eye(X.shape[1]),
np.dot(X.T, y))
Xfit = xfit[:, None] ** np.arange(p + 1)
yfit = np.dot(Xfit, theta)
plt.plot(xfit, yfit, color='black', lw=2)

def plot_underdetermined_line():
plot_underdetermined_fits(1)

def plot_underdetermined_cubic():
plot_underdetermined_fits(3, brange=(-1, 2),
xlim=(0, 1.2))

def plot_conditioned_line():
plot_underdetermined_fits(1, plot_conditioned=True)


### Fitting a Line to Data

The archetypical model-fitting problem is that of fitting a line to data: A straight-line fit is one of the simplest of linear models, and is usually specified by two parameters: the slope m and intercept b. For any observed value $$x$$, the model prediction for $$y$$ under the model $$M$$ is given by

$y_M = m x + b$

for some particular choice of $$m$$ and $$b$$. Given $$N$$ obverved data points $$\{x_i, y_i\}_{y=1}^N$$, it is straightforward (see below) to compute optimal values for $$m$$ and $$b$$ which fit this data:

In [2]:
plot_simple_line()


The simple line-plus-intercept is a two-parameter model, and it becomes underdetermined when fitting it to fewer than two datapoints. This is easy to understand intuitively: after all, you can draw any number of perfectly-fit lines through a single data point:

In [3]:
plot_underdetermined_line()


The single point simply isn't enough to pin-down both a slope and an intercept, and the model has no unique solution.

### Fitting a More General Linear Model

While it's harder to see intuitively, this same notion extends to models with more terms. For example, let's think consider fitting a general cubic curve to data. In this case our model is

$y_M = \theta_0 + \theta_1 x + \theta_2 x^2 + \theta_3 x^3$

Note that this is still a linear model: the "linear" refers to the linearity of model parameters $$\theta$$ rather than linearity of the dependence on the data $$x$$. Our cubic model is a four-parameter linear model, and just as a two-parameter model is underdetermined for fewer than two points, a four-parameter model is underdetermined for fewer than four points. For example, here are some of the possible solutions of the cubic model fit to three randomly-chosen points:

In [4]:
plot_underdetermined_cubic()


For any such simple linear model, an underdetermined system will lead to a similar result: an infinite set of best-fit solutions.

### The Mathematics of Underdetermined Models

To make more progress here, let's quickly dive into the mathematics behind these linear models. Going back to the simple straight-line fit, we have our model

$y_M(x~|~\theta) = \theta_0 + \theta_1 x$

where we've replaced our slope $$m$$ and intercept $$b$$ by a more generalizable parameter vector $$\theta = [\theta_0, \theta_1]$$. Given some set of data $$\{x_n, y_n\}_{n=1}^N$$ we'd like to find $$\theta$$ which gives the best fit. For reasons I'll not discuss here, this is usually done by minimizing the sum of squared residuals from each data point, often called the $$\chi^2$$ of the model in reference to its expected theoretical distribution:

$\chi^2 = \sum_{n=1}^N [y_n - y_M(x_n~|~\theta)]^2$

We can make some progress by re-expressing this model in terms of matrices and vectors; we'll define the vector of $$y$$ values:

$y = [y_1, y_2, y_3, \cdots y_N]$

We'll also define the design matrix which we'll call $$X$$; this contains all the information about the form of the model:

$X = \left[ \begin{array}{ll} 1 & x_1 \\ 1 & x_2 \\ \vdots &\vdots \\ 1 & x_N \\ \end{array} \right]$

With this formalism, the vector of model values can be expressed as a matrix-vector product:

$y_M = X\theta$

and the $$\chi^2$$ can be expressed as a simple linear product as well:

$\chi^2 = (y - X\theta)^T(y - X\theta)$

We'd like to minimize the $$\chi^2$$ with respect to the parameter vector $$\theta$$, which we can do by the normal means of differentiating with respect to the vector $$\theta$$ and setting the result to zero (yes, you can take the derivative with respect to a vector!):

$\frac{d\chi^2}{d\theta} = -2X^T(y - X\theta) = 0$

Solving this for $$\theta$$ gives the Maximum Likelihood Estimate (MLE) for the parameters,

$\hat{\theta}_{MLE} = [X^T X]^{-1} X^T y$

Though this matrix formalism may seem a bit over-complicated, the nice part is that it straightforwardly generalizes to a host of more sophisticated linear models. For example, the cubic model considered above requires only a larger design matrix $$X$$:

$X = \left[ \begin{array}{llll} 1 & x_1 & x_1^2 & x_1^3\\ 1 & x_2 & x_2^2 & x_2^3\\ \vdots & \vdots & \vdots & \vdots\\ 1 & x_N & x_N^2 & x_N^3\\ \end{array} \right]$

The added model complexity is completely encapsulated in the design matrix, and the expression to compute $$\hat{\theta}_{MLE}$$ from $$X$$ is unchanged!

### Why Underdetermined Models Break

Taking a look at this Maximum Likelihood solution for $$\theta$$, we see that there is only one place that it might go wrong: the inversion of the matrix $$X^T X$$. If this matrix is not invertible (i.e. if it is a singular matrix) then the maximum likelihood solution will not be well-defined.

The number of rows in $$X$$ corresponds to the number of data points, and the number of columns in $$X$$ corresponds to the number of parameters in the model. It turns out that a matrix $$C = X^TX$$ will always be singular if $$X$$ has fewer rows than columns, and this is the source of the problem. For underdetermined models, $$X^TX$$ is a singular matrix, and so the maximum likelihood fit is not well-defined.

Let's take a look at this in the case of fitting a line to the single point shown above, $$(x=0.37, y=0.95)$$. For this value of $$x$$, here is the design matrix:

In [5]:
X = np.array([[1, 0.37]])


We can use this to compute the normal matrix, which is the standard name for $$X^TX$$:

In [6]:
C = np.dot(X.T, X)


If we try to invert this, we will get an error telling us the matrix is singular:

In [7]:
np.linalg.inv(C)

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
<ipython-input-7-a6d66d9f99af> in <module>()
----> 1 np.linalg.inv(C)

/Users/jakevdp/anaconda/envs/py34/lib/python3.4/site-packages/numpy/linalg/linalg.py in inv(a)
518     signature = &aposD->D&apos if isComplexType(t) else &aposd->d&apos
519     extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 520     ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
521     return wrap(ainv.astype(result_t))
522

/Users/jakevdp/anaconda/envs/py34/lib/python3.4/site-packages/numpy/linalg/linalg.py in _raise_linalgerror_singular(err, flag)
88
89 def _raise_linalgerror_singular(err, flag):
---> 90     raise LinAlgError("Singular matrix")
91
92 def _raise_linalgerror_nonposdef(err, flag):

LinAlgError: Singular matrix

Evidently, if we want to fix the underdetermined model, we'll need to figure out how to modify the normal matrix so it is no longer singular.

### Fixing an Underdetermined Model: Conditioning

One easy way to make a singular matrix invertible is to condition it: that is, you add to it some multiple of the identity matrix before performing the inversion (in many ways this is equivalent to "fixing" a divide-by-zero error by adding a small value to the denominator). Mathematically, that looks like this:

$C = X^TX + \sigma I$

For example, by adding $$\sigma = 10^{-3}$$ to the diagonal of the normal matrix, we condition the matrix so that it can be inverted:

In [8]:
cond = 1E-3 * np.eye(2)
np.linalg.inv(C + cond)

Out[8]:
array([[ 121.18815362, -325.16038316],
[-325.16038316,  879.69065823]])


Carrying this conditioned inverse through the computation, we get the following intercept and slope for our underdetermined problem:

In [9]:
b, m = np.linalg.solve(C + cond,
np.dot(X.T, [0.95]))
print("Conditioned best-fit model:")
print("y = {0:.3f} x + {1:.3f}".format(m, b))

Conditioned best-fit model:
y = 0.309 x + 0.835


In [10]:
plot_conditioned_line()


This conditioning caused the model to settle on a particular one of the infinite possibilities for a perfect fit to the data. Numerically we have fixed our issue, but this arbitrary conditioning is more than a bit suspect: why is this particular result chosen, and what does it actually mean in terms of our model fit? In the next two sections, we will briefly discuss the meaning of this conditioning term from both a frequentist and Bayesian perspective.

## Frequentist Conditioning: Regularization

In a frequentist approach, this type of conditioning is known as regularization. Regularization is motivated by a desire to penalize large values of model parameters. For example, in the underdetermined fit above (with $$(x, y) = (0.37, 0.95)$$), you could fit the data perfectly with a slope of one billion and an intercept near negative 370 million, but in most real-world applications this would be a silly fit. To prevent this sort of canceling parameter divergence, in a frequentist setting you can "regularize" the model by adding a penalty term to the $$\chi^2$$:

$\chi^2_{reg} = \chi^2 + \lambda~\theta^T\theta$

Here $$\lambda$$ is the degree of regularization, which must be chosen by the person implementing the model.

Using the expression for the regularized $$\chi^2$$, we can minimize with respect to $$\theta$$ by again taking the derivative and setting it equal to zero:

$\frac{d\chi^2}{d\theta} = -2[X^T(y - X\theta) - \lambda\theta] = 0$

This leads to the following regularized maximum likelihood estimate for $$\theta$$:

$\hat{\theta}_{MLE} = [X^TX + \lambda I]^{-1} X^T y$

Comparing this to our conditioning above, we see that the regularization degree $$\lambda$$ is identical to the conditioning term $$\sigma$$ that we considered above. That is, regulariation of this form is nothing more than a simple conditioning of $$X^T X$$, with $$\lambda = \sigma$$. The result of this conditioning is to push the absolute values of the parameters toward zero, and in the process make an ill-defined problem solvable.

I'll add that the above form of regularization is known variably as L2-regularization or Ridge Regularization, and is only one of the possible regularization approaches. Another useful form of regularization is L1-regularization, also known as Lasso Regularization, which has the interesting property that it favors sparsity in the model.

## Bayesian Conditioning: Priors

Regularization illuminates the meaning of matrix conditioning, but it still sometimes seems like a bit of black magic. What does this penalization of model parameters within the $$\chi^2$$ actually mean? Here, we can make progress in understanding the problem by examining regularization from a Bayesian perspective.

As I pointed out in my series of posts on Frequentism and Bayesianism, for many simple problems, the frequentist likelihood (proportional to the negative exponent of the $$\chi^2$$) is equivalent to the Bayesian posterior – albeit with a subtlely but fundamentally different interpretation.

The Bayesian posterior probability on the model parameters $$\theta$$ is given by

$P(\theta~|~D, M) = \frac{P(D~|~\theta, M) P(\theta~|~M)}{P(D~|~M)}$

where the most important terms are the likelihood $$P(D~|~\theta, M)$$ and the prior $$P(\theta~|~M)$$. From the expected correspondence with the frequentist result, we can write:

$P(D~|~\theta, M) P(\theta~|~M) \propto \exp[- \chi^2]$

Because the term on the right-hand-side has no $$\theta$$ dependence, we can immediately see that

$P(D~|~\theta, M) \propto \exp[-\chi^2]\\ P(\theta~|~M) \propto 1$

That is, the simple frequentist likelihood is equivalent to the Bayesian posterior for the model with an implicit flat prior on $$\theta$$.

To understand the meaning of regularization, let's repeat this exercise with the regularized $$\chi^2$$:

$P(D~|~\theta, M) P(\theta~|~M) \propto \exp[- \chi^2 - \lambda~|\theta|^2]$

The regularization term in the $$\chi^2$$ becomes a second term in the product which depends only on $$\theta$$, thus we can immediately write

$P(D~|~\theta, M) \propto \exp[-\chi^2]\\ P(\theta~|~M) \propto \exp[- \lambda~|\theta|^2]$

So we see that ridge regularization is equivalent to applying a Gaussian prior to your model parameters, centered at $$\theta=0$$ and with a width $$\sigma_P = (2\lambda)^{-2}$$. This insight lifts the cover off the black box of regularization, and reveals that it is simply a roundabout way of adding a Bayesian prior within a frequentist paradigm. The stronger the regularization, the narrower the implicit Gaussian prior is.

Returning to our single-point example above, we can quickly see how this intuition explains the particular model chosen by the regularized fit; it is equivalent to fitting the line while taking into account prior knowledge that both the intercept and slope should be near zero:

In [11]:
plot_conditioned_line()


The benefit of the Bayesian view is that it helps us understand exactly what this conditioning means for our model, and given this understanding we can easily extend use more general priors. For example, what if you have reason to believe your slope is near 1, but have no prior information on your intercept? In the Bayesian approach, it is easy to add such information to your model in a rigorous way.

But regardless of which approach you use, this central fact remains clear: you can fit models with more parameters than data points, if you restrict your parameter space through the use of frequentist regularization or Bayesian priors.

## Underdetermined Models in Action

There are a few places where these ideas about underdetermined models come up in real life. I'll briefly discuss a couple of them here, and then walk through the solution of a simple (but rather interesting) problem that demonstrates these ideas.

### Compressed Sensing: Image Reconstruction

One area where underdetermined models are often used is in the field of Compressed Sensing. Compressed sensing comprises a set of models in which underdetermined linear systems are solved using a sparsity prior, the classic example of which is the reconstruction of a full image from just a handful of its pixels. As a simple linear model this would fail, because there are far more unknown pixels than known pixels. But by carefully training a model on the structure of typical images and applying priors based on sparsity, this seemingly impossible problem becomes tractable. This 2010 Wired article has a good popular-level discussion of the technique and its applications, and includes this image showing how a partially-hidden input image can be iteratively reconstructed from a handful of pixels using a sparsity prior:

In [12]:
from IPython.display import Image
Image('http://www.wired.com/magazine/wp-content/images/18-03/ff_algorithm2_f.jpg')

Out[12]:

### Kernel-based Methods: Gaussian Processes

Another area where a classically underdetermined model is solved is in the case of Gaussian Process Regression. Gaussian Processes are an interesting beast, and one way to view them is that rather than fitting, say, a two-parameter line or four-parameter cubic curve, they actually fit an infinite-dimensional model to data! They accomplish this by judicious use of certain priors on the model, along with a so-called "kernel trick" which solves the infinite dimensional regression implicitly using a finite-dimensional representation constructed based on these priors.

In my opinion, the best resource to learn more about Gaussian Process methods is the Gaussian Processes for Machine Learning book, which is available for free online (though it is a bit on the technical side). You might also take a look at the scikit-learn Gaussian Process documentation. If you'd like to experiment with a rather fast Gaussian Process implementation in Python, check out the george library.

### Imperfect Detectors: Extrasolar Planets

Another place I've seen effective use of the above ideas is in situations where the data collection process has unavoidable imperfections. There are two basic ways forward when working with such noisy and biased data:

• you can pre-process the data to try to remove and/or correct data imperfections, and then fit a conventional model to the corrected data.
• you can account for the data imperfections along with interesting model parameters as part of a unified model: this type of unified approach is often preferable in scientific settings, where even the most careful pre-processing can lead to biased results.

If you'd like to see a great example of this style of forward-modeling analysis, check out the efforts of David Hogg's group in finding extrasolar planets in the Kepler survey's K2 data; there's a nice astrobytes post which summarizes some of these results. While the group hasn't yet published any results based on truly underdetermined models, it is easy to imagine how this style of comprehensive forward-modeling analysis could be pushed to such an extreme.

### Example: Fitting a Model to Biased Data

While it might be fun to dive into the details of models for noisy exoplanet searches, I'll defer that to the experts. Instead, as a more approachable example of an underdetermined model, I'll revisit a toy example in which a classically underdetermined model is used to account for imperfections in the input data.

Imagine you have some observed data you would like to model, but you know that your detector is flawed such that some observations (you don't know which) will have a bias that is not reflected in the estimated error: in short, there are outliers in your data. How can you fit a model to this data while accounting for the possibility of such outliers?

To make this more concrete, consider the following data, which is drawn from a line with noise, and includes several outliers:

In [13]:
rng = np.random.RandomState(42)
theta = [-10, 2]
x = 10 * rng.rand(10)
dy = 2 + 2 * rng.rand(10)
y = theta[0] + theta[1] * x + dy * rng.randn(10)
y[4] += 15
y[7] -= 10
plt.errorbar(x, y, dy, fmt='ok', ecolor='gray');


If we try to fit a line using the standard $$\chi^2$$ minimization approach, we will find an obviously biased result:

In [14]:
from scipy import optimize

def chi2(theta, x=x, y=y, dy=dy):
y_model = theta[0] + theta[1] * x
return np.sum(0.5 * ((y - y_model) / dy) ** 2)

theta1 = optimize.fmin(chi2, [0, 0], disp=False)

xfit = np.linspace(0, 10)
plt.errorbar(x, y, dy, fmt='ok', ecolor='gray')
plt.plot(xfit, theta1[0] + theta1[1] * xfit, '-k')
plt.title('Standard $\chi^2$ Minimization');


This reflects a well-known deficiency of $$\chi^2$$ minimization: it is not robust to the presence of outliers.

What we would like to do is propose a model which somehow accounts for the possibility that each of these points may be the result of a biased measurement. One possible route is to add $$N$$ new model parameters: one associated with each point which indicates whether it is an outlier or not. If it is an outlier, we use the standard model likelihood; if not, we use a likelihood with a much larger error. The result for our straight-line fit will be a model with $$N + 2$$ parameters, where $$N$$ is the number of data points. An overzealous application of lessons from simple linear models might lead you to believe this model can't be solved. But, if carefully constructed, it can!

Let's show how it can be done.

Our linear model is:

$y_M(x~|~\theta) = \theta_0 + \theta_1 x$

For a non-outlier (let's call it an "inlier") point at $$x$$, $$y$$, with error on $$y$$ given by $$dy$$, the likelihood is

$L_{in, i}(D~|~\theta) = \frac{1}{\sqrt{2\pi dy_i^2}} \exp\frac{-[y_i - y_M(x_i~|~\theta)]^2}{2 dy_i^2}$

For an "outlier" point, the likelihood is

$L_{out, i}(D~|~\theta) = \frac{1}{\sqrt{2\pi \sigma_y^2}} \exp\frac{-[y_i - y_M(x_i~|~\theta)]^2}{2 \sigma_y^2}$

where $$\sigma_y$$ is the standard deviation of the $$y$$ data: note that the only difference between the "inlier" and "outlier" likelihood is the width of the Gaussian distribution.

Now we'll specify $$N$$ additional binary model parameters $$\{g_i\}_{i=1}^N$$ which indicate whether point $$i$$ is an outlier $$(g_i = 1)$$ or an inlier $$(g_i = 0)$$. With this, the overall Likelihood becomes:

$L(D~|~\theta, g) = \prod_i \left[(1 - g_i)~L_{in, i} + g_i~L_{out, i}\right]$

We will put a prior on these indicator variables $$g$$ which encourages sparsity of outliers; this can be accomplished with a simple L1 prior, which penalizes the sum of the $$g$$ terms:

$P(g) = \exp\left[-\sum_i g_i\right]$

where, recall, $$g_i \in \{0, 1\}$$.

Though you could likely solve for a point estimate of this model, I find the Bayesian approach to be much more straightforward and interpretable for a model this complex. To fit this, I'll make use of the excellent emcee package. Because emcee doesn't have categorical variables, we'll instead allow $$g_i$$ to range continuously between 0 and 1, so that any single point will be some mixture of "outlier" and "inlier".

We start by defining a function which computes the log-posterior given the data and model parameters, using some computational tricks for the sake of floating-point accuracy:

In [15]:
# theta will be an array of length 2 + N, where N is the number of points
# theta[0] is the intercept, theta[1] is the slope,
# and theta[2 + i] is the weight g_i

def log_prior(theta):
g = theta[2:]
#g_i needs to be between 0 and 1
if (np.any(g < 0) or np.any(g > 1)):
return -np.inf # recall log(0) = -inf
else:
return -g.sum()

def log_likelihood(theta, x, y, dy):
sigma_y = np.std(y)
y_model = theta[0] + theta[1] * x
g = np.clip(theta[2:], 0, 1)  # g<0 or g>1 leads to NaNs in logarithm
# log-likelihood for in-lier
logL_in = -0.5 * (np.log(2 * np.pi * dy ** 2) + ((y - y_model) / dy)** 2)
# log-likelihood for outlier
logL_out = -0.5 * (np.log(2 * np.pi * sigma_y ** 2) + ((y - y_model) / sigma_y) ** 2)
return np.sum(np.logaddexp(np.log(1 - g) + logL_in,
np.log(g) + logL_out))

def log_posterior(theta, x, y, dy):
return log_prior(theta) + log_likelihood(theta, x, y, dy)


Now we use the emcee package to run this model. Note that because of the high dimensionality of the model, the run_mcmc command below will take a couple minutes to complete:

In [16]:
import emcee

ndim = 2 + len(x)  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 10000  # "burn-in" period to let chains stabilize
nsteps = 15000  # number of MCMC steps to take

# set walkers near the maximum likelihood
rng = np.random.RandomState(0)
starting_guesses = np.zeros((nwalkers, ndim))
starting_guesses[:, :2] = rng.normal(theta1, 1, (nwalkers, 2))
starting_guesses[:, 2:] = rng.normal(0.5, 0.1, (nwalkers, ndim - 2))

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[x, y, dy])
sampler.run_mcmc(starting_guesses, nsteps)

sample = sampler.chain  # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, ndim)

-c:21: RuntimeWarning: divide by zero encountered in log
-c:22: RuntimeWarning: divide by zero encountered in log



The Runtime warnings here are normal – they just indicate that we've hit log(0) = -inf for some pieces of the calculation.

With the sample chain determined, we can plot the marginalized distribution of samples to get an idea of the value and uncertainty of slope and intercept with this model:

In [17]:
plt.plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1)
plt.xlabel('intercept')
plt.ylabel('slope');


Finally, we can make use of the marginalized values of all $$N + 2$$ parameters and plot both the best-fit model, along with a model-derived indication of whether each point is an outlier:

In [18]:
theta2 = np.mean(sample[:, :2], 0)
g = np.mean(sample[:, 2:], 0)
outliers = (g > 0.5)

plt.errorbar(x, y, dy, fmt='ok', ecolor='gray')
plt.plot(xfit, theta1[0] + theta1[1] * xfit, color='gray')
plt.plot(xfit, theta2[0] + theta2[1] * xfit, color='black')
plt.plot(x[outliers], y[outliers], 'ro', ms=20, mfc='none', mec='red')
plt.title('Bayesian Fit');


The red circles mark the points that were determined to be outliers by our model, and the black line shows the marginalized best-fit slope and intercept. For comparison, the grey line is the standard maximum likelihood fit.

Notice that we have successfully fit an $$(N + 2)$$-parameter model to $$N$$ data points, and the best-fit parameters are actually meaningful in a deep way – the $$N$$ extra parameters give us individual estimates of whether each of the $$N$$ data points has misreported errors. I think this is a striking example of how a model that would be considered impossible under the model complexity myth can be solved in practice to produce very useful results!

## Conclusion

I hope you will see after reading this post that the model complexity myth, while rooted in a solid understanding of simple linear models, should not be assumed to apply to all possible models. In fact, it is possible to fit models with more parameters than datapoints: and for the types of noisy, biased, and heterogeneous data we often encounter in scientific research, you can make a lot of progress by taking advantage of such models. Thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

## July 03, 2015

### NeuralEnsemble

#### Sumatra 0.7 released

We would like to announce the release of version 0.7.0 of Sumatra, a tool for automated tracking of simulations and computational analyses so as to be able to easily replicate them at a later date.

This version of Sumatra brings some major improvements for users, including an improved web browser interface, improved support for the R language, Python 3 compatibility, a plug-in interface making Sumatra easier to extend and customize, and support for storing data using WebDAV.
In addition, there have been many changes under the hood, including a move to Github and improvements to the test framework, largely supported by the use of Docker.
Last but not least, we have changed licence from the CeCILL licence (GPL-compatible) to a BSD 2-Clause Licence, which should make it easier for other developers to use Sumatra in their own projects.

## Updated and extended web interface

Thanks to Felix Hoffman’s Google Summer of Code project, the web browser interface now provides the option of viewing the history of your project either in a “process-centric” view, as in previous versions, where each row in the table represents a computation, or in a “data-centric” view, where each row is a data file. Where the output from one computation is the input to another, additional links make it possible to follow these connections.
The web interface has also had a cosmetic update and several other improvements, including a more powerful comparison view (see screenshot). Importantly, the interface layout no longer breaks in narrower browser windows.

## BSD licence

The Sumatra project aims to provide not only tools for scientists as end users (such as smt and smtweb), but also library components for developers to add Sumatra’s functionality to their own tools. To support this second use, we have switched licence from CeCILL (GPL-compatible) to the BSD 2-Clause Licence.

## Python 3 support

In version 0.6.0, Sumatra already supported provenance capture for projects using Python 3, but required Python 2.6 or 2.7 to run. Thanks to Tim Tröndle, Sumatra now also runs in Python 3.4.

## Plug-in interface

To support the wide diversity of workflows in scientific computing, Sumatra has always had an extensible architecture. It is intended to be easy to add support for new database formats, new programming languages, new version control systems, or new ways of launching computations.
Until now, adding such extensions has required that the code be included in Sumatra’s code base. Version 0.7.0 adds a plug-in interface, so you can define your own local extensions, or use other people’s.

## WebDAV support

The option to archive output data files has been extended to allow archiving to a remote server using the WebDAV protocol.

## Support for the R language

Sumatra will now attempt to determine the versions of all external packages loaded by an R script.

## Other changes

For developers, there has been a significant change - the project has moved from Mercurial to Git, and is now hosted on Github. Testing has also been significantly improved, with more system/integration testing, and the use of Docker for testing PostgreSQL and WebDAV support.
Parsing of command-line parameters has been improved. The ParameterSet classes now have a diff() method, making it easier to see the difference between two parameter sets, especially for large and hierarchical sets.
Following the recommendation of the Mercurial developers, and to enable the change of licence to BSD, we no longer use the Mercurial internal API. Instead we use the Mercurial command line interface via the hgapi package.

## Bug fixes

fair number of bugs have been fixed.

  \$ pip install sumatra

Alternatively, Sumatra 0.7.0 may be downloaded from PyPI or from the INCF Software Center. Support is available from the sumatra-users Google Group. Full documentation is available on Read the Docs.

### Abraham Escalante

#### Mid-term summary

Hello all,

We're reaching the halfway mark for the GSoC and it's been a great journey so far.

I have had some off court issues. I was hesitant to write about them because I don't want my blog to turn into me ranting and complaining but I have decided to briefly mention them in this occasion because they are relevant and at this point they are all but overcome.

Long story short, I was denied the scholarship that I needed to be able to go to Sheffield so I had to start looking for financing options from scratch. Almost at the same time I was offered a place at the University of Toronto (which was originally my first choice). The reason why this is relevant to the GSoC is because it coincided with the beginning of the program so I was forced to cope with not just the summer of code but also with searching/applying for funding and paperwork for the U of T which combined to make for a lot of work and a tough first month.

I will be honest and say that I got a little worried at around week 3 and week 4 because things didn't seem to be going the way I had foreseen in my proposal to the GSoC. In my previous post I wrote about how I had to make a change to my approach and I knew I had to commit to it so it would eventually pay off.

At this point I am feeling pretty good with the way the project is shaping up. As I mentioned, I had to make some changes, but out of about 40 open issues, now only 23 remain, I have lined up PRs for another 8 and I have started discussion (either with the community or with my mentor) on almost all that remain, including some of the longer ones like NaN handling which will span over the entire scipy.stats module and is likely to become a long term community effort depending on what road Numpy and Pandas take on this matter in the future.

I am happy to look at the things that are still left and find that I at least have a decent idea of what I must do. This was definitely not the case three or four weeks ago and I'm glad with the decision that I made when choosing a community and a project. My mentor is always willing to help me understand unknown concepts and point me in the right direction so that I can learn for myself and the community is engaging and active which helps me keep things going.

My girlfriend, Hélène has also played a major role in helping me keep my motivation when it seems like things amount to more than I can handle.

I realise that this blog (since the first post) has been a lot more about my personal journey than technical details about the project. I do apologise if this is not what you expect but I reckon that this makes it easier to appreciate for a reader who is not familiarised with 'scipy.stats', and if you are familiarised you probably follow the issues or the developer's mailing list (where I post a weekly update) so technical details would be redundant to you.  I also think that the setup of the project, which revolves around solving many issues makes it too difficult to write about specific details without branching into too many tangents for a reader to enjoy.

If you would like to know more about the technical aspect of the project you can look at the PRs, contact me directly (via a comment here or the SciPy community) or even better, download SciPy and play around with it. If you find something wrong with the statistics module, chances are it's my fault, feel free to let me know. If you like it, you can thank guys like Ralf Gommers (my mentor), Evgeni Burovski and Josef Perktold (to name just a few of the most active members in 'scipy.stats') for their hard work and support to the community.

I encourage anyone who is interested enough to go here to see my proposal or go here to see currently open tasks to find out more about the project. I will be happy to fill you in on the details if you reach me personally.

Sincerely,
Abraham.

## July 02, 2015

### Continuum Analytics

#### Find Continuum at SciPy Austin 2015

Continuum is a sponsor of this year’s SciPy Conference in Austin, TX. We invite you to join us at the fourteenth annual Scientific Computing with Python conference, and to check out some really fantastic talks and tutorials from our talented engineers and developers.

#### Continuum Analytics - July Tech Events

The Continuum Team is hitting some big conferences this month, and we hope you’ll join us for exciting talks on Bokeh, Numba, Blaze, Anaconda, and more. From our hometown of Austin, Texas all the way to Spain, our team is ready to talk Python.

## July 01, 2015

### PythonXY

#### Python(x, y) 2.7.10.0 Released!

Hi All,

I'm happy to announce that Python(x, y) 2.7.10.0 is available for immediate download.from any of the mirrors. The full change log can be viewed here. Please post your comments and suggestions to the mailing list.

Work on the Python 3.x 64 bit version will resume once Visual Studio 2015 RTM is released.

### What's new in general:

• All packages have been updated to their latest versions. Numpy, Scipy etc.
• ipython has been held back at 2.4.1 to avoid potential compatibility issues.
• OpenCV has been help back at 2.4.11 to avoid potential compatibility issues.

### New noteworthy packages:

• yappi - Yet Another Python Profiler.
• libnacl - A ctypes wrapper around libsodium.
Have fun!

-Gabi Davar

## June 30, 2015

### Matthieu Brucher

#### Audio Toolkit: Anatomy of a transient shaper

When I first about transient shaper, I was like “what’s the difference with a compressor? Is there one?”. And I tried to see how to get these transient without relying on the transient energy, with a relative power (ratio between the instant power and the mean power) filter, or its derivative, but nothing worked. Until someone explained that the gain was driven by the difference between a fast attack filtered power and a slower one. So here it goes.

First, let’s have a look on the filter graph:
Transient shaper pipeline

I’ve surrounded the specific transient shaper part in with a dotted line. This is the difference with a compressor/limiter/expander: the way the signal steered the gain computation is generated.

Let’s start from a kick. The fast envelope follower can then be generated (red curve) as well as the slow envelope follower (green curve). The difference is always positive (if the two follower have the same release time value), so we can use it to compute a gain through our usual GainCompressorFilter.
Depending on whether you want to increase the transient or attenuate it, the ratio will be below or higher than 1 (respectively), which is what is shown in the last two plots here:
Transient shaper plot

In the end, it’s all about the right algorithms. If you have a proper framework, you may already have everything you need for some filters. In the case of a transient shaper, I already had all the filters in my toolbox. Now I just need to make a plugin out of this simple pipeline!

The code for the last plots can be found