## September 16, 2014

### Continuum Analytics

#### Introducing Blaze - Migrations

tl;dr Blaze migrates data efficiently between a variety of data stores.

In our last post on Blaze expressions we showed how Blaze can execute the same tabular query on a variety of computational backends. However, this ability is only useful if you can migrate your data to the new computational system in the first place.

To help with this, Blaze provides the into function which moves data from one container type to another.

### Matthieu Brucher

#### Book review: Machine Learning for Adaptive Many-Core Machines – A Practical Approach

Nice title, surfing on the many core hype, and with a practical approach! What more could one expect from a book on such an interesting subject?

#### Content and opinions

Let’s start first with the table of contents. What the authors call many-core machines is GPUs. You won’t see anything else, like MPI or OpenMP. One of the central topics is also big data, but expect for the part that GPU can process data for some algorithms faster than a CPU, it’s not really big data. I would call it large datasets, but not big data, as in terabytes of data!

The book starts with an introduction on Machine Learning, what will be addressed, and the authors’ GPU library for Machine Learning. If you don’t know anything about GPUs, you will be given a short introduction, but it may not be enough for you to understand the difference between CPUs and GPUs, and why we don’t run everything on GPUs.

The next part tackles supervised learning. Each time, you get a general description of algorithms, then the GPU implementation, and then results and discussions. The authors’ favorite algorithms seem to be neural networks, as everything is compared to these. It’s the purpose of the first chapter in this part, with the classic Back-Propagation algorithm, or the Multiple BP algorithm. They also cover their own tool to automate training, which is nice, as it is usually the issue with neural networks. The next chapter handles missing data and the different kind of missing data. The way it is handled by the library is through an activation mechanism that seems to work, but I don’t know how reliable the training of such NN is, although the results are quite good, so the training system must work. Then we have Support Vector Machines, and curiously, it’s not the “usual” kernel that is mainly used in the book. Also, for once, you don’t the full definition of a kernel, but in a way, you don’t need it for the purpose of this book. The last algorithm of this part is the Incremental Hypersphere Classifier. After a shorter presentation, the authors deal with the experimental setups, and they also chain IHC with SVM. All things considered, this is a small set of supervised learning algorithms. Compared to what you can find in scikit-learn, it is really a small subset.

The third part is about unsupervised and semi-supervised learning. The first algorithm is the Non-Negative Matrix Factorization, presented as a non linear algorithm (when it obviously is), so quite strange. The semi-supervised approach is to divide the set of original features in subsets that have meaning together (eyes, mouth, nose…). Two different implementation are provided in the library, and presented with results in the book. The second chapter is about the current hype in neural networks: deep learning. After the presentation of the basic tool of Deep Belief Networks, the Restricted Boltzmann Machines, you directly jump to implementation and results. I have to say that the fact that the book didn’t describe the architecture of DBNs properly, and it quite difficult to know what DBNs actually are and how they can give such nice results.

The last part consists of the final chapter, a conclusion and a sum-up of the algorithms described in the book.

#### Conclusion

The book doesn’t display all the code that was developed. I certainly didn’t want it to do that, because the code is available on SourceForge. It was a good description of several machine learning algorithms ported on the GPU, with results and comparisons, but it didn’t live up to the expectation of the book title and the introduction. A GPU is kind of many-cores, but we know that the future many-cores won’t be that kind of cores only. It also falls short for the big data approach.

## September 14, 2014

### Jan Hendrik Metzen

#### Unsupervised feature learning with Sparse Filtering

An illustration of unsupervised learning of features for images from the Olivetti faces dataset using the sparse filtering algorithm. This work is based on the paper "Sparse Filtering" by the authors Jiquan Ngiam, Pang Wei Koh, Zhenghao Chen, Sonia Bhaskar, and Andrew Y. Ng published in NIPS 2011.

The sparse filtering algorithm does not try to model the data's distribution but rather to learn features which are sparsely activated, in the sense that

• for each example, only a small subset of features is activated ("Population Sparsity")
• each feature is only activated on a small subset of he examples ("Lifetime Sparsity")
• features are roughly activated equally often ("High Dispersal")

This sparsity is encoded as an objective function and L-BFGS is used to minimize this function.

This notebook illustrates the algorithm on the Olivetti faces dataset.

The source code of the sparse-filtering implementation is avalaible at https://github.com/jmetzen/sparse-filtering

In [2]:
import numpy as np
import pylab as plt

np.random.seed(0)
%matplotlib inline

In [3]:
# Install with "pip install sparse_filtering"
from sparse_filtering import SparseFiltering


### Prepare dataset¶

This will load the Olivetti faces dataset, normalize the examples (global and local centering), and convert each example into a 2D structure (64*64 pixel image). Thereupon, 25 patches of size 16x16pixels are extracted randomly from each image.

In [4]:
from sklearn.datasets import fetch_olivetti_faces
dataset = fetch_olivetti_faces(shuffle=True)
faces = dataset.data

n_samples, _ = faces.shape

faces_centered = faces - faces.mean(axis=0)  # global centering

faces_centered -= \
faces_centered.mean(axis=1).reshape(n_samples, -1)  # local centering

faces_centered = \
faces_centered.reshape(n_samples, 64, 64)  # Reshaping to 64*64 pixel images

plt.imshow(faces_centered[0], cmap=plt.get_cmap('gray'))
_ = plt.title("One example from dataset with n=%s example" % n_samples)

In [5]:
# Extract 25 16x16 patches randomly from each image
from sklearn.feature_extraction.image import extract_patches_2d
patch_width = 16

patches = [extract_patches_2d(faces_centered[i], (patch_width, patch_width),
max_patches=25, random_state=i)
for i in range(n_samples)]
patches = np.array(patches).reshape(-1, patch_width * patch_width)

In [6]:
# Show 25 exemplary patches
plt.figure(figsize=(8, 8))
for i in range(25):
plt.subplot(5, 5, i+1)
plt.imshow(patches[i].reshape(patch_width, patch_width), cmap=plt.get_cmap('gray'))
plt.xticks([])
plt.yticks([])
_ = plt.suptitle("25 exemplary extracted patches")


### Train Sparse Filtering estimator¶

The Sparse Filtering estimator is trained on entires dataset and 64 features extractors are learned. Note that this is computationally expensive and might take several minutes. After training, the corresponding features are extracted for the whole dataset.

In [7]:
n_features = 64   # How many features are learned
estimator = SparseFiltering(n_features=n_features,
maxfun=200,  # The maximal number of evaluations of the objective function
iprint=10)  # after how many function evaluations is information printed
# by L-BFGS. -1 for no information
features = estimator.fit_transform(patches)


The following graphic illustrates the learned feature detectors. Note that most of these correspond to Gabor-like edge detectors, while some seem to correspond to noise. The latter is potentially either because L-BFGS was stopped before convergence or because the number of features (64) was chosen too large.

In [8]:
plt.figure(figsize=(12, 10))
for i in range(estimator.w_.shape[0]):
plt.subplot(int(np.sqrt(n_features)), int(np.sqrt(n_features)), i + 1)
plt.pcolor(estimator.w_[i].reshape(patch_width, patch_width),
cmap=plt.cm.RdYlGn, vmin=estimator.w_.min(),
vmax=estimator.w_.max())
plt.xticks(())
plt.yticks(())
plt.title("Feature %4d" % i)
plt.tight_layout()


# Activation histogram

We check if the learned feature extractors have the desired sparsity properties:

In [9]:
plt.hist(features.flat, bins=50)
plt.xlabel("Activation")
plt.ylabel("Count")
_ = plt.title("Feature activation histogram")


The graphic confirms that features have small activation typically.

In [10]:
activated_features = (features > 0.1).mean(0)
plt.hist(activated_features)
plt.xlabel("Feature activation ratio over all examples")
plt.ylabel("Count")


Lifetime Sparsity: The figure confirms that each feature is only active (i.e., has an activation above 0.1) for a small ratio of the examples (approx. 25%)

In [11]:
activated_features = (features > 0.1).mean(1)
plt.hist(activated_features, bins=10)
plt.xlabel("Ratio of active features in example")
plt.ylabel("Count")
_ = plt.title("Population Sparsity Histogram")


Population Sparsity: Each example activates only few features (approx. 25-30% of the features)

In [12]:
plt.hist((features**2).mean(0))
plt.xlabel("Mean Squared Feature Activation")
plt.ylabel("Count")
_ = plt.title("Dispersal Histogram")


High Dispersal: All features have similar statistics; no one feature has significantly more “activity” than the other features. This is checked by plotting the mean squared activations of each feature obtained by averaging the squared values in the feature matrix across the examples.

### Learn a Second Layer¶

Learning a second layer of features using greedy layer-wise stacking on the features produced by the first layer.

In [13]:
n_features2 = 10   # How many features are learned
estimator2 = SparseFiltering(n_features=n_features2,
maxfun=200,  # The maximal number of evaluations of the objective function
iprint=10)  # after how many function evaluations is information printed
# by L-BFGS. -1 for no information
features2 = estimator2.fit_transform(features)

In [14]:
plt.figure(figsize=(10, 6))
for i in range(n_features2):
# Select the top-6 layer-1 features used within the layer-2 features
indices = np.argsort(estimator2.w_[i])[-6:][::-1]
for j, ind in enumerate(indices):
plt.subplot(6, n_features2, j*n_features2 + i + 1)
plt.pcolor(estimator.w_[ind].reshape(patch_width, patch_width),"
cmap=plt.cm.RdYlGn, vmin=estimator.w_.min(),
vmax=estimator.w_.max())
plt.xticks(())
plt.yticks(())


Each column of the graphic shows the top-6 layer-1 feature extractors used within one of the 10 layer-2 features. According to the paper, this "discovers meaningful features that pool the first layer features". This, however, is only partially reproducable on the Olivetti dataset.

### Discussion¶

In summary

• sparse filtering is a unsupervsied feature learning algorithm that does not try to model the data's distribution but rather to learn features directly which are sparsely activated
• the learned feature extractors obey specific sparsity conditions (lifetime sparsity, population sparsity, and high dispersal)
• deep archictectures can be trained using greedy layer-wise stacking
• the only parameter requiring tuning is the number of features per layer

We will analyse the quality of the learned features in a predictive setting in a follow-up post.

This post was written as an IPython notebook. You can download this notebook.

## September 11, 2014

### Continuum Analytics

#### Bokeh 0.6 Released!

We are excited to announce the release of version 0.6 of Bokeh, an interactive web plotting library for Python!

This release includes some major new features:

• Abstract Rendering recipes for large data sets: isocontour, heatmap
• New charts in bokeh.charts: Time Series and Categorical Heatmap
• Sophisticated Hands-on Table widget
• Complete Python 3 support for bokeh-server
• Much expanded User Guide and Dev Guide
• Multiple axes and ranges now supported
• Object query interface to help with plot styling
• Hit-testing (hover tool support) for patch glyphs

## September 09, 2014

### Titus Brown

#### The MHacks IV hackathon - a student report

Note: this hackathon report was written by ArisKnight Winfree.

This past weekend was the University of Michigan’s hackathon, MHacks IV. Over 1100 students from universities and high schools across the United States took over Ann Arbor from September 5th through September 7th to participate in one of the most anticipated hackathons in the country.

A hackathon is an event where students come together to dream up and build amazing things together -- but under a strict time limit. At MHacks, students spent 36 hours straight collaborating with each other and working with mentors (alongside representatives of some of the most competitive companies in the world) to learn new things and solve interesting problems.

Over $15,000 dollars worth of prizes were awarded in MHacks IV. We had over 21 students representing Michigan State University in Ann Arbor this weekend -- their names are highlighted in bold. An all Michigan State team that made GoalKeeper ended up winning FOUR prizes: Best URL from Namecheap, Best use of Venmo API, Best use of Ziggeo API, and a prize from IBM for using their hosting platform. Goalkeeper is a social platform for holding yourself accountable, whether you want to get in shape, establish good study habits, or write the next great American novel. Users establish goals, then post videos showing that they're keeping up, both logging their achievements and encouraging their friends to keep them on track. In addition to peer pressure, users are incentivized to stick with their goals because every failure to check in means the user is expected to donate$5 to charity.

Team Members: Caitlin McDonald, Erin Hoffman

Backseat Driver is a real time manual transmission driving instructor that could be displayed on a Chrysler vehicle's radio/head unit, reading information from the car's OBD II sensors using the OpenXC platform tools from Chrysler. They won awards for most connected car from Chrysler and best use of mobile sensor data from FarmLogs.

Team Members: Katelyn Dunaski, Michael Ray, Colin Szechy, Riyu Banerjee.

DrinkNoDrive is an app that prevents drinking and driving by monitoring a person’s alcohol use. The app keeps track of your consumption, calculate your BAC, and will order an Uber for you when it’s time to go home. They were awarded the Best Use of the Uber API prize.

Team Members: Chris McGrath, Drew Laske, Tommy Truong, Anh Nguyen.

First place winner: Power Glove - Power Glove 2.0 is an affordable and customizable virtual reality system, focused on finger and hand movement interaction with the computer.

Second place winner: Android for iPhone - We built a full Android environment running inside your iPhone. We even used the newest version of Android possible, the L developer preview!

Third Place winner: Smash Connect – while Super Smash Bros. 64 is a lot of fun to play with a controller, our project uses your body as a controller. Two Microsoft Kinect cameras are used to [Camera 1] Capture user specific commands (Kick, Punch, Special Move, Jump) and [Camera 2] is used to watch both players fight each other to play the game.

List of Companies at Mhacks: Quixey. Mashery. Nest. Thiel fellowship. IBM. Apple. Chrysler. Facebook. Thalmiclabs. Union Pacific. Codility. FarmLogs. Twillio. Ziggeo. Capital One. Microsoft. MLH. Yo. Pillar. Meetup. Sales force. Jawbone. State Farm. Namecheap. Castle. Firebase. Sendgrid. MongoDb, Ebay, Paypal. stackoverflow. Bloomberg. Goldman Sachs. Chrysler. Moxtra. Pillar. Andreessen Horowitz. Uber. Techsmith. Indeed.

The students of Michigan State University were so inspired by the amazing energy and innovation at the University of Michigan’s hackathon (and others) that we are planning on hosting our own student-run hackathon. It is tentatively planned for Spring of 2015.

### Continuum Analytics

#### New Anaconda Launcher - 1.0

We are pleased to announce a new version of the Anaconda Launcher, a graphical user interface that allows Anaconda users to easily discover, install, update, and launch applications with conda.

### Matthieu Brucher

#### Book review: It Sounded Good When We Started – A Project Manager’s Guide to Working with People on Projects

There are a lot of books on software project management best practices. But usually, they are not guides to work with people. And it is people who make projects, not money or computers.

#### Content and opinions

Strangely, the book is not about software projects, its content mainly stems from a hardware project. I find this great, as there are no reasons why people issues are different between different types of project.

I won’t dwell in the details of all chapters of the book, as it is quite long (25 chapters), with no subpart. The basic content of a chapter is an introduction based on stories that happened mainly during a project the authors call Delphi. Based on these stories, the authors talk about what was badly done, then advice on what should have been done, ending in a conclusion for signs to detect before everything goes bad and solutions.

Of course, during the 25 chapters, there are some common good practices, mainly around communication that help solving several issues. But there are enough stories inside that you don’t get bored when reading the same piece of advice twice.

#### Conclusion

For once, a book on software management (I found it on the IEEE website, so I consider it as such) that can be applied on any project. It’s refreshing, still accurate, although the book is almost a decade old.

## September 07, 2014

### Titus Brown

#### The NIH #ADDSup meeting - training breakout report

At the NIH ADDS meeting, we had several breakout sessions. Michelle Dunn and I led the training session. For this breakout, we had the following agenda:

First, build "sticky-note clouds", with one sticky-note cloud with notes for each of the following topics:

1. Desired outcomes of biomedical data science training;
2. Other relevant training initiatives and connections;
3. Topics that are part of data science;
4. Axes of issues (e.g. online training, vs in person?)

Second, coalesce and summarize each topic;

Third, refine concepts and issues;

and fourth, come back with:

1. three recommended actionable items that NIH can do _now_
2. three things that NIH should leave to others (but be aware of)

This was not the greatest agenda, because we ended up spending only a very little time at the end talk about specific actionable items, but otherwise it was a great brainstorming session!

You can read the output document here; below, I expand and elaborate on what _I_ think everything meant.

## Desired outcomes:

In no particular order,

• Increased appreciation by biomedical scientists of the role that data science could play;
• More biomedical scientists biologists using data science tools;
• More data scientists developing tools for biology;
• Increased cross-training of data scientists in biology;
• Map(s) and guidebooks to the data science landscape;
• Increased communication between biomedical scientists and data scientists;
• Making sure that all biomed grad students have a basic understanding of data science;
• Increased data science skills for university staff who provide services

## Data science topics list

Data science topics list:

Stat & experimental design, modeling, math, prob, stochastic processing, NLP/text mining, Machine learning, inference, viz for exploration, regression, classification, Graphical models, data integration

CS - principles of metadata, metadata format algorithms, databases, optimization, complexity, programming, scripting, data management, data wrangling, distributed processing, cloud computing, software development, scripting, Workflows, tool usage

Presentation of results, computational thinking, statistical thinking

We arranged these topics on two diagrams, one set on the "data analysis workflow" chart and one set on the "practical to abstract skill" chart.

Patches welcome -- I did these in OmniGraffle, and you can grab the files here and here.

## Axes/dichotomies of training:

The underlying question here was, how multidimensional is the biomedical training subspace? (Answer: D is around 10.)

• Clinical to biomedical to basic research - what topics do you target with your training?
• Just-in-time vs. background training - do you train for urgent research needs, or for broad background skills?
• Formal vs. informal - is the training formal and classroom oriented, or more informal and unstructured?
• Beginner vs. advanced - do you target beginner level or advanced level students?
• In-person vs. online - do you develop online courses or in-person courses?
• Short vs. long - short (two-day, two week) or long (semester, longer) or longer (graduate career) training?
• Centralized physically vs. federated/distributed physically - training centers, to which everyone travels? Or do you fly the trainers to the trainees? Or do you train the trainers and then have them train locally?
• Andragogy v. pedagogy (Full professor through undergrads) - do your target more senior or more junior people?
• Project-based vs. structured - do you assign open projects, or fairly stereotyped ones, or just go with pure didactic teaching?
• Individual learning vs team learning - self-explanatory.

I should say that I, at least, have no strong evidence that any one point on any of these spectra will be a good fit for even a large minority of potential trainees and goals; we're just trying to lay out the options here.

## Recommendations:

(The first two were from the group; the last three were from the meeting more broadly)

1. Solicit “crazier” R25s in existing BD2K framework; maybe open up another submission round before April.

Provide a list of topics and axes from our discussion; fund things like good software development practice, group-project camp for biomedical data driven discovery, development of good online data sets and explicit examples, hackathons, data management and metadata training, software development training, and communication training.

2. Solicit proposals for training coordination cores (2-3) to gather, catalog, and annotate existing resources, cooperate nationally and internationally, and collaborate on offering training. Also, develop evaluation and assessment criteria that can be applied across workshops.

3. CTB: fund early stage more broadly; don’t just limit training to graduate students in biomedical fields. For example, the biology grad student of today is the biomedical post doc of tomorrow.

4. Mercè Crosas: Fund internships, hands-on collaborative contributions in multi-disciplinary Data Science Labs

5. Leverage T-32 existing by adding req to include data science. (Addendum: there are already supplements available to do this)

6. Right now we're evaluating education grants individually; it would be good to also have a wide and deep evaluation done across grants.

Training should cover senior and junior people

About 3% of the NIH budget is spent on training

Also, Daniel Mietchen (via online document comment) said: Consider using open formats. A good example is the Knight Challenge, currently asking for proposals around the future of libraries: https://www.newschallenge.org/challenge/libraries/brief.html

and that's all, folks!

--titus

## September 06, 2014

### Titus Brown

#### Recipe: Estimate genome size and coverage from shotgun sequencing data

In Extracting shotgun reads based on coverage in the data set, we showed how to get a read coverage spectrum for a shotgun data set. This is a useful diagnostic tool that can be used to estimate total genome size, average coverage, and repetitive content.

Uses for this recipe include estimating the sequencing depth of a particular library, estimating the repeat content, and generally characterizing your shotgun data.

See Estimating (meta)genome size from shotgun data for a recipe that estimates the single-copy genome size; that recipe can be used for any metagenomic or genomic data set, including ones subjected to whole genome amplification, while this recipe can only be used for data from unamplified/unbiased whole genome shotgun sequencing.

Let's assume you have a genome that you've sequenced to some unknown coverage, and you want to estimate the total genome size, repeat content, and sequencing coverage. If your reads are in reads.fa, you can generate a read coverage spectrum from your genome like so:

load-into-counting.py -x 1e8 -k 20 reads.kh reads.fa


and the result looks like this:

For this (simulated) data set, you can see three peaks: one on the far right, which contains the high-abundance reads from repeats; one in the middle, which contains the reads from the single-copy genome; and one all the way on the left at ~1, which contains all of the highly erroneous reads

Pick a coverage number near the peak that you think represents the single-copy genome - with sufficient coverage, and in the absence of significant heterozygosity, it should be the first peak that's well to the right of zero. Here, we'll pick 100.

Then run

./estimate-total-genome-size.py reads.fa reads-cov.dist 100


and you should see:

Estimated total genome size is: 7666 bp
Estimated size > 5x repeats is: 587 bp
Estimated single-genome coverage is: 109.8


The true values (for this simulated genome) are 7500bp genome size, 500 bp > 5x repeat, and 150x coverage, so this isn't a terribly bad set of estimates. The single-genome coverage estimate is probably low because we're estimating read coverage with a k-mer based approach, which is sensitive to highly erroneous reads.

A few notes:

• Yes, you could automate the selection of the first peak, but I think it's always a good idea to look at your data!
• In the case of highly polymorphic diploid genomes, you would have to make sure to include the haploid reads but only count them 50% towards the final genome size.

## September 03, 2014

### Titus Brown

#### The NIH #ADDSup meeting on the next five years of data science at the NIH

Here is a links roundup and some scattered thoughts on the recent meeting on "the next five years of data science at the NIH"; this meeting was hosted by Phil Bourne, the new Associate Director for Data Science at the NIH.

Before I go any further, let me make it clear that I do not speak in any way, shape or form for the NIH, or for Phil Bourne, or anyone else other than myself!

Introduction and commentary:

Phil Bourne took on the role of Associate Director for Data Science at the NIH in March 2014, with the mission of "changing the culture of the NIH" with respect to data science. The #ADDSup meeting was convened on September 3rd with about 50 people attending, after the previous night's dinner and sauna party at Phil's house. (One highlight of dinner was the NIH director, Francis Collins, leading a data science singalong/kumbaya session. I kid you not.) The meeting was incredibly diverse, with a range of academic faculty attending along with representatives from funders, tech companies, biotech, and publishers/data infrastructure folk.

It's very hard to summarize the information-dense discussion in any meaningful way, but notes were taken throughout the meeting if you're interested. I should also note that (like a previous meeting on Software Discovery, #bd2kSDW) tweeting was encouraged with the hashtag #ADDSup -- here's the storify.

I ended up co-leading the training breakout session with Michelle Dunn (NIH), and I am writing a blog post on that separately.

1. Data and Informatics Working Group report
2. Phil Bourne's job application statement.
3. Phil Bourne's "universities as Big Data".
4. Phil Bourne's 10 week report.
5. The final report from the May Software Discovery Workshop (storify here, with video links.)
6. Uduak Thomas' excellent BioInform article (open PDF here) summarizing Phil Bourne's keynote on the NIH Commons, from the 2014 Bioinformatics Open Source Conference. Also see video and slides;

During-meeting coverage:

## Some fragmented thoughts.

Again, all opinions my own :).

It's on us. The NIH can fund things, and mandate things, but cannot _do_ all that much. If you want biomedical data science to advance, figure out what needs to be done and talk to Phil to propose it.

Two overwhelming impressions: the NIH moves very slowly. And the NIH has an awful lot of money.

Nobody in the academic community is interested in computational infrastructure building, unless there's a lot of money involved, in which case we will do it badly (closed source, monolithic architecture, closed data, etc.) Contractors would love to do it for us, but the odds are poor. There may be exceptions but it was hard to think of any extramural infrastructure project that had, long term, met community expectations and been sustainable (counter-examples in comments, please!)

Very few people in the biomedical community are particularly interested in training, either, although they will feign interest if it supports their graduate students in doing research (see: T-32s).

Because of this, the NIH (and the ADDS more specifically) is left carrying water with a sieve. Data science depends critically on software, data sharing, computational infrastructure, and training.

Open was missing. (Geoffrey Bilder pointed this out to me midway through the morning.) That having been said, most of the meeting attendees clearly "got it", but oops?

Use cases! The NIH ADDS is looking for use cases! What do you want to enable, and what would it look like?

The point was made that the commercial data science sector is way more active and advanced than the academic data science sector. There are lots of links, of course, but are we taking advantage of them? I would also counter that this is IMO not true in the case of biomedical data science, where I am unimpressed with what I have seen commercially so far. But maybe I'm just picky.

--titus

### Continuum Analytics

#### Introducing Blaze - Expressions

tl;dr Blaze abstracts tabular computation, providing uniform access to a variety of database technologies

## Introduction

NumPy and Pandas couple a high level interface with fast low-level computation. They allow us to manipulate data intuitively and efficiently.

Occasionally we run across a dataset that is too big to fit in our computer’s memory. In this case NumPy and Pandas don’t fit our needs and we look to other tools to manage and analyze our data. Popular choices include databases like Postgres and MongoDB, out-of-disk storage systems like PyTables and BColz and the menagerie of tools on top of the Hadoop File System (Hadoop, Spark, Impala and derivatives.)

## September 02, 2014

### Jake Vanderplas

#### On Frequentism and Fried Chicken

My recent series of posts on Frequentism and Bayesianism have drawn a lot of comments, but recently Frederick J. Ross, a UW colleague whom I have not yet had the pleasure of meeting, penned a particularly strong-worded critique: Bayesian vs frequentist: squabbling among the ignorant. Here I want to briefly explore and respond to the points he makes in the post.

Mr. Ross doesn't mince words. He starts as follows:

Every so often some comparison of Bayesian and frequentist statistics comes to my attention. Today it was on a blog called Pythonic Perambulations. It's the work of amateurs.

He goes on to lodge specific complaints about subtleties I glossed-over in the four posts, all of which seem to miss one salient detail: the posts were an explicit response to my observation that "many scientific researchers never have opportunity to learn the distinctions between Frequentist and Bayesian methods and the different practical approaches that result..." That is, I aimed the discussion not toward someone with a deep background in statistics, but at someone who can't even name the fundamental differences between frequentism and Bayesianism.

Did I gloss over advanced subtleties in this introductory primer? Certainly. As interesting as it may have been for Mr. Ross and other well-read folks had I delved into, say, the deeper questions of assumptions implicit in frequentist constraints vs. Bayesian priors, it would have distracted from the purpose of the posts, and would have lost the very readers for whom the posts were written.

## Rethinking the Debate

Thus, we see that his first set of complaints can be chalked-up to a simple misunderstanding of the intended audience: that's an honest mistake, and I won't make more of it. But he goes beyond this, and proposes his own final answer to the centuries-old debate between frequentists and Bayesians. As he writes: "Which one is right? The answer, as usual when faced with a dichotomy, is neither."

This should pique your interest: he's claiming that not only am I, a humble blogger, an ignorant amateur (which may be true), but that luminaries of the science and statistics world — people like Neyman, Pearson, Fisher, Jaynes, Jeffries, Savage, and many others who sometimes ardently addressed this question — are simply ignorant squabblers within the field which they all but created. I doubt I'm alone in finding this sweeping indictment a bit suspect.

But let's skip these charges and dig further: what third route does Mr. Ross propose to trample all this precedent? The answer is decision theory:

Probability, as a mathematical theory, has no need of an interpretation... the real battleground is statistics, and the real purpose is to choose an action based on data. The formulation that everyone uses for this, from machine learning to the foundations of Bayesian statistics, is decision theory.

His argument is that frequentist and Bayesian methods, in a reductionist sense, are both simply means of reaching a decision based on data, and can therefore be viewed as related branches of decision theory. He goes on to define some notation which explains how any statistical procedure can be formulated as a question of progressing from data, via some loss function, to a particular decision. Frequentist and Bayesian approaches are simply manifestations of this unified theory which use particular loss functions, and thus squabbling about them is the pastime of the ignorant.

I'd like to offer an analogy in response to this idea.

## Baked or Fried?

One day in the kitchen, two chefs begin arguing about who makes the best chicken. Chef Hugh prefers his chicken fried: the quick action of the hot oil results light, crispy spiced outer breading complementing the tender meat it encloses. Chef Wolfgang, on the other hand, swears by baked chicken, asserting that its gentler process leaves more moisture, and allows more time for complex flavors to seep into the meat. They decide to have a cook-off: Fried vs. Baked, to decide once and for all which method is the best.

They're just beginning their preparations in the test kitchen when Rick, the local Food Theorist, storms through the door. He follows these chefs on Twitter, and has heard about this great Fried vs. Baked debate. Given his clear expertise on the matter, he wants to offer his final say on the question. As Food Theorists are wont to do, he starts lecturing them:

"Truly, I'm not really sure what this whole contest is about. Don't you know that baking and frying are both doing essentially the same thing? Proteins denature as they heat. Water evaporates, sugar caramelizes, and the Maillard Reaction turns carbohydrates and amino acids into a crispy crust. If you could just get into your ignorant heads that any cooking method is simply a manifestation of these simple principles, you'd realize that neither method is better, and we wouldn't need to waste our time on this silly competition."

At this point, Chef Hugh and Chef Wolfgang pause, catch each other's gaze for a moment, and burst into a hearty laughter. They turn around continue the task of actually turning the raw chicken meat into an edible meal, enjoying the craft and camaraderie of cooking even in the midst of their friendly disagreement. Rick slips out the door and heads home to gaze at his navel while eating a dinner of microwaved pizza bagels.

## Baked or Fried? Bayesian or Frequentist?

So what's my point here? The fact is that everything our Food Theorist has said is technically correct: from a completely reductionist perspective, cooking meat is nothing more than controlled denaturing of proteins, evaporation of water, and other well-understood chemical processes. But to actually prepare a meal, you can't stop with the theory. You have to figure out how to apply that knowledge in practice, and that requires decisions about whether to use an oven, a deep fryer, or a charcoal grill.

Similarly, everything Mr. Ross said in his blog post is more or less true, but you can't stop there. Applying his decision theory in practice requires making some choices: despite his protests, you actually do have to decide how to map your theory of probability onto reality reflected in data, and that requires some actual philosophical choices about how you treat probability, which lead to fundamentally different questions being answered.

## Frequentism vs. Bayesianism, Again

This brings us back to the original question Mr. Ross (not I) posed: Frequentism vs. Bayesianism: which is correct? As I've maintained throughout my posts (and as Mr. Ross seems to have overlooked when reading them): neither is correct. Or both. It really depends on the situation. As I have attempted to make clear, if you're asking questions about long-term limiting frequencies of repeated processes, classical frequentist approaches are probably your best bet. If you're hoping to update your knowledge about the world based on a finite set of data, Bayesian approaches are more appropriate.

While I have argued that Frequentist approaches answer the wrong question in most scientific settings, I have never claimed that frequentism is fundamentally flawed, or that it is "wrong": on the contrary, in that particular post I went to great length to use Monte Carlo simulations to show that the frequentist approach does in fact give the correct answer to the question it asks. Frequentist and Bayesian approaches answer different statistical questions, and that is a fact you must realize in order to use them.

So where does this leave us? Well, Mr. Ross seems to have based his critique largely on misunderstandings: my intended audience was novices rather than experts, and despite his claims otherwise I have never held either the frequentist or Bayesian approach as globally correct at the expense of the other. His protests notwithstanding, I maintain that in practice, frequentism and Bayesianism remain as different as fried and baked chicken: you can huff and puff about unified theoretical frameworks until your face is blue, but at the end of the day you need to choose between the oven and the fryer.

### Matthieu Brucher

#### Book review: A Brief Introduction to Continuous Evolutionary Optimization

When I developed my first tool based on genetic algorithms, it was to replace local optimization algorithm (“Simulated Annealing”, as advertised by Numerical Recipes) as a global optimization algorithm. Now a couple of years later, I found a small book on GE that seemed on topic with what I had to implement (I relied at the time on another book that I never reviewed ; perhaps another time). Is it a good brief introduction as the book says?

First of all, the book is really small, only 100 pages. At more than 50$, that’s quite a high page rate, especially when there are only 70 pages of actual content… There are three parts in the book, the foundation, advanced optimization and learning. Let’s start with the first one. The first part also contains the introduction. It mainly describes what evolutionary optimization means, and how to compare it to the other techniques, and a description of the other chapters. So the first interesting/technical chapter is the second one. It explains what a genetic algorithm actually is, and the way the algorithm is described is better that what I found in bigger books. There are also additional variants of algorithms used by GE (different types of recombinations for instance), other usual may be missing (like selection by tournament). As the book is about evolutionary optimization, and not GE, there is also the introduction of other algorithms, namely Particle Swarm Optimization, and Covariance Matrix Adaptation Evolution Strategies. Those were properly defined, better than most online resources as well. The third chapter is about defining the parameters of the algorithms, and this is an interesting chapter as the main topic is to allow this evolutionary algorithm to evolve the parameters during the optimization. This was really interesting to see, and I’m looking forward to implementing this in my tool. More evolution in part 2, with using evolution to handle constraints (really funny, yet something else to try on my case), and then a variant of the Powell method mixed with evolutionary optimization. It is less useful perhaps, as it seems to be a local optimization algorithm, related to simulated annealing in some way, but with the population approach. Still interesting, but you won’t apply this as often as what you learned in the first part or the chapter before. The last chapter in this part tackles multiple objective optimization. This is really a good approach (a colleague of mine made his PhD on this kind of problems, now I understand his thesis better!) to multiple objective optimization. Usually, you have to combine them in some way, but as we use populations, we get actual borders, and possibly an infinite number of possible solutions, without having to choose which function to favor. The last chapter is called “Learning”, but I would have called it “Applications”. The applications are model learning, but that’s about it. It is nice though that the book tackles more or less textbook functions in the first parts, and then tackles a problem that makes some actual sense (no, I don’t think the Rosenbrock cost function optimization can compare to solving an actual problem). Someone thought about what was missing in the other books! #### Conclusion In the end, the main issue is the cost of the book, although compared to other scientific book, it is not that expensive. Otherwise, I found it really clear, efficient, covering several subjects properly. I think I found my new reference for people wanting to start evolutionary computation! ## September 01, 2014 ### Titus Brown #### Running Software Carpentry instructor training at UC Davis in January, 2015 I am pleased to announce that Dr. Greg Wilson will be giving a two-day Software Carpentry Instructor Training workshop at UC Davis, January 6-7, 2015. This will be an in-person version of the instructor training that Greg runs every quarter; see my blog post about the first such instructor training, here to get some of idea of what it's like. Note that the training is necessary for those who want to become accredited Software Carpentry or Data Carpentry instructors. If you're interested in further information, go subscribe to this github issue; we should have room for a dozen extra people to attend. Anyone is welcome, including people unaffiliated with UC Davis. NOTE: This workshop is hosted by my new lab, but I am seeking additional support; drop me a line if you're interested in co-sponsoring the workshop. On January 8th, I plan to run a third day of activities. The three things I have tentatively planned are, 1. run people through the GitHub Flow process of contributing to a github repository, as in our hackathon (also see our writeup here); 2. walk people through our process for generating maintainable command-line tutorials, recipes, and protocols (more on that at ABIC); 3. have an open discussion about what and how to do training in data intensive biology at Davis. This third day of activities is going to be more restricted in attendance than the first two days, but if you're into teaching computation and biology and want to see how we're doing things, drop me a note (t@idyll.org) and I'll see if we can fit you in. --titus ## August 30, 2014 ### Titus Brown #### Some naive ideas about training efforts at UC Davis As I mentioned, I am hoping to significantly scale up my training efforts at UC Davis; it's one of the reasons they hired me, it's a big need in biology, and I'm enthusiastic about the whole thing! A key point is that, at least at the beginning, it may replace some or all of my for-credit teaching. (Note that the first four years of Analyzing Next-Generation Sequencing Data counted as outreach, not teaching, at MSU.) I don't expect to fully spool up before fall 2015, but I wanted to start outlining my thoughts. The ideas below came in large part from conversations with Tracy Teal, a Software Carpentry instructor who is one of the people driving Data Carpentry, and who also was one of the EDAMAME course instructors. ## How much training, how often, and to whom? I think my initial training efforts will center on Software Carpentry-style workshops, on a variety of (largely bio-specific) topics. These would be two-day in-person workshops, 9-5am, each focused on a specific topic. I think I can sustainably lead one a month, with perhaps a few months where I organize two in the same week (M/Tu and Th/Fri, perhaps). These would be on top of at least one NGS course a year, too. I also expect I will participate in various Genome Center training workshops. The classes would be targeted at grad students, postdocs, and faculty -- same as the current NGS course. I would give attendees from VetMed some priority, followed by attendees with UC Davis affiliations, and then open to anyone. I imagine doing this in a tiered way, so that some outsiders could always come; variety and a mixed audience are good things! ## On what topics? I have a laundry list of ideas, but I'm not sure what to start with or how to make decisions about what to teach when. ...suggestions welcome. (I also can't teach all of these myself, but I want to get the list of ideas down!) I'd like to preface this list with a few comments: I've been teaching and training in these topics for five years (at least) now, so I'm not naive about how hard (or easy) it is to teach this to computationally inexperienced biologists. It's clear that there's a progression of skills that need to be taught for most of these, as well as a need for careful lesson planning, tutorial design, and pre/post assessment. These workshops would also be but one arrow in the quiver -- I have many other efforts that contribute to my lab's teaching and training. With that having been said, here's a list of general things I'd like to teach: • Shell and UNIX (long running commands, remote commands, file and path management) • Scripting and automation (writing scripts, make, etc.) • Bioinformatics and algorithms • "Big data" statistics • Data integration for sequencing data • Software engineering (testing, version control, code review, etc.) on the open source model • Practical bioinformatics (See topics below) • Modeling and simulations • Workflows and replication tracking • Software Carpentry • Data Carpentry I have many specific topics that I think people know they want to learn: • Mapping and variant calling • Genome assembly and evaluation (microbial & large genomes both) • Transcriptome assembly and evaluation (reference free & reference based) • Genome annotation • Differential expression analysis • ChIP-seq • Metagenome assembly • Microbial ecology and 16s approaches • Functional inference (pathway annotations) • Phylogenomics • Marker development • Genotyping by sequencing • Population genomics And finally, here are two shorter workshop ideas that I find particularly neat: experimental design (from sample prep through validation), and sequencing case studies (success and failure stories). In the former, I would get together a panel of two or three people to talk through the issues involved in doing a particular experiment, with the goal of helping them write a convincing grant For the latter, I would find both success and failure stories and then talk about what other approaches could have rescued the failures, as well as what made the successful stories successful. ## To what end? Community building and collaborations. Once I started focusing in on NGS data at MSU as an assistant professor, I quickly realized that I could spend all my time in collaborations. I learned to say "no" fairly fast :). But all those people still need to do data analysis. What to do? I had no clear answer at MSU, but this was one reason I focused on training. At Davis, I hope to limit my formal collaborations to research topics, and concentrate on training everybody to deal with their own data; in addition to being the only scalable approach, this is career-building for them. This means not only investing in training, but trying to build a community around the training topics. So I'd like to do regular (weekly? fortnightly?) "help desk" afternoons for the campus, where people can come talk about their issue du jour. Crucially, I would limit this to people that have gone through some amount of training - hopefully both incentivizing people to do the training, and making sure that some minimal level of effort has been applied. The goal would be to move towards a self-sustaining community of people working in bioinformatic data analysis across multiple levels. ## Cost and materials. Since UCD VetMed is generously supporting my salary, I am naively expecting to charge nothing more than a nominal fee -- something that would discourage people from frivolously signing up or canceling. Perhaps lunch money? (This might have to be modified for people from outside of VetMed, or off-campus attendees.) All materials would continue to be CC0 and openly available, of course. 'cause life's too short to limit the utility of materials. ## Other thoughts I'd love to put together a slush fund so that I can invite out speakers to run workshops on topics that I don't know that well (most of 'em). How about a workshop focused on teaching people how to teach with the materials we put together? (I would expect most of these workshops to be cloud-based.) --titus p.s. In addition to Tracy, thanks to Keith Bradnam, Aaron Darling, Matt MacManes and Ethan White, for their comments and critiques on a draft. #### The Critical Assessment of Metagenome Interpretation and why I'm not a fan If you're into metagenomics, you may have heard of CAMI, the Critical Assessment of Metagenome Interpretation. I've spoken to several people about it in varying amounts of detail, and it seems like the CAMI group is working to generate some new shotgun metagenome data sets and will then encourage tool developers to bang on them. (You can also read a short Methagora blog on CAMI.) I've been asked by about a dozen people now what I think of it, so I'm blogging about it now rather than answering people individually :). tl; dr? I'm not a fan. First, what's this subscription nonsense? (We'll keep you in the loop if you register here.) There are a plethora of ways to keep people in the loop without asking them to subscribe to anything (blogs, Twitter, Web sites, e-mail lists...). Is there a reason to keep the details of the contest secret or hidden behind a subscriber wall? (Answer: no.) Second, I am told that the contest will be run once, with software that is blind to the content of the metagenome. I understand the idea of blind software evaluation, but this is not a long-term sustainable approach; we need to generate a nice set of orthogonal data sets and then LART people who fine-tune their software against one or another. Third, it looks like the CAMI folk will make the same mistake as the Assemblathon 2 folk, and not require that the analyses be completely replicable. So in the end there will be a massive expenditure of effort that results in a paper, which will then be a nice static record of how things were back in 2014. Given the pace of tool and technology change, this will have a very short shelf-life (although no doubt tool developers will cite it for years to come, to prove that IDBA was once worse than their own assembler is now). Why not re-run it every 6 months with the latest versions of softwares X, Y, and Z? We have plenty of ways to automate analyses, and there is simply no excuse for not doing so at this point. (ht to Aaron Darling for alerting me to nucleotid.es.) Fourth, there are several mock metagenome data sets that are already underanalyzed. For example, we're currently working with the Shakya et al. (2013) data set, but I don't think anyone else is (and it's pretty clear most people don't realize what a stinging indictment of 16s this paper is - or, at least, don't care. Discussion for another time.). So why are we not poking at existing data first? I don't know, but I suspect it speaks to an underlying bias against re-analyzing published data sets, which we really need to get over in this field. As I said during the Q&A for my BOSC 2014 keynote, I'm not a big fan of how we do benchmarks in bioinformatics; I think this is a fine exemplar. I probably won't participate in the CAMI benchmark, not just for the above reasons but because I simply don't have the people to do so at the moment. ...plus, as with Assemblathon 2, they're going to need reviewers, right? ;) Note that Aaron Darling raised many similar points at ISME, and they were apparently very well received. Maybe CAMI will adapt their approach in response, which would be great! --titus p.s. Thanks to Alice McHardy and Aaron Darling for their detailed discussions of this with me! ## August 29, 2014 ### Titus Brown #### Some software development plans for Davis I've been thinking a lot about what I want to do at UC Davis, and I would like to announce the following plans: 1. I hope to write new and better transcriptome and metagenome assemblers. The current assemblers are hopelessly bad and I think we can probably eke out a 1% improvement, at least, by starting from scratch. 2. I've been very disappointed with the workflow systems that are out there, so for people that want to run our new assembler, we'll be packaging that with a new workflow management system. 3. We'll also be developing a new Web site for running analyses of biological data. Galaxy may have market penetration, but I really dislike their default CSS, and I think my lab can probably do a better job if we start from scratch. 4. Needless to say, I'll need my own physical cloud hardware to run it all. So I'm planning a considerable expansion of our lab's cluster. I think we can probably make a big impact by writing our own virtualization management software, too; the existing systems are written by amateurs like Amazon and OpenStack, after all! The bad news is that after some serious consideration, I have decided not to invest in a new mapping program; on top of the above it's a bit too much. I think I trust Heng Li's bwa, for now; we may revisit in a few months. I will be seeking large scale NSF, NIH, USDA, and DOE funding for the above. Let me know if you want to sign on! --titus ### Stefan Pfenninger #### IPython notebook extensions to ease day-to-day work Since I use the IPython notebook for a lot of explorative data analysis, I wanted some extra functionality to make day-to-day notebook work just a little bit more pleasant. I use several of Min RK’s excellent notebook extensions, particularly the floating table of contents and the Gist button. This post describes some additional extensions I created to do things I need in day-to-day work, as well as how to leverage the customization possibilities of the CodeMirror editor to add vertical rulers to code cells. The notebook extensions described here are available on GitHub. ## Notifications Sometimes larger analysis tasks take more than a couple of seconds to compute, and I want to alt-tab to do something else while I wait. The IPython-notify extension ensures that you don’t forget what you were originally doing, by displaying a notification once the kernel becomes idle again after a set minimum busy time: When the plugin is activated, a button to request notification permissions is shown in the toolbar. After notification permissions have been granted, this button is replaced by a dropdown menu with five choices: Disabled, 0, 5, 10 and 30. To activate notifications, select a minimum kernel busy time required to trigger a notification (e.g. if selecting 5, a notification will only be shown if the kernel was busy for more than 5 seconds). The selection is saved in the notebook’s metadata and restored when the notebook is re-opened. ## Theme switching As described in a previous post, depending on ambient lighting and time of day I prefer seeing my code in light text on a dark background, so I created a simple extension to switch between two themes. The default CSS included only changes code cells (this has the added effect that it’s easier to tell code cells apart from markdown or raw cells): The CSS files can easily be exchanged for a different ones, for example a Base16 theme. ## Customizing the CodeMirror editor IPython notebooks use the CodeMirror editor, which includes a wide range of addons. IPython also allows the user to add additional javascript and CSS via the custom.js and custom.css files in the static/custom directory inside the IPython profile (likely ~/.ipython/profile_default/static/custom on Linux or OS X). I want to display some rulers to indicate line lengths I shouldn’t exceed to conform with PEP8. Thanks to CodeMirror’s flexible configuration system, this can be achieved by adding the following code to custom.js, making use of the rulers addon: require(["components/codemirror/addon/display/rulers"]); var clsname = "ipynb_ruler"; var rulers = [{column: 79, className: clsname}, {column: 99, className: clsname}]; IPython.Cell.options_default.cm_config.rulers = rulers;  By adding the class ipynb_ruler to custom.css, the look of the rulers can be tweaked, for example to use a lighter color: .ipynb_ruler { border-color: #DFDFDF; }  Unfortunately, there is a bug in older CodeMirror versions that in some browsers causes extra scrollbars when enabling rulers. For the time being, this can be fixed by placing a fixed rulers.js into ~/.ipython/profile_default/static/components/codemirror/addon/display (adjust as needed if your IPython profile directory is different). I backported a fix into the rulers.js shipped with IPython 2.2, available here. ## August 27, 2014 ### Titus Brown #### Estimating (meta)genome size from shotgun data This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads provided by you. It does take into account coverage, so you need to specify a desired assembly coverage - we recommend starting with a coverage of 20. Uses for this recipe include estimating the amount of memory required to achieve an assembly and providing a lower bound for metagenome assembly size and single-copy genome diversity. This recipe will provide inaccurate estimates on transcriptomes (where splice variants will end up confusing the issue - this looks at single-copy sequence only) or for metagenomes with high levels of strain variation (where the assembler may collapse strain variants that this estimate will split). Note: at the moment, the khmer script normalize-by-median.py needs to be updated from the master branch of khmer to run this code properly. Once we've cut a new release, we'll remove this note and simply specify the khmer release required. Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa, you can get an estimate of the single-copy genome size (here known to be 5500 bp) by running normalize-by-median.py ~/dev/khmer/scripts/normalize-by-median.py -x 1e8 -k 20 -C 20 -R report.txt reads.fa ./estimate-genome-size.py -C 20 -k 20 reads.fa.keep report.txt  This yields the output: Estimated (meta)genome size is: 8727 bp  This is off by about 50% for reasons that we don't completely understand. Note that you can get more accurate estimates for this data set by increasing C and decreasing k, but 20/20 should work about this well for most data sets. (For an E. coli data set, it returns 6.5 Mbp, which is only about 25% off.) #### Estimate whether your sequencing has saturated your sample to a given coverage This recipe provides a time-efficient way to determine whether you've saturated your sequencing depth, i.e. how much new information is likely to arrive with your next set of sequencing reads. It does so by using digital normalization to generate a "collector's curve" of information collection. Uses for this recipe include evaluating whether or not you should do more sequencing. This approach is more accurate for low coverage than normalize-by-median's reporting, because it collects the redundant reads rather than discarding them. Note: at the moment, the khmer script normalize-by-median.py needs to be updated from the master branch of khmer to run this code properly. Once we've cut a new release, we'll remove this note and simply specify the khmer release required. Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing, and you want to know whether or not you've saturated to a coverage of 5 with your sequencing. You can use a variant of digital normalization, saturate-by-median, to run a collector's curve: ~/dev/khmer/sandbox/saturate-by-median.py -x 1e8 -k 20 -C 5 -R report.txt --report-frequency 10 reads.fa  Then, plot the resulting saturation curve: ./plot-saturation-curve.py report.txt saturation.png --xmin 0 --ymin 0 --xmax 1500  The x axis here is the number of reads examined (column 1 in report.txt), while the y axis (column 2) is the number of reads that are below a coverage of 5 in the data set at that point. You can see here that by the time you had sampled 1000 reads, you'd stopped seeing new coverage=5 reads, which suggests that further sequencing is unnecessary. If you zoom out on the graph, you'll see that the curve keeps on climbing, albeit much more slowly. This is due to the influence of error rate on prediction of "novel" reads, and is something we have to fix. ### William Stein #### What is SageMathCloud: let's clear some things up [PDF version of this blog post] "You will have to close source and commercialize Sage. It's inevitable." -- Michael Monagan, cofounder of Maple, told me this in 2006. SageMathCloud (SMC) is a website that I first launched in April 2013, through which you can use Sage and all other open source math software online, edit Latex documents, IPython notebooks, Sage worksheets, track your todo items, and many other types of documents. You can write, compile, and run code in most programming languages, and use a color command line terminal. There is realtime collaboration on everything through shared projects, terminals, etc. Each project comes with a default quota of 5GB disk space and 8GB of RAM. SMC is fun to use, pretty to look at, frequently backs up your work in many ways, is fault tolerant, encourages collaboration, and provides a web-based way to use standard command-line tools. ### The Relationship with the SageMath Software The goal of the SageMath software project, which I founded in 2005, is to create a viable free open source alternative to Magma, Mathematica, Maple, and Matlab. SMC is not mathematics software -- instead, SMC is best viewed by analogy as a browser-based version of a Linux desktop environment like KDE or Gnome. The vast majority of the code we write for SMC involves text editor issues (problems similar to those confronted by Emacs or Vim), personal information management, support for editing LaTeX documents, terminals, file management, etc. There is almost no mathematics involved at all. That said, the main software I use is Sage, so of course support for Sage is a primary focus. SMC is a software environment that is being optimized for its users, who are mostly college students and teachers who use Sage (or Python) in conjunction with their courses. A big motivation for the existence of SMC is to make Sage much more accessible, since growth of Sage has stagnated since 2011, with the number one show-stopper obstruction being the difficulty of students installing Sage. #### Sage is Failing Measured by the mission statement, Sage has overall failed. The core goal is to provide similar functionality to Magma (and the other Ma's) across the board, and the Sage development model and community has failed to do this across the board, since after 9 years, based on our current progress, we will never get there. There are numerous core areas of research mathematics that I'm personally familiar with (in arithmetic geometry), where Sage has barely moved in years and Sage does only a few percent of what Magma does. Unless there is a viable plan for the areas to all be systematically addressed in a reasonable timeframe, not just with arithmetic geometry in Magma, but with everything in Mathematica, Maple., etc, we are definitely failing at the main goal I have for the Sage math software project. I have absolutely no doubt that money combined with good planning and management would make it possible to achieve our mission statement. I've seen this hundreds of times over at a small scale at Sage Days workshops during the last decade. And let's not forget that with very substantial funding, Linux now provides a viable free open source alternative to Microsoft Windows. Just providing Sage developers with travel expenses (and 0 salary) is enough to get a huge amount done, when possible. But all my attempts with foundations and other clients to get any significant funding, at even the level of 1% of the funding that Mathematica gets each year, has failed. For the life of the Sage project, we've never got more than maybe 0.1% of what Mathematica gets in revenue. It's just a fact that the mathematics community provides Mathematica$50+ million a year, enough to fund over 600 fulltime positions, and they won't provide enough to fund one single Sage developer fulltime.

But the Sage mission statement remains, and even if everybody else in the world gives up on it, I HAVE NOT. SMC is my last ditch strategy to provide resources and visibility so we can succeed at this goal and give the world a viable free open source alternative to the Ma's. I wish I were writing interesting mathematical software, but I'm not, because I'm sucking it up and playing the long game.

### The Users of SMC

During the last academic year (e.g., April 2014) there were about 20K "monthly active users" (as defined by Google Analytics), 6K weekly active users, and usually around 300 simultaneous connected users. The summer months have been slower, due to less teaching.

Numerically most users are undergraduate students in courses, who are asked to use SMC in conjunction with a course. There's also quite a bit of usage of SMC by people doing research in mathematics, statistics, economics, etc. -- pretty much all computational sciences. Very roughly, people create Sage worksheets, IPython notebooks, and Latex documents in somewhat equal proportions.

### What SMC runs on

Technically, SMC is a multi-datacenter web application without specific dependencies on particular cloud provider functionality. In particular, we use the Cassandra database, and custom backend services written in Node.js (about 15,000 lines of backend code). We also use Amazon's Route 53 service for geographically aware DNS. There are two racks containing dedicated computers on opposites sides of campus at University of Washington with 19 total machines, each with about 1TB SSD, 4TB+ HDD, and 96GB RAM. We also have dozens of VM's running at 2 Google data centers to the east.

A substantial fraction of the work in implementing SMC has been in designing and implementing (and reimplementing many times, in response to real usage) a robust replicated backend infrastructure for projects, with regular snapshots and automatic failover across data centers. As I write this, users have created 66677 projects; each project is a self-contained Linux account whose files are replicated across several data centers.

### The Source Code of SMC

The underlying source of SMC, both the backend server and frontend client, is mostly written in CoffeeScript. The frontend (which is nearly 20,000 lines of code) is implemented using the "progressive refinement" approach to HTML5/CSS/Javascript web development. We do not use any Javascript single page app frameworks, though we make heavy use of Bootstrap3 and jQuery. All of the library dependencies of SMC, e.g., CodeMirror, Bootstrap, jQuery, etc. for SMC are licensed under very permissive BSD/MIT, etc. libraries. In particular, absolutely nothing in the Javascript software stack is GPL or AGPL licensed. The plan is that any SMC source code that will be open sourced will be released under the BSD license. Some of the SMC source code is not publicly available, and is owned by University of Washington. But other code, e.g., the realtime sync code, is already available.
Some of the functionality of SMC, for example Sage worksheets, communicate with a separate process via a TCP connection. That separate process is in some cases a GPL'd program such as Sage, R, or Octave, so the viral nature of the GPL does not apply to SMC. Also, of course the virtual machines are running the Linux operating system, which is mostly GPL licensed. (There is absolutely no AGPL-licensed code anywhere in the picture.)

Note that since none of the SMC server and client code links (even at an interpreter level) with any GPL'd software, that code can be legally distributed under any license (e.g., from BSD to commercial).
Also we plan to create a fully open source version of the Sage worksheet server part of SMC for inclusion with Sage. This is not our top priority, since there are several absolutely critical tasks that still must be finished first on SMC, e.g., basic course management.

The University of Washington Center for Commercialization (C4C) has been very involved and supportive since the start of the projects. There are no financial investors or separate company; instead, funding comes from UW, some unspent grant funds that were about to expire, and a substantial Google "Academic Education Grant" ($60K). Our first customer is the "US Army Engineer Research and Development Center", which just started a support/license agreement to run their own SMC internally. We don't currently offer a SaaS product for sale yet -- the options for what can be sold by UW are constrained, since UW is a not-for-profit state university. Currently users receive enhancements to their projects (e.g., increased RAM or disk space) in exchange for explaining to me the interesting research or teaching they are doing with SMC. The longterm plan is to start a separate for-profit company if we build a sufficient customer base. If this company is successful, it would also support fulltime development of Sage (e.g., via teaching buyouts for faculty, support of students, etc.), similar to how Magma (and Mathematica, etc.) development is funded. In conclusion, in response to Michael Monagan, you are wrong. And you are right. #### You don't really think that Sage has failed, do you? I just received an email from a postdoc in Europe, and very longtime contributor to the Sage project. He's asking for a letter of recommendation, since he has to leave the world of mathematical software development (after a decade of training and experience), so that he can take a job at hedge fund. He ends his request with the question: > P.S. You don't _really_ think that Sage has failed, do you? After almost exactly 10 years of working on the Sage project, I absolutely do think it has failed to accomplish the stated goal of the mission statement: "Create a viable free open source alternative to Magma, Maple, Mathematica and Matlab.". When it was only a few years into the project, it was really hard to evaluate progress against such a lofty mission statement. However, after 10 years, it's clear to me that not only have we not got there, we are not going to ever get there before I retire. And that's definitely a failure. Here's a very rough quote I overheard at lunch today at Sage Days 61, from John Voight, who wrote much quaternion algebra code in Magma: "I'm making a list of what is missing from Sage that Magma has for working with quaternion algebras. However, it's so incredibly daunting, that I don't want to put in everything. I've been working on Magma's quaternion algebras for over 10 years, as have several other people. It's truly daunting how much functionality Magma has compared to Sage." The only possible way Sage will not fail at the stated mission is if I can get several million dollars a year in money to support developers to work fulltime on implementing interesting core mathematical algorithms. This is something that Magma has had for over 20 years, and that Maple, Matlab, and Mathematica also have. That I don't have such funding is probably why you are about to take a job at a hedge fund. If I had the money, I would try to hire a few of the absolute best people (rather than a bunch of amateurs), people like you, Robert Bradshaw, etc. -- we know who is good. (And clearly I mean serious salaries, not grad student wages!) So yes, I think the current approach to Sage has failed. I am going to try another approach, namely SageMathCloud. If it works, maybe the world will get a free open source alternative to Magma, Mathematica, etc. Otherwise, maybe the world never ever will. If you care like I do about having such a thing, and you're teaching course, or whatever, maybe try using SageMathCloud. If enough people use SageMathCloud for college teaching, then maybe a business model will emerge, and Sage will get proper funding. ## August 26, 2014 ### Matthieu Brucher #### Announcement: ATKUniversalDelay 1.1.0 I’m happy to announce the release of a mono fixed delay line on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. The three knobs adjust the direct signal (blend), the delayed signal (feedforward) as well as the feedback signal from the delay line injected in the input. The delay can be set from 0 ms to 1 s by steps of 0.1 ms. ATKUniversalDelay The supported formats are: • VST2 (32bits/64bits on Windows, 64bits on OS X) • VST3 (32bits/64bits on Windows, 64bits on OS X) • Audio Unit (64bits, OS X) Direct link for ATKUniversalDelay The files as well as the previous plugins can be downloaded on SourceForge, as well as the source code. Buy Me a Coffee! Other Amount: Your Email Address : ## August 25, 2014 ### Paul Ivanov #### pedestrian musings I walk in monologue through Berkeley's Hills Feet pressing into sidewalk firmly I eat the pensive mood solitude brings And bite into the juiciness of self-reflection I write, first time in years, free verse impromptu Taking few dozen steps between each pair of lines I yearn, on tip-toes stretching high, to be expressive A mode of being longtime self-denied I'm walking home - from job I'll soon be leaving To find myself believing once again That which I do defines me not and feeling That which I am is good. enough. a lot.  ## August 24, 2014 ### Titus Brown #### The first six years of faculty life Inspired by Sarah Bisbing's excellent post on her first year as a faculty member, here are the questions I remember asking myself during my first six years: Year 0: What science do I want to do? Year 1: What the hell am I doing all day and why am I always so exhausted? Year 2: Why don't my grad students know how to do anything yet? Year 3: What does this data mean? Year 4: Why are grant and paper reviewers so dumb? Year 5: Why are grant and paper authors so dumb? Year 6: Will I get tenure, and if so, do I really want to be doing this for the rest of my life? --titus ## August 23, 2014 ### Titus Brown #### Downsampling shotgun reads to a given average coverage (assembly-free) The below is a recipe for subsetting a high-coverage data set to a given average coverage. This differs from digital normalization because the relative abundances of reads should be maintained -- what changes is the average coverage across all the reads. Uses for this recipe include subsampling reads from a super-high coverage data set for the purpose of assembly, as well as more esoteric reasons (see the bottom of the post). This approach won't work on digitally normalized reads, and is primarily intended for genomes and low-complexity metagenomes. For high-complexity metagenomes we recommend partitioning. Note: at the moment, the khmer scripts collect-reads.py and slice-reads-by-coverage.py are in the khmer repository under branch feature/collect_reads. Once we've merged it into the master branch and cut a release, we'll remove this note and simply specify the khmer release required. This recipe uses code from khmer-recipes and dbg-graph-null. Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150 or higher. If your reads are in reads.fa load-into-counting.py -x 1e8 -k 20 reads.kh reads.fa ~/dev/khmer/sandbox/calc-median-distribution.py reads.kh reads.fa reads-cov.dist ./plot-coverage-dist.py reads-cov.dist reads-cov.png --xmax=600 --ymax=500  and looks like this: You see the same peaks at roughly the same places. Now use collect-reads.py to subset the data to a lower average coverage of 50 ~/dev/khmer/sandbox/collect-reads.py -x 1e8 -C 50 -k 20 reads.subset.kh reads.fa -o reads.subset.fa  Here, collect-reads.py is walking through the data set and computing a running average of the coverage of the last 1000 reads. Once it hits the specified average coverage of 50 (-C 50) it stops collecting the reads. Take a look at the read coverage spectrum for the subsetted data: ~/dev/khmer/sandbox/calc-median-distribution.py reads.subset.kh reads.subset.fa reads-subset.dist ./plot-coverage-dist.py reads-subset.dist reads-subset.png --xmax=600 --ymax=500  and compare the resulting plot with the one above -- Here you can see that the coverage spectrum has been shifted left and down by the subsampling (which is what you'd expect). Note that picking the coverage that you want is a bit tricky, because it will be the average across the reads. If you have a highly repetitive genome you may need to go for something higher than your desired single-copy genome coverage, because the repeats will skew your average to the right. ## Esoterica If the peaks look good, you can use the output counting table reads.subset.kh as an argument to slice-reads-by-coverage (see this post). If you use the original reads, this will then give you _all_ the reads that cluster by coverage with that peak. For example, ~/dev/khmer/sandbox/slice-reads-by-coverage.py reads.subset.kh reads.fa reads-repeats.fa -m 100 -M 200  will give you all the reads from the repetitive component, which will be much higher coverage in the combined data set; take a look: load-into-counting.py -x 1e8 -k 20 reads-repeats.kh reads-repeats.fa ~/dev/khmer/sandbox/calc-median-distribution.py reads-repeats.kh reads-repeats.fa reads-repeats.dist ./plot-coverage-dist.py reads-repeats.dist reads-repeats.png --xmax=600 --ymax=500  Here the slice specified (-m and -M) is with respect to the read abundances in reads.subset.kh). This allows you to more explore and subset large data sets than you would otherwise be able to, and also avoids some khmer-specific issues with counting k-mers that are higher abundance than 255. #### Extracting shotgun reads based on coverage in the data set (assembly-free) In recent days, we've gotten several requests, including two or three on the khmer mailing list, for ways to extract shotgun reads based on their coverage with respect to the reference. This is fairly easy if you have an assembled genome, but what if you want to avoid doing an assembly? khmer can do this fairly easily using techniques taken from the digital normalization work, which allows you to estimate read coverage directly from the data. The below is a recipe for computing coverage spectra and slicing reads out of a data set based on their coverage, with no assembly required. It's the first of what I hope to be many practical and useful recipes for working with shotgun data. Let me know what you think! Uses for extracting reads by coverage include isolating repeats or pulling out mitochondrial DNA. This approach won't work on digitally normalized reads, and is primarily intended for genomes and low-complexity metagenomes. For high-complexity metagenomes we recommend partitioning. Note: at the moment, the khmer script slice-reads-by-coverage is in the khmer repository under branch feature/collect_reads. Once we've merged it into the master branch and cut a release, we'll remove this note and simply specify the khmer release required. This recipe uses code from khmer-recipes and dbg-graph-null. Let's assume you have a simple genome with some 5x repeats, and you've done some shotgun sequencing to a coverage of 150. If your reads are in reads.fa, you can generate a k-mer spectrum from your genome with k=20 load-into-counting.py -x 1e8 -k 20 reads.kh reads.fa abundance-dist.py -s reads.kh reads.fa reads.dist ./plot-abundance-dist.py reads.dist reads-dist.png --ymax=300  and it would look something like this: For this (simulated) data set, you can see three peaks: one on the far right, which contains the high-abundance k-mers from your repeats; one in the middle, which contains the k-mers from the single-copy genome; and one all the way on the left at ~1, which contains all of the erroneous k-mers. This is a useful diagnostic tool, but if you wanted to extract one peak or another, you'd have to compute a summary statistic of some sort on the reads. The khmer package includes just such a 'read coverage' estimator. On this data set, the read coverage spectrum can be generated like so:: ~/dev/khmer/sandbox/calc-median-distribution.py reads.kh reads.fa reads-cov.dist ./plot-coverage-dist.py reads-cov.dist reads-cov.png --xmax=600 --ymax=500  and looks like this: You see the same peaks at roughly the same places. While superficially similar to the k-mer spectrum, this is actually more useful in its own right -- because now you can grab the reads and do things with them. We provide a script in khmer to extract the reads; slice-reads-by-coverage will take either a min coverage, or a max coverage, or both, and extract the reads that fall in the given interval. First, let's grab the reads between 50 and 200 coverage -- these are the single-copy genome components. We'll put them in reads-genome.fa. ~/dev/khmer/sandbox/slice-reads-by-coverage.py reads.kh reads.fa reads-genome.fa -m 50 -M 200  Next, grab the reads greater in abundance than 200; these are the repeats. We'll put them in reads-repeats.fa. ~/dev/khmer/sandbox/slice-reads-by-coverage.py reads.kh reads.fa reads-repeats.fa -m 200  Now let's replot the read coverage spectra, first for the genome: load-into-counting.py -x 1e8 -k 20 reads-genome.kh reads-genome.fa ~/dev/khmer/sandbox/calc-median-distribution.py reads-genome.kh reads-genome.fa reads-genome.dist ./plot-coverage-dist.py reads-genome.dist reads-genome.png --xmax=600 --ymax=500  and then for the repeats: load-into-counting.py -x 1e8 -k 20 reads-repeats.kh reads-repeats.fa ~/dev/khmer/sandbox/calc-median-distribution.py reads-repeats.kh reads-repeats.fa reads-repeats.dist ./plot-coverage-dist.py reads-repeats.dist reads-repeats.png --xmax=600 --ymax=500  and voila! As you can see we have the reads of high coverage in reads-repeats.fa, and the reads of intermediate coverage in reads-genome.fa. If you look closely, you might note that some reads seem to fall outside the specified slice categories above -- that's presumably because their coverage was predicated on the coverage of other reads in the whole data set, and now that we've sliced out various reads their coverage has dropped. ## August 22, 2014 ### Titus Brown #### A Review: Neanderthal Man: In Search of Lost Genomes I just finished reading Svante Paabo's autobiography, Neanderthal Man: In Search of Lost Genomes. The book is perfect -- if you're a biologist of any kind, you'll understand most of it without any trouble, and even physicists can probably get a lot out of the story (heh). The book describes Svante Paabo's journey towards sequencing the Neanderthal and Denisovan genomes, and the attendant scientific and popular science implications. It's a fantastic portrayal of how science really works, from the perspective of a driven, charismatic, and successful scientist. Beyond the scientific story, which I had not known much about at all, there were two particularly interesting parts of the book. First, I was surprised at the candor and simultaneous shallowness with which Dr. Paabo discussed his various relationships and bisexuality. Throughout the book there is occasional mention of men, women, marriage, and relationships, and while I don't get much of a sense of how impactful all of this was on him personally, it is striking how little it seems to have impacted his scientific life. There was one particular bit in the middle that I found very understated in reporting, where he and a couple all move into the same apartment building, and then depart with different spouses. I guess I was surprised at the choice to report the personal relationships while avoiding any depth whatsoever. (Craig Venter still takes the cake with a single paragraph in A Life Decoded where he starts the paragraph with a divorce and ends with an engagement. It's a long paragraph, but still.) Second, it was somewhat dispiriting to watch Dr. Paabo's transition from an enthusiastic young scientist concerned primarily with getting accurate scientific knowledge out there to one who was very focused not only on scientific correctness but on publicity. There's a great section at the beginning where he talks about publishing the first mtDNA sequence from Neanderthals, and he chooses Cell because it allowed longer articles and a better representation of the science; he also spends a lot of time making sure his mtDNA sequence can be reproducibly generated by another lab. By the end, however, he's publishing in Science and Nature, and agonizing over whether or not he'll be first with the publication. (The latter approach seem much more harmful to good scientific practice to me; both the secrecy & competitiveness and the desire to push out papers in Science and Nature are problematic.) I don't know how much self-reflection went into the narrative, but it would be interesting to know why his approach towards publication changed so much during his career. Is it because science changed as a whole towards requiring the splashy high impact papers? Or is it because he grew in stature and funding requirements to the point where he needed the splashy high impact publications to justify the funding? Or (as he says towards the end) is it because of the perceived career needs of his junior colleagues? Or all three? Or more? Anyway, a great read. Highly recommended for the sci-curious. --titus ### Jake Vanderplas #### Hacking Academia: Data Science and the University A reflection on our SciFoo breakout session, where we discussed issues of data science within academia. Almost a year ago, I wrote a post I called the Big Data Brain Drain, lamenting the ways that academia is neglecting the skills of modern data-intensive research, and in doing so is driving away many of the men and women who are perhaps best equipped to enable progress in these fields. This seemed to strike a chord with a wide range of people, and has led me to some incredible opportunities for conversation and collaboration on the subject. One of those conversations took place at the recent SciFoo conference, and this article is my way of recording some reflections on that conversation. SciFoo is an annual gathering of several hundred scientists, writers, and thinkers sponsored by Digital Science, Nature, O'Reilly Media & Google. SciFoo brings together an incredibly eclectic group of people: I met philosophers, futurists, alien hunters, quantum physicists, mammoth cloners, magazine editors, science funders, astrophysicists, musicians, mycologists, mesmerists, and many many more: the list could go on and on. The conference is about as unstructured as it can be: the organizers simply provide food, drink, and a venue for conversation, and attendees put together breakout discussions on nearly any imaginable topic. If you ever get the chance to go, my advice is to drop everything else and attend. It was one of the most quirky and intellectually stimulating weekends I've ever spent. The SciFoo meeting is by invitation only, and given the incredible work of other attendees, I'm still not quite sure how I ended up on the invite list (it was perhaps the worst flare-up of impostor syndrome I've ever had!) I forced myself to get over it, though, and teamed-up with Chris Mentzel, a program director in the Moore Foundation, and led a session: we called it Hacking Academia from Inside and Out. The session was in many ways a conversation around the general topic of my Brain Drain post, though it was clear that many of the folks in attendance had been thinking in these terms long before I penned that particular essay. ## The Problem The problem we discussed is laid out in some detail in my Brain Drain post, but a quick summary is this: scientific research in many disciplines is becoming more and more dependent on the careful analysis of large datasets. This analysis requires a skill-set as broad as it is deep: scientists must be experts not only in their own domain, but in statistics, computing, algorithm building, and software design as well. Many researchers are working hard to attain these skills; the problem is that academia's reward structure is not well-poised to reward the value of this type of work. In short, time spent developing high-quality reusable software tools translates to less time writing and publishing, which under the current system translates to little hope for academic career advancement. In my Brain Drain post, I observed the rise of data-intensive research and lamented that researchers with the requisite interdisciplinary knowledge are largely unrecognized and unrewarded in academia, even as they are highly-sought-after in the tech industry. I previously labeled this type of person a "new breed of scientist", but since then it's become clear that the working label for this type of person has become (for better or worse) a data scientist. ## Defining Data Science The term "Data Science" generally seems to get a bad rap: it's variously dismissed as misleading, an empty buzzword, or begrudgingly conceded to be flawed, but useful. Perhaps "Data Scientist" can be understood as just a more subdued term for the "sexy statistician" that Hal Varian predicted would become the top career of this decade. I think the best illustration of data science's definition comes from Drew Conway's Data Science Venn Diagram, which applies the label "Data Science" to the intersection of hacking skills, statistical knowledge, and domain expertise. The key is that in addition to the normal depth of knowledge in one's own field, there as a rare breadth to the knowledge and skill-set of a data scientist. In the words of Alex Szalay, these sorts of researchers must be "Pi-shaped" as opposed to the more traditional "T-shaped" researcher. In Szalay's view, a classic PhD program generates T-shaped researchers: scientists with wide-but-shallow general knowledge, but deep skill and expertise in one particular area. The new breed of scientific researchers, the data scientists, must be Pi-shaped: that is, they maintain the same wide breadth, but push deeper both in their own subject area and in the statistical or computational methods that help drive modern research: Perhaps neither of these labels or descriptions is quite right. Another school of thought on data science is Jim Gray's idea of the "Fourth Paradigm" of scientific discovery: First came the observational insights of empirical science; second were the mathematically-driven insights of theoretical science; third were the simulation-driven insights of computational science. The fourth paradigm involves primarily data-driven insights of modern scientific research. Perhaps just as the scientific method morphed and grew through each of the previous paradigmatic transitions, so should the scientific method across all disciplines be modified again for this new data-driven realm of knowledge. Regardless of what metaphor, definition, or label you apply to this class of researcher, it is clear that their skill set is highly valuable in both academia and industry: the brain drain that many have observed comes from the unfortunate fact that academia fails to properly reward the valuable skill set of the data scientist. ## Our Discussion: Academia and Data Science With this label in mind, our SciFoo discussion focused largely around the following questions: 1. Where does Data Science fit within the current structure of the university? 2. What is it that academic data scientists want from their career? How can academia offer that? 3. What drivers might shift academia toward recognizing & rewarding data scientists in domain fields? 4. Recognizing that graduates will go on to work in both academia and industry, how do we best prepare them for success in both worlds? I'll go through some of the thoughts we discussed below: ### 1. Where does Data Science fit within the current structure of the university? The question of data science's place in academia drew a variety of responses and ideas: The "Fourth Paradigm": data science is simply a label for a new skill-set, and shouldn't be separated from the departments in which it is useful. The thinking here is that data science is simply an umbrella term for an essential skill in modern scientific research. For example, laboratory biologists are dependent on pipetting skills: this doesn't mean that the university should create a new "Department of Applied Pipetting". On the contrary, it simply means that pipetting technique should be part of a laboratory biologist's normal training. Similarly, departments across the university should simply incorporate relevant data science techniques into their normal curriculum. Data science as a consulting service. Perhaps data science is more like Information Technologies (IT). All modern science labs depend on some sort of computer infrastructure, and most universities long ago realized that it's counter-productive to expect their specialized researchers to effectively manage that infrastructure. Instead, IT organizations were created which provide these services to multiple departments. Data science might be treated the same way: we can't expect every scientist to be fluent in the statistical and computational methods required to work with large datasets, so we might instead out-source these tasks to data science experts. Data science as an applied computer science department. There has been a trend in the 20th-century of academic subjects splitting into "pure" and "applied" sub-domains. Many Universities have departments of "applied math" and "applied physics", which (loosely speaking) distinguish themselves from the non-applied version by employing the techniques of the field within practical rather than theoretical contexts. Perhaps data science is best viewed as an applied branch of computer science or of statistics which should become its own academic department. Data science as a new role for libraries. It is no secret that digitization is changing the role of libraries on university campuses. The general public thinks of libraries little more than warehouses for books, but those in the field see printed books as just one particular manifestation of their focus, which has always been data curation. Many library scientists I've talked with recently are excited about the role that new methods and technologies can play in this task of curating and extracting information from their stores of data. From this perspective, Library & Information Science departments may be a natural home for interdisciplinary data science. Data science as a new interdisciplinary institute. A middle ground to the above approaches may be to organize data science within an interdisciplinary institute; this is a common approach for topics that are inherently multi-disciplinary. Such institutes are already common in academia: for example, the University of Washington is home to the Joint Institute for the Study of Atmosphere and Ocean, which brings together dozens of department, schools, and labs to collaborate on topics related to the climate and the environment. Perhaps such an umbrella institute is the place for data science in the University. ### 2. What is it that academic data scientists want from their job? How can academia offer that? Moving from university-level issues to personal-level issues, we brainstormed a list of goals that drive data scientists within academia and industry. While scientists are by no means a homogeneous group, most are driven by some combination of the following concerns: • Salary & other financial compensation • Stability: the desire to live in one place rather than move every few years • Opportunity for Advancement • Respect of Peers • Opportunity to work on open source software projects • Opportunity to travel & attend conferences • Flexibility to work on interesting projects • Opportunity to publish / freedom from the burden of publishing • Opportunity to teach / freedom from the burden of teaching • Opportunity to mentor students / freedom from the burden of mentoring students I've tried to generally order these from "perks of industry" to "perks of academia" from the perspective of a recently-minted PhD, but this rough one-dimensional categorization misses the variety of opportunities in both worlds. For example, some academic research scientists do not spend time teaching or mentoring students, and some tech industry jobs contain the type of flexibility usually associated with academic research. Those younger participants in our conversation who have most recently been "in the game", so to speak, especially noted problems in academia with the first three points. Compared to an industry data scientist position, an academic post-doc has some distinct disadvantages: • Money: the NIH postdoc salary hovers somewhere around$\$$40,000 - \$$50,000 per year. Industry often starts recent PhDs at several times that salary.
• Stability: before finding a permanent position, it is common now for young scientists to do several short-term post-doctoral appointments, often in different corners of the country. This means that many academics cannot expect to root themselves in a community until their mid-30s or later. For an industry data scientist, on the other hand, there may be many attractive and permanent job options near home.
• Advancement: those who focus on software and data in academia, rather than a breakneck pace of publication, have little hope for a tenure-track position under the current system. A researcher who spends significant time building software tools can expect little reward within academia, no matter how influential those tools might be.

Anecdotally, these seem to be the top three issues for talented data scientists in academia. With the tech-industry market for data scientists as hot as it is, skilled PhD candidates have many compelling incentives to leave their field of research.

### 3. What drivers might shift academia toward recognizing & rewarding data scientists in domain fields?

We discussed several drivers that might make academia a more friendly place for data scientists. Generally, these centered around notions of changing the current broken reward structure to recognize success metrics other than classic publication or citation counts. These drivers may be divided into two different categories: those outside academia who might push for change (e.g. funding agencies, publishers, etc.) and those inside academia who might implement the change themselves (e.g. university leadership and department leadership). We discussed the following ideas:

#### From the outside:

• Publishers might provide a means of publishing code as a primary research project. Introducing a DOI for software with easy means of updating contributor lists over time, for example, might lead to increased citation and recognition of alternative research activities.
• Publishers might push for reproducibility as a requirement for publication. Reproducibility of data-intensive research requires the well-engineered code that data scientists can produce, and would increase the value placed on these skills.
• Funding agencies might provide funding specifically for interdisciplinary data scientists within specific domain fields.
• Funding agencies might encourage interdisciplinary collaboration through specific grant requirements.

#### From the inside:

• University leadership might set aside funding for new data science departments or for interdisciplinary institutes focused on data science.
• University leadership might create joint positions: e.g. paying half of a professor or researcher's salary so that he or she can focus that time on tasks such as creating, maintaining, or teaching about important software tools.
• Department leadership might take the lead to adapt their curriculum to include data science topics, and specifically hire faculty who emphasize these areas in their work.
• Department leadership might adjust their hiring practice to recognize alternative metrics that go beyond the H-index: for example, recognizing the importance of an open source software tool that is well-used, but may not generate a classic citation record.

### 4. Recognizing that graduates will go on to work in both academia and industry, how do we best prepare them for success in both worlds?

This question is the flip-side of the Brain Drain theme: the number of PhDs granted each year far exceeds the number of academic positions available, so it is simply impossible for every graduate to remain in academia.

This is, for some reason, a somewhat taboo subject in academia: I've talked to many who at the end of their PhD program were leaning toward leaving academia, and dreaded having "the talk" with their thesis advisor. But academic departments should take seriously the job prospects for their graduates, and that involves making sure they have marketable skills upon graduation.

Fortunately, these marketable skills overlap highly with the skills that can lead to success in modern data-intensive scientific research: the ability to design and write good software, to create tools that others can reuse, to collaboratively work on large software projects, to effectively extract insight from large datasets, etc. Unfortunately, development of this skill set takes time and energy at the level of both graduate coursework and research mentorship. Departments should push to make sure that their students are learning these skills, and are rewarded for exercising them: this will have the happy side-effect of increasing demand within academia for the scientists who nurture these skills.

At the beginning of the SciFoo conference, each of the ~250 people present were invited to introduce themselves with a one line description of who they are and what they spend their time thinking about. The title of this post, Hacking Academia, is a phrase that I borrowed from Chris Mentzel's one-line description of his own focus.

This is hacking not in the Hollywood sense of using computers for malicious purposes, but hacking in the sense of working within a rigid system to accomplish something it is not necessarily designed to do. This type of "institutional hacking" is the best description of what we're after: finding shortcuts that will mold the world of academic scientific research into a more effective version of itself.

Complaining about the state of academia is a favorite past-time of academics, but it is far rarer to see actual solutions to these problems. One of the best pieces of the SciFoo discussion was just this: hearing about the steps that various individuals, foundations, and institutions are taking to address the concerns outlined above. I'll mention a few examples here:

### NSF: the IGERT program

The NSF's long-standing Interactive Graduate Education and Research Traineeship (IGERT) program provides funding for interdisciplinary training for PhDs and postdocs, and recent grants in particular have had a data science focus. UW was recently awarded an IGERT grant for an interdisciplinary data-focused PhD program, and the first group of these students will be starting this fall. An important piece of this was that home departments agreed to allow these students to forego some of their normal degree requirements to make room for joint courses on data science skills and techniques. IGERT remains an active NSF program, and has similar interdisciplinary grants to schools and departments around the United States.

### UW: Provost's Data Science Initiative

At UW, the university-wide leadership is also thinking along these lines with some concrete actions of their own: the provost has set aside funding for half-positions to encourage hiring of interdisciplinary faculty. The next Astronomy professor hired by UW, for example, might spend half of his or her time working and teaching through the eScience institute on general computational and data science methods.

### Publishing Code: the Journal of Statistical Software

Though this isn't a recent initiative, the Journal of Statistical Software (JSS) is an example of a non-profit publisher which is having a positive impact in the area of statistical software, by giving scientists a forum to publish the software they write and cite the software they use. Perhaps in part because of the extreme usefulness of well-written, reusable software, JSS is very highly-ranked (see, for example, the SCImago rankings). More journals like this, which place explicit value on reproducible computation and well-written software tools, could be a huge benefit to the academic data scientist. (full disclosure: I'm on the editorial board of JSS).

### Harvard University: Initiative in Innovative Computing & Institute for Advanced Computational Science

Alyssa Goodman of Harvard was part of our discussion, and mentioned that nearly a decade ago Harvard foresaw and began addressing the value of interdisciplinary data-intensive science and research. They created a short-lived Initiative in Innovative Computing (IIC), which existed from 2005-2009, until the global financial crisis led its funding to be cut. At its peak, the IIC was supported to the tune of around $4 million per year and was home to roughly 40 staff, most working jointly between the IIC and other departments. After the IIC funding dissipated, it seems that most of this momentum (and many of the IIC staff) moved to the Harvard's Institute for Advanced Computational Science (IACS), started by Tim Kaxiras and the Harvard School of Engineering in 2010. Though IACS has traditionally focused more on simulation and computation, it has recently begun to branch out to visualization and data science as well. Alyssa seems optimistic that momentum in this area is again building, and mentioned also Harvard's Institute for Quantitative Social Science, the Seamless Astronomy Program, and the Library as key players. ### Michigan State: new Applied Computation Department Another active participant in the discussion was Steve Hsu, Vice President for Research at Michigan State University. He shared some exciting news about a new development at Michigan State: the creation of a new department, tentatively called Applied Computation and Mathematics. According to Steve, MSU will soon be hiring around twenty interdisciplinary tenure-track faculty, who will have one foot in their home department and one foot in this new department. MSU already has many impressive researchers working in data-intensive areas across the university: with this effort I'm excited to see the dynamic interdisciplinary community they will build. ## Conclusion The SciFoo discussion was excellent, as was the weekend as a whole: I would like to take this chance to thank O'Reilly Media, Digital Science, Nature Publishing Group, and Google for making it all possible. Nearly a year after my Brain Drain brain dump, I am thrilled to see that there are so many good people thinking about and working on these problems, and so many real solutions both in process and on the horizon. While the lure of a well-funded tech industry will doubtless still attract a steady stream of scientists away from their academic research, I'm heartened to see such focused effort toward carving-out niches within academia for those who have so much to contribute. ## August 21, 2014 ### Manoj Kumar #### manojbits Hi, I was postponing the last post for the last of my Pull Requests to get merged. Now since it got merged, I do not have any reason to procrastinate. This is the work that I have done across summer, with a short description of each, (Just in case you were wondering why the “another” in the title, http://manojbits.wordpress.com/2013/09/27/the-end-of-a-journey/ ) 1. Improved memory mangement in the coordinate descent code. Status: merged Pull Request: https://github.com/scikit-learn/scikit-learn/pull/3102 Changing the backend from multiprocessing to threading by removing the GIL, and replacing the function calls with pure cblas. A huge improvement 3x – 4x in terms of memory was seen without compromising much on speed. 2. Randomised coordinate descent Status: merged Pull Request:https://github.com/scikit-learn/scikit-learn/pull/3335 Updating a feature randomnly with replacement instead of doing an update across all features can make descent converge quickly. 3. Logistic Regression CV Status: merged Pull Request: https://github.com/scikit-learn/scikit-learn/pull/2862 Fitting a cross validation path across a grid of Cs, with new solvers based on newton_cg and lbfgs. For high dimensional data, the warm start makes these solvers converge faster. 4. Multinomial Logistic Regression Status: merged Pull Request: https://github.com/scikit-learn/scikit-learn/pull/3490 Minimising the cross-entropy loss instead of doing a OvA across all classes. This results in better probability estimates of the predicted classes. 5. Strong Rules for coordinate descent Status: Work in Progress Pull Request: https://github.com/scikit-learn/scikit-learn/pull/3579 Rules which help skip over non-active features. I am working on this and it should be open for review in a few days. Apart from these I have worked on a good number of minor bug fixes and enhancements, including exposing the n_iter parameter across all estimates, fixing incomplete download of newsgroup datasets, and soft coding the max_iter param in liblinear. I would like to thank my mentor Alex who is the best mentor one can possibly have, (I’m not just saying this because of hope that he will pass me :P), Jaidev, Olivier, Vlad, Arnaud, Andreas, Joel, Lars, and the entire scikit-learn community for helping me to complete an important project to an extent of satisfaction. (It is amazing how people manage to contribute so much, inspite of having other full time jobs). I will be contributing to scikit-learn full-time till December at least as part of my internship. EDIT: And of course Gael (how did I forget), the awesome project manager who is always full of enthusiasm and encouragement. As they say one journey ends for the other to begin. The show must go on. ## August 20, 2014 ### Titus Brown #### The August Carnival of Evolution Every month, Bjørn Østman finds another sucker^W^W^W organizes a Carnival of Evolution blog post, that does a roundup of blogs on evolution from a previous month. This month, I'm hosting it -- it's a bit late, due to some teaching duties, so apologies! Trigger warning: This blog post contains discussions of evolution, which may cause anxiety in those who don't want to be exposed to ideas with which they are pretty sure they disagree. --- My favorite blog post from July was the Marc Srour's post on Cone snail venoms and their awesomeness. Marc reviews a paper, Dutertre et al. (2014), that discusses how one cone snail venom duct manufactures different kinds of defensive and offensive venoms, presumably in response to the different needs of defense and predation. Jane Hu's post on how the largest known flying dinosaur avoided crashing reviews Han et al., 2014, which describes how the long feathers on a raptor helped stabilize it during flight. I found this post on why there are no ring species by Jerry Coyne to be interesting from two perspectives. First, I'd never read about the ring species concept before; and second, I thought it was interesting to devote so much discussion to why no actual examples of this concept seemed to exist :). Bjørn's post on death of the fittest is a nice example of how simple bottom-up simulations can help you understand evolutionary concepts. I'm mildly disappointed that Bjørn didn't make the MATLAB code available as a link in the blog, though - what's with that? Craig Benkman's blog on the outsized impact of a small mammal explores his PNAS paper, Talluto and Benkman (2014) on selection pressures from seed predation and fire. The short version is that pine squirrels prefer to harvest hard woody pine cones that also help repopulate the forest quickly after fires. In what I think is the best line in the blog post, "when squirrel densities exceed one and a half squirrels per hectare", the anti-hard-woody-pine-cone selection pressure from squirrels eating them dominates over the pro-repopulate-after-fire selection pressure. Turning to humans, Bradly Alicea has a nice discussion of how dual process models that take into account both genetic fitness and cultural adaptation could be a better way to understand human biological variation. Veering to something much smaller, Viking wannabe Jeff Morris wrote a nice blog post on microbial ecosystems and the Black Queen Hypothesis, talking about how the Black Queen Hypothesis can foster certain kinds of apparent "cooperation". Next, returning to Jerry Coyne and Why Evolution is True, check out this great blog post on Poelstra et al. 2014, looking at the genomic and transcriptomic underpinnings of two closely related crows. Despite very little genetic variation, these crows maintain distinct appearance and territories. tl; dr? The closer we look at the concept of "species", the harder it is to draw clear lines. Also, the primary point of difference between the two ...subspecies? seems to be located at one very small portion of their chromosome 18. How odd! Wrapping up my Carnival, Samuel Perez blogs about stamen size in some populations of Arabidopsis thaliana. There is a mystery here: what is causing populations at low altitudes to lose their short stamens? And that's it for this month! From a personal perspective, it was nice to compare and contrast so many different topics and styles of blog posts -- it's clear there are many ways to blog, and this is a fine selection of blogs! Thanks for roping me in, Bjørn! --titus The next host: the September edition will be at Sam Hardman's blog Ecologica (http://ecologicablog.wordpress.com). #### In which I declare my intentions to move to UC Davis This past weekend, I accepted an offer to join UC Davis as an Associate Professor of Genetics in the Department of Population Health and Reproduction, in the School of Veterinary Medicine. The appointment is still pending tenure review, but I expect to join Davis whether or not they give me tenure (sshh! don't tell them!) I am very sad to be leaving many good friends and colleagues in Michigan, but the personal and professional opportunities available at UC Davis are simply outstanding; we decided as a family that Davis was our future. I'm sure you must all have lots of questions. So here's a handy list of answers! 1. Wait, tenure review? Didn't you have tenure at MSU?! Nope. 2. VetMed... wait, what? Yep. 3. Really? You're going to leave it there? Turns out veterinary animals have genomes, too! And PHR and VetMed are extremely broad in their research programs -- they have great people in ecology, evolution, microbiology, and many other fields. UC Davis overall is truly excellent, but I am really enthusiastic about joining VetMed specifically. There are several other recent faculty hires that I'm thrilled about, and the existing faculty are just outstanding; I expect Davis VetMed to offer a wonderful and fertile ground for the growth of my research program. 4. OK, seriously, why did they even interview you, much less hire you? Well, I agree that my fit for the position description as posted is not quite perfect. So I asked the same question! Among other things, several members of the hiring committee said that they really liked my education efforts. Without a strong research program, they would probably not have looked seriously at my application; but, once they did, they said that they really liked the mix of research and education and outreach. Just as a reminder, I've made all of my application materials available here. 5. What talk did you give? Your soil metagenomics talk?! How'd that go? My talk is posted on Slideshare so you can see for yourself -- it was almost entirely about the work that my student Dr. Likit Preeyanon did on Marek's Disease resistance in chicken. Yeah, I work on that stuff, too :). 6. Did you apply for any other jobs? Yep. I applied for about six academic positions, including positions in Big Data, mol bio/ecology/evolution/bioinformatics, and microbiology. Got one interview, and one job offer. shrug 7. Are you going to continue doing ... whatever it is you do? Yep. And more! 8. Did Jonathan Eisen have anything to do with this? Jonathan was one of my references, but AFAIK he was uninvolved in the decision past that. Needless to say, however, the fact that UC Davis is supportive of his social media and open access efforts was a strong positive at Davis (although MSU is no slouch there either). We do hope his open access policies extend to his backyard pool. 9. When are you going!? TBD. Sometime in early 2015, I think. 10. Will you be hiring? Yeppers. More on that down the road. 11. Do you even like granola? I do! Note that I went to Reed and Caltech for undergrad and grad school respectively, so I think Davis is a good average between the two -- and not just geographically :). 12. How did you get hired without a passel of Cell/Nature/Science papers? shrug I have a few high profile papers, most notably a bunch o' PNAS papers. But my publication record was never at any point brought up by anyone, so I don't know what they thought of it. 13. It was your klout score, wasn't it! That's why they hired you! Honestly, as far as I can tell they were largely unaware of my social media and open science interests. The search committee seemed interested in it over dinner, though. 14. What else makes you excited about Davis? The new Big Data initiative at Davis. The opportunities for interactions with faculty from the CS, Ag, Microbiology, and Developmental Biology parts of campus. The proximity to the Bay Area, the JGI, UC Berkeley and BIDS, Stanford, Silicon Valley, and of course the Perlstein Lab. The granola. The proximity to Big Sur, a.k.a. "the most beautiful area in the world." The weather. 15. Do you have any plans to scale up (or back) your education and training efforts? Funny you should ask! Yes, I do plan to scale up my training efforts significantly! More on that once I figure out where my office will be, how to get paid, and where all the campus coffee shops are. 16. Do you plan to change the name of your lab? Well, right now it's called the Lab of Genomics, Evolution, and Development (GED), which has become increasingly inaccurate :). I'm think of naming it the Lab for Data Intensive Biology (DIB). Other suggestions welcome; given my inability to focus, the Lab of Life, the Universe, and Everything might work just as well... 17. How do you feel about losing your short e-mail address? It doesn't get much shorter than 'ctb@msu.edu'! Hopefully I can get themagnificenttitus@ucdavis.edu. That will more than make up for it. 18. The City of Davis just received a new armored vehicle. Coincidence? No comment. --titus ### NeuralEnsemble #### GSoC Open Source Brain: Cortical Connections Cortical Connections # Cortical Connections In the same vein that the post before this one we will show here how to construct the connections between the cortical layers. In order to do so we will construct a function that works in general for any arbitrary connectivity, we describe in the following its structure. First, as in the thalamo-cortical connectivity, we have again the same structure of a function that loops over the target population extracting the relevant parameters that characterize these neurons. Furthermore we have another function that loops over the source population creating the corresponding tuples for the connection list. Is in this last function where the particular connectivity rule is implemented. In the particular case of the Troyer model the connectivity between the cortical cells is determined by the correlation between the receptive fields of the neurons, the receptive fields here being Gabor functions. In more detail the neurons whose receptive fields are more correlated will be the ones more likely to have excitatory connections between them. On the other hand the ones whose receptive fields are less correlated will be more likely to receive inhibitory connections. In this post we show two schemes that accomplish this connectivity. The first one uses the fact the parameters of the receptive field to calculate a connectivity and the second one uses the receptive fields directly to calculate the correlations. We present the determining functions in the stated order down here. Now we present the function that creates the connectivity for a given neuron par. The circular distance between the orientation and phases are calculated as a proxy to estimate how similar the receptive fields of the neurons are. After that, the distance between them is weighted and normalized with a normal function in order to obtain a value that we can interpret as a probability value. Finally in order to calculate the connectivity we sample n_pick times with the given probability value to see hwo strong a particular connection should be.  def cortical_to_cortical_connection(target_neuron_index, connections, source_population, n_pick, g, delay, source_orientations, source_phases, orientation_sigma, phase_sigma, target_neuron_orientation, target_neuron_phase, target_type): """ Creates the connections from the source population to the target neuron """ for source_neuron in source_population: # Extract index, orientation and phase of the target source_neuron_index = source_population.id_to_index(source_neuron) source_neuron_orientation = source_orientations[source_neuron_index] source_neuron_phase = source_phases[source_neuron_index] # Now calculate phase and orientation distances or_distance = circular_dist(target_neuron_orientation, source_neuron_orientation, 180) if target_type: phase_distance = circular_dist(target_neuron_phase, source_neuron_phase, 360) else: phase_distance = 180 - circular_dist(target_neuron_phase, source_neuron_phase, 360) # Now calculate the gaussian function or_gauss = normal_function(or_distance, mean=0, sigma=orientation_sigma) phase_gauss = normal_function(phase_distance, mean=0, sigma=phase_sigma) # Now normalize by guassian in zero or_gauss = or_gauss / normal_function(0, mean=0, sigma=orientation_sigma) phase_gauss = phase_gauss / normal_function(0, mean=0, sigma=phase_sigma) # Probability is the product probability = or_gauss * phase_gauss probability = np.sum(np.random.rand(n_pick) probability) # Samples synaptic_weight = (g / n_pick) * probability if synaptic_weight > 0: connections.append((source_neuron_index, target_neuron_index, synaptic_weight, delay)) return connections  Note that the overall strength is weighted by the conductivity value g that is passed as an argument. Furthermore a delay that is also passed as an argument is added to the list as the last element of the tuple. Secondly we present the full correlation scheme. In this scheme we utilize the kernels directly to calculate the spatial correlation between them. In particular after we have flattened our kernels Z to have a series instead of a matrix we use the function perasonr from scipy.stats to calculate the correlation. Again as in the case above we use this probability to sample n_pick times and then calculate the relative connectivity strength with this.  def cortical_to_cortical_connection_corr(target_neuron_index, connections, source_population, n_pick, g, delay, source_orientations, source_phases, target_neuron_orientation, target_neuron_phase, Z1, lx, dx, ly, dy, sigma, gamma, w, target_type): """ Creates the connections from the source population to the target neuron """ for source_neuron in source_population: # Extract index, orientation and phase of the target x_source, y_source = source_neuron.position[0:2] source_neuron_index = source_population.id_to_index(source_neuron) source_neuron_orientation = source_orientations[source_neuron_index] source_neuron_phase = source_phases[source_neuron_index] Z2 = gabor_kernel(lx, dx, ly, dy, sigma, gamma, source_neuron_phase, w, source_neuron_orientation, x_source, y_source) if target_type: probability = pearsonr(Z1.flat, Z2.flat)[0] else: probability = (-1) * pearsonr(Z1.flat, Z2.flat)[0] probability = np.sum(np.random.rand(n_pick) probability) # Samples synaptic_weight = (g / n_pick) * probability if synaptic_weight > 0: connections.append((source_neuron_index, target_neuron_index, synaptic_weight, delay)) return connections  Note that the overall strength is weighted by the conductivity value g that is passed as an argument. Furthermore a delay that is also passed as an argument is added to the list as the last element of the tuple. We now show how a plot that illustrates how the probabilities change when the parameters that determined the gabor function are changed for each scheme. In the figure above we have int he upper part how the probability for the first scheme a neuron with phase 0 and orientation 0 change as we vary the phase (left) and orientation (right). In the two graphs bellow we have the same for the second scheme we presented #### GSoC Open Source Brain: Thalamo-Cortical Connections Thalamo-cortical connections # Thalamo-cortical connections In this post I will show how to build arbitrary custom connections in PyNN. We will illustrate the general technique in the particular case of the Troyer model. In the Troyer model the connections from the LGN to the cortex are determined with a gabor-profile therefore I am going to describe the required functions to achieve such an aim. In the PyNN documentation we find that one of the ways of implementing arbitrary connectivity patterns is to use the FromListConnector utility. In this format we have to construct a list of tuples with a tuple for each connection. In each tuple we need to include the index of the source neuron (the neuron from which the synapse originates), the index of the target neuron (the neuron into which the synapse terminates), the weight and the delay. For example (0, 1, 5, 0.1) would indicate that we have a connection from the neuron 0 to the neuron 1 with a synaptic weight of 5 and a delay of 0.1. In the light of the explanation above we need to construct a function that is able to construct a list with the appropriate weights given a target and a source populations. In order to start moving towards this goal we will first write a function that connects a given neuron in the target population to all the neurons in the source population. We first present the function here bellow and we will explain it later:  def lgn_to_cortical_connection(cortical_neuron_index, connections, lgn_neurons, n_pick, g, delay, polarity, sigma, gamma, phi, w, theta, x_cortical, y_cortical): """ Creates connections from the LGN to the cortex with a Gabor profile. This function adds all the connections from the LGN to the cortical cell with index = cortical_neuron_index. It requires as parameters the cortical_neruon_index, the current list of connections, the lgn population and also the parameters of the Gabor function. Parameters ---- cortical_neuron_index : the neuron in the cortex -target- that we are going to connect to connections: the list with the connections to which we will append the new connnections lgn_neurons: the source population n_pick: How many times we will sample per neuron g: how strong is the connection per neuron delay: the time it takes for the action potential to arrive to the target neuron from the source neuron polarity: Whether we are connection from on cells or off cells sigma: Controls the decay of the exponential term gamma: x:y proportionality factor, elongates the pattern phi: Phase of the overall pattern w: Frequency of the pattern theta: Rotates the whole pattern by the angle theta x_cortical, y_cortical : The spatial coordinate of the cortical neuron """ for lgn_neuron in lgn_neurons: # Extract position x, y = lgn_neuron.position[0:2] # Calculate the gabbor probability probability = polarity * gabor_probability(x, y, sigma, gamma, phi, w, theta, x_cortical, y_cortical) probability = np.sum(np.random.rand(n_pick) probability) # Samples synaptic_weight = (g / n_pick) * probability lgn_neuron_index = lgn_neurons.id_to_index(lgn_neuron) # The format of the connector list should be pre_neuron, post_neuron, w, tau_delay if synaptic_weight > 0: connections.append((lgn_neuron_index, cortical_neuron_index, synaptic_weight, delay))  The first thing to note from the function above are its arguments. It contains the source population and the particular target neuron that we want to connect to. It also contains all the connectivity and gabor-function related parameters. In the body of the function we have one loop over the whole source population that decides whether we add a connection from a particular cell or not. In order to decide if we add a connection we have to determine the probability from the gabor function. Once we have this we sample n_pick times and add a weighted synaptic weight accordingly to the list for each neuron. In the function above we have the values of the gabor function passed as arguments. However, the values of each gabor function depend on the nature of the cell of the target population. In the light of this we will construct another function that loops over the target population and extracts the appropriate gabor values for each function in this population. We again present the function and then explain it: def create_lgn_to_cortical(lgn_population, cortical_population, polarity, n_pick, g, delay, sigma, gamma, phases, w, orientations): """ Creates the connection from the lgn population to the cortical population with a gabor profile. It also extracts the corresponding gabor parameters that are needed in order to determine the connectivity. """ print 'Creating connection from ' + lgn_population.label + ' to ' + cortical_population.label # Initialize connections connections = [] for cortical_neuron in cortical_population: # Set the parameters x_cortical, y_cortical = cortical_neuron.position[0:2] cortical_neuron_index = cortical_population.id_to_index(cortical_neuron) theta = orientations[cortical_neuron_index] phi = phases[cortical_neuron_index] # Create the connections from lgn to cortical_neuron #lgn_to_cortical_connection(cortical_neuron_index, connections, lgn_population, n_pick, g, polarity, sigma, #gamma, phi, w, theta, x_cortical, y_cortical) lgn_to_cortical_connection(cortical_neuron_index, connections, lgn_population, n_pick, g, delay, polarity, sigma, gamma, phi, w, theta, 0, 0) return connections  This function requires as arguments the source and target populations as well as the necessary parameters that characterize each cell connectivity: orientation and phase. In the body of the function we have a loop over the cortical population that extracts the relevant parameters -position, orientation and phase- and then calls the function that we already describe previously in order to create the connectivity from the source population to the cell in place. So now we have the necessary functions to construct a list. Now, we can use FromListConnector to transform the list into a connector. And the use this to define a Projection. We define both the excitatory and inhibitory connections. We abstract this complete set into the following function: def create_thalamocortical_connection(source, target, polarity, n_pick, g, delay, sigma, gamma, w, phases, orientations, simulator): """ Creates a connection from a layer in the thalamus to a layer in the cortex through the mechanism of Gabor sampling """ # Produce a list with the connections connections_list = create_lgn_to_cortical(source, target, polarity, n_pick, g, delay, sigma, gamma, phases, w, orientations) # Transform it into a connector connector = simulator.FromListConnector(connections_list, column_names=["weight", "delay"]) # Create the excitatory and inhibitory projections simulator.Projection(source, target, connector, receptor_type='excitatory') simulator.Projection(source, target, connector, receptor_type='inhibitory')  With this we can create in general connections from one target population to the other. We can even change change the gabor function for whatever we want if we want to experiment with other connectivity patterns. Finally we present down here an example of a sampling from a Gabor function with the aglorithm we just constructed: So in the image we show in the left the sampling from the ideal Gabor function in the right. ## August 19, 2014 ### Titus Brown #### The fifth workshop on Analyzing Next Generation Sequencing Data The fifth annual Analyzing Next Generation Sequencing Data workshop just finished - #ngs2014. As usual the schedule and all of the materials are openly available. tl; dr? Good stuff. We've been running this thing since 2010, and we now have almost 120 alumni (5 classes of roughly 24 students each). The students come from all over the world (although I think we're missing attendees from Africa and Antarctica), and from many different types of institutions, including top tier research universities, biotech, and non-profit research centers. The "students" vary in rank from graduate students and staff to full professors; the age range goes from low 20s to a fair bit older than me. The class invariably starts with an introduction to cloud computing, followed by a number of step-by-step tutorials through command line BLAST, quality analysis of Illumina data, mapping, and assembly. We cover basic analysis of your data, statistical considerations in sample design and data analysis, automation, version control, R, Python, IPython Notebook, mRNAseq analysis, and a variety of other topics. The students take full advantage of the opportunity to ask an immense variety of questions, ranging from algorithmic and technical to fundamental scientific questions. It's truly awesome, and I am unreasonably proud of this workshop! This year the previous workshop faculty and instructors (Ian Dworkin, Istvan Albert, Chris Chandler, Adina Howe) were joined by a newer cohort, including Meg Staton, Matt MacManes, Daniel Standage, Aaron Darling, and Martin Schilling. The younger group of scientists (Chris, Adina, Meg, Matt, Daniel, Aaron, and Martin) make me feel old: they're young, enthusiastic, all about open science, and totally into reproducibility and Software Carpentry and good bioinformatics. Many of them are now junior faculty, which really gives me hope for the future! Of particular note, Adina Howe took the course in 2010 (the first time we offered it) and is now taking a position as a big data biology junior faculty member -- how cool is that!? Our TAs this year did a fantastic job. Amanda Charbonneau was the overall "cruise director", and Aswathy Sebastian, Will Pitchers, Elijah Lowe, and Qingpeng Zhang served as TAs. ## How did the course go this year? Each year we do an end-of-course discussion where I try to stay quiet as the students dissect all of the bad decisions I made in organizing the course. This year, I took a page from Greg Wilson's handbook and made the students offer up one good thing and one bad thing -- every student had to provide at least one of each, and they had to be non-overlapping. We didn't entirely succeed at getting completely non-overlapping feedback, but the lists are still interesting and informative: I think my favorite is "Good: covered a lot of material; bad: covered A LOT of material!" although "I am worried that we now have a false sense of hope" comes in as a close second. Meg Staton should feel proud that one of the comments boiled down to "more Meg", although it came out as "more of Meg's flowcharts." And of course there's the always popular opinion that "if only you'd given us more to read up front, we'd have come better prepared", which in my experience is an incredibly over-optimistic lie, if well intentioned :). ## Assessing the workshop This is now the third year we've run assessments on the workshop. Our expert assessment company, StemEd LLC, hasn't yet finished the assessment report for 2014, but you can read the 2012 and 2013 evaluations here and here. These aren't complete assessments -- we are still working on processing the long-form answers -- and they are somewhat superficial, but overall paint a very positive picture of the workshop. This is in line with what we hear from the students both informally throughout, as well as more formally at the end-of-workshop discussions. One thing that surprised me this year (and I'm mentioning it because I wouldn't have noticed if someone hadn't said it for their "one good thing") was that people commented very positively on the diversity of students, who came from all over the world (we covered all but two South American countries!), had many different research backgrounds and interests, and were at many different career stages. While we are in fact tempted to roll dice to choose the students (this year, we accepted 24 of 170 applicants), we actually do spend some time trying to balance the class -- and it seemed to work well this year! ## Next years' plans We are planning to run the workshop against next year, at about the same time and in the same place, but there are some potential complications. (More about those soon.) The biggest change that I think I'm going to put into place next year is that the first ~5 days will be paced much more carefully. Some people come in with a lot of self-confidence and some solid expertise, while others come in with little of either; the latter group is really sensitive to being overwhelmed, while the former group is usually eager to drink from the firehose (sometimes until they see just how high we can turn up the pressure, hah). I plan to address this by making the first 5 days all about gentle-but-thorough introductions to UNIX, mapping, assembly, and scripting. In my experience, even the people who come in with some knowledge get a lot out of the more thorough introductions. Then in the second week we'll go crazy with a range of subjects. As part of this change, I may restrict the first week lecturers to trained Software Carpentry instructors. This would include Adina Howe, Meg Staton, and Martin Schilling, as well as myself (although I'd like to offer to pay for other instructors to do the in-person training if they were interested). From what I observed, people who haven't gone through Greg Wilson's training bootcamp are really lacking in an ability to "read" the classroom - for one, they don't pay enough attention to stickies, and for another, they are often too enthusiastic to get to the cool stuff. These instructors are incredibly valuable after people have learned to tread water, but can too easily drown students in the first week. I got a lot of feedback this year that they needed to be introduced more carefully. (There are some people that are just ill-suited to instructing non-experts, but I tend not to invite them -- I'm thinking of the first lecture at another workshop, which (literally) started with "OK, now after compiling and installing my software package, fire up vi and edit the config file to reflect your local system settings. Then run the program on the first demo file XXXX.") Something else I need to make sure of is that I (or someone) remains heavily involved in the course throughout. This year I was distracted by several different things, including three (!) thesis defenses on main campus that took place during the course, and I didn't bring the intensity. Next year, more intensity (or perhaps a new workshop director ;). So that's my report for this year. --titus ### Matthieu Brucher #### Announcement: ATKLimiter 1.0.0, ATKCompressor 1.1.0, ATKExpander 1.1.0 I’m happy to announce the release of one new limiter plugin based on the Audio Toolkit. It is available on Windows and OS X (min. 10.8) in different formats. I also updated the compressor and the expander with improved UI controls. The compressor also has now a dry/wet knob, allowing to use it for parallel compression. ATKLimiter ATKCompressor ATKExpander The supported formats are: • VST2 (32bits/64bits on Windows, 64bits on OS X) • VST3 (32bits/64bits on Windows, 64bits on OS X) • Audio Unit (64bits, OS X) The files can be downloaded on SourceForge, as well as the source code. Direct link for ATKLimiter Direct link for ATKCompressor Direct link for ATKExpander Buy Me a Coffee! Other Amount: Your Email Address : ## August 18, 2014 ### Hamzeh Alsalhi #### Google Summer of Code 2014 Final Summary Now at the end of this GSoC I have contributed four pull requests that have been merged into the code base. There is one planed pull request that has not been started and another pull request nearing its final stages. The list below gives details of each pull request and what was done or needs to be done in the future. This GSoC has been an excellent experience. I wan't to thank the members of the scikit-learn community, most of all Vlad, Gael, Joel, Oliver, and my mentor Arnaud, for their guidance and input which improved the quality of my projects immeasurably. ## Sparse Input for Ensemble Methods PR #3161 - Sparse Input for AdaBoost StatusCompleted and Merged Summary of the work done: The ensemble/weighted_boosting class was edited to avoid densifying the input data and to simply pass along sparse data to the base classifiers to allow them to proceed with training and prediction on sparse data. Tests were written to validate correctness of the AdaBoost classifier and AdaBoost regressor when using sparse data by making sure training and prediction on sparse and dense formats of the data gave identical results, as well verifying the data remained in sparse format when the base classifier supported it. Go to the AdaBoost blog post to see the results of sparse input with AdaBoost visualized. PR - Sparse input Gradient Boosted Regression Trees (GBRT) StatusTo be started Summary of the work to be done: Very similar to sparse input support for AdaBoost, the classifier will need modification to support passing sparse data to its base classifiers and similar tests will be written to ensure correctness of the implementation. The usefulness of this functionality depends on the sparse support for decision trees which is a pending mature pull request here PR #3173. ## Sparse Output Support PR #3203 - Sparse Label Binarizer StatusCompleted and Merged Summary of the work done: The label binarizing function in scikit-learns label code was modified to support conversion from sparse formats and helper functions to this function from the utils module were modified to be able to detect the representation type of the target data when it is in sparse format. Read about the workings of the label binarizer. PR #3276 - Sparse Output One vs. Rest StatusCompleted and Merged Summary of the work done: The fit and predict functions for one vs. rest classifiers modified to detect sparse target data and handle it without densifying the entire matrix at once, instead the fit function iterates over densified columns of the target data and fits an individual classifier for each column and the predict uses binarizaion on the results from each classifier individually before combining the results into a sparse representation. A test was written to ensure that classifier accuracy was within a suitable range when using sparse target data. PR #3438 - Sparse Output Dummy Classifier StatusCompleted and Merged Summary of the work done: The fit and predict functions were adjusted to accept the sparse format target data. To reproduce the same behavior of prediction on dense target data first a sparse class distribution function was written to get the classes of each column in the sparse matrix, second a random sampling function was created to provide a sparse matrix of randomly drawn values from a user specified distribution. Read the blog post to see detailed results of the sparse output dummy pull request. PR #3350 - Sparse Output KNN Classifier StatusNearing Completion Summary of the work done: In the predict function of the classifier the dense target data is indexed one column at a time. The main improvement made here is to leave the target data in sparse format and only convert a column to a dense array when it is necessary. This results in a lower peak memory consumption, the improvement is proportional to the sparsity and overall size of the target matrix. ## Future Directions It is my goal for the Fall semester to support the changes I have made to the scikit-learn code base the best I can. I also hope to see myself finalize the remaining two pull requests. ## August 17, 2014 ### Vighnesh Birodkar #### vighneshbirodkar This years GSoC coding period has nearly come to an end. This post aims to briefly summarize everything that happened during the last three months. My task was to implement Region Adjacency Graph based segmentation algorithms for scikit-image. This post provides a good explanation about them. Below I will list out my major contributions. ## Contributions Region Adjacency Graphs Fixing the API for RAGs was very important, since it was directly going to affect everything else that followed. After a long discussion and some benchmarks we finally decided to have NetworkX as a dependency. This helped a lot, since I had a lot of graph algorithms already implemented for me. The file rag.py implements the RAG class and the RAG construction methods. I also implemented threshold_cut, a function which segments images by simply thresholding edge weights. To know more, you can visit, RAG Introduction. Normalized Cut The function cut_normazlied, implements the Normalized Cut algorithm for RAGs. You can visit Normalized Cut on RAGs to know more. See the videos at the end to get a quick idea of how NCut works. Also see, A closer look at NCut, where I have benchmarked the function and indicated bottlenecks. Drawing Regions Adjacency Graphs In my posts, I had been using a small piece of code I had written to display RAGs. This Pull Request implements the same functionality for scikit-image. This would be immensely useful for anyone who is experimenting with RAGs. For a more detailed explanation, check out Drawing RAGs. Hierarchical Merging of Region Adjacency Graphs This Pull Request implements a simple form of Hierarchical Merging. For more details, see Hierarchical Merging of Region Adjacency Graphs. This post also contains videos at the end, do check them out. This can also be easily extended to a boundary map based approach, which I plan to do post-GSoC ## Final Comments The most important thing for me is that I am a better Python programmer as compared to what I was before GSoC began this year. I was able to see how some graph based segmentation methods work at their most basic level. Although GSoC has come to an end, I don’t think my contributions to scikit-image have. Contributing to it has been a tremendous learning experience and plan to continue doing so. I have been been fascinated with Image Processing since me and my friends wrote an unholy piece of Matlab code about 3 years ago to achieve this. And as far as I can see its a fascination I will have for the rest of my life. Finally, I would like to thank my mentors Juan, Johannes Schönberger and Guillaume Gay. I would also like to thank Stefan for reviewing my Pull Requests. #### baseball Region Adjacency Graphs model regions in an image as nodes of a graph with edges between adjacent regions. Superpixel methods tend to over segment images, ie, divide into more regions than necessary. Performing a Normalized Cut and Thresholding Edge Weights are two ways of extracting a better segmentation out of this. What if we could combine two small regions into a bigger one ? If we keep combining small similar regions into bigger ones, we will end up with bigger regions which are significantly different from its adjacent ones. Hierarchical Merging explores this possibility. The current working code can be found at this Pull Request ## Code Example The merge_hierarchical function performs hierarchical merging on a RAG. It picks up the smallest weighing edge and combines the regions connected by it. The new region is adjacent to all previous neighbors of the two combined regions. The weights are updated accordingly. It continues doing so till the minimum edge weight in the graph in more than the supplied thresh value. The function takes a RAG as input where smaller edge weight imply similar regions. Therefore, we use the rag_mean_color function with the default "distance" mode for RAG construction. Here is a minimal code snippet. from skimage import graph, data, io, segmentation, color img = data.coffee() labels = segmentation.slic(img, compactness=30, n_segments=400) g = graph.rag_mean_color(img, labels) labels2 = graph.merge_hierarchical(labels, g, 40) g2 = graph.rag_mean_color(img, labels2) out = color.label2rgb(labels2, img, kind='avg') out = segmentation.mark_boundaries(out, labels2, (0, 0, 0)) io.imsave('out.png',out)  I arrived at the threshold 40 after some trial and error. Here is the output. The drawback here is that the thresh argument can vary significantly depending on image to image. ## Comparison with Normalized Cut Loosely speaking the normalized cut follows a top-down approach where as the hierarchical merging follow a bottom-up approach. Normalized Cut starts with the graph as a whole and breaks it down into smaller parts. On the other hand hierarchical merging, starts with individual regions and merges them into bigger ones till a criteria is reached. The Normalized Cut however, is much more robust and requires little tuning of its parameters as images change. Hierarchical merging is a lot faster, even though most of its computation logic is written in Python. ## Effect of change in threshold Setting a very low threshold, will not merge any regions and will give us back the original image. A very large threshold on the other hand would merge all regions and give return the image as just one big blob. The effect is illustrated below. #### threshold=10 #### threshold=20 #### threshold=40 #### threshold=70 #### threshold=100 ## Hierarchical Merging in Action With this modification the following code can output the effect of all the intermediate segmentation during each iteration. from skimage import graph, data, io, segmentation, color import time from matplotlib import pyplot as plt img = data.coffee() labels = segmentation.slic(img, compactness=30, n_segments=400) g = graph.rag_mean_color(img, labels) labels2 = graph.merge_hierarchical(labels, g, 60) c = 0 out = color.label2rgb(graph.graph_merge.seg_list[-10], img, kind='avg') for label in graph.graph_merge.seg_list: out = color.label2rgb(label, img, kind='avg') out = segmentation.mark_boundaries(out, label, (0, 0, 0)) io.imsave('/home/vighnesh/Desktop/agg/' + str(c) + '.png', out) c += 1  I then used avconv -f image2 -r 3 -i %d.png -r 20 car.mp4 to output a video. Below are a few examples. In each of these videos, at every frame, a boundary dissapears. This means that the two regions separated by that boundary are merged. The frame rate is 5 FPS, so more than one region might be merged at a time. ### Coffee Image ### Car Image ### Baseball Image ### Issam Laradji #### (GSoC 2014) Final Summary (Neural Networks) GSoC 2014 has been an extraordinary experience. Not only did it encourage me to develop much needed open-source implementation of neural network algorithms, but also exposed me to a great, diverse community. I also learned useful practices for maintaining clean, quality code and writing accessible documentation. This prepared me to work well, and efficiently under pressure, since quality work had to be produced in a short period of time. In terms of the requirements, the three algorithms mentioned in the proposal - (1) Multi-layer perceptron, (2) Multi-layer perceptron with pre-training, and (3) Extreme Learning Machines - are completed (see below for a comprehansive description). In terms of specific requirements, there has been a lot of changes in order to accommodate positive suggestions, especially for MLP, and ELM. While a part of MLP was completed prior to the start of GSoC, the code went through a complete renovation, which made it faster, more readable, more scalable, and more robust. In fact, most of the work involved was optimizing the speed of execution, improving readability - this includes proper naming and convenient infrastructure of the code base, and writing a comprehensive documentation. The algorithms are explained in more detail below. Acknowledgements This wouldn't have been possible without the profound, sincere assistance of my mentor Olivier Grisel, and the scikit-learn team - including, Arnaud Joly, Gael Varoquaux, Kyle Kastner, Jnothman, Lars Buitinck, and many more. I sincerely thank the PSA team for emphasizing on summarizing my work as blog posts here and I do greatly appreciate Google's significant support it offered, which was instrumental in the successful completion of this project. (1) Multi-layer perceptron (MLP) (link: #3204) ---------------------------------------------------------------------------- Figure 1: One hidden layer MLP This implements the classic backpropagation algorithm supporting one or more hidden layers (see Figure 1). Depending on the problem type (classification or regression), backpropagation optimizes an objective function whose main purpose is to have the predicted output as close to the target as possible, though subject to some constraints like regularization. The MLP supports L2 regularization which controls the degree to which it is overfitting. Increased regularization constrains the trained weights to be of smaller value which makes the decision function more linear. We also added a renowned activation function known as rectified linear unit (ReLU) function, which not only is faster, but allows training more than one hidden layer efficiently, at least more than hyperbolic tan and logistic [1]. Unit testing was made thorough as 99.17% of the statements were covered. A gradient unit test helped make sure the activation functions - hyperbolic tan, logistic, and ReLU - work as expected. After the mid-term, much of the code was renovated. Many methods were combined to simplify the code and improve readability. Performance was improved by removing redundant calls and taking advantage of pre-allocation of matrices - including, values of activation layers, gradients, and weights. Many private variables were removed, making pickling less prone to error and less dense. MLP might benefit from a scheme known as pre-training which is explained in section 2. (2) Multi-layer perceptron with pre-training (link: ------------------------------------------------------------------- Figure 2: Pre-training scheme using restricted boltzmann machines. Prior to running backpropagation, an unsupervised learner can provide the MLP with initial weights that might be better than randomized weights. The parameters optimized by the unsupervised learner - after training on the dataset - can be assigned to the MLP weight parameters as starting points. The motivation is that these initial weights are meant to allow backpropagation to converge in a better local optima than otherwise [2]. Figure 2 illustrates the scheme of using pre-training with multi-layer perceptron. For each set of weights between two layers, a restricted boltzmann machine (RBMs) trains on the input data of the previous layer and the final parameters are assigned to these set of weights in the large multi-layer perceptron. An example was set to compare the performance of multi-layer perceptron (MLP) with and without pre-training using RBMs [3]. MLP without pre-training had its parameters initialized using scaled, random distribution. For pre-training, an RBM trains on the digits dataset and the resultant parameters are given to MLP as initial coefficient and intercept parameters. Below are the testing scores against the digits dataset [4], Testing accuracy of mlp without pretraining: 0.967 Testing accuracy of mlp with pretraining: 0.978 However, it is not always the case that pretraining improves performance. In some occasions, especially when dealing with large training sets, it could even decrease the score. (3) Extreme Learning Machines (link: #3306) ---------------------------------------------------- Figure 3: Neural network for ELM The main focus after the mid-term evaluations was on developing extreme learning machines (ELMs). First we implemented the standard algorithm of ELMs that optimize an objective function using least-square solutions. An ELM has a similar network as a one hidden layer MLP, except the output layer has no bias (see Figure 3). ELM, basically, trains a network through these 3 steps, • it applies a random projection to the input space, onto a possibly higher dimensional space; • the result passes through an element-wise non-linear activation function, typically a sigmoid such as the tanh, and logistic functions; and • last, it trains a linear one vs. rest classifier or a multi-output ridge regression model. The algorithm trains a single-hidden layer feedforward network by computing the hidden layer values using randomized parameters, then solving for the output weights using least-square solutions. For prediction, after computing the forward pass, the continuous output values pass through a gate function converting them to integers that represent classes. The function representing ELM is given as,$y=\beta\cdot f(W^T \cdot X + b ) $where matrices$X$and$y$represent the input samples and target values, respectively; matrices$W$and$b$are randomly generated based on a uniform distribution; matrix$\beta$contains unknown variables; and$f(\cdot)$is the non-linear, component-wise activation function. ELM solves for$\beta$using the ridge regression implementation, given as,$(H^T H + (1 / C) * I)^{-1} H^T y$where$H = f(W^TX+b)$,$C$is a regularization term which controls the linearity of the decision function, and$I\$ is the identity matrix.
We demonstrated the effects of tuning two main hyperparameters,
• weight_scale, which controls the variance of the random projection weights, the higher the value the more the less the regularization and therefore more overfitting.
• C, which controls the regularization strength of the output linear model, which regularizes the hidden-to-output weights in the same way as weight_scale regularizes the input-to-hidden weights.

and another main hyperparameter,
• n_hidden, which controls the number of hidden layer nodes.

Below are 3 plots that illustrate the effect of these parameters on score,

Figure 4: Effect of weight_scale on decision function.

Figure 5: Effect of C on decision function.

Figure 6: Effect of weight_scale and C on the scores against the Digits dataset.

Figures 4 and 5 show how increasing the regularization terms C would lead to a more non-linear decision function.

Figure 6 shows a colour map representing scores returned by grid-search illustrating the fact that a balance between C and weight_scale is important to have a higher score. C=1.0 and weight_scale=10  achieved the highest score as indicated by the darkest shade of the relevant blue square.

We re-used ridge regression [5] implementation for solving the least-square solution as it optimizes training speed for different data types. Next, we implemented the sequential algorithm of the ELM. It allows ELM to train on the dataset in batches, while, interestingly, the end result is exactly the same as though the whole dataset is put into memory. However, decreasing the size of the batches, can potentially increase training time. Below is a benchmark showing the training time in seconds of training ELMs with different batch sizes on a 10000 image MNIST dataset.

batch_size        50 hidden neurons            500 hidden neurons
----------------------------------------------------------------------
None                0.32s                                 2.21s
10                     0.71s                                13.62s
100                   0.33s                                3.30s
1000                 0.32s                                2.44s

batch_size=None means that the whole dataset is put into memory. As shown in the benchmark, Training on larger batch sizes improves speed but could cause memory error if the batch size is too large. Nevertheless, using batches the algorithm supports on-line learning and therefore it can update its solutions as the data arrives in chunks.

Also, support for kernel was added to ELM, which was later removed for reasons that will soon appear. ELM originally transforms the input data into hidden activations depending on the number of hidden neurons. Similarly, kernels, like in SVM, transform the input data into another dimensional space where hidden neurons play no role. Empirically, kernels were found to be slow, yet lead to no accuracy improvement over the standard ELM. For that reason and to avoid feature creep, we removed kernel support.

However, we added support of the ReLU activation function, hyperbolic tan, and logistic. They were put in an external file so that they can be shared between different modules in scikit-learn .

Further, we updated another file [6] that is responsible for assigning class weights, useful for several algorithms that support weighted classification. We added method that computes the weights  corresponding to each sample as a vector to allow ridge-regression to run weighted least-square solutions in the ELM.

We also improved testing coverage. ELM has a coverage of 100% of the code, making it reliable. Testings were made to make sure, that weighted ELM does improve results in instances of imbalanced datasets; that higher number of hidden neurons does improve the training score; and that whether the algorithm runs using batch-based or not should produce the same end result.

To conclude, this experience was special and useful in that it brought me closer to the scikit-learn community and other open-source communities. It also encouraged me to satisfy my long ambition of implementing useful algorithms and writing accessible documentation for any user who wish to delve into the world of neural networks.

I sincerely look forward to continue working with the scikit-learn team for the years to come and I sincerely look forward to participating in GSoC 2015, either as a mentor or as a student.

References
--------------

[1] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." ICML Workshop on Deep Learning for Audio, Speech, and Language Processing. 2013.

[2] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507.

[4] http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html

[5] http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

------------------

1) Multi-layer perceptron: https://github.com/scikit-learn/scikit-learn/pull/3204

2) Multi-layer perceptron with pre-training: https://github.com/scikit-learn/scikit-learn/pull/3281

3) Extreme learning machines: https://github.com/scikit-learn/scikit-learn/pull/3306

### Maheshakya Wijewardena

#### Performance comparison among LSH Forest, ANNOY and FLANN

Finally, it is time to compare performance of Locality Sensitive Hashing Forest(approximate nearest neighbor search implementation in scikit-learn), Spotify Annoy and FLANN.

## Criteria

Synthetic datasets of different sizes (varying n_samples and n_features) are used for this evalutation. For each data set, following measures were calculated.

1. Index building time of each ANN (Approximate Nearest Neighbor) implementation.
2. Accuracy of nearest neighbors queries with their query times.

Python code used for this evaluation can be found in this Gist. Parameters of LSHForest (n_estimators=10 and n_candidates=50) are kept fixed during this experiment. Accuracies can be raised by tuning these parameters.

## Results

For each dataset, two graphs have been plotted according to the measures expressed in the above section. It is evident that index building times of LSH Forest and FLANN are almost incomparable to that of Annoy for almost all the datasets. Moreover, for larger datasets, LSH Forest outperforms Annoy at large margins with respect to accuracy and query speed. Observations from these graphs prove that LSH Forest is competitive with FLANN for large datasets.

#### A demonstration of the usage of Locality Sensitive Hashing Forest in approximate nearest neighbor search

This is a demonstration to explain how to use the approximate nearest neighbor search implementation using locality sensitive hashing in scikit-learn and to illustrate the behavior of the nearest neighbor queries as the parameters vary. This implementation has an API which is essentially as same as the NearestNeighbors module as approximate nearest neighbor search is used to speed up the queries at the cost of accuracy when the database is very large.

Before beginning the demonstration, background has to be set. First, the required modules are loaded and a synthetic dataset is created for testing.

import time
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
from sklearn.neighbors import LSHForest
from sklearn.neighbors import NearestNeighbors

# Initialize size of the database, iterations and required neighbors.
n_samples = 10000
n_features = 100
n_iter = 30
n_neighbors = 100
rng = np.random.RandomState(42)

# Generate sample data
X, _ = make_blobs(n_samples=n_samples, n_features=n_features,
centers=10, cluster_std=5, random_state=0)

There are two main parameters which affect queries in the LSH Forest implementation.

1. n_estimators : Number of trees in the LSH Forest.
2. n_candidates : Number of candidates chosen from each tree for distance calculation.

In the first experiment, average accuracies are measured as the value of n_estimators vary. n_candidates is kept fixed. slearn.neighbors.NearestNeighbors used to obtain the true neighbors so that the returned approximate neighbors can be compared against.

# Set n_estimators values
n_estimators_values = np.linspace(1, 30, 5).astype(np.int)
accuracies_trees = np.zeros(n_estimators_values.shape[0], dtype=float)

# Calculate average accuracy for each value of n_estimators
for i, n_estimators in enumerate(n_estimators_values):
lshf = LSHForest(n_candidates=500, n_estimators=n_estimators,
n_neighbors=n_neighbors)
nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')

lshf.fit(X)
nbrs.fit(X)
for j in range(n_iter):
query = X[rng.randint(0, n_samples)]
neighbors_approx = lshf.kneighbors(query, return_distance=False)
neighbors_exact = nbrs.kneighbors(query, return_distance=False)

intersection = np.intersect1d(neighbors_approx,
neighbors_exact).shape[0]
ratio = intersection/float(n_neighbors)
accuracies_trees[i] += ratio

accuracies_trees[i] = accuracies_trees[i]/float(n_iter)

Similarly, average accuracy vs n_candidates is also measured.

# Set n_candidate values
n_candidates_values = np.linspace(10, 500, 5).astype(np.int)
accuracies_c = np.zeros(n_candidates_values.shape[0], dtype=float)

# Calculate average accuracy for each value of n_candidates
for i, n_candidates in enumerate(n_candidates_values):
lshf = LSHForest(n_candidates=n_candidates, n_neighbors=n_neighbors)
nbrs = NearestNeighbors(n_neighbors=n_neighbors, algorithm='brute')
# Fit the Nearest neighbor models
lshf.fit(X)
nbrs.fit(X)
for j in range(n_iter):
query = X[rng.randint(0, n_samples)]
# Get neighbors
neighbors_approx = lshf.kneighbors(query, return_distance=False)
neighbors_exact = nbrs.kneighbors(query, return_distance=False)

intersection = np.intersect1d(neighbors_approx,
neighbors_exact).shape[0]
ratio = intersection/float(n_neighbors)
accuracies_c[i] += ratio

accuracies_c[i] = accuracies_c[i]/float(n_iter)

You can get a clear view of the behavior of queries from the following plots.

The next experiment demonstrates the behavior of queries for different database sizes (n_samples).

# Initialize the range of n_samples
n_samples_values = [10, 100, 1000, 10000, 100000]
average_times = []
# Calculate the average query time
for n_samples in n_samples_values:
X, labels_true = make_blobs(n_samples=n_samples, n_features=10,
centers=10, cluster_std=5,
random_state=0)
# Initialize LSHForest for queries of a single neighbor
lshf = LSHForest(n_candidates=1000, n_neighbors=1)
lshf.fit(X)

average_time = 0

for i in range(n_iter):
query = X[rng.randint(0, n_samples)]
t0 = time.time()
approx_neighbors = lshf.kneighbors(query,
return_distance=False)[0]
T = time.time() - t0
average_time = average_time + T

average_time = average_time/float(n_iter)
average_times.append(average_time)

n_samples space is defined as [10, 100, 1000, 10000, 100000]. Query time for a single neighbor is measure for these different values of n_samples.

### NeuralEnsemble

#### GSoC Open Source Brain: Arbitrary Spike-trains in PyNN

Arbitrary Spikes in PyNN

# Arbitrary Spike-trains in PyNN

In this example we are going to create a population of cells with arbitrary spike trains. We will load the spike train from a file where they are stored as a list of arrays with the times at which they occurred. In order to so we are going to use the SpikeSourceArray class model of PyNN

First we start importing PyNN in with nest and all the other required libraries. Furthermore we start the simulators and given some general parameters. We assume that we already have produced spikes for the cells with 0.50 contrast:

      import numpy as np      import matplotlib.pyplot as plt      import cPickle      import pyNN.nest as simulator      contrast = 0.50      Nside_lgn = 30      Ncell_lgn = Nside_lgn * Nside_lgn      N_lgn_layers = 4      t = 1000  # ms      simulator.setup(timestep=0.1, min_delay=0.1, max_delay=5.0)

So we are going to suppose that we have our data stored in './data'. The spike-trains are lists as long as the cell population that contain for each element an array with the times at which the spikes occurred for that particular neuron. In order to load them we will use the following code

      directory = './data/'      format = '.cpickle'      spikes_on = []      spikes_off = []      for layer in xrange(N_lgn_layers):      #  Layer 1      layer = '_layer' + str(layer)      polarity = '_on'      contrast_mark = str(contrast)      mark = '_spike_train'      spikes_filename = directory + contrast_mark + mark + polarity + layer + format      f2 = open(spikes_filename, 'rb')      spikes_on.append(cPickle.load(f2))      f2.close()      polarity = '_off'      contrast_mark = str(contrast)      mark = '_spike_train'      spikes_filename = directory + contrast_mark + mark + polarity + layer + format      f2 = open(spikes_filename, 'rb')      spikes_off.append(cPickle.load(f2))      f2.close()

Now this is the crucial part. If we want to utilize the SpikeSourceArray model for a cell in PyNN we can define a function that pass the spike-train for each cell in the population. In order to so we use the following code:

      def spike_times(simulator, layer, spikes_file):         return [simulator.Sequence(x) for x in spikes_file[layer]]

Note that we have to change every spike-train array to a sequence before using it as a spike-train. After defining this function we can create the LGN models:

      # Cells models for the LGN spikes (SpikeSourceArray)      lgn_spikes_on_models = []      lgn_spikes_off_models = []      for layer in xrange(N_lgn_layers):         model = simulator.SpikeSourceArray(spike_times=spike_times(simulator, layer, spikes_on))         lgn_spikes_on_models.append(model)         model = simulator.SpikeSourceArray(spike_times=spike_times(simulator, layer, spikes_off))         lgn_spikes_off_models.append(model)

Now that we have the corresponding model for the cells we can create the populations in the usual way:

 # LGN Popluations lgn_on_populations = [] lgn_off_populations = [] for layer in xrange(N_lgn_layers): population = simulator.Population(Ncell_lgn, lgn_spikes_on_models[layer], label='LGN_on_layer_' + str(layer)) lgn_on_populations.append(population) population = simulator.Population(Ncell_lgn, lgn_spikes_off_models[layer], label='LGN_off_layer_' + str(layer)) lgn_off_populations.append(population)

In order to analyze the spike-trains patterns for each population we need to declare a recorder for each population:

       layer = 0  # We declare here the layer of our interest       population_on = lgn_on_populations[layer]      population_off = lgn_off_populations[layer]      population_on.record('spikes')      population_off.record('spikes')

Note here that we can chose the layer of our interest by modifying the value of the layer variable. Finally we run the model with the usual instructions and extract the spikes:

      #############################      # Run model      #############################      simulator.run(t)  # Run the simulations for t ms      simulator.end()      #############################      # Extract the data      #############################      data_on = population_on.get_data()  # Creates a Neo Block      data_off = population_off.get_data()      segment_on = data_on.segments[0]  # Takes the first segment      segment_off = data_off.segments[0]

In order to visualize the spikes we use the following function:

       # Plot spike trains      def plot_spiketrains(segment):          """          Plots the spikes of all the cells in the given segments          """          for spiketrain in segment.spiketrains:              y = np.ones_like(spiketrain) * spiketrain.annotations['source_id']              plt.plot(spiketrain, y, '*b')              plt.ylabel('Neuron number')              plt.xlabel('Spikes')

Here the spiketrain variable contains the spike-train for each cell, that is, an array with the times at which the action potentials happened for each cell. In order to tell them apart we assigned them the value of the cell id. Finally we can plot the spikes of the on and off cells with the following code:

      plt.subplot(2, 1, 1)      plt.title('On cells ')      plot_spiketrains(segment_on)      plt.subplot(2, 1, 2)      plt.title('Off cells ')      plot_spiketrains(segment_off)      plt.show()

We now show the plot produced by the code above. Note that the on and off cells are off-phase by 180.

#### GSoC Open Source Brain: Firing Rate Induced by a Sinus Grating

Firing Rate Induced by a Sinus Grating

# Firing Rate induced by a Sinus Grating

Now that we know how to do convolutions with our center-surround kernel we can chose any other kind of stimulus to carry this out. In the neuoscience of vision it is very common to use a sinus grating in a wide array of experimental setings so we are going to use it now. In short, in this post we are going to see the see what signal does a center-surround kernel produces when is convolved with a sinus grating.

## Center-Surround Kernel

In order to do the convolution we are going to define the kernel in the usual way using a function that we have utilized from our work before:

      # First we define the size and resolution of the space in which the convolution is going to happen      dx = 0.05      dy = 0.05      lx = 6.0  # In degrees      ly = 6.0  # In degrees      # Now we define the temporal parameters of the kernel      dt_kernel = 5.0  # ms      kernel_duration = 150  # ms      kernel_size = int(kernel_duration / dt_kernel)      #  Now the center surround parameters      factor = 1  # Controls the overall size of the center-surround pattern      sigma_center = 0.25 * factor  # Corresponds to 15'      sigma_surround = 1 * factor  # Corresponds to 1 degree      # Finally we create the kernel      kernel_on = create_kernel(dx, lx, dy, ly, sigma_surround, sigma_center, dt_kernel, kernel_size)

## Sinus Grating

Now we are going to construct our sinus grating. But first, we need to think on how long our stimulus is going to last which is a function of how long the we want to simulate the convolution and of the resolutions of the stimulus and the simulation:

      ## Now we define the temporal l parameters of the sinus grating      dt_stimuli = 5.0  # ms      # We also need to add how long do we want to convolve      dt = 1.0  # Simulation resolution      T_simulation = 1 * 10 ** 3.0 # ms      T_simulation += int(kernel_size * dt_kernel)  # Add the size of the kernel      Nt_simulation = int(T_simulation / dt)  # Number of simulation points      N_stimuli = int(T_simulation / dt_stimuli)  # Number of stimuli points

Finally we now present the parameters that determine the sinus grating. First the spatial frequency (K), followed by the spatial phase (Phi) and orientation (Theta). Furthermore we have also a parameter for the amplitude and the temporal frequency:

      # And now the spatial parameters of the sinus grating      K = 0.8  # Cycles per degree      Phi = 0  # Spatial phase       Theta = 0 # Orientation       A = 1 # Amplitude       # Temporal frequency of sine grating      w = 3  # Hz

Now with all the spatial parameters in our possession we can call the function that produces the sine grating, we define it as the following function that we present below:

       stimuli = sine_grating(dx, lx, dy, ly, A, K, Phi, Theta, dt_stimuli, N_stimuli, w)            def sine_grating(dx, Lx, dy, Ly, A, K, Phi, Theta, dt_stimuli, N_stimuli, w):       '''       Returns a sine grating stimuli       '''       Nx = int(Lx / dx)       Ny = int(Ly / dy)       # Transform to appropriate units       K = K * 2 * np.pi # Transforms K to cycles per degree       w = w / 1000.0 # Transforms w to kHz       x = np.arange(-Lx/2, Lx/2, dx)       y = np.arange(-Ly/2, Ly/2, dy)       X, Y = np.meshgrid(x, y)       Z = A * np.cos(K * X *cos(Theta) + K * Y * sin(Theta) - Phi)       t = np.arange(0, N_stimuli * dt_stimuli, dt_stimuli)       f_t = np.cos(w * 2 * np.pi *  t )       stimuli = np.zeros((N_stimuli, Nx, Ny))       for k, time_component in enumerate(f_t):           stimuli[k, ...] = Z * time_component       return stimuli

## Convolution

Now that we have the stimulus and the kernel we can do the convolution, in order to do that we use again our functions and indexes that we use in the last post:

       ## Now we can do the convolution      # First we define the necessary indexes to the convolution      signal_indexes, delay_indexes, stimuli_indexes = create_standar_indexes(dt, dt_kernel, dt_stimuli, kernel_size, Nt_simulation)      working_indexes, kernel_times = create_extra_indexes(kernel_size, Nt_simulation)      # Now we calculate the signal      signal = np.zeros(Nt_simulation)      for index in signal_indexes:          signal[index] = convolution(index, kernel_times, delay_indexes, stimuli_indexes, kernel_on, stimuli)

We can visualize signal with the following code:

       #Plot the signal       t = np.arange(kernel_size*dt_kernel, T_simulation, dt)      plt.plot(t, signal[signal_indexes])      plt.show()

We can see that the signal is also a sinus with a frequency that is consistent with the one from the sinus grating.

## Variation with number of regions

In this post I explained how the Normalized Cut works and demonstrated some examples of it. This post aims to take a closer look at the code. I ran the following code to monitor the time taken by NCut with respect to initial number of regions.

from __future__ import print_function
from skimage import graph, data, io, segmentation, color
import time
from matplotlib import pyplot as plt

image = data.coffee()
segment_list = range(50, 801, 50)

for nseg in segment_list:
labels = segmentation.slic(image, compactness=30, n_segments=nseg)
rag = graph.rag_mean_color(image, labels, mode='similarity')
T = time.time()
new_labels = graph.ncut(labels, rag)
time_taken = time.time() - T
out = color.label2rgb(new_labels, image, kind='avg')
io.imsave('/home/vighnesh/Desktop/ncut/' + str(nseg) + '.png', out)
print(nseg, time_taken)


Here is the output sequentially.

By a little guess-work, I figured that the curve approximately varies as x^2.2. For 800 nodes, the time taken is around 35 seconds.

## Line Profile

I used line profiler to examine the time taken by each line of code in threshold_normalized. Here are the results.

   218                                           @profile
219                                           def _ncut_relabel(rag, thresh, num_cuts):
220                                               """Perform Normalized Graph cut on the Region Adjacency Graph.
221
222                                               Recursively partition the graph into 2, until further subdivision
223                                               yields a cut greather than thresh or such a cut cannot be computed.
224                                               For such a subgraph, indices to labels of all its nodes map to a single
225                                               unique value.
226
227                                               Parameters
228                                               ----------
229                                               labels : ndarray
230                                                   The array of labels.
231                                               rag : RAG
233                                               thresh : float
234                                                   The threshold. A subgraph won't be further subdivided if the
235                                                   value of the N-cut exceeds thresh.
236                                               num_cuts : int
237                                                   The number or N-cuts to perform before determining the optimal one.
238                                               map_array : array
239                                                   The array which maps old labels to new ones. This is modified inside
240                                                   the function.
241                                               """
242        59       218937   3710.8      3.2      d, w = _ncut.DW_matrices(rag)
243        59          151      2.6      0.0      m = w.shape[0]
244
245        59           61      1.0      0.0      if m > 2:
246        44         3905     88.8      0.1          d2 = d.copy()
247                                                   # Since d is diagonal, we can directly operate on its data
248                                                   # the inverse of the square root
249        44          471     10.7      0.0          d2.data = np.reciprocal(np.sqrt(d2.data, out=d2.data), out=d2.data)
250
251                                                   # Refer Shi & Malik 2001, Equation 7, Page 891
252        44        26997    613.6      0.4          vals, vectors = linalg.eigsh(d2 * (d - w) * d2, which='SM',
253        44      6577542 149489.6     94.9                                       k=min(100, m - 2))
254
255                                                   # Pick second smallest eigenvector.
256                                                   # Refer Shi & Malik 2001, Section 3.2.3, Page 893
257        44          618     14.0      0.0          vals, vectors = np.real(vals), np.real(vectors)
258        44          833     18.9      0.0          index2 = _ncut_cy.argmin2(vals)
259        44         2408     54.7      0.0          ev = _ncut.normalize(vectors[:, index2])
260
261        44        22737    516.8      0.3          cut_mask, mcut = get_min_ncut(ev, d, w, num_cuts)
262        44           78      1.8      0.0          if (mcut < thresh):
263                                                       # Sub divide and perform N-cut again
264                                                       # Refer Shi & Malik 2001, Section 3.2.5, Page 893
265        29        78228   2697.5      1.1              sub1, sub2 = partition_by_cut(cut_mask, rag)
266
267        29          175      6.0      0.0              _ncut_relabel(sub1, thresh, num_cuts)
268        29           92      3.2      0.0              _ncut_relabel(sub2, thresh, num_cuts)
269        29           32      1.1      0.0              return
270
271                                               # The N-cut wasn't small enough, or could not be computed.
272                                               # The remaining graph is a region.
273                                               # Assign ncut label by picking any label from the existing nodes, since
274                                               # labels are unique, new_label is also unique.
275        30          685     22.8      0.0      _label_all(rag, 'ncut label')


As you can see above 95% of the time is taken by the call to eigsh.

To take a closer look at it, I plotted time while ensuring only one iteration. This commit here takes care of it. Also, I changed the eigsh call to look for the largest eigenvectors instead of the smallest ones, with this commit here. Here are the results.

A single eignenvalue computation is bounded by O(n^1.5) as mentioned in the original paper. The recursive NCuts are pushing the time required towards more than O(n^2).eigsh solves the eigenvalue problem for a symmetric hermitian matrix. It in turn relies on a library called ARPack. As documented here ARPack isn’t very good at finding the smallest eigenvectors. If the value supplied as the argument k is too small, we get the ArpackNoConvergence Exception. As seen from the above plot, finding the largest eigenvectors is much more efficient using the eigsh function.

Since the problem is specific to ARPack, using other libraries might lead to faster computation. slepc4py is one such BSD licensed library. The possibility of optionally importing slec4py should be certainly explored in the near future.

Also, we can optionally ask the user for a function to solve the eigenvalue problem, so that he can use a matrix library of his choice if he/she so desires.

## Final Thoughts

Although the current Normalized Cut implementation takes more than quadratic time, the preceding over segmentation method does most of the heavy lifting. With something like SLIC, we can be sure of the number of nodes irrespective of the input image size. Although, a better eigenvalue finding technique for smallest eigenvectors would immensely improve its performance.

### Jan Hendrik Metzen

#### Compare Classifier Predictions using Reliability Diagrams

This notebook generates reliability diagrams for some classifiers on an artificial data set. Reliability diagrams allow checking if the predicted probabilities of a binary classifier are well calibrated. For perfectly calibrated predictions, the curve in a reliability diagram should be as close as possible to the diagonal/identity. This would correspond to a situation in which among $$N$$ instances for which a classifier predicts probability $$p$$ for class $$A$$, the ratio of instances which actually belong to class $$A$$ is approx. $$p$$ (for any $$p$$ and sufficiently large $$N$$).

This notebook reproduces some of the results from the paper Predicting Good Probabilities with Supervised Learning.

In [1]:
%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scikit-learn

Jan Hendrik Metzen 16/08/2014

CPython 2.7.7
IPython 2.1.0

numpy 1.8.1
scikit-learn 0.14.1

compiler   : GCC 4.1.2 20080704 (Red Hat 4.1.2-54)
system     : Linux
release    : 3.13.0-29-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit


In [2]:
import numpy as np
np.random.seed(0)

from sklearn import datasets

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.isotonic import IsotonicRegression

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline


### Function for reliability curve computation¶

In [4]:
def reliability_curve(y_true, y_score, bins=10, normalize=False):
"""Compute reliability curve

Reliability curves allow checking if the predicted probabilities of a
binary classifier are well calibrated. This function returns two arrays
which encode a mapping from predicted probability to empirical probability.
For this, the predicted probabilities are partitioned into equally sized
bins and the mean predicted probability and the mean empirical probabilties
in the bins are computed. For perfectly calibrated predictions, both
quantities whould be approximately equal (for sufficiently many test
samples).

Note: this implementation is restricted to binary classification.

Parameters
----------

y_true : array, shape = [n_samples]
True binary labels (0 or 1).

y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive
class or confidence values. If normalize is False, y_score must be in
the interval [0, 1]

bins : int, optional, default=10
The number of bins into which the y_scores are partitioned.
Note: n_samples should be considerably larger than bins such that
there is sufficient data in each bin to get a reliable estimate
of the reliability

normalize : bool, optional, default=False
Whether y_score needs to be normalized into the bin [0, 1]. If True,
the smallest value in y_score is mapped onto 0 and the largest one
onto 1.

Returns
-------
y_score_bin_mean : array, shape = [bins]
The mean predicted y_score in the respective bins.

empirical_prob_pos : array, shape = [bins]
The empirical probability (frequency) of the positive class (+1) in the
respective bins.

References
----------
.. [1] Predicting Good Probabilities with Supervised Learning
<http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf>_

"""
if normalize:  # Normalize scores into bin [0, 1]
y_score = (y_score - y_score.min()) / (y_score.max() - y_score.min())

bin_width = 1.0 / bins
bin_centers = np.linspace(0, 1.0 - bin_width, bins) + bin_width / 2

y_score_bin_mean = np.empty(bins)
empirical_prob_pos = np.empty(bins)
for i, threshold in enumerate(bin_centers):
# determine all samples where y_score falls into the i-th bin
bin_idx = np.logical_and(threshold - bin_width / 2 < y_score,
y_score <= threshold + bin_width / 2)
# Store mean y_score and mean empirical probability of positive class
y_score_bin_mean[i] = y_score[bin_idx].mean()
empirical_prob_pos[i] = y_true[bin_idx].mean()
return y_score_bin_mean, empirical_prob_pos


### Training data¶

Generate a toy dataset on which different classifiers are compared. Among the 20 features, only 2 are actually informative. 2 further features are redundant, i.e., linear combinations of the 2 informative features. The remaining features are just noise.

In [5]:
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)

In [6]:
bins = 25
train_samples = 100  # Samples used for training the models
calibration_samples =  400  # Additional samples for claibration using Isotonic Regression

X_train = X[:train_samples]
X_calibration = X[train_samples:train_samples+calibration_samples]
X_test = X[train_samples+calibration_samples:]
y_train = y[:train_samples]
y_calibration = y[train_samples:train_samples+calibration_samples]
y_test = y[train_samples+calibration_samples:]


### Compute reliability curves for different classifiers¶

Compute reliability curves for different classifiers: * Logistic Regression * Naive Bayes * Random Forest * Support-Vector Classification (scores) * Support-Vector Classification + Isotonoc Calibration

In [7]:
classifiers = {"Logistic regression": LogisticRegression(),
"Naive Bayes": GaussianNB(),
"Random Forest": RandomForestClassifier(n_estimators=100),
"SVC": SVC(kernel='linear', C=1.0),
"SVC + IR": SVC(kernel='linear', C=1.0)}

In []:
reliability_scores = {}
y_score = {}
for method, clf in classifiers.items():
clf.fit(X_train, y_train)
if method == "SVC + IR":  # Calibrate SVC scores using isotonic regression.
n_plus = (y_calibration == 1.0).sum()  # number of positive examples
n_minus = (y_calibration == 0.0).sum()  # number of negative examples
# Determine target values for isotonic calibration. See
# "Predicting Good Probabilities with Supervised Learning"
# for details.
y_target = np.where(y_calibration == 0.0,
1.0 / (n_minus + 2.0),
(n_plus + 1.0) / (n_plus + 2.0))
# Perform actual calibration using isotonic calibration
svm_score = clf.decision_function(X_calibration)[:, 0]
ir = IsotonicRegression(out_of_bounds='clip').fit(svm_score, y_target)
y_score[method] = ir.transform(clf.decision_function(X_test)[:, 0])
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=False)
elif method == "SVC":
# Use SVC scores (predict_proba returns already calibrated probabilities)
y_score[method] = clf.decision_function(X_test)[:, 0]
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=True)
else:
y_score[method] = clf.predict_proba(X_test)[:, 1]
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=False)


### Plot reliability diagram¶

In [9]:
plt.figure(0, figsize=(8, 8))
plt.subplot2grid((3, 1), (0, 0), rowspan=2)
plt.plot([0.0, 1.0], [0.0, 1.0], 'k', label="Perfect")
for method, (y_score_bin_mean, empirical_prob_pos) in reliability_scores.items():
scores_not_nan = np.logical_not(np.isnan(empirical_prob_pos))
plt.plot(y_score_bin_mean[scores_not_nan],
empirical_prob_pos[scores_not_nan], label=method)
plt.ylabel("Empirical probability")
plt.legend(loc=0)

plt.subplot2grid((3, 1), (2, 0))
for method, y_score_ in y_score.items():
y_score_ = (y_score_ - y_score_.min()) / (y_score_.max() - y_score_.min())
plt.hist(y_score_, range=(0, 1), bins=bins, label=method,
histtype="step", lw=2)
plt.xlabel("Predicted Probability")
plt.ylabel("Count")
plt.legend(loc='upper center', ncol=2)

Out[9]:
<matplotlib.legend.Legend at 0x7f8e75ee1890>


The following observations can be made:

• Logistic regression returns well-calibrated probabilities close to the "perfect" line
• Naive Bayes tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because it makes the assumption that features are conditionally independent given the class, which is not the case in this dataset which contains 2 redundant features.
• Random Forest shows the opposite behavior: The histograms show peaks at approx. 0.1 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and Caruana: "Methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values. Because predictions are restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example, if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly with random forests because the base-level trees trained with random forests have relatively high variance due to feature subseting." As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the classifier could trust its "intuition" more and return probabilties closer to 0 or 1 typically. A post-processing such as Platt-calibration, which fits a sigmoid to the probabilities on a separate calibration dataset, would typically help if the calibration curve is sigmoid.
• The scores of a Support Vector Classification (SVC), which are linearly related to the distance of the sample from the hyperplane, show a similar but even stronger effect as the Random Forest. This is not too surprising as the scores are in no sense probabilties and must not be interpreted as such as the curve shows.
• One alternative to Platt-calibration is Isotonic Regression. While Platt-calibration fits a sigmoid, Isotonic Regression fits an arbitrary increasing (isotonic) function. Thus, it has a weaker inducttive bias and can be applied more broadly (also in situations where the calibration curve is not sigmoid). The downside is that it typically requires more calibration data because its inductive bias is weaker. This can also be seen in the SVC + IR curve: While the sigmoid shape of the pure SVC scores is removed and the calibration curve does not show a clear bias, it is quite noisy, indicating much variance. Thus, the used calibration dataset (even thoug 4 times larger than the training data) is too small in this case.

For a further discussion and more extensive experiments, please refer to Niculescu-Mizil and Caruana.

This post was written as an IPython notebook. You can download this notebook.

#### Compare Classifier Predictions using Reliability Diagrams

This notebook generates reliability diagrams for some classifiers on an artificial data set. Reliability diagrams allow checking if the predicted probabilities of a binary classifier are well calibrated. For perfectly calibrated predictions, the curve in a reliability diagram should be as close as possible to the diagonal/identity. This would correspond to a situation in which among $$N$$ instances for which a classifier predicts probability $$p$$ for class $$A$$, the ratio of instances which actually belong to class $$A$$ is approx. $$p$$ (for any $$p$$ and sufficiently large $$N$$).

This notebook reproduces some of the results from the paper Predicting Good Probabilities with Supervised Learning.

In [1]:
%load_ext watermark
%watermark -a "Jan Hendrik Metzen" -d -v -m -p numpy,scikit-learn

Jan Hendrik Metzen 16/08/2014

CPython 2.7.7
IPython 2.1.0

numpy 1.8.1
scikit-learn 0.14.1

compiler   : GCC 4.1.2 20080704 (Red Hat 4.1.2-54)
system     : Linux
release    : 3.13.0-29-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit


In [2]:
import numpy as np
np.random.seed(0)

from sklearn import datasets

from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.isotonic import IsotonicRegression

In [3]:
import matplotlib.pyplot as plt
%matplotlib inline


### Function for reliability curve computation¶

In [4]:
def reliability_curve(y_true, y_score, bins=10, normalize=False):
"""Compute reliability curve

Reliability curves allow checking if the predicted probabilities of a
binary classifier are well calibrated. This function returns two arrays
which encode a mapping from predicted probability to empirical probability.
For this, the predicted probabilities are partitioned into equally sized
bins and the mean predicted probability and the mean empirical probabilties
in the bins are computed. For perfectly calibrated predictions, both
quantities whould be approximately equal (for sufficiently many test
samples).

Note: this implementation is restricted to binary classification.

Parameters
----------

y_true : array, shape = [n_samples]
True binary labels (0 or 1).

y_score : array, shape = [n_samples]
Target scores, can either be probability estimates of the positive
class or confidence values. If normalize is False, y_score must be in
the interval [0, 1]

bins : int, optional, default=10
The number of bins into which the y_scores are partitioned.
Note: n_samples should be considerably larger than bins such that
there is sufficient data in each bin to get a reliable estimate
of the reliability

normalize : bool, optional, default=False
Whether y_score needs to be normalized into the bin [0, 1]. If True,
the smallest value in y_score is mapped onto 0 and the largest one
onto 1.

Returns
-------
y_score_bin_mean : array, shape = [bins]
The mean predicted y_score in the respective bins.

empirical_prob_pos : array, shape = [bins]
The empirical probability (frequency) of the positive class (+1) in the
respective bins.

References
----------
.. [1] Predicting Good Probabilities with Supervised Learning
<http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf>_

"""
if normalize:  # Normalize scores into bin [0, 1]
y_score = (y_score - y_score.min()) / (y_score.max() - y_score.min())

bin_width = 1.0 / bins
bin_centers = np.linspace(0, 1.0 - bin_width, bins) + bin_width / 2

y_score_bin_mean = np.empty(bins)
empirical_prob_pos = np.empty(bins)
for i, threshold in enumerate(bin_centers):
# determine all samples where y_score falls into the i-th bin
bin_idx = np.logical_and(threshold - bin_width / 2 < y_score,
y_score <= threshold + bin_width / 2)
# Store mean y_score and mean empirical probability of positive class
y_score_bin_mean[i] = y_score[bin_idx].mean()
empirical_prob_pos[i] = y_true[bin_idx].mean()
return y_score_bin_mean, empirical_prob_pos


### Training data¶

Generate a toy dataset on which different classifiers are compared. Among the 20 features, only 2 are actually informative. 2 further features are redundant, i.e., linear combinations of the 2 informative features. The remaining features are just noise.

In [5]:
X, y = datasets.make_classification(n_samples=100000, n_features=20,
n_informative=2, n_redundant=2)

In [6]:
bins = 25
train_samples = 100  # Samples used for training the models
calibration_samples =  400  # Additional samples for claibration using Isotonic Regression

X_train = X[:train_samples]
X_calibration = X[train_samples:train_samples+calibration_samples]
X_test = X[train_samples+calibration_samples:]
y_train = y[:train_samples]
y_calibration = y[train_samples:train_samples+calibration_samples]
y_test = y[train_samples+calibration_samples:]


### Compute reliability curves for different classifiers¶

Compute reliability curves for different classifiers: * Logistic Regression * Naive Bayes * Random Forest * Support-Vector Classification (scores) * Support-Vector Classification + Isotonoc Calibration

In [7]:
classifiers = {"Logistic regression": LogisticRegression(),
"Naive Bayes": GaussianNB(),
"Random Forest": RandomForestClassifier(n_estimators=100),
"SVC": SVC(kernel='linear', C=1.0),
"SVC + IR": SVC(kernel='linear', C=1.0)}

In []:
reliability_scores = {}
y_score = {}
for method, clf in classifiers.items():
clf.fit(X_train, y_train)
if method == "SVC + IR":  # Calibrate SVC scores using isotonic regression.
n_plus = (y_calibration == 1.0).sum()  # number of positive examples
n_minus = (y_calibration == 0.0).sum()  # number of negative examples
# Determine target values for isotonic calibration. See
# "Predicting Good Probabilities with Supervised Learning"
# for details.
y_target = np.where(y_calibration == 0.0,
1.0 / (n_minus + 2.0),
(n_plus + 1.0) / (n_plus + 2.0))
# Perform actual calibration using isotonic calibration
svm_score = clf.decision_function(X_calibration)[:, 0]
ir = IsotonicRegression(out_of_bounds='clip').fit(svm_score, y_target)
y_score[method] = ir.transform(clf.decision_function(X_test)[:, 0])
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=False)
elif method == "SVC":
# Use SVC scores (predict_proba returns already calibrated probabilities)
y_score[method] = clf.decision_function(X_test)[:, 0]
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=True)
else:
y_score[method] = clf.predict_proba(X_test)[:, 1]
reliability_scores[method] = \
reliability_curve(y_test, y_score[method], bins=bins, normalize=False)


### Plot reliability diagram¶

In [9]:
plt.figure(0, figsize=(8, 8))
plt.subplot2grid((3, 1), (0, 0), rowspan=2)
plt.plot([0.0, 1.0], [0.0, 1.0], 'k', label="Perfect")
for method, (y_score_bin_mean, empirical_prob_pos) in reliability_scores.items():
scores_not_nan = np.logical_not(np.isnan(empirical_prob_pos))
plt.plot(y_score_bin_mean[scores_not_nan],
empirical_prob_pos[scores_not_nan], label=method)
plt.ylabel("Empirical probability")
plt.legend(loc=0)

plt.subplot2grid((3, 1), (2, 0))
for method, y_score_ in y_score.items():
y_score_ = (y_score_ - y_score_.min()) / (y_score_.max() - y_score_.min())
plt.hist(y_score_, range=(0, 1), bins=bins, label=method,
histtype="step", lw=2)
plt.xlabel("Predicted Probability")
plt.ylabel("Count")
plt.legend(loc='upper center', ncol=2)

Out[9]:
<matplotlib.legend.Legend at 0x7f8e75ee1890>


The following observations can be made:

• Logistic regression returns well-calibrated probabilities close to the "perfect" line
• Naive Bayes tends to push probabilties to 0 or 1 (note the counts in the histograms). This is mainly because it makes the assumption that features are conditionally independent given the class, which is not the case in this dataset which contains 2 redundant features.
• Random Forest shows the opposite behavior: The histograms show peaks at approx. 0.1 and 0.9 probability, while probabilities close to 0 or 1 are very rare. An explanation for this is given by Niculescu-Mizil and Caruana: "Methods such as bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values. Because predictions are restricted to the interval [0,1], errors caused by variance tend to be one-sided near zero and one. For example, if a model should predict p = 0 for a case, the only way bagging can achieve this is if all bagged trees predict zero. If we add noise to the trees that bagging is averaging over, this noise will cause some trees to predict values larger than 0 for this case, thus moving the average prediction of the bagged ensemble away from 0. We observe this effect most strongly with random forests because the base-level trees trained with random forests have relatively high variance due to feature subseting." As a result, the calibration curve shows a characteristic sigmoid shape, indicating that the classifier could trust its "intuition" more and return probabilties closer to 0 or 1 typically. A post-processing such as Platt-calibration, which fits a sigmoid to the probabilities on a separate calibration dataset, would typically help if the calibration curve is sigmoid.
• The scores of a Support Vector Classification (SVC), which are linearly related to the distance of the sample from the hyperplane, show a similar but even stronger effect as the Random Forest. This is not too surprising as the scores are in no sense probabilties and must not be interpreted as such as the curve shows.
• One alternative to Platt-calibration is Isotonic Regression. While Platt-calibration fits a sigmoid, Isotonic Regression fits an arbitrary increasing (isotonic) function. Thus, it has a weaker inducttive bias and can be applied more broadly (also in situations where the calibration curve is not sigmoid). The downside is that it typically requires more calibration data because its inductive bias is weaker. This can also be seen in the SVC + IR curve: While the sigmoid shape of the pure SVC scores is removed and the calibration curve does not show a clear bias, it is quite noisy, indicating much variance. Thus, the used calibration dataset (even thoug 4 times larger than the training data) is too small in this case.

For a further discussion and more extensive experiments, please refer to Niculescu-Mizil and Caruana.

This post was written as an IPython notebook. You can download this notebook.

## August 15, 2014

The deadlines ringing do remind that 3 months have flown past in a jiffy. Time to bid sayonara! Good times seem to roll by too soon, always; sad thing! Nevertheless it has been a great experience under a brilliant mentoring. The post has come up much later the previous one. Things have taken shape since then.

1. Ellipsoidal Harmonics:
Ellipsoidal harmonic functions, the first kind; also known as Lames functions have been implemented in Cython and had been integrated with ufuncs.
Ellipsoidal Harmonic function of the second kind and the calculation normalization constant for Lames function were implemented as .pyx files due to the involvement of global variables. The calculation of normalization constant was implemented in 2 different ways, using integration and using recurrence. Though recurrence seemed to be the more basic and faster way of implementation, the numerical stability wasn't that good; so we adopted integration.
The process involved many new things to me, like the integration of awesome LAPACK library, the speed and awesomeness of Cython etc!
The pull request is here: https://github.com/scipy/scipy/pull/3811
2. Hypergeometric functions:
The present implementation of hypergeometric functions is buggy. For real values of x, C implementation of the function from Cephes library has few errors while the FORTRAN implementation for complex values of x suffers with errors for much wider domain. There has been an attempt to re-implement the function in Cython and make it less ridden with errors. Though the shortage of time denied a bug-free implementation a few bugs have been removed successfully.
The implementation so far has been posted here

I would yet again stress on the fact that the flipping of bits this summer has been of great fun and well as a great skill and knowledge booster. Never was my summer so productive!

Signing off with loads of great memories and experiences
Janani

### Vighnesh Birodkar

#### coins

A lot of Image Processing algorithms are based on intuition from visual cues. Region Adjacency Graphs would also benefit if they were somehow drawn back on the images they represent. If we are able to see the nodes, edges, and the edges weights, we can fine tune our parameters and algorithms to suit our needs. I had written a small hack in this blog post to help better visualize the results. Later, Juan suggested I port if for scikit-image. It will indeed be a very helpful tool for anyone who wants to explore RAGs in scikit-image.

## Getting Started

You will need to pull for this Pull Request to be able to execute the code below. I’ll start by defining a custom show_image function to aid displaying in IPython notebooks.

from skimage import graph, data, io, segmentation, color
from matplotlib import pyplot as plt
from skimage.measure import regionprops
import numpy as np
from matplotlib import colors

def show_image(img):
width = img.shape[1] / 50.0
height = img.shape[0] * width/img.shape[1]
f = plt.figure(figsize=(width, height))
plt.imshow(img)


We will start by loading a demo image just containing 3 bold colors to help us see how the draw_rag function works.

image = io.imread('/home/vighnesh/Desktop/images/colors.png')
show_image(image)


We will now use the SLIC algorithm to give us an over-segmentation, on which we will build our RAG.

labels = segmentation.slic(image, compactness=30, n_segments=400)


Here’s what the over-segmentation looks like.

border_image = segmentation.mark_boundaries(image, labels, (0, 0, 0))
show_image(border_image)


## Drawing the RAGs

We can now form out RAG and see how it looks.

rag = graph.rag_mean_color(image, labels)
out = graph.draw_rag(labels, rag, border_image)
show_image(out)


In the above image, nodes are shown in yellow whereas edges are shown in green. Each region is represented by its centroid. As Juan pointed out, many edges will be difficult to see because of low contrast between them and the image, as seen above. To counter this we support the desaturate option. When set to True the image is converted to grayscale before displaying. Hence all the image pixels are a shade of gray, while the edges and nodes stand out.

out = graph.draw_rag(labels, rag, border_image, desaturate=True)
show_image(out)


Although the above image does very well to show us individual regions and their adjacency relationships, it does nothing to show us the magnitude of edges. To give us more information about the magnitude of edges, we have the colormap option. It colors edges between the first and the second color depending on their weight.

blue_red = colors.ListedColormap(['blue', 'red'])
out = graph.draw_rag(labels, rag, border_image, desaturate=True,
colormap=blue_red)
show_image(out)


As you can see, the edges between similar regions are blue, whereas edges between dissimilar regions are red. draw_rag also accepts a thresh option. All edges above thresh are not considered for drawing.

out = graph.draw_rag(labels, rag, border_image, desaturate=True,
colormap=blue_red, thresh=10)
show_image(out)


Another clever trick is to supply a blank image, this way, we can see the RAG unobstructed.

cyan_red = colors.ListedColormap(['cyan', 'red'])
out = graph.draw_rag(labels, rag, np.zeros_like(image), desaturate=True,
colormap=cyan_red)
show_image(out)


Ahhh, magnificent.

Here is a small piece of code which produces a typical desaturated color-distance RAG.

image = data.coffee()
labels = segmentation.slic(image, compactness=30, n_segments=400)
rag = graph.rag_mean_color(image, labels)
cmap = colors.ListedColormap(['blue', 'red'])
out = graph.draw_rag(labels, rag, image, border_color=(0,0,0), desaturate=True,
colormap=cmap)
show_image(out)


If you notice the above image, you will find some edges crossing over each other. This is because, some regions are convex. Hence their centroid lies outside their boundary and edges emanating from it can cross other edges.

## Examples

I will go over some examples of RAG drawings, since most of it is similar, I won’t repeat the code here. The Ncut technique, wherever used, was with its default parameters.

### Color distance RAG of Coffee after applying NCut

Notice how the centroid of the white rim of the cup is placed at its centre. It is the one adjacent to the centroid of the gray region of the upper part of the spoon, connected to it via a blue edge. Notice how this edge crosses others.

## Further Improvements

• A point that was brought up in the PR as well is that thick lines would immensely enhance the visual
appeal of the output. As and when they are implemented, rag_draw should be modified to support drawing
thick edges.
• As centroids don’t always lie in within an objects boundary, we can represent regions by a point other than their centroid, something which always lies within the boundary. This would allow for better visualization of the actual RAG from its drawing.

## August 12, 2014

### Matthieu Brucher

#### Announcement: Audio TK 0.5.0

Audio Toolkit is now almost ready for its first stable release. Its content will now move toward more advanced DSP algorithms (zero delay filters, amplifiers).

Changelog:

0.5.0
* Renamed slope attribute to ratio for Gain Compressor and Expander Filters
* Renamed the Chamberlin filter
* Added a StereoUniversalFixedDelayLineFilter that can make mix two channels together with different delay for each channel with Python wrappers
* Added a GainLimiterFilter (maximum ratio) with Python wrappers
* Added a MaxFilter with Python wrappers
* Added a DryWetFilter with Python wrappers

0.4.2
* Bug fixes

0.4.1
* Added a PipelineGlobalSinkFilter with Python wrapper
* Changed the MiddleSideFilter scale (no more dividing by 2 in the code)
* Added a second order all pass filter with Python wrappers

Other Amount:

## August 11, 2014

### Richard Tsai

#### GSoC2014: Recent progress

Hi! It has been several weeks since I talked about my work last time. In the past serveral weeks I mainly worked on the optimization of cluster.hierarchy.

The most important optimization is the SLINK alogrithm1 for single linkage. The naive hierarchical agglomerative clustering (HAC) algorithm has a $$O(n ^ 3)$$ time complexity, while SLINK is $$O(n ^ 2)$$ and very easy to implement (even easier than the naive algorithm).

### The Pointer Representation

SLINK is very good in performance but requires a special linkage representation – the pointer representation, which is different from what cluster.hierarchy is using. The pointer representation can be described as follows.

• A cluster is represented by the member with the largest index
• $$\Pi[i] (i \in [0, n - 1])$$ is the first cluster that cluster $$i$$ joins, $$\Lambda[i] (i \in [0, n - 1]$$ is the distance between cluster $$i$$ and cluster $$\Pi[i]$$ when they join

For example, the pointer representation of the following dendrogram is

• $$\Pi[i] = \{6, 3, 3, 5, 9, 9, 8, 9, 9, 9\}$$
• $$\Lambda[i] = \{1.394, 0.419, 0.831, 1.561, 3.123, 10.967, 1.633, 1.198, 4.990, \infty\}$$

### Implementation

The implementation of SLINK is very simple. There’s a pesudo-code in the orignal paper. The following Cython code need two pre-defined function condensed_index, which calculate the index of element (i, j) in a square condensed matrix, and from_pointer_representation, which convert the pointer representation to what you need.

def slink(double[:] dists, double[:, :] Z, int n):
cdef int i, j
cdef double[:] M = np.ndarray(n, dtype=np.double)
cdef double[:] Lambda = np.ndarray(n, dtype=np.double)
cdef int[:] Pi = np.ndarray(n, dtype=np.int32)

Pi[0] = 0
Lambda[0] = NPY_INFINITYF
for i in range(1, n):
Pi[i] = i
Lambda[i] = NPY_INFINITYF

for j in range(i):
M[j] = dists[condensed_index(n, i, j)]

for j in range(i):
if Lambda[j] >= M[j]:
M[Pi[j]] = min(M[Pi[j]], Lambda[j])
Lambda[j] = M[j]
Pi[j] = i
else:
M[Pi[j]] = min(M[Pi[j]], M[j])

for j in range(i):
if Lambda[j] >= Lambda[Pi[j]]:
Pi[j] = i

from_pointer_representation(Z, Lambda, Pi, n)


### Performance

On a N = 2000 dataset, the improvement is significant.

In [20]: %timeit _hierarchy.slink(dists, Z, N)
10 loops, best of 3: 29.7 ms per loop

In [21]: %timeit _hierarchy.