December 17, 2014

PythonXY

Python(x, y) 2.7.9.0 Released!

Hi All,

I'm happy to announce that Python(x, y) 2.7.9.0 is available for immediate download.from any of the mirrors. The full change log can be viewed here. Please post your comments and suggestions to the mailing list.

Work on the 64 bit & 3.x versions progresses slowly but surely. They will happen eventually :)

What's new in general:

  • The official Visual C++ for python 2.7 has replaced MinGW as the bundled compiler. Finally everyone can enjoy native python binary extenions for win32. MinGW is still available as an additional plugin.
  • Console2 has been replaced with ConsoleZ - a actively developed fork. Which adds split views and many other improvements.
  • Maintaining 2/3 code base is much easier with the inclusion of python-modernize. Also all backports have been updated and expanded with unittest2.
  • All packages updated to their latest versions (at the time) - IPython, GDAL, SciPy, ITK, VTK. sqlalchemy, Astropy, SWIG etc. 
  • User Local install support has been revived - it is still experimental though.
  • Numpy has been held back at 1.8.2 to avoid potential compatibility issues.

New noteworthy packages:

  • cvxpy- A domain-specific language for modeling convex optimization problems in Python.
  • grin - Utility which searches directories of source code better than grep or find.
Have fun!

-Gabi Davar

by Gabi Davar (noreply@blogger.com) at December 17, 2014 10:27 PM

December 16, 2014

Titus Brown

The post-apocalyptic world of binary containers

The apocalypse is nigh. Soon, binary executables and containers in object stores will join the many Web-based pipelines and the several virtual machine images on the dystopic wasteland of "reproducible science."


Anyway.

I had a conversation a few weeks back with a senior colleague about container-based approaches (like Docker) wherein they advocated the shipping around of executable binary blobs with APIs. I pointed out that blobs of executable code were not a good replacement for understanding or a path to progress (see my blog post on that) and they vehemently disagreed, saying that they felt it was an irrelevant point to the progress of science.

That made me sad.

One of the things I really like about Docker is that the community emphasizes devops-style clean installs and configurations over deployment and distribution of binary blobs (images, VMs, etc.) Let's make sure to keep that; I think it's important for scientific progress to be able to remix software.

I'll just close with this comment:

The issue of whether I can use your algorithm is largely orthogonal to the issue of whether I can understand your algorithm. The former is engineering progress; the latter is scientific progress.

--titus

p.s. While I do like to pick on the Shock/Awe/MG-RAST folk because their pipeline is utterly un-reusable by anyone, anywhere, ever, I am being extremely unfair in linking to their paper as part of this blog post. They're doing something neat that I am afraid will ultimately lead in bad directions, but they're not espousing a binary-only view of the world at all. I'm just being funny. Honest.

p.p.s. Bill Howe and I also agree. So I'm being multiply unfair. I know.

by C. Titus Brown at December 16, 2014 11:00 PM

December 14, 2014

Gaël Varoquaux

PRNI 2016: call for organization

PRNI (Pattern Recognition for NeuroImaging) is an IEEE conference about applying pattern recognition and machine learning to brain imaging. It is a mid-sized conference (about 150 attendee), and is a satellite of OHBM (the annual “Human Brain Mapping” meeting).

The steering committee is calling for bids to organize the conference in June 2016, in Europe, as a satellite the OHBM meeting in Geneva.

by Gaël Varoquaux at December 14, 2014 04:00 PM

December 11, 2014

Filipe Saraiva

Moving to Stockholm

Stockholm

[This post is off-topic for some Planet readers, sorry for it. I just expect to get some help with free software communities.]

Exciting times are coming to me before the end of year. Next week (probably) I am going to Stockholm to live for 5 months. I am getting an visiting researcher position at  KTH – Royal Institute of Technology as part of my PhD course. I will work with PSMIX group – Power System Management with related Information Exchange, leaded by Prof. Lars Nordström.

The subject of my project is the modelling and simulation of smart grids (power system plus a layer of communication and decision making automation) features using multi-agent systems. I expect to work with the simulation platform developed by PSMIX based in Raspberry PI and SPADE framework. The platform is described in this paper.

Well, I am very anxious with this travel because two things: the communication in English and the Sweden winter. The second is my main concern. Gosh, going out from the all the days Brazilian summer to the Nordic winter! :'( But ♫ I will survive ♪ (I expect). ;)

If you know someone to help me with tips about apartment rent, I will be very glad. Rent accommodation in Stockholm is very hard and expensive.

Thanks and I will send news!

by Filipe Saraiva at December 11, 2014 12:41 PM

December 08, 2014

Philip Herron

redbrain812

Cython, is an epic project lets just be up front about it. I do make and have made some use of it professionally in previous projects.

There is one thing that keeps coming up and it really bugs me: “Cython to speed up your Python programs”. This in my opinion is a mis-representation of what cython is, but i am however being “hyper-critical”.

Cython has many uses i have illustrated them probably more than most different cases aimed at the hacker. Extending pure C/C++ projects directly without writing any C/C++ and doing it in Python is where it can seriously shine and be insane how powerful it can become, if your setup your project accordingly. My frustration comes from some small criticism on how i don’t really talk about speeding up _your_ python code.

There are several reasons why i seriously avoiding talking about this in the book and my recent cython article for linux format. But to be clear you need to understand what cython is and what it does.

Cython provides a way of natively working with C/C++ types and code. And the side-effect is your can compile cython code which just so happens to look a lot like python to be really efficient and strip out the use of the Python Runtime since you are directly using C types. This is fine but let’s be clear. Cython and Python are two very different Programming language, it just so happens that Cython can compile (most) Python code but Python cannot compile Cython code. This is essentially the problem.

So when your brand cython as a way to speed up python programs your essentially saying that you need to re-write your code in cython. And outside of the toy examples of single cython files it will not work. Remember cython compiles single files into single python modules. Meaning you cannot therefore point cython at a package or set of modules to compile this all down magically without some serious downsides on project structure and common-sense.

Cython just so happens to fit extremely well into scientific computing which is entirely different topic to normal software engineering and not only that scientific code is an entirely different beast of engineering. And in the scientific world having some python scripts with single files that do heavy computation using cython here to use C Types and compile it as a shared module to be imported into Python code works extremely well. Even on lists or maps scientific people will use the libPython api to directly access data. I feel there was 2 sides to the coin those who are scientific which will want this section and another section of people who want to know what can i do with cython in my code (games, servers, applications), really anything outside of statistical computation programs. Not only that but having a solid understanding of the uses i demonstrate will make this side seem trivial outside of the more fine-grained ways of making it just that little bit faster using more cython specifics and libpython specifics which would fit better as a paper than a technical book since it would up to much more change.


by redbrain812 at December 08, 2014 03:37 PM

Continuum Analytics

Bokeh 0.7 Released!

We are excited to announce the release of version 0.7 of Bokeh, an interactive web plotting library for Python! This release includes many major new features:

  • IPython widgets and animations without a Bokeh server
  • Touch UI working for tools on mobile devices
  • Vastly improved linked data table
  • More new (and improving) bokeh.charts (high level charting interface)
  • Color mappers on the python side
  • Improved toolbar
  • Many new tools: lasso, poly, and point selection, crosshair inspector

by Bryan Van de Ven at December 08, 2014 08:30 AM

December 07, 2014

Titus Brown

Letter of resignation


Dear <chairs>,

I am resigning my Assistant Professor position at Michigan State University effective January 2nd, 2015.

Sincerely,

CTB.


Anticipated FAQ:

  • Why? I'm moving to UC Davis.
  • Do you have an employment contract with UC Davis?? Nope. But I'm starting there in January, anyway. Or that's the plan. And yes, that's how this kind of thing happen. I'm carefully refraining from geopolitical commentary on Twitter at the moment :).
  • Do you have tenure there? Nope. Wheels are still turning.
  • Aren't you nervous about that? Not really. Even if I don't get tenure, they've promised to hire me. And if that doesn't work out, someone will hire me... right? Guys? Guys? Why did it go silent all of a sudden?
  • Why didn't you use 'I am hereby...' at the beginning? I don't know, seemed pretentious.

--titus

by C. Titus Brown at December 07, 2014 11:00 PM

December 04, 2014

Fabian Pedregosa

Data-driven hemodynamic response function estimation

My latest research paper[1] deals with the estimation of the hemodynamic response function (HRF) from fMRI data.

This is an important topic since the knowledge of a hemodynamic response function is what makes it possible to extract the brain activation maps that are used in most of the impressive applications of machine learning to fMRI, such as (but not limited to) the reconstruction of visual images from brain activity [2] [3] or the decoding of numbers [4].

Besides the more traditional paper that describes the method, I've put online the code I used for the experiments. The code at this stage is far from perfect but it should help in reproducing the results or improving the method. I've also put online an ipython notebook with the analysis of a small piece of data. I'm obviously glad to receive feedback/bug reports/patches for this code.


  1. Pedregosa, Fabian, et al. "Data-driven HRF estimation for encoding and decoding models.", Neuroimage, (2014). Also available here as an arXiv preprint. 

  2. Miyawaki, Yoichi et al., "Visual Image Reconstruction from Human Brain Activity using a Combination of Multiscale Local Image Decoders", Neuron (2008)

  3. Kay, Kendrick N., et al. "Identifying natural images from human brain activity." Nature 452.7185 (2008). 

  4. Eger, Evelyn, et al. "Deciphering cortical number coding from human brain activity patterns." Current Biology 19.19 (2009). 

by Fabian Pedregosa at December 04, 2014 11:00 PM

December 03, 2014

Spyder

Hi all,

On the behalf of Spyder's development team, I'm pleased to announce that Spyder 2.3.2 has been released and is available for Windows XP/Vista/7/8, GNU/Linux and MacOS X: https://bitbucket.org/spyder-ide/spyderlib/downloads

This release represents more than 2 months of development since 2.3.1 and introduces major enhancements and new features:

- Editor
  - Improve cells visualization
  - Add support for drag selection and improve look of line number area
  - Open on it any text file present in the Variable Explorer
  - View and edit IPython notebooks as Json files
  - Syntax highlighting for Json and Yaml files

- Variable Explorer:
  - Import csv files as Pandas DataFrames
  - Improve browsing speed for NumPy arrays and DataFrames with more than 1e5 elements

- IPython Console
  - Add a stop button to easily stop computations

We fixed almost 40 bugs, merged 13 pull requests from 8 authors and added about 150 commits between these two releases.

This is a very important bugfix release which solves a lot of unicode problems in our consoles, the variable explorer and the main interface, so everyone is encouraged to update.

For a full list of fixes see our changelog

by Carlos (noreply@blogger.com) at December 03, 2014 01:26 PM

December 01, 2014

Titus Brown

What version of Python should we use to for computational science courses?

Brian O'Shea (a physics prof at Michigan State) asked me the following, and I thought I'd post it on my blog to get a broader set of responses. I know the answer is "Python 3", but I would appreciate specific thoughts from people with experience either with the specific packages below, or in teaching Python 3 more generally.

(For the record, I continue to teach and develop in Python 2, but each year comes closer to switching to Python 3...)


We're going to start teaching courses for the new CMSE [ computational science] department next year, and we're using Python as the language. I need to decide whether to teach it using python 2.x or python 3.x. I'm curious about which one you have chosen to use when teaching your own courses, and why you made that choice. (Also, it'd be interesting to know if/why you regret that choice?)

The intro courses are aimed at students who plan to use computational and data science for research, other classes, and ultimately in their academic/industrial careers. We anticipate that it'll mostly be science/math and engineering students in the beginning, but there's significant interest from social science and humanities folks on campus. Given the audience and goals, my choice of programming language is fairly utilitarian - we want to introduce students to standard numerical packages (numpy, scipy, matplotlib, h5py, pandas, ipython) and also some of the discipline-specific packages, and get them into the mindset of "I have a problem, somebody has probably already written a tool to address this, let's not reinvent the wheel." So, I want to choose the version of Python that's likely to work well with pretty much any relatively widely-used Python package. My impression, based on a variety of blog posts and articles that I've found, is that the mainstream libraries work just fine with Python 3 (e.g., matplotlib), but a lot of other stuff simply doesn't work at this point.

This course is going to be the gateway course for our new minor/major, and a lot of later courses will be based on it (the graduate version of the course will be the gateway course for the certificate, and presumably taken by lots of grad students here at MSU). I'd like to make the most sensible choice given that we'll be creating course materials that will be used by other faculty, and which may stick around for a while... Anyway, any thoughts you have would be appreciated!


Brian sent me these links as examples of the discussion:

http://sjbyrnes.com/?page_id=67

http://cyrille.rossant.net/whats-wrong-with-scientific-python/

https://jakevdp.github.io/blog/2013/01/03/will-scientists-ever-move-to-python-3/

http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-0-Scientific-Computing-with-Python.ipynb


My strongest personal advice at this point is that Brian should invest in the Anaconda distribution as a teaching foundation.

Thoughts? Comments?

--titus

by C. Titus Brown at December 01, 2014 11:00 PM

November 29, 2014

Titus Brown

Annotating papers with pipeline steps - suggestions?

A few months ago, I wrote a short description of how we make our papers replicable in the lab. One problem with this process is that for complex pipelines, it's not always obvious how to connect a number in the paper to the steps in the pipeline that produced it -- there are lots of files and outputs involved in all of this. You can sometimes figure it out by looking at the numbers and correlating with what's in the paper, and/or trying to understand the process from the bottom up, but that's quite difficult. Even my students and I (who can meet in person) sometimes have trouble tracking things down quickly.

Now, on my latest paper, I've started annotating results in the paper source code with Makefile targets. For example, if you are interested in how we got these results, you can use the comment in the LaTeX file to go straight to the associated target in the Makefile.

That's pretty convenient, but then I got to thinking -- how do we communicate this to the reader more directly? Not everyone wants to go to the github repo, read the LaTeX source, and then go find the relevant target in the Makefile (and "not everyone" is a bit of an understatement :). But there's no reason we couldn't link directly to the Makefile target in the PDF, is there? And, separately, right now it is a reasonably large burden to copy the results from the output of the scripts into the LaTeX file. Surely there's a way to get the computer to do this, right?

So, everyone -- two questions!

First, assuming that I'm going the nouveau traditional publishing route of producing a PDF for people to download from a journal site, is there a good, or suggested, or even better yet journal-supported way to link directly to the actual computational methods? (Yes, yes, I know it's suboptimal to publish into PDFs. Let me know when you've got that figured out, and I'll be happy to try it out. 'til then kthxbye.) I'd like to avoid intermediaries like individual DOIs for each method, if at all possible; direct links to github FTW.

Ideally, it would make it through the publishing process. I could, of course, make it available as a preprint, and point interested people at that.

Second, what's a good way to convey results from a script into a LaTeX (or more generally any text document)? With LaTeX I could make use of the 'include' directive; or I could do something with templating; or...?

Well, and third, is there any (good, scriptable) way to do both (1) and (2) at the same time?

Anyway, hit me with your wisdom; would love to hear that there's already a good solution!

--titus

by C. Titus Brown at November 29, 2014 11:00 PM

November 21, 2014

Filipe Saraiva

Presenting a Season of KDE 2014 student – Minh Ngo

Season of KDE is an outreach program hosted by the KDE community. This year I am working as a mentor to a long time requested project related with Cantor – the development of Python 3 backend. You can read more about Cantor in my blog (texts in English and Portuguese). So, let’s say welcome and good luck to Minh Ngo, the student behind this project!

Hi,

My name is Minh,

Minh Ngo

I’m BSc graduated student. I’m Vietnamese, but unlike other Vietnamese students spent most of my life in Ukraine. Currently, I’m preparing myself to the Master degree that will start in the next semester.

Open source is my free time hobby, so I would like to make something that is useful for the community. Previously, I was participated in the GSoC 2013 program and in several open source projects. Some of my personal projects is available on my github page https://github.com/Ignotus, not so popular like other cool projects, but several are used by other people and this fact makes me very happy :) .

Cantor is one of opportunities to spend time to create an useful thing and win an exclusive KDE T-shirt :). I decided to start my contribution with the Python3 backend, because few months ago I studied several courses that are related with Machine Learning, so I was looking for a stable desktop backend for IPython. A notebook version IPython I do not entirely like and its qtconsole version doesn’t satisfy me in terms of functionality, therefore I decided to find some existent frontend for IPython that I can tune for myself. And the story with Cantor began after than :)

Happy hacking!

by Filipe Saraiva at November 21, 2014 04:32 PM

November 19, 2014

Matthew Rocklin

Blaze Datasets

tl;dr Blaze aids exploration by supporting full databases and collections of datasets.

This post was composed using Blaze version 0.6.6. You can follow along with the following conda command.

conda install -c blaze blaze=0.6.6

When we encounter new data we need to explore broadly to find what exists before we can meaningfully perform analyses. The tools for this task are often overlooked.

This post outlines how Blaze explores collections and hierarchies of datasets, cases where one entity like a database or file or directory might hold many tables or arrays. We use examples from HDF5 files and SQL databases. Blaze understands how the underlying libraries work so that you don’t have to.

Motivating problem - Intuitive HDF5 File Navigation

For example, if we want to understand the contents of this set of HDF5 files encoding meteorological data then we need to navigate a hierarchy of arrays. This is common among HDF5 files.

Typically we navigate these files in Python with h5py or pytables.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

HDF5 files organize datasets with an internal file system. The h5py library accesses this internal file system through successive calls to .keys() and item access.

>>> f.keys()
['HDFEOS', 'HDFEOS INFORMATION']

>>> f['/HDFEOS'].keys()
['ADDITIONAL', 'SWATHS']

>>> f['/HDFEOS/SWATHS'].keys()
['ColumnAmountAerosol']

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol'].keys()
['Data Fields', 'Geolocation Fields']

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields'].keys()
['AerosolIndexUV',
 'AerosolIndexVIS',
 'AerosolModelMW',
 'AerosolModelsPassedThreshold',
 'AerosolOpticalThicknessMW',
 ...

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure']
<HDF5 dataset "TerrainPressure": shape (1643, 60), type "<i2">

>>> f['/HDFEOS/SWATHS/ColumnAmountAerosol/Data Fields/TerrainPressure'][:]
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       ...,
       [1010,  992,  986, ..., 1013, 1013, 1013],
       [ 999,  983,  988, ..., 1013, 1013, 1013],
       [1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

This interaction between programmer and interpreter feels like a long and awkward conversation.

Blaze improves the exploration user experience by treating the entire HDF5 file as a single Blaze variable. This allows users to both explore and compute on larger collections of data in the same workflow. This isn’t specific to HDF5 but works anywhere many datasets are bundled together.

Exploring a SQL Database

For example, a SQL database can be viewed as a collection of tables. Blaze makes it easy to to access a single table in a database using a string URI specifying both the database and the table name.

>>> iris = Data('sqlite:////my.db::iris')  # protocol://database::table-name
>>> iris
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
...

This only works if we know what table we want ahead of time. The approach above assumes that the user is already familiar with their data. To resolve this problem we omit the table name and access the database as a variable instead. We use the same interface to access the entire database as we would a specific table.

>>> db = Data('sqlite:////my.db')  # protocol://database
>>> db
Data:       Engine(sqlite:////home/mrocklin/workspace/blaze/blaze/examples/data/iris.db)
DataShape:  {
  iris: var * {
    sepal_length: ?float64,
    sepal_width: ?float64,
    petal_length: ?float64,
    petal_width: ?float64,
    species: ?string
  ...

The db expression points to a SQLAlchemy engine. We print the engine alongside a truncated datashape showing the metadata of the tables in that database. Note that the datashape maps table names to datashapes of those tables, in this case a variable length collection of records with fields for measurements of flowers.

Blaze isn’t doing any work of the grunt work here, SQLAlchemy is. SQLAlchemy is a mature Python library that interacts with a wide variety of SQL databases. It provides both database reflection (as we see above) along with general querying (as we see below). Blaze provides a convenient front-end.

We seamlessly transition from exploration to computation. We query for the shortest and longest sepal per species.

>>> by(db.iris.species, shortest=db.iris.sepal_length.min(),
...                      longest=db.iris.sepal_length.max())
           species  longest  shortest
0      Iris-setosa      5.8       4.3
1  Iris-versicolor      7.0       4.9
2   Iris-virginica      7.9       4.9

Blaze doesn’t pull data into local memory, instead it generates SQLAlchemy which generates SQL which executes on the foreign database. The (much smaller) result is pulled back into memory and rendered nicely using Pandas.

A Larger Database

Improved metadata discovery on SQL databases overlaps with the excellent work done by yhat on db.py. We steal their example, the Lahman Baseball statistics database, as an example of a richer database with a greater variety of tables.

>>> lahman = Data('sqlite:///baseball-archive-2012.sqlite')
>>> lahman.dshape  # ask for the entire datashape
dshape("""{
  battingpost: var * {
    yearID: ?int32,
    round: ?string,
    playerID: ?string,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  awardsmanagers: var * {
    managerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  hofold: var * {
    hofID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string
    },
  salaries: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    salary: ?float64
    },
  pitchingpost: var * {
    playerID: ?string,
    yearID: ?int32,
    round: ?string,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  managers: var * {
    managerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32,
    plyrMgr: ?string
    },
  teams: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    franchID: ?string,
    divID: ?string,
    Rank: ?int32,
    G: ?int32,
    Ghome: ?int32,
    W: ?int32,
    L: ?int32,
    DivWin: ?string,
    WCWin: ?string,
    LgWin: ?string,
    WSWin: ?string,
    R: ?int32,
    AB: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    SB: ?int32,
    CS: ?int32,
    HBP: ?int32,
    SF: ?int32,
    RA: ?int32,
    ER: ?int32,
    ERA: ?float64,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    HA: ?int32,
    HRA: ?int32,
    BBA: ?int32,
    SOA: ?int32,
    E: ?int32,
    DP: ?int32,
    FP: ?float64,
    name: ?string,
    park: ?string,
    attendance: ?int32,
    BPF: ?int32,
    PPF: ?int32,
    teamIDBR: ?string,
    teamIDlahman45: ?string,
    teamIDretro: ?string
    },
  teamshalf: var * {
    yearID: ?int32,
    lgID: ?string,
    teamID: ?string,
    Half: ?string,
    divID: ?string,
    DivWin: ?string,
    Rank: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32
    },
  fieldingpost: var * {
    playerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    round: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    TP: ?int32,
    PB: ?int32,
    SB: ?int32,
    CS: ?int32
    },
  master: var * {
    lahmanID: ?int32,
    playerID: ?string,
    managerID: ?string,
    hofID: ?string,
    birthYear: ?int32,
    birthMonth: ?int32,
    birthDay: ?int32,
    birthCountry: ?string,
    birthState: ?string,
    birthCity: ?string,
    deathYear: ?int32,
    deathMonth: ?int32,
    deathDay: ?int32,
    deathCountry: ?string,
    deathState: ?string,
    deathCity: ?string,
    nameFirst: ?string,
    nameLast: ?string,
    nameNote: ?string,
    nameGiven: ?string,
    nameNick: ?string,
    weight: ?int32,
    height: ?float64,
    bats: ?string,
    throws: ?string,
    debut: ?string,
    finalGame: ?string,
    college: ?string,
    lahman40ID: ?string,
    lahman45ID: ?string,
    retroID: ?string,
    holtzID: ?string,
    bbrefID: ?string
    },
  fieldingof: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    Glf: ?int32,
    Gcf: ?int32,
    Grf: ?int32
    },
  pitching: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    W: ?int32,
    L: ?int32,
    G: ?int32,
    GS: ?int32,
    CG: ?int32,
    SHO: ?int32,
    SV: ?int32,
    IPouts: ?int32,
    H: ?int32,
    ER: ?int32,
    HR: ?int32,
    BB: ?int32,
    SO: ?int32,
    BAOpp: ?float64,
    ERA: ?float64,
    IBB: ?int32,
    WP: ?int32,
    HBP: ?int32,
    BK: ?int32,
    BFP: ?int32,
    GF: ?int32,
    R: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32
    },
  managershalf: var * {
    managerID: ?string,
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    inseason: ?int32,
    half: ?int32,
    G: ?int32,
    W: ?int32,
    L: ?int32,
    rank: ?int32
    },
  appearances: var * {
    yearID: ?int32,
    teamID: ?string,
    lgID: ?string,
    playerID: ?string,
    G_all: ?int32,
    G_batting: ?int32,
    G_defense: ?int32,
    G_p: ?int32,
    G_c: ?int32,
    G_1b: ?int32,
    G_2b: ?int32,
    G_3b: ?int32,
    G_ss: ?int32,
    G_lf: ?int32,
    G_cf: ?int32,
    G_rf: ?int32,
    G_of: ?int32,
    G_dh: ?int32,
    G_ph: ?int32,
    G_pr: ?int32
    },
  halloffame: var * {
    hofID: ?string,
    yearid: ?int32,
    votedBy: ?string,
    ballots: ?int32,
    needed: ?int32,
    votes: ?int32,
    inducted: ?string,
    category: ?string
    },
  awardsplayers: var * {
    playerID: ?string,
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    tie: ?string,
    notes: ?string
    },
  schoolsplayers: var * {
    playerID: ?string,
    schoolID: ?string,
    yearMin: ?int32,
    yearMax: ?int32
    },
  schools: var * {
    schoolID: ?string,
    schoolName: ?string,
    schoolCity: ?string,
    schoolState: ?string,
    schoolNick: ?string
    },
  teamsfranchises: var * {
    franchID: ?string,
    franchName: ?string,
    active: ?string,
    NAassoc: ?string
    },
  allstarfull: var * {
    playerID: ?string,
    yearID: ?int32,
    gameNum: ?int32,
    gameID: ?string,
    teamID: ?string,
    lgID: ?string,
    GP: ?int32,
    startingPos: ?int32
    },
  tmp_batting: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    G: ?int32,
    G_batting: ?int32,
    AB: ?int32,
    R: ?int32,
    H: ?int32,
    2B: ?int32,
    3B: ?int32,
    HR: ?int32,
    RBI: ?int32,
    SB: ?int32,
    CS: ?int32,
    BB: ?int32,
    SO: ?int32,
    IBB: ?int32,
    HBP: ?int32,
    SH: ?int32,
    SF: ?int32,
    GIDP: ?int32,
    G_old: ?int32
    },
  seriespost: var * {
    yearID: ?int32,
    round: ?string,
    teamIDwinner: ?string,
    lgIDwinner: ?string,
    teamIDloser: ?string,
    lgIDloser: ?string,
    wins: ?int32,
    losses: ?int32,
    ties: ?int32
    },
  awardssharemanagers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    managerID: ?string,
    pointsWon: ?int32,
    pointsMax: ?int32,
    votesFirst: ?int32
    },
  awardsshareplayers: var * {
    awardID: ?string,
    yearID: ?int32,
    lgID: ?string,
    playerID: ?string,
    pointsWon: ?float64,
    pointsMax: ?int32,
    votesFirst: ?float64
    },
  fielding: var * {
    playerID: ?string,
    yearID: ?int32,
    stint: ?int32,
    teamID: ?string,
    lgID: ?string,
    POS: ?string,
    G: ?int32,
    GS: ?int32,
    InnOuts: ?int32,
    PO: ?int32,
    A: ?int32,
    E: ?int32,
    DP: ?int32,
    PB: ?int32,
    WP: ?int32,
    SB: ?int32,
    CS: ?int32,
    ZR: ?float64
    }
  }""")

Seeing at once all the tables in the database, all the columns in those tables, and all the types of those columns provides a clear and comprehensive overview of our data. We represent this information as a datashape, a type system covers everything from numpy arrays to SQL databases and Spark collections.

We use standard Blaze expressions to navigate more deeply. Things like auto-complete work fine.

>>> lahman.<tab>
lahman.allstarfull          lahman.fieldingpost         lahman.pitchingpost
lahman.appearances          lahman.fields               lahman.relabel
lahman.awardsmanagers       lahman.halloffame           lahman.salaries
lahman.awardsplayers        lahman.hofold               lahman.schema
lahman.awardssharemanagers  lahman.isidentical          lahman.schools
lahman.awardsshareplayers   lahman.like                 lahman.schoolsplayers
lahman.battingpost          lahman.managers             lahman.seriespost
lahman.data                 lahman.managershalf         lahman.teams
lahman.dshape               lahman.map                  lahman.teamsfranchises
lahman.fielding             lahman.master               lahman.teamshalf
lahman.fieldingof           lahman.pitching             lahman.tmp_batting

And we finish by a fun multi-table computation, joining two tables on year, team, and league and then computing the average salary by team name and year

>>> joined = join(lahman.teams, lahman.salaries, ['yearID', 'teamID', 'lgID'])

>>> by(joined[['name', 'yearID']], average_salary=joined.salary.mean())
                    name  yearID  average_salary
0         Anaheim Angels    1997  1004370.064516
1         Anaheim Angels    1998  1214147.058824
2         Anaheim Angels    1999  1384704.150000
3         Anaheim Angels    2000  1715472.233333
4         Anaheim Angels    2001  1584505.566667
5         Anaheim Angels    2002  2204345.250000
6         Anaheim Angels    2003  2927098.777778
7         Anaheim Angels    2004  3723506.185185
8   Arizona Diamondbacks    1998   898527.777778
9   Arizona Diamondbacks    1999  2020705.852941
...

Looks good, we compute and store to CSV file with into

>>> into('salaries.csv',
...      by(j[['name', 'yearID']], total_salary=j.salary.mean()))

(Final result here: salaries.csv)

Beyond SQL

SQL isn’t unique, many systems hold collections or hierarchies of datasets. Blaze supports navigation over Mongo databases, Blaze servers, HDF5 files, or even just dictionaries of pandas DataFrames or CSV files.

>>> d = {'accounts 1': CSV('accounts_1.csv'),
...      'accounts 2': CSV('accounts_2.csv')}

>>> accounts = Data(d)

>>> accounts.dshape
dshape("""{
  accounts 1: var * {id: ?int64, name: string, amount: ?int64},
  accounts 2: var * {id: ?int64, name: string, amount: ?int64}
  }""")

None of this behavior was special-cased in Blaze. The same mechanisms that select a table from a SQL database select a column from a Pandas DataFrame.

Finish with HDF5 example

To conclude we revisit our motivating problem, HDF5 file navigation.

Raw H5Py

Recall that we previously had a long back-and-forth conversation with the Python interpreter.

>>> import h5py
>>> f = h5py.File('OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5')

>>> f.keys()
['HDFEOS', 'HDFEOS INFORMATION']

>>> f['/HDFEOS'].keys()
['ADDITIONAL', 'SWATHS']
...

H5Py with Blaze expressions

With Blaze we get a quick high-level overview and an API that is shared with countless other systems.

>>> from blaze import Data
>>> d = Data(f)  # Wrap h5py file with Blaze interactive expression
>>> d
Data:       <HDF5 file "OMI-Aura_L2-OMAERO_2014m1105t2304-o54838_v003-2014m1106t215558.he5" (mode r+)>
DataShape:  {
  HDFEOS: {
    ADDITIONAL: {FILE_ATTRIBUTES: {}},
    SWATHS: {
      ColumnAmountAerosol: {
        Data Fields: {
          AerosolIndexUV: 1643 * 60 * int16,
  ...

By default we see the data and a truncated datashape.

We ask for the datashape explicitly to see the full picture.

>>> d.dshape
dshape("""{
  HDFEOS: {
    ADDITIONAL: {FILE_ATTRIBUTES: {}},
    SWATHS: {
      ColumnAmountAerosol: {
        Data Fields: {
          AerosolIndexUV: 1643 * 60 * int16,
          AerosolIndexVIS: 1643 * 60 * int16,
          AerosolModelMW: 1643 * 60 * uint16,
          AerosolModelsPassedThreshold: 1643 * 60 * 10 * uint16,
          AerosolOpticalThicknessMW: 1643 * 60 * 14 * int16,
          AerosolOpticalThicknessMWPrecision: 1643 * 60 * int16,
          AerosolOpticalThicknessNUV: 1643 * 60 * 2 * int16,
          AerosolOpticalThicknessPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdMean: 1643 * 60 * 9 * int16,
          AerosolOpticalThicknessPassedThresholdStd: 1643 * 60 * 9 * int16,
          CloudFlags: 1643 * 60 * uint8,
          CloudPressure: 1643 * 60 * int16,
          EffectiveCloudFraction: 1643 * 60 * int8,
          InstrumentConfigurationId: 1643 * uint8,
          MeasurementQualityFlags: 1643 * uint8,
          NumberOfModelsPassedThreshold: 1643 * 60 * uint8,
          ProcessingQualityFlagsMW: 1643 * 60 * uint16,
          ProcessingQualityFlagsNUV: 1643 * 60 * uint16,
          RootMeanSquareErrorOfFitPassedThreshold: 1643 * 60 * 10 * int16,
          SingleScatteringAlbedoMW: 1643 * 60 * 14 * int16,
          SingleScatteringAlbedoMWPrecision: 1643 * 60 * int16,
          SingleScatteringAlbedoNUV: 1643 * 60 * 2 * int16,
          SingleScatteringAlbedoPassedThreshold: 1643 * 60 * 10 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdMean: 1643 * 60 * 9 * int16,
          SingleScatteringAlbedoPassedThresholdStd: 1643 * 60 * 9 * int16,
          SmallPixelRadiancePointerUV: 1643 * 2 * int16,
          SmallPixelRadiancePointerVIS: 1643 * 2 * int16,
          SmallPixelRadianceUV: 6783 * 60 * float32,
          SmallPixelRadianceVIS: 6786 * 60 * float32,
          SmallPixelWavelengthUV: 6783 * 60 * uint16,
          SmallPixelWavelengthVIS: 6786 * 60 * uint16,
          TerrainPressure: 1643 * 60 * int16,
          TerrainReflectivity: 1643 * 60 * 9 * int16,
          XTrackQualityFlags: 1643 * 60 * uint8
          },
        Geolocation Fields: {
          GroundPixelQualityFlags: 1643 * 60 * uint16,
          Latitude: 1643 * 60 * float32,
          Longitude: 1643 * 60 * float32,
          OrbitPhase: 1643 * float32,
          SolarAzimuthAngle: 1643 * 60 * float32,
          SolarZenithAngle: 1643 * 60 * float32,
          SpacecraftAltitude: 1643 * float32,
          SpacecraftLatitude: 1643 * float32,
          SpacecraftLongitude: 1643 * float32,
          TerrainHeight: 1643 * 60 * int16,
          Time: 1643 * float64,
          ViewingAzimuthAngle: 1643 * 60 * float32,
          ViewingZenithAngle: 1643 * 60 * float32
          }
        }
      }
    },
  HDFEOS INFORMATION: {
    ArchiveMetadata.0: string[65535, 'A'],
    CoreMetadata.0: string[65535, 'A'],
    StructMetadata.0: string[32000, 'A']
    }
  }""")

When we see metadata for everything in this dataset all at once the full structure becomes apparent. We have two collections of arrays, all shaped (1643, 60); the collection in Data Fields holds what appear to be weather measurements while the collection in Geolocation Fields holds coordinate information.

Moreover we can quickly navigate this structure to compute relevant information.

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Latitude
array([[-67., -67., -67., ..., -61., -60., -60.],
       [-67., -67., -67., ..., -61., -60., -60.],
       [-67., -67., -68., ..., -61., -61., -60.],
       ...,
       [ 69.,  70.,  71., ...,  85.,  84.,  84.],
       [ 69.,  70.,  71., ...,  85.,  85.,  84.],
       [ 69.,  70.,  71., ...,  85.,  85.,  84.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Geolocation_Fields.Longitude
array([[  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
       [  46.,   43.,   40., ...,   -3.,   -5.,   -7.],
       [  46.,   43.,   40., ...,   -4.,   -5.,   -7.],
       ...,
       [ 123.,  124.,  124., ..., -141., -131., -121.],
       [ 123.,  124.,  124., ..., -141., -130., -120.],
       [ 123.,  123.,  124., ..., -140., -130., -119.]], dtype=float32)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure
array([[1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       [1013, 1013, 1013, ..., 1013, 1013, 1013],
       ...,
       [1010,  992,  986, ..., 1013, 1013, 1013],
       [ 999,  983,  988, ..., 1013, 1013, 1013],
       [1006,  983,  993, ..., 1013, 1013, 1013]], dtype=int16)

>>> d.HDFEOS.SWATHS.ColumnAmountAerosol.Data_Fields.TerrainPressure.min()
620

Final Thoughts

Often the greatest challenge is finding what you already have.

Discovery and exploration are just as important as computation. By extending the Blaze’s expression system to hierarchies of datasets we create a smooth user experience from first introductions to data all the way to analytic queries and saving results.

This work was easy. The pluggable architecture of Blaze made it surprisingly simple to extend the Blaze model from tables and arrays to collections of tables and arrays. We wrote about 40 significant lines of code for each supported backend.

November 19, 2014 12:00 AM

November 18, 2014

Titus Brown

RFC: The khmer project: what are we, and what are our goals?

As we think about the next few years of khmer development, it is helpful to explore what khmer is, and what our goals for khmer development are. This can provide guiding principles for development, refactoring, extension, funding requests, and collaborations.

Comments solicited!


Links:

Definition

khmer is an open source project that serves as:

  • a stable research platform for novel CS/bio research on data structures and algorithms, mostly k-mer based;
  • a set of of command line tools for various kinds of data transformations;
  • a test bed for software engineering practice in science;
  • a Python library for working with k-mers and graph structures;
  • an exercise in community building in scientific software engineering;
  • an effort to participate in and sustainably grow the bioinformatics ecosystem.

Goals

Our long term goals for khmer, in some rough order of priority, are:

  • Keep khmer versatile and agile enough to easily enable the CS and bio we want to do.

    Practical implications: limit complexity of internals as much as possible.

  • Continue community building.

    Practical implications: run khmer as a real open source project, with everything done in the open; work nicely with other projects.

  • Build, sustain, and maintain a set of protocols and recipes around khmer.

    Practical implications: take workflow design into account.

  • Improve the efficiency (time/memory) of khmer implementations.

    Practical implications: optimize, but not at expense of clean code.

    Some specifics: streaming; variable sized counters.

  • Lower barriers to an increasing user base.

    Practical implications: find actual pain points, address if it’s easy or makes good sense.

    Some specifics: hash function k > 32, stranded hash function, integrate efficient k-mer cardinality counting, implement dynamically sized data structures.

  • Keep khmer technologically up to date.

    Practical implications: transition to Python 3.


--titus

p.s. Thanks to Qingpeng Zhang, Travis Collier, and Matt MacManes for comments in the draft stage!

by C. Titus Brown at November 18, 2014 11:00 PM

Thomas Wiecki

A modern guide to getting started with Data Science and Python

Python has an extremely rich and healthy ecosystem of data science tools. Unfortunately, to outsiders this ecosystem can look like a jungle (cue snake joke). In this blog post I will provide a step-by-step guide to venturing into this PyData jungle.

What's wrong with the many lists of PyData packages out there already you might ask? I think that providing too many options can easily overwhelm someone who is just getting started. So instead, I will keep a very narrow scope and focus on the 10% of tools that allow you to do 90% of the work. After you mastered these essentials you can browse the long lists of PyData packages to decide which to try next.

The upside is that the few tools I will introduce already allow you to do most things a data scientist does in his day-to-day (i.e. data i/o, data munging, and data analysis).

Installation

It has happened quite a few times that people came up to me and said "I heard Python is amazing for data science so I wanted to start learning it but spent two days installing Python and all the other modules!". It's quite reasonable to think that you have to install Python if you want to use it but indeed, installing the full PyData stack manually when you don't know which tools you actually need is quite an undertaking. So I strongly recommend against doing that.

Fortunately for you, the fine folks at Continuum have created the Anaconda Python distribution that installs most of the PyData stack for you, and the modules it doesn't provide out of the box can easily be installed via a GUI. The distribution is also available for all major platforms so save yourself the two days and just use that!

IPython Notebook

After Python is installed, most people start by launching it. Again, very reasonable but unfortunately dead wrong. I don't know a single SciPythonista that uses the Python command shell directly (YMMV). Instead, IPython, and specifically the IPython Notebook are incredibly powerful Python shells that are used ubiquitously in PyData. I strongly recommend you directly start using the IPython Notebook (IPyNB) and don't bother with anything else, you won't regret it. In brief, the IPyNB is a Python shell that you access via your web browser. It allows you to mix code, text, and graphics (even interactive ones). This blog post was written in an IPyNB and it's rare to go find a talk at a Python conference that does not use the IPython Notebook. It comes preinstalled by Anaconda so you can just start using it. Here's an example of what it looks like:

In [1]:
print('Hello World')
Hello World

This thing is a rocket -- every time I hear one of the core devs talk at a conference I am flabbergasted by all the new things they cooked up. To get an idea for some of the advanced capabilities, check out this short tutorial on IPython widgets. These allow you to attach sliders to control a plot interactively:

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('wxVx54ax47s') # Yes, it can also embed youtube videos.
Out[1]:

Pandas

Normally, people recommend you start by learning NumPy (pronounced num-pie, not num-pee!) which is the library that provides multi-dimensional arrays. Certainly this was the way to go a few years ago but I hardly use NumPy at all today. The reason is that NumPy became more of a core library that's used by other libraries which provide a much nicer interface. Thus, the main library to use for working with data is Pandas. It can input and output data from all kinds of formats (including databases), do joins and other SQL-like functions for shaping the data, handle missing values with ease, support time series, has basic plotting capabilities and basic statistical functionality and much more. There is certainly a learning curve to all its features but I strongly suggest you go through most of the documentation as a first step. Trust me, the time you invest will be set off a thousand fold by being more efficient in your data munging. Here are a few quick tricks to whet your appetite:

In [18]:
import pandas as pd

df = pd.DataFrame({ 'A' : 1.,
                    'B' : pd.Timestamp('20130102'),
                    'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D' : pd.Series([1, 2, 1, 2], dtype='int32'),
                    'E' : pd.Categorical(["test", "train", "test", "train"]),
                    'F' : 'foo' })
In [19]:
df
Out[19]:
A B C D E F
0 1 2013-01-02 1 1 test foo
1 1 2013-01-02 1 2 train foo
2 1 2013-01-02 1 1 test foo
3 1 2013-01-02 1 2 train foo

Columns can be accessed by name:

In [17]:
df.B
Out[17]:
0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]

Compute the sum of D for each category in E:

In [21]:
df.groupby('E').sum().D
Out[21]:
E
test     2
train    4
Name: D, dtype: int32

Doing this is in NumPy (or *gasp* Matlab!) would be much more clunky.

There's a ton more. If you're not convinced, check out 10 minutes to pandas where I borrowed this from.

Seaborn

The main plotting library of Python is Matplotlib. However, I don't recommend using it directly for the same reason I don't recommend spending time learning NumPy initially. While Matplotlib is very powerful, it is its own jungle and requires lots of tweaking to make your plots look shiny. So instead, I recommend to start using Seaborn. Seaborn essentially treats Matplotlib as a core library (just like Pandas does with NumPy). I will briefly illustrate the main benefits of seaborn. Specifically, it:

  1. creates aesthetically pleasing plots by default (for one thing, it does not default to the jet colormap),
  2. creates statistically meaningful plots, and
  3. understands the pandas DataFrame so the two work well together.

While pandas comes prepackaged with anaconda, seaborn is not directly included but can easily be installed with conda install seaborn.

Statistically meaningful plots

In [5]:
%matplotlib inline # IPython magic to create plots within cells
In [7]:
import seaborn as sns

# Load one of the data sets that come with seaborn
tips = sns.load_dataset("tips")

sns.jointplot("total_bill", "tip", tips, kind='reg');

As you can see, with just one line we create a pretty complex statistical plot including the best fitting linear regression line along with confidence intervals, marginals and the correlation coefficients. Recreating this plot in matplotlib would take quite a bit of (ugly) code, including calls to scipy to run the linear regression and manually applying the linear regression formula to draw the line (and I don't even know how to do the marginal plots and confidence intervals from the top of my head). This and the next example are taken from the tutorial on quantitative linear models.

Works well with Pandas DataFrame

Data has structure. Often, there are different groups or categories we are interested in (pandas' groupby functionality is amazing in this case). For example, the tips data set looks like this:

In [9]:
tips.head()
Out[9]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

We might ask if smokers tip differently than non-smokers. Without seaborn, this would require a pandas groupby together with the complex code for plotting a linear regression. With seaborn, we can provide the column name we wish to split by as a keyword argument to col:

In [11]:
sns.lmplot("total_bill", "tip", tips, col="smoker");

Pretty neat, eh?

As you dive deeper you might want to control certain details of these plots at a more fine grained level. Because seaborn is just calling into matplotlib you probably will want to start learning this library at that point. For most things I'm pretty happy with what seaborn provides, however.

Conclusions

The idea of this blog post was to provide a very select number of packages which maximize your efficiency when starting with data science in Python.

Further reading

Acknowledgements

Thanks to Katie Green and Andreas Dietzel for helpful feedback on an earlier draft.

by Thomas Wiecki at November 18, 2014 03:00 PM

Continuum Analytics

Blaze Datasets

tl;dr Blaze aids exploration by supporting full databases and collections of datasets.

by Matt Rocklin at November 18, 2014 08:00 AM

November 14, 2014

Titus Brown

A Saturday morning conversation about publishing inconclusions

Here's an excerpt from an e-mail to a student whose committee I'm on; they were asking me about a comment their advisor had made that they shouldn't put a result in a paper because "It'll confuse the reviewer."

One thing to keep in mind is that communicating the results _is_ important. "It'll confuse the reviewer" is not always shorthand for "well, it's true but we can't say it because it's too confusing", which is how I think you're interpreting your advisor's comment; it is also shorthand for "well, we don't really have the depth of data/information to understand why the result is this way, so there's something more complicated going on, but we can't talk about it because we don't understand it ourselves." These can lead to the most productive bits of a project, if followed up (but it takes a lot of time to do so...)

Their response:

I'm curious, however, if we don't understand what's going on ourselves, shouldn't that be all the more reason to publish something? Because then other people with more knowledge can read it and they may know what's going on? Or at least help us?

And my response:


Well, that's definitely not how it works. Let me see if I can think of some reasons why...

First, it's much easier to get confused than it is to fight your way out of confusion - and you usually learn something in the process, obviously. So you have a lot more people who are confused than have learned something, and the latter is considered more interesting and worthy of communication than the former.

Second, writing papers is a lot of work and takes a lot of time. If you're confused (and hence probably wrong about any conclusions) it's probably not worth the effort...

Third, reading papers is also a lot of work! Most papers are never read seriously by anyone other than the reviewers. So doubling or tripling the number of papers to include confusing or inconclusive data would probably not be helpful.

Fourth, most conclusions require multiple lines of evidence, most which are often not developed until you have at least some solid hypotheses about why you're seeing what you're seeing from one line of evidence. So a lot of those pubs would be wrong.

Fifth, failure is frequently underdetermined -- that is, we may have wrong or incorrect results and not know or know why. It's rarely worth chasing down exactly why something didn't work unless it doesn't work reproducibly and it's critical and important to your results.

That having been said, there's a movement to make publishing data easier and bring data citations into standard practice, which I think is partly targeted at the same problem you're asking about.


What am I missing?

thanks, --titus

by C. Titus Brown at November 14, 2014 11:00 PM

William Stein

SageMathCloud Notifications are Now Better

I just made live a new notifications systems for  SageMathCloud, which I spent all week writing.  




These notifications are what you see when you click the bell in the upper right.   This new system replaces the one I made live two weeks ago.     Whenever somebody actively *edits* (using the web interface) any file in any project you collaborate on, a notification will get created or updated.    If a person *comments* on any file in any project you collaborate on (using the chat interface to the right), then not only does the notification get updated, there is also a little red counter on top of the bell and also in the title of the  SageMathCloud tab.   In particular, people will now be much more likely to see the chats you make on files.




NOTE: I have not yet enabled any sort of daily email notification summary, but that is planned. 

Some technical details:  Why did this take all week?  It's because the technology that makes it work behind the scenes is something that was fairly difficult for me to figure out how to implement.  I implemented a way to create an object that can be used simultaneously by many clients and supports realtime synchronization.... but is stored by the distributed Cassandra database instead of a file in a project.   Any changes to that object get synchronized around very quickly.   It's similar to how synchronized text editing (with several people at once) works, but I rethought differential synchronization carefully, and also figured out how to synchronize using an eventually consistent database.    This will be useful for implementing a lot other things in SageMathCloud that operate at a different level than "one single project".  For example, I plan to add functions so you can access these same "synchronized databases" from Python processes -- then you'll be able to have sage worksheets (say) running on several different projects, but all saving their data to some common synchronized place (backed by the database).   Another application will be a listing of the last 100 (say) files you've opened, with easy ways to store extra info about them.    It will also be easy to make account and project settings more realtime, so when you change something, it automatically takes effect and is also synchronized across other browser tabs you may have open.   If you're into modern Single Page App web development, this might remind you of Angular or React or Hoodie or Firebase -- what I did this week is probably kind of like some of the sync functionality of those frameworks, but I use Cassandra (instead of MongoDB, say) and differential synchronization.  

I BSD-licensed the differential synchronization code  that I wrote as part of the above. 


by William Stein (noreply@blogger.com) at November 14, 2014 02:31 PM

November 12, 2014

Jake Vanderplas

The Hipster Effect: An IPython Interactive Exploration

This week I started seeing references all over the internet to this paper: The Hipster Effect: When Anticonformists All Look The Same. It essentially describes a simple mathematical model which models conformity and non-conformity among a mutually interacting population, and finds some interesting results: namely, conformity among a population of self-conscious non-conformists is similar to a phase transition in a time-delayed thermodynamic system. In other words, with enough hipsters around responding to delayed fashion trends, a plethora of facial hair and fixed gear bikes is a natural result.

Also naturally, upon reading the paper I wanted to try to reproduce the work. The paper solves the problem analytically for a continuous system and shows the precise values of certain phase transitions within the long-term limit of the postulated system. Though such theoretical derivations are useful, I often find it more intuitive to simulate systems like this in a more approximate manner to gain hands-on understanding. By the end of this notebook, we'll be using IPython's incredible interactive widgets to explore how the inputs to this model affect the results.

Mathematically Modeling Hipsters

We'll start by defining the problem, and going through the notation suggested in the paper. We'll consider a group of \(N\) people, and define the following quantities:

  • \(\epsilon_i\) : this value is either \(+1\) or \(-1\). \(+1\) means person \(i\) is a hipster, while \(-1\) means they're a conformist.
  • \(s_i(t)\) : this is also either \(+1\) or \(-1\). This indicates person \(i\)'s choice of style at time \(t\). For example, \(+1\) might indicated a bushy beard, while \(-1\) indicates clean-shaven.
  • \(J_{ij}\) : The influence matrix. This is a value greater than zero which indicates how much person \(j\) influences person \(i\).
  • \(\tau_{ij}\) : The delay matrix. This is an integer telling us the length of delay for the style of person \(j\) to affect the style of person \(i\).

The idea of the model is this: on any given day, person \(i\) looks at the world around him or her, and sees some previous day's version of everyone else. This information is \(s_j(t - \tau_{ij})\).

The amount that person \(j\) influences person \(i\) is given by the influence matrix, \(J_{ij}\), and after putting all the information together, we see that person \(i\)'s mean impression of the world's style is

\[ m_i(t) = \frac{1}{N} \sum_j J_{ij} \cdot s_j(t - \tau_{ij}) \]

Given the problem setup, we can quickly check whether this impression matches their own current style:

  • if \(m_i(t) \cdot s_i(t) > 0\), then person \(i\) matches those around them
  • if \(m_i(t) \cdot s_i(t) < 0\), then person \(i\) looks different than those around them

A hipster who notices that their style matches that of the world around them will risk giving up all their hipster cred if they don't change quickly; a conformist will have the opposite reaction. Because \(\epsilon_i\) = \(+1\) for a hipster and \(-1\) for a conformist, we can encode this observation in a single value which tells us what which way the person will lean that day:

\[ x_i(t) = -\epsilon_i m_i(t) s_i(t) \]

Simple! If \(x_i(t) > 0\), then person \(i\) will more likely switch their style that day, and if \(x_i(t) < 0\), person \(i\) will more likely maintain the same style as the previous day. So we have a formula for how to update each person's style based on their preferences, their influences, and the world around them.

But the world is a noisy place. Each person might have other things going on that day, so instead of using this value directly, we can turn it in to a probabilistic statement. Consider the function

\[ \phi(x;\beta) = \frac{1 + \tanh(\beta \cdot x)}{2} \]

We can plot this function quickly:

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Use seaborn styles for nice-looking plots
import seaborn; seaborn.set()
In [2]:
x = np.linspace(-1, 1, 1000)
for beta in [0.5, 1, 5]:
    plt.plot(x, 0.5 * (1 + np.tanh(beta * x)),
             label=r'$\beta = {0:.1f}$'.format(beta))
plt.legend(loc='upper left', fontsize=18)
plt.xlabel('$x$', size=18); plt.ylabel(r'$\phi(x;\beta)$', size=18);

This gives us a nice way to move from our preference \(x_i\) to a probability of switching styles. Here \(\beta\) is inversely related to noise. For large \(\beta\), the noise is small and we basically map \(x > 0\) to a 100% probability of switching, and \(x<0\) to a 0% probability of switching. As \(\beta\) gets smaller, the probabilities get less and less distinct.

The Code

Let's see this model in action. We'll start by defining a class which implements everything we've gone through above:

In [3]:
class HipsterStep(object):
    """Class to implement hipster evolution
    
    Parameters
    ----------
    initial_style : length-N array
        values > 0 indicate one style, while values <= 0 indicate the other.
    is_hipster : length-N array
        True or False, indicating whether each person is a hipster
    influence_matrix : N x N array
        Array of non-negative values. influence_matrix[i, j] indicates
        how much influence person j has on person i
    delay_matrix : N x N array
        Array of positive integers. delay_matrix[i, j] indicates the
        number of days delay between person j's influence on person i.
    """
    def __init__(self, initial_style, is_hipster,
                 influence_matrix, delay_matrix,
                 beta=1, rseed=None):
        self.initial_style = initial_style
        self.is_hipster = is_hipster
        self.influence_matrix = influence_matrix
        self.delay_matrix = delay_matrix
        
        self.rng = np.random.RandomState(rseed)
        self.beta = beta
        
        # make s array consisting of -1 and 1
        self.s = -1 + 2 * (np.atleast_2d(initial_style) > 0)
        N = self.s.shape[1]
        
        # make eps array consisting of -1 and 1
        self.eps = -1 + 2 * (np.asarray(is_hipster) > 0)
        
        # create influence_matrix and delay_matrix
        self.J = np.asarray(influence_matrix, dtype=float)
        self.tau = np.asarray(delay_matrix, dtype=int)
        
        # validate all the inputs
        assert self.s.ndim == 2
        assert self.s.shape[1] == N
        assert self.eps.shape == (N,)
        assert self.J.shape == (N, N)
        assert np.all(self.J >= 0)
        assert np.all(self.tau > 0)

    @staticmethod
    def phi(x, beta):
        return 0.5 * (1 + np.tanh(beta * x))
            
    def step_once(self):
        N = self.s.shape[1]
        
        # iref[i, j] gives the index for the j^th individual's
        # time-delayed influence on the i^th individual
        iref = np.maximum(0, self.s.shape[0] - self.tau)
        
        # sref[i, j] gives the previous state of the j^th individual
        # which affects the current state of the i^th individual
        sref = self.s[iref, np.arange(N)]

        # m[i] is the mean of weighted influences of other individuals
        m = (self.J * sref).sum(1) / self.J.sum(1)
        
        # From m, we use the sigmoid function to compute a transition probability
        transition_prob = self.phi(-self.eps * m * self.s[-1], beta=self.beta)
        
        # Now choose steps stochastically based on this probability
        new_s = np.where(transition_prob > self.rng.rand(N), -1, 1) * self.s[-1]
        
        # Add this to the results, and return
        self.s = np.vstack([self.s, new_s])
        return self.s
    
    def step(self, N):
        for i in range(N):
            self.step_once()
        return self.s
    
    def trend(self, hipster=True, return_std=True):
        if hipster:
            subgroup = self.s[:, self.eps > 0]
        else:
            subgroup = self.s[:, self.eps < 0]
            
        return subgroup.mean(1), subgroup.std(1)

Now we'll create a function which plots the trend for a certain number of time steps:

In [4]:
def plot_results(Npeople=500, Nsteps=200,
                 hipster_frac=0.8, initial_state_frac=0.5,
                 delay=20, log10_beta=0.5, rseed=42):
    rng = np.random.RandomState(rseed)
    
    initial_state = (rng.rand(1, Npeople) > initial_state_frac)
    is_hipster = (rng.rand(Npeople) > hipster_frac)

    influence_matrix = abs(rng.randn(Npeople, Npeople))
    influence_matrix.flat[::Npeople + 1] = 0
    
    delay_matrix = 1 + rng.poisson(delay, size=(Npeople, Npeople))

    h = HipsterStep(initial_state, is_hipster,
                    influence_matrix, delay_matrix=delay_matrix,
                    beta=10 ** log10_beta, rseed=rseed)
    h.step(Nsteps)
    
    def beard_formatter(y, loc):
        if y == 1:
            return 'bushy-\nbeard'
        elif y == -1:
            return 'clean-\nshaven'
        else:
            return ''
        
    t = np.arange(Nsteps + 1)

    fig, ax = plt.subplots(2, sharex=True, figsize=(8, 6))
    ax[0].imshow(h.s.T, cmap='binary', interpolation='nearest')
    ax[0].set_ylabel('individual')
    ax[0].axis('tight')
    ax[0].grid(False)
    
    mu, std = h.trend(True)
    ax[1].plot(t, mu, c='red', label='hipster')
    ax[1].fill_between(t, mu - std, mu + std, color='red', alpha=0.2)
    
    mu, std = h.trend(False)
    ax[1].plot(t, mu, c='blue', label='conformist')
    ax[1].fill_between(t, mu - std, mu + std, color='blue', alpha=0.2)
    
    ax[1].set_xlabel('time')
    ax[1].set_ylabel('Trend')
    ax[1].legend(loc='best')
    ax[1].set_ylim(-1.1, 1.1);
    ax[1].set_xlim(0, Nsteps)
    ax[1].yaxis.set_major_formatter(plt.FuncFormatter(beard_formatter))

Exploring the Result

With this code in place, we can now explore the result. We'll start by seeing what happens when just 10% of the population is made up of non-conformist hipsters:

In [5]:
plot_results(hipster_frac=0.1)

Let's describe this plot briefly: the top panel has 500 rows and 200 columns: each row represents an individual person, and the color (white or black) along the row represents the style of that person at that time.

In the bottom panel, we see the mean and standard deviation of the styles of all hipsters (red) and all conformists (blue).

This plot shows something relatively unsurprising: when there are only a few hipsters in the population, we quickly reach an equilibrium where hipsters all have one style (a bushy beard) while the norm-core conformists have the opposite (clean shaven faces).

Let's see what happens if there are more hipsters in the population:

In [6]:
plot_results(hipster_frac=0.5)

With half the population made up of hipsters, the trend washes out. There is so much noise and confusion about styles, that both the hipsters and the conformists have a wide range of styles at any given time.

Now let's see what happens when we have even more hipsters:

In [7]:
plot_results(hipster_frac=0.8)

Now this is getting interesting! With a population dominated by hipsters, we end up approaching steady cycles in trends. The hipsters start trying to distance themselves from the "normal" style, and then the normal style moves to catch up with them. The hipsters then swing the other way, and the process repeats. This is an example of the "phase transition" that the author of the original paper talked about in analogy to thermodynamic systems: above certain critical values of the model parameters, a qualitatively new type of behavior appears out of the noise. This oscillation can be thought of as a rough and simplistic mathematical model for recurrent cycles of cultural and fashion trends that anyone over a couple decades old has probably noticed over time.

But let's explore this even more.

Fully Interactive

One of the nice pieces of the IPython notebook is the ability to quickly create interactive visualizations. Unfortunately this only works when you're viewing the notebook live (i.e. a static HTML view on a blog post won't give you any interaction). If you're reading this on my blog or on nbviewer, then you can download the notebook and run it with IPython to see these features.

What we'll do is to call IPython's interactive tools on our Python function, which will create javascript sliders allowing us to explore the parameter space of this hipster conformity model. I'd encourage you to download the notebook and try it out!

In [8]:
from IPython.html.widgets import interact, fixed

interact(plot_results, hipster_frac=(0.0, 1.0), delay=(1, 50),
         initial_state_frac=(0.0, 1.0), log10_beta=(-2.0, 2.0),
         Npeople=fixed(500), Nsteps=fixed(200), rseed=fixed(42));

Again, unless you download the notebook and run it on a local Python kernel, all you'll see is a static graphic above. But with the interactive version, you can really start to see how these various parameters affect the system.

Conclusion

This has been a lot of fun, and if you've read this far I hope this helped you understand the mathematics of Hipster-dom! For more information and analysis, go read the paper. It goes much deeper than the rough, discrete approximation I've used here.

For further ideas, I'd love to see a simulation of how this looks if we add-in spatial information, and create a delay related to that information. Would you start to see pockets of people adapting similar styles? My guess is yes, but I'm not entirely sure... there's only one way to find out.

Happy coding!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at November 12, 2014 05:00 AM

November 10, 2014

Titus Brown

Some thoughts on Journal Impact Factor

A colleague just e-mailed me to ask me how I felt about journal impact factor being such a big part of the Academic Ranking of World Universities - they say that 20% of the ranking weight comes from # of papers published in Nature and Science. So what do I think?

On evaluations

I'm really not a big fan of rankings and evaluations in the first place. This is largely because I feel that evaluations are rarely objective. For one very specific example, last year at MSU I got formal evaluations from both of my departments. Starting with the same underlying data (papers published, students graduated, grants submitted/awarded, money spent, classes taught, body mass/height ratio, gender weighting, eye color, miles flown, talks given, liquid volume of students' tears generated, committees served on, etc.) one department gave me a "satisfactory/satisfactory/satisfactory" while the other department gave me an "excellent/excellent/excellent." What fun! (I don't think the difference was the caliber in departments.)

Did I mention that these rankings helped determine my raise for the year?

Anyhoo, I find the rating and ranking scheme within departments at MSU to be largely silly. It's done in an ad hoc manner by untrained people, and as far as I can tell, is biased against people who are not willing to sing their own praises. (I brought up the Dunning-Kruger effect in my last evaluation meeting. Heh.) That's not to say there's not serious intent -- they are factored into raises, and at least one other purpose of evaluating assistant professors is so that once you fire their ass (aka "don't give them tenure") there's a paper trail of critical evaluations where you explained that they were in trouble.

Metrics are part of the job, though; departments evaluate their faculty so they can see who, if anyone, needs help or support or mentoring, and to do this, they rely at least in part on metrics. Basically, if someone has lots of funding and lots of papers, they're probably not failing miserably at being a research professor; if they're grant-poor and paper-poor, they're targets for further investigation. There are lots of ways to evaluate, but metrics seem like an inextricable part of it.

Back to the impact factor

Like faculty evaluations, ranking by the impact factor of the journals that university faculty publish in is an attempt to predict future performance using current data.

But impact factor is extremely problematic for many reasons. It's based on citations, which (over the long run) may be an OK measure of impact, but are subject to many confounding factors, including field-specific citation patterns. It's an attempt to predict the success of individual papers on a whole-journal basis, which falls apart in the face of variable editorial decisions. High-impact journals are also often read more widely by people than low-impact journals, which yields a troubling circularity in terms of citation numbers (you're more likely to cite a paper you've read!) Worse, the whole system is prone to being gamed in various ways, which is leading to high rates of retractions for high-impact journals, as well as outright fraud.

Impact factor is probably a piss-poor proxy for paper impact, in other words.

If impact factor was just a thing that didn't matter, I wouldn't worry. The real trouble is that impact factors have real-world effect - many countries use impact factor of publications as a very strong weight in funding and promotion decisions. Interestingly, the US is not terribly heavy handed here - most universities seem pretty enlightened about considering the whole portfolio of a scientist, at least anecdotally. But I can name a dozen countries that care deeply about impact factor for promotions and raises.

And apparently impact factor affects university rankings, too!

Taking a step back, it's not clear that any good ranking scheme can exist, and if it does, we're certainly not using it. All of this is a big problem if you care about fostering good science.

The conundrum is that many people like rankings, and it seems futile to argue against measuring and ranking people and institutions. However, any formalized ranking system can be gamed and perverted, which ends up sometimes rewarding the wrong kind of people, and shutting out some of the right kind of people. (The Reed College position on the US News & World Report ranking system is worth reading here.) More generally, in any ecosystem, the competitive landscape is evolving, and a sensible measure today may become a lousy measure tomorrow as the players evolve their strategies; the stricter the rules of evaluation, and the more entrenched the evaluation system, the less likely it is to adapt, and the more misranking will result. So ranking systems need to evolve continuously.

At its heart, this is a scientific management challenge. Rankings and metrics do pretty explicitly set the landscape of incentives and competition. If our goal in science is to increase knowledge for the betterment of mankind, then the challenge for scientific management is to figure out how to incentive behaviors that trend in that direction in the long term. If you use bad or outdated metrics, then you incentivize the wrong kind of behavior, and you waste precious time, energy, and resources. Complicating this is the management structure of academic science, which is driven by many things that include rankings and reputation - concepts that range from precise to fuzzy.

My position on all of this is always changing, but it's pretty clear that the journal system is kinda dumb and rewards the wrong behavior. (For the record, I'm actually a big fan of publications, and I think citations are probably not a terribly bad measure of impact when measured on papers and individuals, although I'm always happy to engage in discussions on why I'm wrong.) But the impact factor is especially horrible. The disproportionate effect that high-IF glamour mags like Cell, Nature and Science have on our scientific culture is clearly a bad thing - for example, I'm hearing more and more stories about editors at these journals warping scientific stories directly or indirectly to be more press-worthy - and when combined with the reproducibility crisis I'm really worried about the short-term future of science. Journal Impact Factor and other simple metrics are fundamentally problematic and are contributing to the problem, along with the current peer review culture and a whole host of other things. (Mike Eisen has written about this a lot; see e.g. this post.)

In the long term I think a much more experimental culture of peer review and alternative metrics will emerge. But what do we do for now?

More importantly:

How can we change?

I think my main advice to faculty is "lead, follow, or get out of the way."

Unless you're a recognized big shot, or willing to take somewhat insane gambles with your career, "leading" may not be productive - following or getting out of the way might be best. But there are a lot of things you can do here that don't put you at much risk, including:

  • be open to a broader career picture when hiring and evaluating junior faculty;
  • argue on behalf of alternative metrics in meetings on promotion and tenure;
  • use sites like Google Scholar to pick out some recent papers to read in depth when hiring faculty and evaluating grants;
  • avoid making (or push back at) cheap shots at people who don't have a lot of high-impact-factor papers;
  • invest in career mentoring that is more nuanced than "try for lots of C-N-S papers or else" - you'd be surprised how often this is the main advice assistant professors take away...
  • believe in and help junior faculty that seem to have a plan, even if you don't agree with the plan (or at least leave them alone ;)

What if you are a recognized big shot? Well, there are lots of things you can do. You are the people who set the tone in the community and in your department, and it behooves you to think scientifically about the culture and reward system of science. The most important thing you can do is think and investigate. What evidence is there behind the value of peer review? Are you happy with C-N-S editorial policies, and have you talked to colleagues who get rejected at the editorial review stage more than you do? Have you thought about per-article metrics? Do you have any better thoughts on how to improve the system than 'fund more people', and how would you effect changes in this direction by recognizing alternate metrics during tenure and grant review?

The bottom line is that the current evaluation systems are the creation of scientists, for scientists. It's our responsibility to critically evaluate them, and perhaps evolve them when they're inadequate; we shouldn't just complain about how the current system is broken and wait for someone else to fix it.

Addendum: what would I like to see?

Precisely predicting the future importance of papers is obviously kind of silly - see this great 1994 paper by Gans and Shepherd on rejected classics papers, for example -- and is subject to all sorts of confounding effects. But this is nonetheless what journals are accustomed to doing: editors at most journals, especially the high impact factor ones, select papers based on projected impact before sending them out for review, and/or ask the reviewers to review impact as well.

So I think we should do away with impact review and review for correctness instead. This is why I'm such a big fan of PLOS One and PeerJ, who purport to do exactly that.

But then, I get asked, what do we do about selecting out papers to read? Some (many?) scientists claim that they need the filtering effect of these selective journals to figure out what they should be reading.

There are a few responses to this.

First, it's fundamentally problematic to outsource your attention to editors at journals, for reasons mentioned above. There's some evidence that you're being drawn into a manipulated and high-retraction environment by doing that, and that should worry you.

But let's say you feel you need something to tell you what to read.

Well, second, this is technologically solvable - that's what search engines already do. There's a whole industry of search engines that give great results based on integrating free text search, automatic content classification, and citation patterns. Google Scholar does a great job here, for example.

Third, social media (aka "people you know") provides some great recommendation systems! People who haven't paid much attention to Twitter or blogging may not have noticed, but in addition to person-to-person recommendations, there are increasingly good recommendation systems coming on line. I personally get most of my paper recs from online outlets (mostly people I follow, but I've found some really smart people to follow on Twitter!). It's a great solution!

Fourth, if one of the problems is that many journals review for correctness AND impact together, why not separate them? For example, couldn't journals like Science or Nature evolve into literature overlays that highlight papers published in impact-blind journals like PLOS One or PeerJ? I can imagine a number of ways that this could work, but if we're so invested in having editors pick papers for us, why not have them pick papers that have been reviewed for scientific correctness first, and then elevate them to our attention with their magic editorial pen?

I don't see too many drawbacks to this vs the current approach, and many improvements. (Frankly this is where I see most of scientific literature going, once preprint archives become omnipresent.)

So that's where I want and expect to see things going. I don't see ranking based on predicted impact going away, but I'd like to see it more reflective of actual impact (and be measured in more diverse ways).

--titus

p.s. People looking for citations of high retraction rate, problematic peer review, and the rest could look at one of my earlier blog posts on problems with peer review. I'd be interested in more citations, though!

by C. Titus Brown at November 10, 2014 11:00 PM

November 06, 2014

Fabian Pedregosa

Plot memory usage as a function of time

One of the lesser known features of the memory_profiler package is its ability to plot memory consumption as a function of time. This was implemented by my friend Philippe Gervais, previously a colleague at INRIA and now working for Google.

With this feature it is possible to generate very easily a plot of the memory consumption as a function of time. The result will be something like this:

where you can see the memory used (in the y-axis) as a function of time (x-axis). Furthermore, we have used two functions, test1 and test2, and it is possible to see with the colored brackets at what time do these functions start and finish.

This plot was generated with the following simple script:

import time

@profile
def test1():
    n = 10000
    a = [1] * n
    time.sleep(1)
    return a

@profile
def test2():
    n = 100000
    b = [1] * n
    time.sleep(1)
    return b

if __name__ == "__main__":
    test1()
    test2()

what happens here is that we have two functions, test1() and test2() in which we create two lists of different sizes (the one in test2 is bigger). We call time.sleep() for one second so that the function does not return too soon and so we have time to get reliable memory measurements.

The decorator @profile is optional and is useful so that memory_profiler knows when the function has been called so he can plot the brackets indicating that. If you don't put the decorator, the example will work just fine except that the brackets will not appear in your plot.

Suppose we have saved the script as test1.py. We run the script as

$ mprof run test1.py

where mprof is an executable provided by memory_profiler. If the above command was successful it will print something like this

$ mprof run test1.py
mprof: Sampling memory every 0.1s
running as a Python program...

The above command will create a .dat file on your current working directory, something like mprofile_20141108113511.dat. This file (you can inspect it, it's a text file) contains the memory measurements for your program.

You can now plot the memory measurements with the command

$ mprof plot

This will open a matplotlib window and show you the plot:

As you see, attention has been paid to the default values so that the plot it generates already looks decent without much effort. The not-so-nice-part is that, at least as of November 2014, if you want to customize the plot, well, you'll have to look and modify the mprof script. Some refactoring is still needed in order to make it easier to customize the plots (work in progress).

by Fabian Pedregosa at November 06, 2014 11:00 PM

November 02, 2014

Titus Brown

Doing biology, however it needs to be done

Sean Eddy wrote an interesting blog post on how scripting is something every biologist should learn to do. This spurred a few discussions on Twitter and elsewhere, most of which devolved into the usual arguments about what, precisely, biologists should be taught.

I always find these discussions not merely predictable but rather besides the point. Statisticians will always complain that people need a better appreciation of stats; bioinformatics will point to alignment or sequence comparison or whatnot; evolutionary biologists will talk about how important evolution is; physicists will point to more math; etc. But the truth is there's very, very little that is necessary to be a biologist.

My perspective on this is informed by my own background, which is a tad idiosyncratic. Despite being an assistant professor in a Microbiology department and (soon) a professor in a VetMed program, I have taken more physics courses than biology courses; heck, I've taught more biology courses than I've taken. I've never taken a chemistry course and know virtually nothing about biochemistry, in particular. Most of my evolutionary knowledge derives from my reading for my research on Avida; I've never been taught anything about ecology and still know very little. I failed my first qual exam at Caltech because I didn't actually know anything about cells (which event, ahem, spurred me to learn). Despite this utter lack of formal background in biology, I spent 5-10 years of my life doing experimental molecular biology for developmental biology research (after 3-5 years learning basic molecular biology), and I've learned what I think this a reasonable amount about cell biology into the bargain. I know a fair bit about developmental biology, evo-devo, molecular biology, and genomics. My PhD is actually in Biology, with a Developmental Biology focus. But I learned it all as I needed it. (My undergrad degree is in Unapplied Mathematics.)

My weakest formal training is probably in stats, where I know enough to point out where whatever system I'm studying is violating standard statistical requirements, but not enough to point how to rescue our approach.

Despite having "run" a bioinformatics lab for the last few years, my formal bioinformatics background is basically nil - I took a strong programming background, learned a bunch of biology and genomics, and then realized that much of bioinformatics is fairly obvious at that point. I don't really understand Hidden Markov Models or sequence alignment (but shh, don't tell anyone!)

With all of this, what do I call myself? Well, I definitely consider myself a biologist, as do at least a few different hiring panels, including one at UC Davis :). And when I talk to other biologists, I think that at least some of them agree - I'm focused on answering biological questions. I do so primarily in collaboration at this point, and primarily from the standpoint of data, but: biology.

So here's my conclusion: to be a biologist, one must be seriously trying to study biology. Period. Clearly you must know something about biology in order to be effective here, and critical thinking is presumably pretty important there; I think "basic competency in scientific practice" is probably the minimum bar, but even there you can imagine lab techs or undergraduates putting in useful work at a pretty introductory level here. I think there are many useful skills to have, but I have a hard time concluding that any of them are strictly necessary.

The more interesting question, to my mind, is what should we be teaching undergraduates and graduate students in biology? And there I unequivocally agree with the people who prioritize some reasonable background in stats, and some reasonable background in data analysis (with R or Python - something more than Excel). What's more important than teaching any one thing in specific, though, is that the whole concept that biologists can avoid math or computing in their training (be it stats, modeling, simulation, programming, data science/data analysis, or whatever) needs to die. That is over. Dead, done, over.

One particular challenge we are facing now is that we don't have many people capable of teaching these younger biologists the appropriate data analysis skills, because most biologists (including the non-research-active faculty that do most teaching) don't know anything about them, and data analysis in biology is about data analysis in biology -- you can't just drop in a physicists or an engineer to teach this stuff.

At the end of the day, though, a scientist either learns what they need to know in order to do their research, or they collaborate with others to do it. As data becomes ever more important in biology, I expect more and more biologists will learn how to do their own analysis. One of my interests is in figuring out how to help biologists to make this transition if they want to.

So perhaps we can shift from talking about what you must know in order to practice biology, and talk about what we're going to teach, to whom, and when, to people who are the biologists of the future?

--titus

by C. Titus Brown at November 02, 2014 11:00 PM

October 27, 2014

Juan Nunez-Iglesias

GitHub's hidden PR comments

It’s time to draw my “continuous integration in Python” series to a close. This final post ties all six previous posts together and is the preferred write-up to share more widely and on which to provide feedback.

Almost everything I know about good Python development I’ve learned from Stéfan van der Walt, Tony Yu, and the rest of the scikit-image team. But a few weeks ago, I was trying to emulate the scikit-image CI process for my own project: cellom2tif, a tool to liberate images from a rather useless proprietary format. (I consider this parenthetical comment sufficient fanfare to announce the 0.2 release!) As I started copying and editing config files, I found that even from a complete template, getting started was not very straightforward. First, scikit-image has much more complicated requirements, so that a lot of the .travis.yml file was just noise for my purposes. And second, as detailed in the previous posts, a lot of the steps are not found or recorded anywhere in the repository, but rather must be navigated to on the webpages of GitHub, Travis, and Coveralls. I therefore decided to write this series as both a notetaking exercise and a guide for future CI novices. (Such as future me.)

To recap, here are my six steps to doing continuous integration in Python with pytest, Travis, and Coveralls:

If you do all of the above at the beginning of your projects, you’ll be in a really good place one, two, five years down the line, when many academic projects grow far beyond their original scope in unpredictable ways and end up with much broken code. (See this wonderful editorial by Zeeya Merali for much more on this topic.)

Reducing the boilerplate with PyScaffold

But it’s a lot of stuff to do for every little project. I was about to make myself some minimal setup.cfg and .travis.yml template files so that I could have these ready for all new projects, when I remembered PyScaffold, which sets up a Python project’s basic structure automatically (setup.py, package_name/__init__.py, etc.). Sure enough, PyScaffold has a --with-travis option that implements all my recommendations, including pytest, Travis, and Coveralls. If you set up your projects with PyScaffold, you’ll just have to turn on Travis-CI on your GitHub repo admin and Coveralls on coveralls.io, and you’ll be good to go.

When Travises attack

I’ve made a fuss about how wonderful Travis-CI is, but it breaks more often than I’d like. You’ll make some changes locally, and ensure that the tests pass, but when you push them to GitHub, Travis fails. This can happen for various reasons:

  • your environment is different (e.g. NumPy versions differ between your local build and Travis’s VMs).
  • you’re testing a function that depends on random number generation and have failed to set the seed.
  • you depend on some web resource that was temporarily unavailable when you pushed.
  • Travis has updated its VMs in some incompatible way.
  • you have more memory/CPUs locally than Travis allows.
  • some other, not-yet-understood-by-me reason.

Of these, the first three are acceptable. You can use conda to match your environments both locally and on Travis, and you should always set the seed for randomised tests. For network errors, Travis provides a special function, travis_retry, that you can prefix your commands with.

Travis VM updates should theoretically be benign and not cause any problems, but, in recent months, they have been a significant source of pain for the scikit-image team: every monthly update by Travis broke our builds. That’s disappointing, to say the least. For simple builds, you really shouldn’t run into this. But for major projects, this is an unnecessary source of instability.

Further, Travis VMs don’t have unlimited memory and disk space for your builds (naturally), but the limits are not strictly defined (unnaturally). This means that builds requiring “some” memory or disk space randomly fail. Again, disappointing. Travis could, for example, guarantee some minimal specs that everyone could program against — and request additional space either as special exemptions or at a cost.

Finally, there’s the weird failures. I don’t have any examples on hand but I’ll just note that sometimes Travis builds fail, where your local copy works fine every single time. Sometimes rebuilding fixes things, and other times you have to change some subtle but apparently inconsequential thing before the build is fixed. These would be mitigated if Travis allowed you to clone their VM images so you could run them on a local VM or on your own EC2 allocation.

HeisenbugA too-common Travis occurrence: randomly failing tests

In all though, Travis is a fantastic resource, and you shouldn’t let my caveats stop you from using it. They are just something to keep in mind before you pull all your hair out.

The missing test: performance benchmarks

Testing helps you maintain the correctness of your code. However, as Michael Droettboom eloquently argued at SciPy 2014, all projects are prone to feature creep, which can progressively slow code down. Airspeed Velocity is to benchmarks what pytest is to unit tests, and allows you to monitor your project’s speed over time. Unfortunately, benchmarks are a different beast to tests, because you need to keep the testing computer’s specs and load constant for each benchmark run. Therefore, a VM-based CI service such as Travis is out of the question.

If your project has any performance component, it may well be worth investing in a dedicated machine only to run benchmarks. The machine could monitor your GitHub repo for changes and PRs, check them out when they come in, run the benchmarks, and report back. I have yet to do this for any of my projects, but will certainly consider this strongly in the future.

Some reservations about GitHub

The above tools all work great as part of GitHub’s pull request (PR) development model. It’s a model that is easy to grok, works well with new programmers, and has driven massive growth in the open-source community. Lately, I recommend it with a bit more trepidation than I used to, because it does have a few high-profile detractors, notably Linux and git creator Linus Torvalds, and OpenStack developer Julien Danjou. To paraphrase Julien, there are two core problems with GitHub’s chosen workflow, both of which are longstanding and neither of which shows any sign of improving.

First, comments on code diffs are buried by subsequent changes, whether the changes are a rebase or they simply change the diff. This makes it very difficult for an outside reviewer to assess what discussion, if any, resulted in the final/latest design of a PR. This could be a fairly trivial fix (colour-code outdated diffs, rather than hiding them), so I would love to see some comments from GitHub as to what is taking so long.

GitHub's hidden PR commentsExpect to see a lot of these when using pull requests.

Second, bisectability is broken by fixup commits. The GitHub development model is not only geared towards small, incremental commits being piled on to a history, but it actively encourages these with their per-commit badging of a user’s contribution calendar. Fixup commits make bug hunting with git bisect more difficult, because some commits will not be able to run a test suite at all. This could be alleviated by considering only commits merging GitHub PRs, whose commit message start with Merge pull request #, but I don’t know how to get git to do this automatically (ideas welcome in the comments).

I disagree with Julien that there is “no value in the social hype [GitHub] brings.” In fact, GitHub has dramatically improved my coding skills, and no doubt countless others’. For many, it is their first experience with code review. Give credit where it is due: GitHub is driving the current, enormous wave of open-source development. But there is no doubt it needs improvement, and it’s sad to see GitHub’s developers apparently ignoring their critics. I hope the latter will be loud enough soon that GitHub will have no choice but to take notice.

Final comments

This series, including this post, sums up my current thinking on CI in Python. It’s surely incomplete: I recently came across a curious “Health: 88%” badge on Mitchell Stanton-Cook’s BanzaiDB README. Clicking it took me to the project’s landscape.io page, which appears to do for coding style what Travis does for builds/tests and Coveralls does for coverage. How it measures “style” is not yet clear to me, but it might be another good CI tool to keep track of. Nevertheless, since it’s taken me a few years to get to this stage in my software development practice, I hope this series will help other scientists get there faster.

If any more experienced readers think any of my advice is rubbish, please speak up in the comments! I’ll update the post(s) accordingly. CI is a big rabbit hole and I’m still finding my way around.


by Juan Nunez-Iglesias at October 27, 2014 02:03 PM

October 22, 2014

Titus Brown

Infrastructure for Data Intensive Biology - a statement of work

Since being chosen as a Moore Foundation Data Driven Discovery Investigator, I've been putting together the paperwork at UC Davis to actually receive the money. Part of that is putting together a budget and a Statement of Work to help guide the conversation between me, Davis, and the Moore Foundation. Here's what I sent Chris Mentzel at Moore:


Title: Infrastructure for Data Intensive Biology

OUTCOME: In support of demonstrating the high level of scientific impact that data scientists deliver through their focus on interdisciplinary data-driven research, funds from this award will be used to better understand gene function in non-model organisms through the development of new workflows and better data sharing technology for large-scale data analysis.

Research direction 1: Develop and extend protocols for non-model genomic and transcriptomic analysis.

Research direction 2: Integrate and extend existing workflow and data analysis software into a cloud-enabled deployment system with a Web interface for executing protocols.

Research direction 3: Investigate and develop a distributed graph database system and distributed query functionality to support distributed data-driven discovery. (DDDD :)


For more of the background, see my full award submission, my presentation, and a science fiction story that I would like to enable.

Comments and pointers welcome!

--titus

by C. Titus Brown at October 22, 2014 10:00 PM

October 17, 2014

Juan Nunez-Iglesias

Readme badges

We’re finally ready to wrap up this topic. By now you can:

But, much as exercise is wasted if your bathroom scale doesn’t automatically tweet about it, all this effort is for naught if visitors to your GitHub page can’t see it!

Most high-profile open-source projects these days advertise their CI efforts. Above, I cheekily called this showing off, but it’s truly important: anyone who lands on your GitHub page is a potential user or contributor, and if they see evidence that your codebase is stable and well-tested, they are more likely to stick around.

Badging your README is easy. (You do have a README, don’t you?) In Travis, go to your latest build. Near the top right, click on the “build passing” badge:

Travis-CI badge

You’ll get an overlay, with a pull-down menu for all the different options for getting the badge. You can grab the image URL, or you can directly grab the Markdown to put into your markdown-formatted README, or a bunch of other options, including RST:

Travis-CI badge URLs

Just copy and paste the appropriate code and add it to your README file wherever you please.

Meanwhile, on your repository’s Coveralls page, on the right-hand side, you will find another pull-down menu with the appropriate URLs:

Coveralls badge URLs

Again, just grab whichever URL is appropriate for your needs (I prefer Markdown-formatted READMEs), and add it to your README, usually next to the Travis badge.

And you’re done! You’ve got automated tests, tons of test coverage, you’re running everything correctly thanks to configuration files, and all this is getting run on demand thanks to Travis and Coveralls. And thanks to badging, the whole world knows about it:

Readme badges


by Juan Nunez-Iglesias at October 17, 2014 01:37 PM

William Stein

A Non-technical Overview of the SageMathCloud Project

What problems is the SageMathCloud project trying to solve? What pain points does it address? Who are the competitors and what is the state of the technology right now?


What problems you’re trying to solve and why are these a problem?

  • Computational Education: How can I teach a course that involves mathematical computation and programming?
  • Computational Research: How can I carry out collaborative computational research projects?
  • Cloud computing: How can I get easy user-friendly collaborative access to a remote Linux server?

What are the pain points of the status quo and who feels the pain?

  • Student/Teacher pain:
    • Getting students to install software needed for a course on their computers is a major pain; sometimes it is just impossible, due to no major math software (not even Sage) supporting all recent versions of Windows/Linux/OS X/iOS/Android.
    • Getting university computer labs to install the software you need for a course is frustrating and expensive (time and money).
    • Even if computer labs worked, they are often being used by another course, stuffy, and students can't possibly do all their homework there, so computation gets short shrift. Lab keyboards, hardware, etc., all hard to get used to. Crappy monitors.
    • Painful confusing problems copying files around between teachers and students.
    • Helping a student or collaborator with their specific problem is very hard without physical access to their computer.
  • Researcher pain:
    • Making backups every few minutes of the complete state of everything when doing research often hard and distracting, but important for reproducibility.
    • Copying around documents, emailing or pushing/pulling them to revision control is frustrating and confusing.
    • Installing obscuring software is frustarting and distracting from the research they really want to do.
  • Everybody:
    • It is frustrating not having LIVE working access to your files wherever you are. (Dropbox/Github doesn't solve this, since files are static.)
    • It is difficult to leave computations running remotely.

Why is your technology poised to succeed?

  • When it works, SageMathCloud solves every pain point listed above.
  • The timing is right, due to massive improvements in web browsers during the last 3 years.
  • I am on full sabbatical this year, so at least success isn't totally impossible due to not working on the project.
  • I have been solving the above problems in less scalable ways for myself, colleagues and students since the 1990s.
  • SageMathCloud has many users that provide valuable feedback.
  • We have already solved difficult problems since I started this project in Summer 2012 (and launched first version in April 2013).

Who are your competitors?

There are no competitors with a similar range of functionality. However, there are many webapps that have some overlap in capabilities:
  • Mathematical overlap: Online Mathematica: "Bring Mathematica to life in the cloud"
  • Python overlap: Wakari: "Web-based Python Data Analysis"; also PythonAnywhere
  • Latex overlap: ShareLaTeX, WriteLaTeX
  • Web-based IDE's/terminals: target writing webapps (not research or math education): c9.io, nitrous.io, codio.com, terminal.com
  • Homework: WebAssign and WebWork
Right now, SageMathCloud gives away for free far more than any other similar site, due to very substantial temporary financial support from Google, the NSF and others.

What’s the total addressable market?

Though our primary focus is the college mathematics classroom, there is a larger market:
  • Students: all undergrad/high school students in the world taking a course involving programming or mathematics
  • Teachers: all teachers of such courses
  • Researchers: anybody working in areas that involve programming or data analysis
Moreover, the web-based platform for computing that we're building lends itself to many other collaborative applications.

What stage is your technology at?

  • The site is up and running and has 28,413 monthly active users
  • There are still many bugs
  • I have a precise todo list that would take me at least 2 months fulltime to finish.

Is your solution technically feasible at this point?

  • Yes. It is only a matter of time until the software is very polished.
  • Morever, we have compute resources to support significantly more users.
  • But without money (from paying customers or investment), if growth continues at the current rate then we will have to clamp down on free quotas for users.

What technical milestones remain?

  • Infrastructure for creating automatically-graded homework problems.
  • Fill in lots of details and polish.

Do you have external credibility with technical/business experts and customers?

  • Business experts: I don't even know any business experts.
  • Technical experts: I founded the Sage math software, which is 10 years old and relatively well known by mathematicians.
  • Customers: We have no customers; we haven't offered anything for sale.

Market research?

  • I know about math software and its users as a result of founding the Sage open source math software project, NSF-funded projects I've been involved in, etc.

Is the intellectual property around your technology protected?

  • The IP is software.
  • The website software is mostly new Javascript code we wrote that is copyright Univ. of Washington and mostly not open source; it depends on various open source libraries and components.
  • The Sage math software is entirely open source.

Who are the team members to move this technology forward?

I am the only person working on this project fulltime right now.
  • Everything: William Stein -- UW professor
  • Browser client code: Jon Lee, Andy Huchala, Nicholas Ruhland -- UW undergrads
  • Web design, analytics: Harald Schilly -- Austrian grad student
  • Hardware: Keith Clawson

Why are you the ideal team?

  • We are not the ideal team.
  • If I had money maybe I could build the ideal team, leveraging my experience and connections from the Sage project...

by William Stein (noreply@blogger.com) at October 17, 2014 01:04 PM

October 16, 2014

Jake Vanderplas

How Bad Is Your Colormap?

(Or, Why People Hate Jet – and You Should Too)

I made a little code snippet that I find helpful, and you might too:

In [1]:
def grayify_cmap(cmap):
    """Return a grayscale version of the colormap"""
    cmap = plt.cm.get_cmap(cmap)
    colors = cmap(np.arange(cmap.N))
    
    # convert RGBA to perceived greyscale luminance
    # cf. http://alienryderflex.com/hsp.html
    RGB_weight = [0.299, 0.587, 0.114]
    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
    colors[:, :3] = luminance[:, np.newaxis]
    
    return cmap.from_list(cmap.name + "_grayscale", colors, cmap.N)

What this function does is to give you a lumninance-correct grayscale version of any matplotlib colormap. I've found this useful for quickly checking how my plots might appear if printed in black and white, but I think it's probably even more useful for stoking the flame of the internet's general rant against jet.

If you want to take a step toward joining the in-crowd of chromatically-sensitive data viz geeks, your best bet is to start by bashing jet. Even if you don't know it by name, I can guarantee that if you've read many scientific papers, you've seen jet before. For example, here's a snapshot of a plot from neuroscience journal which is skewered by an appropriately ranty blogpost on the subject:

Jet is the default colorbar originally used by matlab, and this default was inherited in the early days of Python's matplotlib package. The reasons not to use jet are numerous, and you can find good arguments against it across the web. For some more subdued and nuanced arguments, I'd start with the paper Rainbow Color Map (Still) Considered Harmful and, for more general visualization tips, Ten Simple Rules for Better Figures.

So what do I have to add to this discussion that hasn't been already said? Well, nothing really, except the code snippet I shared above. Let me show you what it does.

Taking the Color Out of Jet

Let's start by defining some data and seeing how it looks with the default jet colormap:

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 6)
y = np.linspace(0, 3)[:, np.newaxis]
z = 10 * np.cos(x ** 2) * np.exp(-y)

We'll use matplotlib's imshow command to visualize this. By default, it will use the "jet" colormap:

In [4]:
plt.imshow(z)
plt.colorbar();

At first glance this might look OK. But upon closer examination, you might notice that jet's Luminance profile is incredibly complicated. Because your eye has different levels of sensitivity to light of different color, the luminance is not simply the sum of the RGB values as you might naively expect, but some weighted Euclidean sum of the individual values. You can find more information than you'd ever need to know on imagemagick's website.

When you take the jet colormap used above and convert it to luminance using the code snippet above, you get this:

In [5]:
plt.imshow(z, cmap=grayify_cmap('jet'))
plt.colorbar();

It's a mess! The greyscale-only version of this colormap has strange luminance spikes in the middle, and makes it incredibly difficult to figure out what's going on in a plot with a modicum of complexity. Much better is to use a colormap with a uniform luminance gradient, such as the built-in grayscale colormap. Let's plot this beside the previous two:

In [6]:
cmaps = [plt.cm.jet, grayify_cmap('jet'), plt.cm.gray]
fig, axes = plt.subplots(1, 3, figsize=(12, 3))
fig.subplots_adjust(wspace=0)

for cmap, ax in zip(cmaps, axes):
    im = ax.imshow(z, cmap=cmap)
    ax.set_title(cmap.name)
    fig.colorbar(im, ax=ax)

In particular, notice that in the left panel, your eye is drawn to the yellow and cyan regions, because the luminance is higher. This can have the unfortunate side-effect of highlighting "features" in your data which may not actually exist!

We can see this Luminance spike more clearly if we look at the color profile of jet up close:

In [7]:
def show_colormap(cmap):
    im = np.outer(np.ones(10), np.arange(100))
    fig, ax = plt.subplots(2, figsize=(6, 1.5),
                           subplot_kw=dict(xticks=[], yticks=[]))
    fig.subplots_adjust(hspace=0.1)
    ax[0].imshow(im, cmap=cmap)
    ax[1].imshow(im, cmap=grayify_cmap(cmap))
    
show_colormap('jet')

Once you have the grayscale lined-up with the color version, it's easy to point out these luminance spikes in the jet spectrum. By comparison, take a look at the Cube Helix colormap:

In [8]:
show_colormap('cubehelix')

This is a rainbow-like colormap which – by design – has a uniform luminance gradient across its progression of colors. It's certainly not the best choice in all situations, but you could easily argue that it's always a better choice than jet.

All the Colormaps!

It's useful to see this sort of visualization for all the available colormaps in matplotlib. Here's a quick script that does this:

In [18]:
fig, axes = plt.subplots(36, 6, figsize=(10, 7))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1,
                    hspace=0.1, wspace=0.1)

im = np.outer(np.ones(10), np.arange(100))

cmaps = [m for m in plt.cm.datad if not m.endswith("_r")]
cmaps.sort()

axes = axes.T.ravel()
for ax in axes:
    ax.axis('off')

for cmap, color_ax, gray_ax, null_ax in zip(cmaps, axes[1::3], axes[2::3], axes[::3]):
    del null_ax
    color_ax.set_title(cmap, fontsize=10)
    color_ax.imshow(im, cmap=cmap)
    gray_ax.imshow(im, cmap=grayify_cmap(cmap))

There are some colormaps in here that have very nice, linear luminance gradients, and this is something you should keep in mind when choosing your color map.

Much more could be written about choosing an appropriate color map for any given data; for a more in-depth discussion of matplotlib's maps (and some interesting luminance illustrations), you can refer to matplotlib's choosing a colormap documentation.

If you're interested in streamlined statistical plotting in Python with well thought-out default color choices, I'd suggest taking a look at Michael Waskom's seaborn project, and especially the associated Color Palette Tutorial.

I hope you find this grayify_cmap snippet helpful, and thanks for reading!

This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here.

by Jake Vanderplas at October 16, 2014 04:30 PM

William Stein

Public Sharing in SageMathCloud, Finally

SageMathCloud (SMC) is a free (NSF, Google and UW supported) website that lets you collaboratively work with Sage worksheets, IPython notebooks, LaTeX documents and much, much more. All work is snapshotted every few minutes, and copied out to several data centers, so if something goes wrong with a project running on one machine (right before your lecture begins or homework assignment is due), it will pop up on another machine. We designed the backend architecture from the ground up to be very horizontally scalable and have no single points of failure.

This post is about an important new feature: You can now mark a folder or file so that all other users can view it, and very easily copy it to their own project.




This solves problems:
  • Problem: You create a "template" project, e.g., with pre-installed software, worksheets, IPython notebooks, etc., and want other users to easily be able to clone it as a new project. Solution: Mark the home directory of the project public, and share the link widely.

  • Problem: You create a syllabus for a course, an assignment, a worksheet full of 3d images, etc., that you want to share with a group of students. Solution: Make the syllabus or worksheet public, and share the link with your students via an email and on the course website. (Note: You can also use a course document to share files with all students privately.) For example...


  • Problem: You run into a problem using SMC and want help. Solution: Make the worksheet or code that isn't working public, and post a link in a forum asking for help.
  • Problem: You write a blog post explaining how to solve a problem and write related code in an SMC worksheet, which you want your readers to see. Solution: Make that code public and post a link in your blog post.
Here's a screencast.

Each SMC project has its own local "project server", which takes some time to start up, and serves files, coordinates Sage, terminal, and IPython sessions, etc. Public sharing completely avoids having anything to do with the project server -- it works fine even if the project server is not running -- it's always fast and there is no startup time if the project server isn't running. Moreover, public sharing reads the live files from your project, so you can update the files in a public shared directory, add new files, etc., and users will see these changes (when they refresh, since it's not automatic).
As an example, here is the cloud-examples github repo as a share. If you click on it (and have a SageMathCloud account), you'll see this:


What Next?

There is an enormous amount of natural additional functionality to build on top of public sharing.

For example, not all document types can be previewed in read-only mode right now; in particular, IPython notebooks, task lists, LaTeX documents, images, and PDF files must be copied from the public share to another project before people can view them. It is better to release a first usable version of public sharing before systematically going through and implementing the additional features needed to support all of the above. You can make complicated Sage worksheets with embedded images and 3d graphics, and those can be previewed before copying them to a project.
Right now, the only way to visit a public share is to paste the URL into a browser tab and load it. Soon the projects page will be re-organized so you can search for publicly shared paths, see all public shares that you have previously visited, who shared them, how many +1's they've received, comments, etc.

Also, I plan to eventually make it so public shares will be visible to people who have not logged in, and when viewing a publicly shared file or directory, there will be an option to start it running in a limited project, which will vanish from existence after a period of inactivity (say).

There are also dozens of details that are not yet implemented. For example, it would be nice to be able to directly download files (and directories!) to your computer from a public share. And it's also natural to share a folder or file with a specific list of people, rather than sharing it publicly. If somebody is viewing a public file and you change it, they should likely see the update automatically. Right now when viewing a share, you don't even know who shared it, and if you open a worksheet it can automatically execute Javascript, which is potentially unsafe.  Once public content is easily found, if somebody posts "evil" content publicly, there needs to be an easy way for users to report it.

Sharing will permeate everything

Sharing has been thought about a great deal during the last few years in the context of sites such as Github, Facebook, Google+ and Twitter. With SMC, we've developed a foundation for interactive collaborative computing in a browser, and will introduce sharing on top of that in a way that is motivated by your problems. For example, as with Github or Google+, when somebody makes a copy of your publicly shared folder, this copy should be listed (under "copies") and it could start out public by default. There is much to do.

One reason it took so long to release the first version of public sharing is that I kept imagining that sharing would happen at the level of complete projects, just like sharing in Github. However, when thinking through your problems, it makes way more sense in SMC to share individual directories and files. Technically, sharing at this level works works well for read-only access, not for read-write access, since projects are mapped to Linux accounts. Another reason I have been very hesitant to support sharing is that I've had enormous trouble over the years with spammers posting content that gets me in trouble (with my University -- it is illegal for UW to host advertisements). However, by not letting search engines index content, the motivation for spammers to post nasty content is greatly reduced.

Imagine publicly sharing recipes for automatically gradable homework problems, which use the full power of everything installed in SMC, get forked, improved, used, etc.

by William Stein (noreply@blogger.com) at October 16, 2014 02:29 PM

Juan Nunez-Iglesias

Coveralls dashboard

In this series of posts, we’ve covered:

Today I will show you how to continuously check your test coverage using Coveralls.

Travis runs whatever commands you tell it to run in your .travis.yml file. Normally, that’s just installing your program and its requirements, and running your tests. If you wanted instead to launch some nuclear missiles, you could do that. (Assuming you were happy to put the launch keys in a public git repository… =P)

The Coveralls service, once again free for open-source repositories, takes advantage of this: you just need to install an extra piece of software from PyPI, and run it after your tests have passed. Do so by adding the line pip install coveralls to your before_install section, and just the coveralls command to a new after_success section:

language: python
python:
    - "2.7"
    - "3.4"
before_install:
    - pip install pytest pytest-cov
    - pip install coveralls
script:
    - py.test
after_success:
    - coveralls

This will push your coverage report to Coveralls every time Travis is run, i.e., every time you push to your GitHub repository. You will need to turn on the service by:

  • going to coveralls.io and signing in with your GitHub credentials
  • clicking on “Repositories”, and “Add Repo” (and perhaps on “Sync GitHub Repos”, top right)
  • switching your repository’s switch to “On”

Adding a repo to Coveralls

That’s it! With just two small changes to your .travis.yml and the flick of a switch, you will get continuous monitoring of your test coverage with each push, and also with each pull request. By default, Coveralls will even comment on your pull requests, so that you instantly know whether someone is not testing their submitted code!

Plus, you get a swanky web dashboard to monitor your coverage!

Coveralls dashboard

Tune in tomorrow for the final chapter in continuous integration in Python: showing off! No, it’s not a blog post about blogging about it. =P

(Addendum: the above .travis.yml will run Coveralls twice, once per Python version, so it will also comment twice for your PRs, notify you twice, etc. If you want to test more Python versions, the problem multiplies. Therefore, you can choose to call Coveralls only for a specific Python version, like so:

after_success:
    - if [[ $ENV == python=3.4* ]]; then
          coveralls;
      fi

This is the approach taken by the scikit-image project.)


by Juan Nunez-Iglesias at October 16, 2014 06:11 AM

October 15, 2014

Continuum Analytics news

The Leading Enterprise Python Distribution for Data Analytics Is Moving to the Hadoop and Spark Infrastructures

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, today announced the release of Anaconda Cluster. Continuum’s enterprise Python distribution for data analytics, Anaconda Server, is moving to the Hadoop and Spark infrastructures. Anaconda Cluster is a highly-scalable cluster resource management tool that lets data scientists streamline deployment without rework. “We would love to pass off clustering to the engineering and IT team, but we are the only ones that know how to manage it right now.” - A Hadoop/Python user in response to the pain points that Anaconda Cluster addresses

by Continuum at October 15, 2014 12:09 PM

Juan Nunez-Iglesias

Travis build page

This is the fourth post in my Continuous integration (CI) in Python series, and the one that puts the “continuous” in CI! For the introductory three posts, see:

Introduction to Travis-CI

Once you’ve set up your tests locally, it does you no good if you don’t remember to run them! Travis-CI makes this seamless, because it will check out your code and run you tests for each and every commit you push to GitHub! (This is even more important when you are receiving pull requests on GitHub: the tests will be run online, without you having to individually check out each PR and run the tests on your machine!)

This is what continuous integration is all about. Once upon a time, the common practice was to pile on new features on a codebase. Then, come release time, there would be a feature freeze, and some time would be spent cleaning up code and removing bugs. In continuous integration, instead, no new feature is allowed into the codebase until it is bug free, as demonstrated by the test suite.

What to do

You need to first add a .travis.yml file to the root of your project. This tells Travis how to install your program’s dependencies, install your program, and run the tests. Here’s an example file to run tests and coverage on our maths.py sample project:

language: python
python:
    - "2.7"
    - "3.4"
before_install:
    - pip install pytest pytest-cov
script:
    - py.test

Pretty simple: tell Travis the language of your project, the Python version (you can specify multiple versions, one per line, to test on multiple Python versions!), how to install test dependencies. Finally, the command to run your tests. (This can be anything, not just pytest or another testing framework, as long as a shell exit status of 0 means “success” and anything else means “failure”.)

You can read more about the syntax and options for your .travis.yml in the Travis documentation. There are other sections you can add, such as “virtualenv” to set up Python virtual environments, “install” to add compilation and other installation steps for your library, before testing, and “after_success” to enable e.g. custom notifications. (We will use this last one in the next post.)

Once you have created .travis.yml for your project, you need to turn it on for your repo. This is, currently, a fairly painful process, and I can’t wait for the day that it’s enabled by default on GitHub. [Update: see below] In the meantime though:

[Update 2014-10-28: Thanks to @hugovk for pointing out that the first four points above can be skipped. It turns out that when you first log in to Travis-CI using your GitHub account, you give them write access to your webhooks. So, when you add a repo from their end, they go ahead and add themselves on the GitHub end! Boom. Way easier.]

Voilà! Every push and pull-request to your repository will trigger a job on Travis’s servers, building your dependencies and your software, running your tests, and emailing you if anything went wrong! Amazingly, this is completely free for open source projects, so you really have no excuse for not using it!

Travis build page

Follow this blog to learn how to continuously check test coverage using Coveralls, coming in the next post!


by Juan Nunez-Iglesias at October 15, 2014 11:00 AM

October 14, 2014

Matthieu Brucher

Book review: The Prize – The Epic Quest for Oil, Money & Power

Yet another book that my colleague suggested me to read. I also discussed it with another colleague who told me the same: this is a book that anyone in the oil and gas field should read. And what about people not in this industry?

Content and opinions

The book is primarily focused on the big scale, from a geopolitical and strategic point of view, with anecdotes on what happen behind the scenes. But even for the global scale, there are many things in the book that I didn’t know about: how Rockfeller make so rich in the first place, the role of gasoline in World War 1, what drove the 2 oil shocks… The book is quite big, and split in five parts, mainly driven by fundamental changes in the approaches to oil place. Sometimes it is difficult to follow the main thread of the story as the author has to go back in time to follow a different side of History, and also sometimes the name of the biggest companies is confusing, albeit precise.

The first part was like a Far West story. It starts in the middle of the 19th century and finishes before World War 1. Basically this is the story of Rockfeller fortune, the fall of his company, as well as a nice description of what happens when everyone looks for oil without regulation. It is “funny” to compare this to the current rush for gas shale in the US and how it is exploited (i.e. without regard for the global economy).

The second chapter describes World War 1 as well as the Depression that follows. Here, the politicians start noticing the importance of oil, and how it plays a crucial role in World War 1, as well as how it was a great helper in the recession. But it also indicated that oil needed a minimum price, which is what I think is currently forgotten in the gas shale exploitation: the gas companies are sacrified on the altar of growth. At leat Roosevelt understood that all the economy has to grow, and the oil and gas industry couldn’t be sacrified…

The third chapter tackles World War 2, and the different theatres of war. Clearly the different opponents knew about the importance of oil, and also the winners were those who could manage to keep a steady supply of oil. Also here is mentioned the issue we still face: what energy to use when you don’t have oil anymore?

Fourth chapter is about growth years before the first oil shock. Clearly the balance shifted then between companies and countries, and even though this part is less interesting in terms of events (there is still the birth of OPEC and the Suez crisis), it still brings some background for the current power balance that is not even hinted in the medias. Also the corruption of the countries by money is also introduced.

The fifth chapter describes the current balance, with the oil sword that was so important during the second oil shock and the war between Israel and the Arab nations. The way it was used is not something that I linked with the second oil shock at first, although it seems so blatant. Here is also where the Middle-East story ends with the first Iraq War. Of course, the Middle East is actually the central piece of the whole book, and there are so many pieces that gravitated around it int he book that it was sometimes incredible.

There is a small epilogue to sum up what could be after the first Iraq war and then the advent of the war on terrorism. This epilogue is lacking some depth, but it is really difficult to get perspective of events that are happening when you are writing the book!

Conclusion

I thought I understood the story of the 20th century quite well. I was very mistaken. Even though it is a long book, there is so much to discuss that it is sometimes not enough. It is simple enough so that anyone can read it. And everyone should read definitely read it.

by Matt at October 14, 2014 07:11 AM

Juan Nunez-Iglesias

jnuneziglesias

This is the third post in my series on continuous integration (CI) in Python. For the first two posts, see 1: automated tests with pytest, and 2: measuring test coverage.

By now, you’ve got a bunch of tests and doctests set up in your project, which you run with the command:

$ py.test --doctest-modules --cov . --cov-report term-missing

If you happen to have a few script and setup files lying around that you don’t want to test, this might expand to

$ py.test --doctest-modules --cov . --cov-report term-missing --ignore script.py

As you can see, this is getting cumbersome. You can fix this by adding a setup.cfg file to the root of your project, containing configuration options for your testing. This follows standard INI file syntax.

[pytest]
addopts = –ignore script.py –doctest-modules –cov-report term-missing –cov .

Note that the “ignore” flag only tells pytest not to run that file during testing. You also have to tell the coverage module that you don’t want those lines counted for coverage. Do this in a .coveragerc file, with similar syntax:

[run]
omit = *script.py

Once you’ve done this, you can invoke your tests with a simple, undecorated call to just py.test.

To find out more about your configuration options, see the pytest basic test configuration page, and Ned Batchelder’s excellent .coveragerc syntax guide.

That’s it for this entry of my CI series. Follow this blog for the next two entries, setting up Travis CI and Coveralls.


by Juan Nunez-Iglesias at October 14, 2014 02:57 AM

October 13, 2014

Titus Brown

The emerging discipline of Data Intensive Biology

Yesterday I gave my third keynote address ever, at the Australasian Genomics Technology Association's annual meeting in Melbourne (talk slides here). On my personal scale of talks, it was a 7 or 8 out of 10: I gave it a lot of energy, and I think the main messages got across, but I ended up conveying a few things that I regretted - in particular, the Twitter feed pointed that I'd slagged on biologists a few times (slag, v: British informal. Criticize (someone) in an abusive and insulting manner). Absolutely not my intent!

I consider myself a biologist, albeit one who works primarily on biological data analysis. So I'm starting to call myself a data-intensive biologist. And I thought now would be a good time to talk about the emerging discipline of Data Intensive Biology.

Before I go on, this post is dedicated to my Australian hosts. I've had a wonderful time so far, and I've been tremendously impressed with the Australasian genomics and bioinformatics research communities. They've got something great going on here and I look forward to further interactions!

Who are Data Intensive Biologists?

In my talk, I included the de rigeur picture of a tidal wave, reproduced below.

../static/images/tidalwave.jpg

This tidal wave is intended to represent the data deluge in biology, the combination of lots of -omics and sensor data that is starting to hit the field of biology. If you look closely at the picture, you'll see three groups of researchers.

  • the inland researchers, who are up on the shore well away from the boardwalk. They are looking at the tidal wave, wondering if the water's going to reach them; they're far away from it, though, so they're not really too worried about it.
  • the boardwalk researchers, who are on the little walkway at the base of the crashing wave. Some of them are looking at the wave in horror, aware that this is going to be painful; others are busy putting on little lifevests, in the vain hope that this will save them; and the third group are looking the other way, wondering what everyone is talking about.
  • the surfer dudes and dudettes, who are having the time of their lives, surfing down the face of the wave. Every now and then they fall off, but the water's deep and they can get right back on the board. (OK, it's an imperfect analogy.)

The surfers are the data intensive biologists: they love the data, they love the thrill of data analysis, and they're going somewhere fast, although maybe not in a forward direction.

The anatomy of a Data Intensive Biologist

In my ABiC 2014 keynote (keynote #2), I listed out five character types that participate in bioinformatics. These were:

  1. Computer scientists
  2. Software engineers
  3. Data scientists
  4. Statisticians
  5. Biologists

(I missed a 6th, database maintainers, and possibly a 7th, data curators.)

In another miscommunication, I meant to say (but did not say during my talk) that almost every effective bioinformatics researcher is some linear combination of these seven characters. I think that data intensive biologists can be defined on this set of axes, too.

So: data intensive biologists are biology researchers who are focused on biological questions, but who have substantial footings in many or all of the other fields above. That is, their focus is on making biological progress, but they are using tools from computer science, software engineering, data science, statistics, and databases to study biology.

Some additional characteristics

Data Intensive Biologists:

  • are usually well grounded in at least one field of biology;
  • understand that data is not information, but that it's a darn good start;
  • know that most of our current biological knowledge is limited or wrong, but that we've got to rely on it anyway;
  • are aware that investing in automation is sometimes repaid 10x in efficiency, except when it's not;
  • realize that reproducible computational analyses are a great idea;
  • write software when they have to, but only because they have to;
  • think data science training is neat, because it gives an enduring competitive edge;
  • constantly rebalance themselves between the CS, software engineering, data science, and stats perspectives -- almost as if they were surfing across those fields;
  • get that open science is obvious if not always practical;
  • are confused as to why biology graduate programs aren't teaching more data analysis;

and, finally, data intensive biologists

  • shouldn't worry about their career options.

Concluding thoughts

If you had to nail down a definition - people like to do that, for some reason :) - I would go with:

Data intensive biology: a researcher focused on addressing biological research questions primarily through large scale data analysis or integration.

I don't have any interest in being exclusionary with this definition. If you're tackling biological questions in any way, shape, or form, and you're using lots of data to do it, you fit my definition!

Oh, and by the way? There are already workshops and faculty positions in this area. Although they're all at Hopkins, so maybe the best you can say is that James Taylor and I agree on terminology :).

--titus

by C. Titus Brown at October 13, 2014 10:00 PM

October 10, 2014

Titus Brown

How good is MEGAHIT?

A few weeks back, Nick Loman (via Manoj Samanta) brought MEGAHIT to our attention on Twitter. MEGAHIT promised "an ultra-fast single-node solution for large and complex metagenome assembly" and they provided a preprint and some open source software. This is a topic near and dear to my heart (see Pell et al., 2012 and Howe et al., 2014), so I was immediately interested - especially since the paper used our Howe et al. data set to prove out their results. (The twitterati also pointed out that the preprint engaged in some bashing of this previous work, presumably to egg me on. ;)

So I thought, heck! Let's take MEGAHIT out for a spin! So my postdoc Sherine Awad and I tried it out.

tl; dr? MEGAHIT seems pretty awesome to me, although IDBA and SPAdes still seem to beat it by a bit on the actual assembly results.

Installing MEGAHIT

We ran into some small compilation problems but got it working on an Ubuntu 12.04 system easily enough.

Running it was also a snap. It took a few minutes to work through the required options, and voila, we got it running and producing results. (You can see some example command lines here.)

First question --

How does it do on E. coli?

One of the claims made in the paper is that this approach performs well on low-coverage data. To evaluate this, I took a 5m read subset from the E. coli MG1655 dataset (see Chitsaz et al., 2011) and further subsetted it to 1m reads and 500k reads, to get (roughly) 100x, 20x, and 10x data sets. I then ran MEGAHIT with default parameters, specifying 1 GB of memory, and limiting only the upper k size used (because otherwise it crashed) -- again, see the Makefile.

For comparison, I also ran SPAdes on the lowest-coverage data, looking only at the contigs (not the scaffolds).

After it was all done assembling, I ran QUAST on the results.

Measure 100x 20x 10x 10x (SPAdes)
N50 73736 52352 9067 18124
Largest alignment 221kb 177kb 31kb 62kb
bp in contigs > 1kb 4.5mb 4.5mb 4.5mb 4.5mb
Genome fraction 98.0% 98.0% 97.4% 97.9%
Misassembled length 2kb 40.8kb 81.3kb 63.6kb

(Data: MEGAHIT 100x, 20x, and 10x; and SPAdes 10x.)

In summary, it does pretty well - with even pretty low coverage, you're getting 97.4% of the genome in contigs > 500bp (QUAST's default cutoff). Misassemblies grow significantly at low coverage, but you're still only at 2% in misassembled contigs.

In comparison to SPAdes at low coverage, the results are ok also. SPAdes performs better in every category, which I would expect -- it's a great assembler! - but MEGAHIT performs well enough to be usable. MEGAHIT is also much, much faster - seconds vs minutes.

Next question -

How well does it do on on a metagenomic data set?

Sherine has been running benchmarks for Velvet, SPAdes, and IDBA on the data set from Shakya et al, 2013, a mock community data set. So I asked her to add some MEGAHIT results. She did quality trimming as specified in Kalamazoo, and ran MEGAHIT with 10 GB of RAM. She then used QUAST to evaluate the results against the known good genomes.

Measure MEGAHIT SPAdes IDBA
# contigs > 1kb 19,982 16,387 16,191
length in contigs >1kb 190.7mb 192.5mb 194.8
# misassemblies 698 894 1223
Bp in misassemblies 12.7mb 28.2mb 21.8mb
Metagenome fraction 89.96% 90.42% 90.97%

Again, the answer is "MEGAHIT works pretty well." Fewer misassemblies, but also more contigs and a bit less coverage of the known genome.

Third question --

How fast and memory efficient was MEGAHIT?

Very. We didn't actually measure it, but, like, really fast. And low memory, also. We're doing systematic benchmarking on this front for our own paper, and we'll provide details as we get them.

(We didn't measure MEGAHIT's performance because we don't have numbers for SPAdes and IDBA yet. We didn't measure SPAdes and IDBA yet because actually doing the benchmarking well is really painful - they take a long time to run. 'nuff said :)

So, what are your conclusions?

So far, +1. Highly recommended to people who are good at command line stuff and general all-around UNIX-y folk. I'd want to play around with it a bit more before strongly recommending it to anyone who wasn't a seasoned bioinformatician. It's rough around the edges, and I haven't looked at the code much yet. It also breaks in various edge cases, but at least it's pretty robust when you just hand it a straight up FASTQ file!

That having been said, it works shockingly well and is quite fast and memory efficient. If you're having trouble achieving an assembly any other way I would definitely recommend investing the time to try out MEGAHIT.

--titus

p.p.s. Thanks to Rayan Chikhi and Lex Nederbragt for reading and commenting on
a draft version of this post!

Appendix: MEGAHIT and digital normalization

In the MEGAHIT paper, they commented that they believed that digital normalization could lead to loss of information. So I thought I'd compare MEGAHIT on 100x against MEGAHIT and SPAdes running on digitally normalized 100x:

Measure 100x DN (w/MEGAHIT) DN (w/SPAdes)
N50 73736 82753 132872
Largest alignment 221kb 222kb 224kb
bp in contigs > 1kb 4.5mb 4.5mb 4.6mb
Genome fraction 98.0% 98.1% 98.2%
Misassembled length 2kb 120kb 48kb

(Data: MEGAHIT 100x, MEGAHIT DN, and SPAdes DN.)

The short version is, I don't see any evidence that diginorm leads to incompleteness, but clearly diginorm leads to lots of misassemblies when used in conjunction with MEGAHIT or SPAdes on high-coverage genomes. (We have some (ok, lots) of evidence that this doesn't happen with lower coverage genomes, or metagenomes.) That having been said, it's clearly rather assembler-specific, since SPAdes does a much better job than MEGAHIT on dn data.

The shorter version? You probably won't need to use diginorm with MEGAHIT, and you shouldn't. That's OK. (There are lots of reasons why you shouldn't use diginorm.)

I still don't have any evidence that diginorm drops information in non-polyploid situations. Let me know if you've seen this happen!

Appendix II: Running your own evaluation

All of the E. coli numbers above are available in the 2014-megahit-evaluation github repo. See README.md in that repo for basic install instructions, and Makefile for what I ran and how to run it. Feel free to reproduce, extend, and update!

by C. Titus Brown at October 10, 2014 10:00 PM

October 09, 2014

Continuum Analytics

Tips for More Effective Numba Usage

Last week, I wrote about the features Numba has, and what our long term development goals are. Today, I want to dig into the details of how to more effectively use Numba on your programs.

by Stanley Seibert at October 09, 2014 03:15 PM

October 07, 2014

Matthieu Brucher

Book review: Fundamentals of Reservoir Engineering

I worked for a long time for the seismic department of my company, and switched to the reservoir department only last year. The problems that are tackled are quite different, and the way they are solved as well. So nothing to do with the book I reviewed a long time ago. So after 2 trainings in reservoir simulation, I also read this book that a colleague of mine labeled as the reference book.

Content and opinions

The book has a nice progression in complexity. Although it won’t tackle anything like model generation (it is someone else’s job after all), it tackles the basic questions that a reservoir engineer has to ask himself. The progression is also close to what I had during my reservoir engineering training.

The first four chapters are an introduction to the basic physics, namely how to compute the volume of oil in the reservoir, how does it work, how much can you get out of it, basic thermodynamics of oil (black oil model), or the Darcy law.

Then, the four next chapters deal with wells, pressure, which are the most important things in the job. The main values you can get to describe your reservoir come from well tests and the analysis of the results. They may seem basic, with a lot of approximations, but they are still the first maths you do when appreciating a reservoir! 40 years later…

The last two chapters are a little bit different, but equally important. I didn’t think much of aquifers before I actually realized that they are the main way of recovering oil… Without them, you have to use quite rapidly enhanced techniques. And they behave quite differently depending on their size, their location. The last chapter deals with relative permeability. It’s something that I found “missing” int he previous chapters, as it was always mentioned during my trainings, and then I noticed that the last chapter tries to cover everything they knew about this topic in the last chapter.

Conclusion

Although the book is quite old and although reservoir simulation has made tremendous progress, the basic approaches are still used today, and this is what the book is about. The location of the exercises and the fact that they are solved just after is also well appreciated.

by Matt at October 07, 2014 07:37 AM

October 04, 2014

Titus Brown

Putting together an online presence for a diffuse academic community - how?

I would like to build a community site. Or, more precisely, I would like to recognize, collect, and collate information from an already existing but rather diffuse community.

The focus of the community will be academic data science, or "data driven discovery". This is spurred largely by the recent selection of the Moore Data Driven Discovery Investigators, as well as the earlier Moore and Sloan Data Science Environments, and more broadly by the recognition that academia is broken when it comes to data science.

So, where to start?

For a variety of reasons -- including the main practical one, that most academics are not terribly social media integrated and we don't want to try to force them to learn new habits -- I am focusing on aggregating blog posts and Twitter.

So, the main question is... how can we most easily collect and broadcast blog posts and articles via a Web site? And how well can we integrate with Twitter?

First steps and initial thoughts

Following Noam Ross's suggestions in the above storify, I put together a WordPress blog that uses the RSS Multi Importer to aggregate RSS feeds as blog posts (hosted on NFSN). I'd like to set this up for the DDD Investigators who have blogs; those who don't can be given accounts if they want to post something. This site also uses a Twitter feed plugin to pull in tweets from the list of DDD Investigators.

The resulting RSS feed from the DDDI can be pulled into a River of News site that aggregates a much larger group of feeds.

The WordPress setup was fairly easy and I'm going to see how stable it is (I assume it will be very stable, but shrug time will tell :). I'm upgrading my own hosting setup and once that's done, I'll try out River4.

Next steps and longer-term thoughts

Ultimately a data-driven-discovery site that has a bit more information would be nice; I could set up a mostly static site, post it on github, authorize a few people to merge, and then solicit pull requests when people want to add their info or feeds.

One thing to make sure we do is track only a portion of feeds for prolific bloggers, so that I, for example, have to tag a post specifically with 'ddd' to make it show up on the group site. This will avoid post overload.

I'd particularly like to get a posting set up that integrates well with how I consume content. In particular, I read a lot of things via my phone and tablet, and the ability to post directly from there -- probably via e-mail? -- would be really handy. Right now I mainly post to Twitter (and largely by RTing) which is too ephemeral, or I post to Facebook, which is a different audience. (Is there a good e-mail-to-RSS feed? Or should I just do this as a WordPress blog with the postie plug-in?)

The same overall setup could potentially work for a Software Carpentry Instructor community site, a Data Carpentry Instructor community site, trainee info sites for SWC/DC trainees, and maybe also a bioinformatics trainee info site. But I'd like to avoid anything that involves a lot of administration.

Things I want to avoid

Public forums.

Private forums that I have to administer or that aren't integrated with my e-mail (which is where I get most notifications, in the end).

Overly centralized solutions; I'm much more comfortable with light moderation ("what feeds do I track?") than anything else.


Thoughts?

--titus

by C. Titus Brown at October 04, 2014 10:00 PM

October 02, 2014

Continuum Analytics

Advanced Features of Conda Part 2

Conda is the package manager that comes with Continuum’s Anaconda distribution. Conda makes it easy to install and manage all your packages and environments on Windows, Mac OS X, and Linux.

If you are new to conda, I recommend starting at the documentation, and watching my presentation from SciPy 2014.

In this post, I will assume that you are already familiar with conda and its basic usage, both for installing and building packages. I will look at a few advanced features or tips that may not be well known, even among advanced users of conda. These features will help you with feature discovery, customization, and allow you to manage your packages and environments in more advanced ways.

In part 1, we looked at --help, configuration, conda update --all, conda list --export and conda create --file, conda clean, and pinning packages. In this post, we will look at some tools that make building packages easier and some more features that help manage your environments and use conda more effectively from the command line.

by Aaron Meurer at October 02, 2014 12:44 PM

Juan Nunez-Iglesias

jnuneziglesias

(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 1. But everything else remains the same.)

This is the second post in a series about setting up continuous integration for a Python project from scratch. For the first post, see Automated tests with pytest.

After you’ve written some test cases for a tiny project, it’s easy to check what code you have automatically tested. For even moderately big projects, you will need tools that automatically check what parts of your code are actually tested. The proportion of lines of code that are run at least once during your tests is called your test coverage.

For the same reasons that testing is important, measuring coverage is important. Pytest can measure coverage for you with the coverage plugin, found in the pytest-cov package. Once you’ve installed the extension, a test coverage measurement is just a command-line option away:

 ~/projects/maths $ py.test --doctest-modules --cov .
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.25 -- pytest-2.6.3
plugins: cov
collected 2 items 

maths.py .
test_maths.py .
--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover
--------------------------------
maths            2      0   100%
test_maths       4      0   100%
--------------------------------
TOTAL            6      0   100%

=========================== 2 passed in 0.07 seconds ===========================

(The --cov takes a directory as input, which I find obnoxious, given that py.test so naturally defaults to the current directory. But it is what it is.)

Now, if I add a function without a test, I’ll see my coverage drop:

def sqrt(x):
    """Return the square root of `x`."""
    return x * 0.5

(The typo is intentional.)

--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover
--------------------------------
maths            4      1    75%
test_maths       4      0   100%
--------------------------------
TOTAL            8      1    88%

With one more option, --cov-report term-missing, I can see which lines I haven’t covered, so I can try to design tests specifically for that code:

--------------- coverage: platform darwin, python 2.7.8-final-0 ----------------
Name         Stmts   Miss  Cover   Missing
------------------------------------------
maths            4      1    75%   24
test_maths       4      0   100%   
------------------------------------------
TOTAL            8      1    88%   

Do note that 100% coverage does not ensure correctness. For example, suppose I test my sqrt function like so:

def sqrt(x):
    """Return the square root of `x`.

    Examples
    --------
    >>> sqrt(4.0)
    2.0
    """
    return x * 0.5

Even though my test is correct, and I now have 100% test coverage, I haven’t detected my mistake. Oops!

But, keeping that caveat in mind, full test coverage is a wonderful thing, and if you don’t test something, you’re guaranteed not to catch errors. Further, my example above is quite contrived, and in most situations full test coverage will spot most errors.

That’s it for part 1. Tune in next time to learn how to turn on Travis continuous integration for your GitHub projects!


by Juan Nunez-Iglesias at October 02, 2014 08:04 AM

October 01, 2014

Titus Brown

Announcement: Moore Data Driven Discovery Investigator Award

I am very, very happy to announce that I have been selected to be one of the fourteen Moore Data Driven Discovery Investigators.

This is a signal investment by the Moore Foundation into the burgeoning area of data-intensive science, and it is quite a career booster. It will provide my lab with $1.5m over a period of 5 years, enabling us to dive into certain aspects of infrastructure building that we would otherwise be unable to support. I'm also particularly enthusiastic about the implicit recognition of our efforts in open science, open source, and replicability!

My talk and proposal for the competition are linked here:

http://ivory.idyll.org/blog/2014-moore-ddd-talk.html

You can see the list of all fourteen DDD Investigators, along with various and sundry information that includes talk titles, proposals, and Twitter handles, here.

--titus


Affiliations and links:

Dr. C. Titus Brown

Visiting Associate Professor, Population Health and Reproduction, UC Davis School of Veterinary Medicine.

Member, UC Davis Data Science Initiative

Member, UC Davis Genome Center

External faculty, BEACON Center for the Study of Evolution in Action.

Member, Software Carpentry Scientific Advisory Board.

Blog: http://ivory.idyll.org/blog/

Twitter: @ctitusbrown

Slightly outdated Lab Web site: http://ged.msu.edu/

by C. Titus Brown at October 01, 2014 10:00 PM

Introducing the Moore Foundation's Data Driven Discovery (DDD) Investigators

Note: the source data for this is available on github at https://github.com/ctb/dddi

Today, the Moore Foundation announced that they have selected fourteen Moore Data Driven Discovery Investigators.

In reverse alphabetical order, they are:


Dr. Ethan White, University of Florida

Proposal: Data-intensive forecasting and prediction for ecological systems

Source code repository: https://github.com/weecology

Google Scholar: http://scholar.google.com/citations?user=pHmra8cAAAAJ&hl=en

ImpactStory: https://impactstory.org/EthanWhite

Blog: http://jabberwocky.weecology.org/

Twitter: @ethanwhite


Dr. Laura Waller, UC Berkeley

Source code repository: https://github.com/Waller-Lab/CalOptrics


Dr. Matt Turk, University of Illinois at Urbana-Champaign

Proposal: Changing How We Conduct Inquiry

Source code repository: http://bitbucket.org/MatthewTurk/

Google Scholar: http://scholar.google.com/citations?user=QTmv2p0AAAAJ&hl=en

Blog: https://plus.google.com/+MatthewTurk

Twitter: @powersoffour


Dr. Blair D. Sullivan, North Carolina State University

Proposal title: Enabling Science via Structural Graph Algorithms

Source code repository: https://github.com/bdsullivan/

Google Scholar: http://scholar.google.com/citations?user=oZBtvZMAAAAJ&hl=en&oi=ao

Twitter: @BlairDSullivan


Dr. Matthew Stephens, University of Chicago

Proposal title: Gene Regulation and Dynamic Statistical Comparisons

Source code repository: https://github.com/stephens999

Google Scholar: http://scholar.google.com/citations?user=qOQFhUkAAAAJ&hl=en

Blog: http://randomdeviation.blogspot.com/

Twitter: @mstephens999


Dr. Amit Singer, Princeton University

Proposal title: Maximum Likelihood Estimation by Semidefinite Programming: Application to Cryo-Electron Microscopy

Google Scholar: http://scholar.google.com/citations?user=h67w8QsAAAAJ&hl=en


Dr. Kimberly Reynolds, UT Southwestern

Proposal title: Decoding the Genome: Finding Effective Variables from Evolutionary Ensembles

Google Scholar: http://scholar.google.com/citations?user=6bWFU7MAAAAJ&hl=en&oi=ao


Dr. Chris Re, Stanford

Google Scholar: http://scholar.google.com/citations?user=DnnCWN0AAAAJ&hl=en


Dr. Laurel Larsen, UC Berkeley

Proposal title: Developing the informationscape approach to environmental change detection.


Dr. Carl Kingsford, Carnegie Mellon University

Proposal title: Lightweight Algorithms for Petabase-scale Genomics

Google Scholar: http://scholar.google.com/citations?user=V_cvqKcAAAAJ

Twitter: @ckingsford


Dr. Jeffrey Heer, U. Washington Seattle

Proposal: Interactive Data Analysis

Google Scholar: http://scholar.google.com/citations?user=vlgs4G4AAAAJ

Twitter: @jeffrey_heer


Dr. Casey Greene, Dartmouth

Proposal title: Learning the context of publicly available genome-wide data

Google Scholar: http://scholar.google.com/citations?user=ETJoidYAAAAJ&hl=en

Twitter: @GreeneScientist


Dr. C. Titus Brown, UC Davis

Proposal: Infrastructure for Data Intensive Biology

Source code repository: http://github.com/ged-lab/

Google Scholar: http://scholar.google.com/citations?user=O4rYanMAAAAJ

ImpactStory: https://impactstory.org/TitusBrown

Blog: http://ivory.idyll.org/blog/

Twitter: @ctitusbrown


Dr. Joshua Bloom, UC Berkeley

Proposal title: Efficient Data-Driven Astrophysical Inquiry with Machine Learning

Google Scholar: http://scholar.google.com/citations?user=fHkUYk0AAAAJ

Blog: http://5nf5.blogspot.com/

Twitter: @profjsb


--titus

by C. Titus Brown at October 01, 2014 10:00 PM

Filipe Saraiva

Cantor: new features in KDE 4.14

KDE 4.14 was released in August 2014 but I did not have time to write about new features in Cantor for that release.

So, let’s fix it now!

New backend: Lua

Cantor family of backends have a new member: Lua, using luajit implementation.

This backend have a lot of features: syntax highlighting, tab complete, figures in worksheet, script editor, and more.

Cantor + Lua in action

Lua backend was developed by Lucas Negri, a Brazilian guy, and this is a reason for me to be very happy. Welcome aboard Lucas!

You can read more about this backend in a text of Lucas blog.

Use utf8 on LaTeX entries

When you to export the worksheet to LaTeX, utf8 will be used as default. This improvement was developed by Lucas.

Support to packaging extension in Sage and Octave backends

Now these backends have an assistant to import packages/modules/libraries.

Support to auto run scripts

python2_autorun

Auto run scripts/commands in Python 2 backend

Now Python 2, Scilab, Octave, Sage, Maxima, Qalculate, and KAlgebra backends have support to auto run scripts. You can configure a set of scripts or commands and they will run automatically after the worksheet launch!

Add CTRL+Space as alternative default code completion to worksheet

Default code completion command in worksheet is TAB key, but now we have an alternative command too: CTRL + Space. It will maintain consistence between script editor (where the default code completion is CTRL + Space) and worksheet.

Initial support to linear algebra and plot assistants in Python 2

I developed the initial support to 2 amazing plugins in Python 2 backend: the linear algebra plugin and the plot plugin.

First, let’s see the linear algebra plugin. In menu bar go to Linear Algebra > Create Matrix. A window to matrix creation will be open, as below. You must to put the values in the cells.

python3_linearalgebraMatrix creation assistant

After push ‘Ok’ button, the matrix command from numpy  module will be loaded in the worksheet, automatically.

python2_linearalgebra_resultNew matrix created

For now this plugin have implemented just the matrix creation.

Let’s see the plot plugin now. You can use it to create 2D and 3D plot. Let’s to do x = numpy.arange(0.0, 2.0, 0.01) and, in menu bar, go to Graphics > Graphics 2D. The window below will be open.

python2_graphicPloting 2D assistant

You can set some expression to be the Y axis (in this case I am using numpy.sin) and a variable name to X axis (this case, 2 * x * numpy.pi). You could to put just x in variable name to do a plot with the values of x.

After push ‘Ok’ button, the command using pylab will be load in worksheet to make the graphic.

python2_graphic_result3D plotting assistant have a similar way to create the pictures.

How you can see, to use this assistants we need to have some python modules in the workspace, and they must to have the same name used in the plugins. There are a lot of ways to import modules in python environment (import foo; import foo as [anyname]; from foo import *; etc), so to do a generic way to use it is impossible (well, if you have some idea I would like to hear it).

My choice was to import numpy, scipy, matplotlib and pylab when Python 2 backend is loaded by Cantor. Well, I intent to change it because that modules will be mandatory to use Python 2 backend correctly, and pylab is not longer recommended in recent matplotlib version. So, wait for some changes in this plugin soon.

In any case, I would like to hear the opinions of scientific python community about this features.

Future

For now we are working in Cantor port to Qt5/KF5. You can follow the work in ‘frameworks‘ branch on Cantor repository.

Donations

If you use or appreciate my work in Cantor or another free software project, please consider to make a donation for me, then I can to continue my contributions to improve Cantor.

You can consider make a donation to KDE too, and help with the maintenance of this great free software community and their products.

by Filipe Saraiva at October 01, 2014 05:58 PM

William Stein

SageMathCloud Course Management

by William Stein (noreply@blogger.com) at October 01, 2014 01:05 PM

Juan Nunez-Iglesias

jnuneziglesias

(Edit: I initially thought it would be cute to number from 0. But it turns out it becomes rather obnoxious to relate English (first, second, …) to 0-indexing. So this was formerly volume 0. But everything else remains the same.)

I just finished the process of setting up continuous integration from scratch for one of my projects, cellom2tif, a simple image file converter/liberator. I thought I would write a blog post about that process, but it has slowly mutated into a hefty document that I thought would work better as a series. I’ll cover automated testing, test coverage, and how to get these to run automatically for your project with Travis-CI and Coveralls.

Without further ado, here goes the first post: how to set up automated testing for your Python project using pytest.

Automated tests, and why you need them

Software engineering is hard, and it’s incredibly easy to mess things up, so you should write tests for all your functions, which ensure that nothing obviously stupid is going wrong. Tests can take a lot of different forms, but here’s a really basic example. Suppose this is a function in your file, maths.py:

def square(x):
    return x ** 2

Then, elsewhere in your package, you should have a file test_maths.py with the following function definition:

from maths import square
def test_square():
    x = 4
    assert square(x) == 16

This way, if someone (such as your future self) comes along and messes with the code in square(), test_square will tell you whether they broke it.

Testing in Python with pytest

A whole slew of testing frameworks, such as nose, will then traverse your project, look for files and functions whose names begin with test_, run them, and report any errors or assertion failures.

I’ve chosen to use pytest as my framework because:

  • it is a strict superset of both nose and Python’s built-in unittest, so that if you run it on projects set up with those, it’ll work out of the box;
  • its output is more readable than nose’s; and
  • its fixtures provide a very nice syntax for test setup and cleanup.

For a quick intro to pytest, you can watch Andreas Pelme’s talk at this year’s EuroPython conference:

But the basics are very simple: sprinkle files named test_something.py throughout your project, each containing one or more test_function() definition; then type py.test on the command line (at your project root directory), and voila! Pytest will traverse all your subdirectories, gather up all the test files and functions, and run your tests.

Here’s the output for the minimal maths project described above:

~/projects/maths $ py.test
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2
collected 1 items

test_maths.py .

=========================== 1 passed in 0.03 seconds ===========================

In addition to the test functions described above, there is a Python standard called doctest in which properly formatted usage examples in your documentation are automatically run as tests. Here's an example:

def square(x):
    """Return the square of a number `x`.

    [...]

    Examples
    --------
    >>> square(5)
    25
    """
    return x ** 2

(See my post on NumPy docstring conventions for what should go in the ellipsis above.)

Depending the complexity of your code, doctests, test functions, or both will be the appropriate route. Pytest supports doctests with the --doctest-modules flag. (This runs both your standard tests and doctests.)

~/projects/maths $ py.test --doctest-modules
============================= test session starts ==============================
platform darwin -- Python 2.7.8 -- py-1.4.20 -- pytest-2.5.2
collected 2 items

maths.py .
test_maths.py .

=========================== 2 passed in 0.06 seconds ===========================

Test-driven development

That was easy! And yet most people, my past self included, neglect tests, thinking they’ll do them eventually, when the software is ready. This is backwards. You’ve probably heard the phrase “Test-driven development (TDD)”; this is what they’re talking about: writing your tests before you’ve written the functionality to pass them. It might initially seem like wasted effort, like you’re not making progress in what you actually want to do, which is write your software. But it’s not:

By spending a bit of extra effort to prevent bugs down the road, you will get to where you want to go faster.

That’s it for volume 0! Watch out for the next post: ensuring your tests are thorough by measuring test coverage.


by Juan Nunez-Iglesias at October 01, 2014 08:40 AM

September 30, 2014

Matthieu Brucher

Book review: Intelligent Systems in Oil Field Development under Uncertainty

How to know whether or not to produce an oil field, and how to know how? The oil industry is used to make a lot of simulations to satisfy the SEC rules, and to know how much oil they can drill today and in the future. This book goes further than just the usual how much from a field under specific condition, by adding the development plan as well as the oil price in the equation.

Content and opinions

The book seems to be based on things Petrobras, the Brazilian national oil company, does, so it’s fair to assume they actually do far more than what is presented in the book.

There are four chapters that make the core of the book, exposing the different algorithms, explaining bits of them (you will still need to dig further to understand what they actually are, and obviously even more to duplicate the results), but also giving lots of examples.

Let’s start with the first two chapters, they are not in the “core” of the book. The first one explains why you need to do uncertainties in the oil world. Main reason: why would you spend billions of dollars when you are not sure you will get them back? The second chapters explains some financial theory. Usual uncertainty analysis (the one for the SEC for instance) don’t involve financial studies. They just require you to give your P1/proven oil and gas reserves (10% chance that you get this oil out of the field). Then for your management, you want to get the P2 (50%). Here, you go one step further, because these 2 values don’t show how much money you will get. So the financial framework is important to understand how you can get to the final step of the simulation.

The third chapter deals with the Machine Learning algorithms that will be used through the book. Several evolutionary algorithms (genetic algorithms and genetic programming) are now mainstream, but some of them are not (cultural algorithms or quantum-inspired GA). It’s always interesting to see new algorithms, especially as they are applied to a concrete problem in one of the biggest industries. Then the authors tackle neural networks, with a simple approach (compared to my previous review), fuzzy logic (I hadn’t heard about it for quite some time) and the mix between the two of them, neuro-fuzzy models.

The fourth chapter is about the exploration process and mainly uses evolutionary algorithms to optimize the development plan, that is the number of wells and their type to optimize the number of oil barrels at the end of the life of the field. It is interesting to see that they don’t simulate an actual complete development plan (when are the wells drilled and then started). Injection of prior knowledge is addressed, and it isn’t always making the result better, interesting to know!

Fifth chapter starts introducing the market behavior by adding the option to activate new parts of the development plan, depending on several parameters. Here, the authors use genetic algorithms and fuzzy models, adding also the simulation of the oil price during the life of the field, ending up with Monte Carlo methods to define the “option” price. I have to say that I didn’t understand everything in their examples, but I understood the global ideas, without knowing much of the financial side. The explanation of how the fuzzy numbers and statistics differ in the end is lacking, although they are quite close. It’s a different approach, with a different theory anyway.

The sixth chapter and the last one in the “core” takes everything together and tries to make it even further. It introduced additional mathematical tools but wasn’t as thorough as the last three. It was still interesting, but obviously taking advantage of this one requires mastering the other ones before.

The last chapter present the tools used in production and the architecture decisions behind. It has less value as you can’t get the tool…

Conclusion

I picked up the book because I’m currently in the oil industry, and I have interests in intelligent systems. Even though I can’t use the tools now, it gives some good leads to build such a system, and hopefully I may work on such systems some day (OK, I’ve already designed the core framework around the reservoir simulator, before I even read the book).

by Matt at September 30, 2014 03:42 PM

Continuum Analytics news

Continuum Analytics Releases Anaconda 2.1

Continuum Analytics, the premier provider of Python-based data analytics solutions and services, announced today the release of the latest version of Anaconda, its free, enterprise-ready collection of libraries for Python.

Anaconda enables big data management, analysis, and cross-platform visualization for business intelligence, scientific analysis, engineering, machine learning, and more. The latest release, version 2.1, adds a new version of the Anaconda Launcher and PyOpenSSL, as well as updates NumPy, Blaze, Bokeh, Numba, and 50 other packages.

by Corinna Bahr at September 30, 2014 08:00 AM

September 28, 2014

Gaël Varoquaux

Improving your programming style in Python

Here are some references on software development techniques and patterns to help write better code. They are intended for the casual programmer, and certainly not an advanced developer.

They are listed in order of difficulty.

Software carpentry

http://swc.scipy.org.

These are the original notes from Greg Wilson’s course on software engineering at the university of Toronto. This course is specifically intended for scientists, but not computer science students. It is very basic and does not cover design issues.

A tutorial introduction to Python

http://www.informit.com/articles/article.asp?p=23100&seqNum=3&rl=1.

This tutorial is easier to follow than Guido’s tutorial, thought it does not go as much in depth.

Python Essential Reference

http://www.informit.com/articles/article.asp?p=453682&rl=1

http://www.informit.com/articles/article.asp?p=459269&rl=1

These are two chapters out of David Beazley’s excellent book Python Essential Reference. They allow to understand more deeply how python works. I strongly recommend this book to anybody serious about python.

An Introduction to Regular Expressions

http://www.informit.com/articles/article.asp?p=20454&rl=1

If you are going to do any sort of text manipulation, you absolutely need to know how to use regular expressions: powerful search and replace patterns.

Software design for maintainability

My own post

A case of shameless plug: this is a post that I wrote a few years ago. I think that it is still relevant.

Writing a graphical application for scientific programming using TraitsUI

http://gael-varoquaux.info/computers/traits_tutorial/index.html

Building interactive graphical application is a difficult problem. I have found that the traitsUI module provides a great answer to this problem. This is a tutorial intended for the non programmer.

An introduction to Python iterators

http://www.informit.com/articles/article.asp?p=26895&rl=1

This article may not be terribly easy to follow, but iterator are a great feature of Python, so this is definitely worth reading.

Functional programming

http://www.ibm.com/developerworks/linux/library/l-prog.html?open&l=766,t=gr,p=PrmgPyth

Functional programming is a programming style where mathematical functions are successively applied to immutable objects to go from the inputs of the program to its outputs in a succession of transformation. It is appreciated by some because it is easy to analyze and prove. In certain cases it can be very readable.

Patterns in Python

http://www.suttoncourtenay.org.uk/duncan/accu/pythonpatterns.html.

This document exposes a few design patterns in Python. Design patterns are solutions to recurring development problems using object oriented programming. I suggest this reading only if you are familiar with OOP.

General Object-Oriented programming advice

Designing Object-oriented code actually requires some care: when you are building your set of abstractions, you are designing the world in which you are going to be condemned to living (or actually coding). I would advice people to keep things as simple as possible, and follow the SOLID principles:

http://mmiika.wordpress.com/oo-design-principles/

Using decorators to do meta-programming in Python

http://www-128.ibm.com/developerworks/linux/library/l-cpdecor.html.

A very beautiful article for the advanced python user. Meta-programming is a programming technique that involves changing the program at the run-time. This allows to add new abstractions to the code the programmer writes, thus creating a “meta-language”. This article shows this very well.

A Primer on Python Metaclass Programming

http://www.onlamp.com/lpt/a/3388

Metaclasses allow to define new style of objects, that can have different calling, creation or inheritance rules. This is way over my head, but I am referencing it here for the record.

Iterators in Python

https://docs.python.org/2/library/itertools.html#recipes

Learn to use the itertools (but don’t abuse them)!

Related to the producer/consumer problem with iterators, see:

http://www.oluyede.org/blog/2007/04/09/producerconsumer-in-python/

by Gaël Varoquaux at September 28, 2014 10:00 PM

September 23, 2014

Continuum Analytics

Introducing Blaze - HMDA Practice

We look at data from the Home Mortgage Disclosure Act, a collection of actions taken on housing loans by various governmental agencies (gzip-ed csv file here) (thanks to Aron Ahmadia for the pointer). Uncompressed this dataset is around 10GB on disk, so we don’t want to load it up into memory with a modern commercial notebook.

Instead, we use Blaze to investigate the data, select down to the data we care about, and then migrate that data into a suitable computational backend.

by Matt Rocklin at September 23, 2014 10:00 AM

September 19, 2014

Gaël Varoquaux

Hiring an engineer to mine large functional-connectivity databases

Work with us to leverage leading-edge machine learning for neuroimaging

At Parietal, my research team, we work on improving the way brain images are analyzed, for medical diagnostic purposes, or to understand the brain better. We develop new machine-learning tools and investigate new methodologies for for quantifying brain function from MRI scans.

One of our important alley of contributions is in deciphering “functional connectivity”: analysis the correlation of brain activity to measure interactions across the brain. This direction of research is exciting because it can be used to probe the neural-support of functional deficits in incapacitated patients, and thus lead to new biomarkers on functional pathologies, such as autism. Indeed, functional connectivity can be computed without resorting to complicated cognitive tasks, unlike most functional imaging approaches. The flip side is that exploiting such “resting-state” signal requires advanced multivariate statistics tools, something at which the Parietal team excels.

For such multivariate processing of brain imaging data, Parietal has an ecosystem of leading-edge high-quality tools. In particular we have built the foundations of the most successful Python machine learning library, scikit-learn, and we are growing a dedicate software, nilearn, that leverages machine-learning for neuroimaging. To support this ecosystem, we have dedicated top-notch programmers, lead by the well-known Olivier Grisel.

We are looking for a data-processing engineer to join our team and work on applying our tools on very large neuroimaging databases to learn specific biomarkers of pathologies. For this, the work will be shared with the CATI, the Fench platform for multicentric neuroimaging studies, located in the same building as us. The general context of the job is the NiConnect project, a multi-organisational research project that I lead and that focuses on improving diagnostic tools on resting-state functional connectivity. We have access to unique algorithms and datasets, before they are published. What we