April 24, 2014

Filipe Saraiva

KDE packagers: give some love to Cantor

python2_select

I have some posts to write about Cantor but first I would like to request a help to KDE packagers of several Linux distros around the world.

I received some mails from users asking “how can I use python in Cantor?” or “where is python support in Cantor?”. Well, python2-backend is available in Cantor since KDE 4.12 release. If you is using KDE >= 4.12 but you can not to use python in Cantor, maybe the package was not build correctly.

python 2 development library (commonly packed as python-devel in some Linux distros) is required to build python2-backend. python 2 is required to use Cantor with python 2.

Then if you are a Cantor user and can not to use Cantor with python, please write a bug report in the bug management system of your distro. You can to put a link in the bug report to this post too.

Anyway, if your distro bring or not bring python2-backend, write a comment below and I will make a table with this information.

by Filipe Saraiva at April 24, 2014 12:47 AM

April 23, 2014

Gaël Varoquaux

Google summer of code projects for scikit-learn

I’d like to welcome the four students that were accepted for the GSoC this year:

Welcome to all of you. Your submissions were excellent, and you demonstrated a good will to integrate in the project, with its social and coding dynamics. It is a privilege to work with you.

I’d also like to thank all the mentors, Alex, Arnaud, Daniel, James, Jaidev, Olivier, Robert and Vlad. It is a lot of work to mentor and mentors are not only making it possible for great code to enter scikit-learn, but also shaping a future generation of scikit-learn contributors.

by gael at April 23, 2014 05:56 AM

April 22, 2014

Titus Brown

A first science fair

So my daughter just participated in her first science fair, at the age of 6. ("Conclusion: science can be fun! and sticky!")

Over dinner, my wife and I came up with some ideas for her next fair. She was having trouble dissolving sugar in ice water, so we suggested maybe something where we measured the effect of temperature on sugar dissolution in water.

Our daughter then came up with her own idea for experiments, involving many different sugar amounts dissolved in lemonade and a test of the resulting sweetness. (It sounded suspiciously self-serving, frankly, but as long as she files the right Conflict of Interest forms, I have no problem with that.)

I suggested crowdsourcing it, by doing a double-blind study and asking our neighbors to score the lemonade. I said I knew a guy, Ethan Perlstein, that could probably give her some advice. But either way I told her that if she wanted funding, she'd have to do a write up of her idea before anyone would consider it. Fair's fair -- we have many demands on our time and attention. (And heck, if it was crazy enough, maybe she could farm it to the Sloan Foundation.)

My wife then said, "Taking the joy out of science as soon as possible, eh?" I acknowledged the point and told our daughter, "Don't worry about it too much; we'll talk about it more at your annual review."

I didn't bring up the issue of high impact publications. There's time for that when she's seven.

--titus

by C. Titus Brown at April 22, 2014 10:00 PM

April 21, 2014

Luis Pedro Coelho

luispedro

New release of Jug: 1.0 release candidate

I’ve put out a new release of jug, which I’m calling 1.0-rc0. This is a release candidate for version 1.0 and if there are no bugs, in a few days, I’ll just call it 1.0!

There are few changes from the previous version, but this has reached maturity now.

For the future, I want to start developing the hooks interface and use hooks for more functionality.

In the meanwhile, download jug from PyPI, watch the video or read the tutorials or the full documentation.


by luispedro at April 21, 2014 02:27 PM

April 18, 2014

NeuralEnsemble

Sumatra 0.6 released

We would like to announce the release of version 0.6.0 of Sumatra, a tool for automated tracking of simulations and computational analyses so as to be able to easily replicate them at a later date.

This version of Sumatra sees many new features, improvements to existing ones, and bug fixes:

New export formats: LaTeX and bash script

You can now export your project history as a LaTeX document, allowing you to easily generate a PDF listing details of all the simulations or data analyses you've run.


$ smt list -l -f latex > myproject.tex

Using tags, you can also restrict the output to only a subset of the history.

You can also generate a shell script, which can be executed to repeat the sequence of computations captured by Sumatra, or saved as an executable record of your work. The shell script includes all version control commands needed to ensure the correct version of the code is used at each step.


$ smt list -l -f shell > myproject.sh

Working with multiple VCS branches

Previous versions of Sumatra would always update to the latest version of the code in your version control repository before running the computation.

Now, "smt run" will never change the working copy, which makes it much easier to work with multiple version control branches and to go back to running earlier versions of your code.

Programs that read from stdin or write to stdout

Programs that read from standard input or write to standard output can now be run with "smt run". For example, if the program is normally run using:


$ myprog input > output

You can run it with Sumatra using:


$ smt run -e myprog -i input -o output

Improvements to the Web browser interface

There have been a number of small improvements to the browser interface:

  • a slightly more compact and easier-to-read (table-like) layout for parameters;
  • the working directory is now displayed in the record detail view;
  • the interface now works if record label contain spaces or forward slashes.

The minimal Django version needed is now 1.4.

Support for PostgreSQL

The default record store for Sumatra is based on the Django ORM, using SQLite as the backend. It is now possible to use PostgreSQL instead of SQLite, which gives better performance and allows the Sumatra record store to be placed on a separate server.

To set up a new Sumatra project using PostgreSQL, you will first have to create a database using the PostgreSQL tools ("psql", etc.). You then configure Sumatra as follows:


$ smt init --store=postgres://username:password@hostname/databasename MyProject

Note that the database tables will not be created until after the first "smt run".

Integration with the SLURM resource manager

Preliminary support for launching MPI computations via SLURM has been added.


$ smt configure --launch_mode=slurm-mpi
$ smt run -n 256 input_file1 input_file2

This will launch 256 tasks using "salloc" and "mpiexec".

Command-line options for SLURM can also be set using "smt configure"


$ smt configure --launch_mode_options=" --tasks-per-node=1"

If you are using "mpiexec" on its own, without a resource manager, you can set MPI command-line options in the same way.

Improvements to the @capture decorator

The "@capture" decorator, which makes it easy to add Sumatra support to Python scripts (as an alternative to using the "smt" command), now captures stdout and stderr.

Improvements to parameter file handling

Where a parameter file has a standard mime type (like json, yaml), Sumatra uses the appropriate extension if rewriting the file (e.g. to add parameters specified on the command line), rather than the generic ".param".

Migrating data files

If you move your input and/or output data files, either within the filesystem on your current computer or to a new computer, you need to tell Sumatra about it so that it can still find your files. For this there is a new command, "smt migrate". This command also handles changes to the location of the data archive, if you are using one, and changes to the base URL of any mirrored data stores. For usage information, run:


$ smt help migrate

Miscellaneous improvements

  • "smt run" now passes unknown keyword args on to the user program. There is also a new option '--plain' which prevents arguments of the form "x=y" being interpreted by Sumatra; instead they are passed straight through to the program command line.
  • When repeating a computation, the label of the original is now stored in the 'repeats' attribute of 'Record', rather than appending "_repeat" to the original label. The new record will get a new unique label, or a label specified by the user. This means a record can be repeated more than once, and is a more reliable method of indicating a repeat.
  • A Sumatra project now knows the version of Sumatra with which it was created.
  • "smt list" now has an option '-r'/'--reverse' which lists records oldest to newest.

Bug fixes

A fair number of bugs have been fixed.

Download, support and documentation

The easiest way to get the latest version of Sumatra is


$ pip install sumatra

Alternatively, Sumatra 0.6.0 may be downloaded from PyPIor from the INCF Software Center. Support is available from the sumatra-users Google Group. Full documentation is available on pythonhosted.org.

by Andrew Davison (noreply@blogger.com) at April 18, 2014 06:52 PM

April 16, 2014

Luis Pedro Coelho

luispedro

Modernity

The Bourne shell was first released in 1977 (37 years ago).

The C Programming Language book was published in 1978 (36 years ago), describing the C language which had been out for a few years.

Python was first released in 1991 (23 years ago). The first version that looks very much like the current Python was version 2.2, released in 2001 (13 years ago), but the code from Python 1.0.1 is already pretty familiar to my eyes.

The future might point in the direction of functional languages such as Haskell, which first appeared in 1990 (24 years ago), although the first modern version is from 1998 (Haskell 98).

Vim was first released in 1991, based on vi released in 1976. Some people prefer Emacs, released a bit earlier (GNU Emacs, however, is fairly recent, only released in 1985; so 29 years ago)

The web was first implemented in 1989 with some preliminary work in the early 1980s (although the idea of hypertext had been around for longer).

§

The only really new software I use regularly is distributed version control systems, which are less than 20 years old in both implementation and ideas.

Edit: the original version of the post had a silly arithmetic error pointed out in the comments. Thanks


by luispedro at April 16, 2014 08:49 AM

April 15, 2014

William Stein

SageMathCloud's new storage architecture

Keywords: ZFS, bup, rsync, Sage

SageMathCloud (SMC) is a browser-based hosted cloud computing environment for easily collaborating on Python programs, IPython notebooks, Sage worksheets and LaTeX documents. I spent the last four months wishing very much that less people would use SMC. Today that has changed, and this post explains some of the reasons why.

Consistency Versus Availability

Consistency and availability are competing requirements. It is trivial to keep the files in a SageMathCloud project consistent if we store it in exactly one place; however, when the machine that project is on goes down for any reason, the project stops working, and the users of the project are very unhappy. By making many copies of the files in a project, it's fairly easy to ensure that the project is always available, even if network switches in multiple data centers completely fail, etc. Unfortunately, if there are too many users and the synchronization itself puts too heavy of a load on the overall system, then machines will fail more frequently, and though projects are available, files do not stay consistent and data is lost to the user (though still "out there" somewhere for me to find).

Horizontal scalability of file storage and availability of files are also competing requirements. If there are a few compute machines in one place, then they can all mount user files from one central file server. Unfortunately, this approach leads to horrible performance if instead the network is slow and has high latency; it also doesn't scale up to potentially millions of users. A benchmark I care about is downloading a Sage binary (630MB) and extracting it (creating over 70,000 files); I want this to take at most 3 minutes total, which is hard using a networked filesystem served over the general Internet between data centers. Instead, in SMC, we store the files for user projects on the compute machines themselves, which provides optimal speed. Moreover, we use a compressed filesystem, so in many cases read and write speeds are nearly twice as fast as they might be otherwise.

New Architecture of SageMathCloud

An SMC project with id project_id consists of two directories of files, replicated across several machines using rsync:
  1. The HOME directory: /projects/project_id
  2. A bup repository: /bup/bups/project_id
Users can also create files they don't care too much about in /scratch, which is a compressed and deduplicated ZFS filesystem. It is not backed up in any way, and is local to that compute.

The /projects directory is one single big ZFS filesystem, which is both lz4 compressed and deduplicated. ZFS compression is just plain awesome. ZFS deduplication is much more subtle, as deduplication is tricky to do right. Since data can be deleted at any time, one can't just use a bloom filter to very efficiently tell whether data is already known to the filesystem, and instead ZFS uses a much less memory efficient data structure. Nonetheless, deduplication works well in our situation, since the compute machines all have sufficient RAM (around 30-60GB), and the total data stored in /projects is well under 1TB. In fact, right now most compute machines have about 100GB stored in /projects.
The /bup/bups directory is also one single big ZFS filesystem; however, it is neither compressed nor deduplicated. It contains bup repositories, where bup is an awesome git-based backup tool written in Python that is designed for storing snapshots of potentially large collections of arbitrary files in a compressed and highly deduplicated way. Since the git pack format is already compressed and deduplicated, and bup itself is highly efficient at deduplication, we would gain almost nothing by using compression or deduplication directly on this ZFS filesystem. When bup deduplicates data, it does so using a sliding window through the file, unlike ZFS which simply breaks the file up into blocks, so bup does a much better job at deduplication. Right now, most compute machines have about 50GB stored in /bup/bups.

When somebody actively uses a project, the "important" working files are snapshotted about once every two minutes. These snapshots are done using bup and stored in /bup/bups/project_id, as mentioned above. After a snapshot is successfully created, the files in the working directory and in the bup repository are copied via rsync to each replica node. The users of the project do not have direct access to /bup/bups/project_id, since it is of vital importance that these snapshots cannot be corrupted or deleted, e.g., if you are sharing a project with a fat fingered colleague, you want peace of mind that even if they mess up all your files, you can easily get them back. However, all snapshots are mounted at /projects/project_id/.snapshots and browseable by the user; this uses bup's FUSE filesystem support, enhanced with some patches I wrote to support file permissions, sizes, change times, etc. Incidentally, the bup snapshots have no impact on the user's disk quota.

We also backup all of the bup archives (and the database nodes) to a single large bup archive, which we regularly backup offsite on encrypted USB drives. Right now, with nearly 50,000 projects, the total size of this large bup archive is under 250GB (!), and we can use it efficiently recover any particular version of any file in any project. The size is relatively small due to the excellent deduplication and compression that bup provides.

In addition to the bup snapshots, we also create periodic snapshots of the two ZFS filesystems mentioned above... just in case. Old snapshots are regularly deleted. These are accessible to users if they search around enough with the command line, but are not consistent between different hosts of the project, hence using them is not encouraged. This ensures that even if the whole replication/bup system were to somehow mess up a project, I can still recover everything exactly as it was before the problem happened; so far there haven't been any reports of problems.

Capacity

Right now there are about 6000 unique weekly users of SageMathCloud and often about 300-400 simultaneous users, and there are nearly 50,000 distinct projects. Our machines are at about 20% disk space capacity, and most of them can easily be expanded by a factor of 10 (from 1TB to 12TB). Similarly, disk space for our Google compute engine nodes is $0.04 GB / month. So space-wise we could scale up by a factor of 100 without too much trouble. The CPU load is at about 10% as I write this, during a busy afternoon with 363 clients connected very actively modifying 89 projects. The architecture that we have built could scale up to a million users, if only they would come our way...

by William Stein (noreply@blogger.com) at April 15, 2014 04:02 PM

April 12, 2014

Titus Brown

Imagine...

Links, software, thoughts -- all solicited! Add 'em below or send 'em to me, t@idyll.org.

---

Imagine... a rolling 48 hour hackathon, internationally teleconferenced, on reproducing analyses in preprints and papers. Each room of contributors could hack on things collaboratively while awake, then pass it on to others in overlapping timezones and go to sleep. The next day, they can wake up to a set of pull requests, documentation, tests, and learned information, together with a friendly set of about-to-sleep people who can tell them what's going on. Work could be done on cloud machines in a standardized environment, with literate computing and version control. (Hat tip to Greg Wilson, who introduced me to this concept via Random Hacks of Kindness.)

Imagine... open, remixable education materials on building reproducible computational analyses, with lesson plans, homeworks, and workshop exercises under CC0. Video lectures would be posted, together with the "lecture script" used to produce them. (Hat tip to Jake Vanderplas.)

Imagine... a monthly journal club on reproducibility and reproducibility engineering, done as with the IPython lab meetings: via Google Hangout broadcast to YouTube. We could pick apart papers that have been implemented reproducibly, run them, understand design tradeoffs, and invite the author to discuss their choices and talk about what they would do now. (Hat tip to Randy LeVeque, Bill Howe, and Steven Roberts, with whom I conversed about this in October; and Olga Botvinnik and Gabe Pratt, who developed the idea further at PyCon.)

Imagine... a declarative metadata standard that you can use to tell a Linux VM how to download your data, execute your computational analysis, and spin up an interface to a literate computing environment like IPython Notebook with the analysis preloaded. Then we can provide buttons on scientific papers that say "run this analysis on Rackspace! or Amazon! Estimated cost: $25." (Hat tip to Fernando Perez and Jake Vanderplas, who helped me think through this.)

Imagine... automated integration tests for papers, where you provide the metadata to run your analysis (see one paragraph above) while you're working on your paper and a service automatically pulls down your analysis source and data, runs it, and generates your figures for you to check. Then when the paper is ready to submit, the journal takes your metadata format and verifies it themselves, and passes it on to reviewers with a little "reproducible!" tick mark. (Hat tip to Fernando Perez and Jake Vanderplas, who helped me think through this.)

---

Also see a previous list of ideas: http://ivory.idyll.org/blog/w4s-tech-wanted.html

--titus

by C. Titus Brown at April 12, 2014 10:00 PM

April 11, 2014

April 09, 2014

Luis Pedro Coelho

sausage

I was in Denmark last week, teaching software carpentry. The students were very enthusiastic, but they had very different starting points, which made teaching harder.

For a complete beginner’s to programming course, I typically rely heavily on the Python Tutor created by Philip Guo, which is an excellent tool. Then, my goal is to get them to understand names, objects, and the flow of control.

I don’t use the term variable when discussing Python as I don’t think it’s a very good concept. C has variables, which work like little boxes you put values in. If you’re thinking of little boxes in Python, things get confusing. If you try to think of little boxes plus pointers (or references), it’s still not a very good map of what Python is actually doing.

For more intermediate students (the kind that has used one programming language), I typically still go through this closely. I find that many still have major faults in their mental model of how names and objects work. However, for these intermediate  students have, this can go much faster [1]. If it’s the first time they are even seeing the idea of writing code, then it naturally needs to be slow.

Last week, because the class was so mixed, it was probably too slow for some and too fast for others.

§

A bit of Danish weirdness:

sausage

 A sausage display at a local pub

[1] I suppose if students knew Haskell quite well but no imperative programming, this may no longer apply, but teaching Python to Haskell programmers is not a situation I have been in.


by luispedro at April 09, 2014 08:31 AM

April 08, 2014

Continuum Analytics

OpenSSL 1.0.1g fixes critical vulnerability

OpenSSL 1.0.1g packages are now available via Anaconda. These packages patch the “heartbleed” vulnerability described in CVE-2014-0160. This is a critical update and we strongly recommend that all Anaconda users - especially those in corporate or sensitive environments - follow the steps outlined here.

by Matthew Willis at April 08, 2014 01:00 PM

Matthieu Brucher

Announcement: Audio TK 0.1.0

A long time ago, I started implementing audio filters with a Qt GUI. I also started other pet projects in the same area, but I didn’t have a proper audio support library in C++ for that. Also Qt plugins are no longer an option (for me), I still hope to implement new filters in the future.


Enters Audio ToolKit. It is loosely inspired by ITK and VTK, with a pipeline concept that should allow for composition and extensibility. The first official release supports my previous example with a simple overdrive.

It is currently only available on OS X, as it is now my main development platform. It should be quite easy to compile ATK 0.1.0 on Windows and Linux once the FFT filter based on Accelerate can support other FFT implementations (which is a goal for 0.2.0).

Download link: ATK 0.1.0

Changelog:

0.1.0
* Butterworth low pass filter
* Python wrappers for Distortion filters

0.0.3
* Audio files input/output filters based on libsndfile
* Input and output filters based on pointers
* Python wrappers for Core filters
* Python wrappers for EQ filters
* Python wrappers for Tools filters

0.0.2
* Middle Side separator filter for stereo channels
* Sinus generator filter for Mock tests
* Frequency tester based on Accelerate FFT for Mock tests
* Second order EQ filters
* Decimation filters
* Oversampling filters
* Basic Wav input/output filters
* Overdrive filter implementation

0.0.1
* Base filter with automatic type conversion to help assemble plugins with different processing types
* Mock filters for generating and checking some signals
* Pan filters with different laws
* 0 dB center, sin/cos taper, constant power
* -3 dB center, sin/cos taper, constant power
* 0 dB center, square-root taper, constant power
* -3 dB center, square-root taper, constant power
* -6 dB center, linear taper
* 0 dB center, balance control
* Volume filter, with gain input in dB and no dB
* Sum filter

Buy Me a Coffee!



Other Amount:



Your Email Address :



by Matthieu Brucher at April 08, 2014 07:01 AM

April 03, 2014

Paul Ivanov

indenting with tabs

This post was written as an IPython Notebook. You can view it on nbviewer, or download it.

Greg Wilson asked on the IPython mailing list:

Subject: easiest way to insert a literal tab character in a code
cell? 
Greg Wilson, on 2014-04-03 18:37,  wrote:
> Hi,
> I'd like to put literal tab characters in cells, but of course tab means 
> "indent" to the editor.  What's the easiest way to do this?
> Thanks,
> Greg
> p.s. because I'm going to write Makefiles in the notebook...

The easiest way to do this is to just get a tab character somewhere that you can copy, and then paste it in.

In [1]:
print("\t")



In [2]:
    # I copy pasted the output of the cell above here

An alternative solution is to make a string with tabs and insert it into another cell, using IPython machinery.

In [3]:
ip = get_ipython()

In [4]:
ip.set_next_input("\tMakefiles are awesome")

In []:
    Makefiles are awesome

If you have a file on disk or on the web, you can also just use the %load magic to do this.

In [5]:
%load /home/pi/file_with_tabs

In []:
default:
    cat /etc/issue
    whoami

Such files can be written with the %%writefile cell magic... but of course you need to have inserted tabs there in some manner.

In [6]:
%%writefile ~/file_with_tabs
default:
    cat /etc/issue
    whoami

Overwriting /home/pi/file_with_tabs

In [7]:
!make -f /home/pi/file_with_tabs

cat /etc/issue
Debian GNU/Linux jessie/sid \n \l

whoami
pi

The more involved, but more usable way

We can set up CodeMirror to insert tabs instead of spaces.

In [8]:
%%javascript

IPython.tab_as_tab_everywhere = function(use_tabs) {
    if (use_tabs === undefined) {
        use_tabs = true; 
    }

    // apply setting to all current CodeMirror instances
    IPython.notebook.get_cells().map(
        function(c) {  return c.code_mirror.options.indentWithTabs=use_tabs;  }
    );
    // make sure new CodeMirror instances created in the future also use this setting
    CodeMirror.defaults.indentWithTabs=use_tabs;

    };

The reason we attach tab_as_tab_everywhere to IPython is because when we use the %%javascript magic, any variables we define there must be called in the same cell that defined it - they get their own closure. The reason we do this is to allow the notebook javascript to not get screwed up when there are javascript errors. We could have attached it to window or CodeMirror or anything else that's already in javascript-land.

I covered how to add functions like this to the custom.js file in your profile in my post about disabling blinking in the notebook. That way these little functions are available in every notebook, without you having to insert a cell defining them.

Now we've got code that allows us to apply the change to all current and future cells. We leave it as an exercise for the interested reader to modify that code and make a little button in the toolbar, to toggle it on a per-cell basis.

Hints:

You can get to the code mirror instance via IPython.notebook.get_selected_cell().code_mirror

This post was written as an IPython Notebook. You can view it on nbviewer, or download it.

by Paul Ivanov at April 03, 2014 07:00 AM

April 01, 2014

Continuum Analytics

Python 1.0 installable using conda

Python 1.0.1 is now available using conda (on Unix, because this version of Python does not support Windows). This blog post explains how easy it is to create a Python 1.0 environment and some of the features of this Python version.

by Ilan Schnell at April 01, 2014 01:29 PM

Paul Ivanov

bipython 0.1.0

bipython
logo

the boldly indiscriminate python interpreter

"...because you shouldn't have to choose."

PROLOGUE

Two interpreters, both alike in dignity,
In fair Pythona, where we lay our scene,
From ancient grudge break to new mutiny,
Where civil code makes git commits unclean.
From forth the fatal loins of these two foes
A newer kind of stranger's given life;
Whose misadventured piteous overthrows
Doth with its birth bury its parents' strife.

ACT I

Enter bpython and ipython

bpython

I'm a fancy terminal-based interface to the Python interpreter. I give you
inline syntax highlighting and auto-completion prompts as you type, and I'll
even automatically show you a little tooltip with a docstring and parameter
list as soon as you hit ( to make the function call, so you always know
what you're doing! I'm svelte and proud of it - I don't try to do all of the
shenanigans that ipython does with the shell and the web, but the cool kids
love my rewind feature for demos. I strive to make interactive python coding
a joy!

ipython

I'm an awesome suite of interactive computing ideas that work together.
For millennia, I've given you tab-completion and object introspection via
obj? instead of help(obj) in Python. I also have sweet shell features,
special magic commands (%run, %timeit, %matplotlib, etc.) and a
history mechanism for both input (command history) and output (results
caching).

More recently, I've decoupled the REPL into clients and kernels, allowing
them to run on independent of each other. One popular client is the
IPython Notebook which allows you to write code and prose using a web
browser, sending code to the kernel for execution and getting rich media
results back inline. The decoupling of clients and kernels also allows
multiple clients to interact with the same kernel, so you can hook-up to
that same running kernel from the terminal. The terminal workflow makes
more sense for some things, but my user interface there isn't as polished
as bpython's.

Enter bipython

bipython

By your powers combined... I am bipython!

Exeunt

The Power is Yours!

pip install  bipython
easy_install bipython

bipython requires ipython, pyzmq, bpython, and urwid.

For now, you'll need to have a running ipython kernel before running bipython. You can do this by either opening a notebook or running ipython console. It won't always be like this, I'll fix it as soon as I can, but it'll be sooner with your help over ivanov/bipython.

After that, just run bipython and enjoy the ride.

Here's a walkthrough of ipython, bpython, and bipython:

The screencast is 20 minutes long, but here I'll play it back double speed. There's no sound, and you can pause at any time and select / copy portion of the text as you like. Changing the browser font size in the usual way works, too. (click here if the embed didn't work)

by Paul Ivanov at April 01, 2014 07:00 AM

March 31, 2014

Titus Brown

Announcing khmer 1.0

The khmer team is pleased to announce the release of khmer version 1.0. khmer is our software for working efficiently with fixed length DNA words, or k-mers, for research and work in computational biology.

Links:

khmer v1.0 is the culmination of about 9 months of development work by Michael Crusoe (@biocrusoe). Michael is the software engineer I hired on our NIH BIG DATA grant, and he's spent at least half his time since being hired wrangling the project into submission.

What is new?

The last nine months has been a process+code extravaganza. Here are some of the things we did for 1.0, with an emphasis on the parts of the process we changed:

  1. Added continuous integration. Thanks in part to a Rackspace VM, pull requests on github trigger builds and unit tests via Jenkins!
  2. Moved to a pull request model for development.
  3. Instituted code reviews and a development checklist.
  4. Made khmer pip-installable.
  5. Moved to a unified cross-platform build system.
  6. Normalized our command line arguments and made the CLI documentation be auto-generated from the code.
  7. Moved to semantic versioning.
  8. Built a Galaxy interface.
  9. Added code coverage analysis of both C++ and Python code to our continuous integration system.
  10. Introduced a CITATION file and modified the scripts to output citation information.
  11. Wrote a citation handle for the software.
  12. Built better user-focused documentation.
  13. Starting using parts of the khmer protocols for acceptance testing.

Why did we do all this work!?

The short answer is, we like it when our software works. khmer is becoming increasingly broadly used, and it would be good if the software were to continue working.

A slightly longer answer is that we are continuing to improve khmer -- we're making it faster, stronger, and better -- while we're also doing research with it. We want to make sure that the old stuff keeps working even as we add new stuff.

But it's not just that. There are now about five people working on fixing, improving, and extending khmer -- several of them are graduate students, working on kick-ass new functionality as part of their PhDs. By having testing infrastructure that better ensures the stability and reliability of our software, each graduate student can work out on their own long branch of functionality, and -- when they're ready -- they can merge their functionality back into the stable core, and we can all take advantage of it.

Actually, it's even more. We're building an edifice here, and we need to have a stable foundation. One very important addition that just came last weekend was the addition of khmer acceptance testing -- making sure that khmer not only works as we expect it to, but integrates with other tools. For this, we turned to the khmer protocols, our nascent effort to build open, integrated pipelines for certain kinds of sequence analysis.

Our acceptance testing consists of running from raw data through the assembly stage of the Eel Pond mRNAseq assembly protocol, albeit with a carefully chosen data subset. This takes less than an hour to run, but in that hour it tests some of the core functionality of khmer at the command line, on real data. We hope to extend this into the majority of our functionality over time -- for now we're mostly just testing digital normalization and read manipulation.

Having passing acceptance tests before a major release is both extraordinarily reassuring and quite useful: in fact, it caught several last minute bugs that we'd missed because of either incomplete unit and functional tests, OR because they were bugs at the integration level and not at the unit/functional level.

And one interesting side note -- our acceptance tests encompass Trimmomatic, fastx, and Trinity. That's right, we're passively-aggressively testing other people's software in our acceptance tests, too ;).

Better Science Through Superior Software

Ultimately, this improved infrastructure and process lets us confidently move forward with new features in khmer, lets my group work in concert on orthogonal new features, enables larger processes and pipelines with less fear, uncertainty, and doubt, and -- ultimately -- should result in significant time savings as extend our research program. My firm belief is that this will allow us to do better science as we move forward.

Watch This Space. We'll let you know how it goes.

--titus

by C. Titus Brown at March 31, 2014 10:00 PM

Philip Herron

redbrain812

So when i was travelling to work this morning i couldn’t not read this article http://www.bbc.co.uk/news/science-environment-26810559, i have several things to say about this article in particular.

I personally find when articles like this come about its kind of overwhelming in how under-prepared our society is for the future but so it may seem.  I find it so depressing and disgusting that we simply believe we are all so un-prepared for the future when you consider how much man kind has already endured. And now all of a sudden people are worried were all going have to live off insects to survive.

The article basically hits on 2 main stress points:

  • Food shortages
  • Energy supply shortages

I am going to show you several things that might open your eyes and also discuss my experience from growing up in Northern Ireland and how the complacency of man has created these problems out of nothing.

TL;DR: i believe we are going to be ok, tackling food waste and regulating investment banking will stop making the already rich even more so and the normal farmer to be more fairly paid giving rise to more money into research and development into farming and also education in food waste and nutrition and food heritage. I also believe the energy problem is already worked out but there is no financial gains for the already wealthy until its needed. We will be ok and we will be prosperous as a race.


Food Shortages and Crop Yields

When you consider this most news people and scientist at this high high level really don’t even know the simplest of details, for instance i live on a farm and i see the effects of complacency stronger than you would dare to believe.

For instance the fields behind where i live are owned by our landlord, its a vast vast amount of land with which he even has more on the west of northern Ireland. All for growing crops wheat is a big issue at the moment, my girlfriend’s father was al most put out of business because of this.

Northern Ireland has been dealing with huge rainfall for the last year or so. This means dairy and meat farmers are forced to bring cattle in doors what does this mean. Usually in good weather you can leave your cattle out and feed on the grass on the fields and you rotate them around so they always have grass to feed on. So this is great the farmer has no need to buy artificial feeds made from wheat etc.

With the large down pours the demand for wheat has increased but the supply has been lower than previous years not drastically lower but lower more and more smaller farmers have been forced to buy from abroad etc. The demand has increased because in the rain farmers are forced to bring the animals inside into shelters and feed them a mixture of wheat feeds, straw and cut grass. But why cant you just leave them out in the rain the whole time. This is because if you leave cattle out in the fields you fields will be utterly RUINED and you will NOT be able to grow grass usable to feed your animals for probably best part of a year. So your will be forced to buy food and you cannot grow at least some grass.

So this is 1 aspect the effect it has on dairy farmers so ok this sounds bad with the more rain they cant feed animals of their own back with such low pay for milk which by the way is insane milk is cheaper than water which makes no sense at all.

So the people growing the wheat remember i said i live on a farm, with all the rain the fields are entirely water logged and haven’t produced anything in the last 4 months and are trying to be fixed. And i even had a chat with green grocers in my town and how most of the veg we all eat here is from Spain and Holland. Now i have mostly conjectures no hard truth to say there are many effects to consider to understand here.

  • Adequate drainage was never implemented for these fields ever
  • Poor poor maintenance of the fields
  • Poor maintenance of equipment
  • Not enough workers and low pay

So you can see there are quite a few reasons here for this situation the fields are in. This has so many effects and problems to be considered to fully appreciate the problem lets think first low pay which is the easiest for most skilled workers in the western society to understand.

Supply and Demand of Raw Produce

So when i see our fields behind out house when we first moved in there were several wheat fields they were brimming with produce and busy happy workers. But when it even vaguely came closer to winter everything shuts down. And there were equipment problems for them and not enough money available to warrant the amount of effort required to make a living. Never mind the completely un-prepared fields with no drainage etc which was complacent on the farmer. So when i had that conversation with the green grocers all they had to say on the matter is “you’ll never beat the dutch in growing”. So why is this statement true why cant we in nothern Ireland grow as well as the dutch is it to do with our climates probably something to do with it and the fact they do grow marijuana there which has had an awful lot of research and development into that probably can have an effect on normal growing by a far greater understanding. But why isn’t there more research into bio domes and more available funding for bio domes and micro climates etc for vegetable growing and farming a much more vibrant industry for a newer younger generation to be involved in.

I wouldn’t say the work is any more difficult, for me as a software engineer sometimes helping on my partners families farm is the first peace of mind i feel in a long time. The stress and anxiety of a corporate world is overwhelming sometimes. But this can then lead into the next issue BANKS and large Super Markets.


Banks and Investment Banking

I am not going to pretend i don’t go to Tesco or Sainsburies’, but do you ever look onto BBC news and see oh Tesco’s operating profits have dropped and Tesco credit cards and Tesco banks and Marks and Spencer bank accounts and ISA’s etc.

Well that the hell is this, since when where i buy my potatoes and sausages all of a sudden turn into a bank where i can get a loan and get my wages paid into. Well this is where it gets complicated and intentionally so which you might find. When you see these articles on these super markets dropping in profits or growing in profits; its never that more customers decided “i don’t link Tesco” all of a sudden and go elsewhere, generally the same customers are going to these shops.

But what’s happened that they are loosing or gaining money, when you spent money in these super markets they are using the profits to start putting it into investments abroad housing mortgage’s, stock exchanges, stocks on commodities or shares in companies. And gambling their profits to gain money.

For example the Co-Op bank has been on bbc news a lot recently but co-op is simply a small shop round the corner. Since when did it become a bank (doesn’t matter). What’s happening these smalls shops started out doing fairly well but over time they ended up using profits over time investing into anything and everything and if they put money into apple shares in 2002 then 2004 sure fuck me they are probably loaded. But this is why investment banking is so insane its just a gamble with money. The 2008 financial crisis similar but more complicated issue we as tax payers pay for their gambling debt and i say gambling debt because it is.

But wait how does this related to the issue of wheat supply though-out the world. Well when the wheat prices effect the dairy farmers which is a good example creameries are under so much pressure of super markets such as Tesco to drive price DOWN DOWN DOWN. So the effect hits on the dairy farmer so he is taking the hit here and believe you me this is a big hit small milk farming of around 100 cows is almost dead because of this. Even though this actually produces a fair amount of raw produces you can trade.

And what’s worse is this is almost un talked about and unheard of. To most people they might think 100 cows is a lot and it is. For example the Artificial insemination cost of breeding cattle, and different breeds different yields and effects of milk. Then health cost of cattle these days is going up and up veterinary bills and so on.

I mean me and my partner were with her father and we had to bring tepid water to a cow who was bleeding from the inside out after a bad calving where the calf at least lived, but the cow died even though we had the vet out and the they took blood from another cow to help and sown her up and everything. Me and my partner went back to work the next day and that night we were told the cow died. It was so sad i remember that day so well i remember where i was sitting and what we were doing.

I mean its one cow and the farmer bared the cost, but 1 cow gives so much to society. Even farmers these days by law have to pay the government to dispose of the animal when it dies and the farmer bears this cost also.

Beef and milk are the raw produce from a single cow. We can make yoghurt, cheese, milk, cream, butter milk, steak… so much. The costs in doing this are becoming so great for smaller farmers that small local producers are almost died out because of the big competition of big farms. So ok there are huge farms here and in England but are you going to bet you whole food supply on 1 guy to do so?

And the other issue is supermarkets are hiding the cost of the issues, when they drive down prices to creameries, creameries are forced to talk it because of such high competition of super market pressure they have to accept the best they can find.

The farmer is the most important man in the world and no one notices. I only choose the dairy and wheat farmer as examples because i have first hand experience with this. So when you think of massive monopolies on food such as Microsoft had on software for years this isn’t good for society and each person around us. Competition although difficult but choice is just so important because of our buy/sell and trading style economics it keeps vendors in check to be fair to the end customers because the buyer or consumer has a choice.

You can argue these super markets aren’t doing anything wrong by branching into investment banking but its created this gambling society throwing money out of our country if it fails and giving vast wealth to the lazy and already wealthy when it goes well.

People creating the commodities are completely forgotten about.

The other side of the coin is this type of investment banking into stock exchanges funding companies and tech companies has brought about so much research and development and vastly changed how we think and communicate with technology but its time investment banking gets tougher and tougher regulation. I don’t believe super markets should be allowed to do any kind of investment banking because your playing with local rural economies which are already weak because of a lower population and investment and consolidating wealth and prosperity into single cities such as London in the United Kingdoms case. I mean the same job here in Northern Ireland will offer me around £30k a year but in London they will offer in excess of £60-70k a year and the end result in often cases will be worse in London than the lower paid guy from here.

The consolidation of money is the weakness to our economy and high liquidity keeps everything in check and to have high liquidity you need more competition of smaller vendors and avoiding huge monopolies on anything. For example if someone person owned all the oil in the world and all the food in the world. What would the world look like? And if there are lots of people who own a little bit so we can trade fairly what would the world look like?

Then what about tacking food shortages, i really don’t believe in any way were all going to have to start eating insects even though something like 50% of the world does eat insects as part of a healthy diet. Whats happening is that no one has any life skills any-more in my experience, its only the last 2 years the cut of meat Brisket is becoming known in the United Kingdom the amount of food waste there must have been for such a long time. And un-fashionable vegetables such as a simple Turnip which can grow easily here any root vegetable even, there are even crops that like harsher weather the means for us to survive magically already exist but are forgotten about.

For food security the United Kingdom must respect its old food heritage of Nose to Tail eating and old traditional Game meats and offals its only recently i really started to take the notion of eating liver seriously and calves liver is very good even ox-heart is not organ’y at all its meant to be like a steak and when cut and prepared properly meant to be beautiful but i cant just go to down the street and get this. Fish and shell fish ,we live on an island and there is 1 part time fish mongers in Belfast Wednesday to Friday and on Friday and Saturday in the market.

So tackle the issue of Food waste then tell me were all going to have to eat insects and i might believe you, there is simply just so much we are over looking in food waste and exporting and not eating i refuse to believe that statement. I am not saying the issue is a simple one but education on food and food preparation should be mandatory in school. Not only this but the notion of food waste must be discussed and nutrition. For me i work full time i leave my house at 8am, and i am not home until 6pm going to my local green grocer and butchers is almost impossible.

Except if i come home early and work from home more. And i think the working life needs an adjustment the corporate world rewards compliance and mundane process over end results which simply makes no sense in a changing world the attitudes to the world need to change to have much more freedom as a human being.

its even getting to the stage in the United Kingdom that having a family a traditional nuclear family i think they call it is almost impossible now the pressures of money, housing, living costs are pushing people to live alone the cost of having a family too great. For me i think its going to be possible to find a job where i work from home full time so at least my partner might be able to resume some what of a career so she can live her life the way she wants and i will support her every way. Just the way she has supported me so strongly with my frustrations of working where i am.


When you then think about the energy supplies, you immediately think of fossil fuels, coal oil and gas and how we’ve been told for the last 30 years that its going to run out no one knows what were going to do and its all going to go to hell.

I really hate this statement you see it made all the time on the news and in school and its just scaring people. All you need to do is watch this video and you will feel security that the world isn’t going to be over in a long time.

I believe the human race will adapt like it always has and it will work and we will survive and we will be prosperous and happy. When articles like this come about people forget where we all came from and how we’ve dealt with stuff in the past.

Thanks for reading and: BE EXCELLENT TO EACH OTHER!


by Philip Herron at March 31, 2014 12:12 PM

March 27, 2014

Matthieu Brucher

On the importance of not reinventing the wheel in distributed applications

Sometimes, it’s so easy to rewrite some existing code because it doesn’t fit exactly your bill.

I just so the example with an All To All communication that was written by hand. The goal was to share how many elements would be sent from one MPI process to another, and these elements were stored on one process in different structure instances, one for each MPI process. So in the end, you had n structures on each of the n MPI processes.

The MPI_Alltoall cannot map directly to this scattered structure, so it sounds fair to assume that using MPI_Isend and MPI_Irecv would be simpler to implement. The issue is that this pattern uses buffers on each process for each other process it will send values to or receive values from. A lot of MPI library allocate their buffer when needed, but will never let go of the memory until the end. So you end up with a memory consumption that doesn’t scale. In my case, when using more than 1000 cores, the MPI library uses more than 1GB per MPI process when it hits these calls, just for these additional hidden buffers. This is just no manageable.

Now, if you use MPI_Alltoall, two things happen:

  • there are no additional buffer allocated, so this scales nicely when you increase the number of cores
  • it is actually faster than your custom implementation

Now with MPI 3 standard having non-blocking collective operations, there is absolutely no reason to try to outsmart the library when you need a collective operation. It has heuristics when it knows that it is doing a collective call, so let them work. You won’t be smarter if you try, but you will if you use them.

In my case, the code to retrieve all values and store them in an intermediate buffer was smaller that the one with the Isend/Irecv.

by Matthieu Brucher at March 27, 2014 10:18 AM

March 25, 2014

Philip Herron

redbrain812

So today i was listening to a lecture on lean mean software development process and my got what the fuck has happened people. I mean fuck, seriously! The reason its intriguing is so many software places people are maintaining really horrible old old code and maybe 1 or 2 people know what they are doing with it.

So this is some sort of formal process of lets write our own code and let the people who know what they are donig do it. Even if you have a bad manager and bad process this happens anyways. I mean seriously anyone who puts this ahead of any other than purely writing software and getting stuff done is retarded these managerial processes are so full of crap.

Look at open source or free software how much its taken off for everything. Is there this crap attached to every project NO. Its NOT NEEDED. Fuck i mean the amount of bull shitters in this world who come up with this crap.

 


by Philip Herron at March 25, 2014 02:20 PM

March 20, 2014

Andreas Mueller

Off-topic: speed reading like spritz

As the title suggests, this is a non-machine-learning, non-vision, non-python post *gasp*.
Some people in my network posted about spritz a startup that recently went out of stealth-mode. They do a pretty cool app for speed reading. See this huffington post article for a quick demo and explanation.
They say they are still in development, so the app is not available for the public.

The app seemed seems pretty neat but also pretty easy to do. I said that and people came back with "they probably do a lot of natural language processing and parsing the sentence to align to the proper character" and similar remarks.
So I reverse engineered it. By which I mean opening the demo gifs in Gimp and counting letters. And, surprise surprise: they just count letters. So the letter they highlight (at least in the demo) is only depending on the letter of the word.
The magic formula is
highlight = 1 + ceil(word_length / 4). They might also be timing the transitions differently, haven't really looked at that, I must admit.
After this revelation, I coded up my own little version in javascript.
Obviously the real value of an app like that is integration with various ecosystems, browser integration etc...
But  if you are just interested in the concept, you can paste your favorite (or maybe rather least favorite given the nature of the app?) e-book into my webpage and give it a shot.
The code is obviously on github and CC0.
I wouldn't try to use it commercially though, without talking to spritz.

by Andreas Mueller (noreply@blogger.com) at March 20, 2014 10:52 PM

March 17, 2014

Thomas Wiecki

The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3

Today's blog post is co-written by Danne Elbers who is doing her masters thesis with me on computational psychiatry using Bayesian modeling. This post also borrows heavily from a Notebook by Chris Fonnesbeck.

The power of Bayesian modelling really clicked for me when I was first introduced to hierarchical modelling. In this blog post we will:

  • provide and intuitive explanation of hierarchical/multi-level Bayesian modeling;
  • show how this type of model can easily be built and estimated in PyMC3;
  • demonstrate the advantage of using hierarchical Bayesian modelling as opposed to non-hierarchical Bayesian modelling by comparing the two;
  • visualize the "shrinkage effect" (explained below); and
  • highlight connections to the frequentist version of this model.

Having multiple sets of related measurements comes up all the time. In mathematical psychology, for example, you test multiple subjects on the same task. We then want to estimate a computational/mathematical model that describes the behavior on the task by a set of parameters. We could thus fit a model to each subject individually, assuming they share no similarities; or, pool all the data and estimate one model assuming all subjects are identical. Hierarchical modeling allows the best of both worlds by modeling subjects' similarities but also allowing estimiation of individual parameters. As an aside, software from our lab, HDDM, allows hierarchical Bayesian estimation of a widely used decision making model in psychology. In this blog post, however, we will use a more classical example of hierarchical linear regression to predict radon levels in houses.

This is the 3rd blog post on the topic of Bayesian modeling in PyMC3, see here for the previous two:

The data set

Gelman et al.'s (2007) radon dataset is a classic for hierarchical modeling. In this dataset the amount of the radioactive gas radon has been measured among different households in all counties of several states. Radon gas is known to be the highest cause of lung cancer in non-smokers. It is believed to be more strongly present in households containing a basement and to differ in amount present among types of soil. Here we'll investigate this differences and try to make predictions of radonlevels in different counties based on the county itself and the presence of a basement. In this example we'll look at Minnesota, a state that contains 85 counties in which different measurements are taken, ranging from 2 to 116 measurements per county.

radon

radon

First, we'll load the data:

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm 
import pandas as pd

data = pd.read_csv('radon.csv')

county_names = data.county.unique()
county_idx = data['county_code'].values
n_counties = len(data.county.unique())

The relevant part of the data we will model looks as follows:

In [2]:
data[['county', 'log_radon', 'floor']].head()
Out[2]:
county log_radon floor
0 AITKIN 0.832909 1
1 AITKIN 0.832909 0
2 AITKIN 1.098612 0
3 AITKIN 0.095310 0
4 ANOKA 1.163151 0

5 rows × 3 columns

As you can see, we have multiple radon measurements (log-converted to be on the real line) -- one row for each house -- in a county and whether the house has a basement (floor == 0) or not (floor == 1). We are interested in whether having a basement increases the radon measured in the house.

The Models

Pooling of measurements

Now you might say: "That's easy! I'll just pool all my data and estimate one big regression to asses the influence of a basement across all counties". In math-speak that model would be:

\[radon_{i, c} = \alpha + \beta*\text{floor}_{i, c} + \epsilon\]

Where \(i\) represents the measurement, \(c\) the county and floor contains a 0 or 1 if the house has a basement or not, respectively. If you need a refresher on Linear Regressions in PyMC, check out my previous blog post. Critically, we are only estimating one intercept and one slope for all measurements over all counties pooled together as illustrated in the graphic below (\(\theta\) represents \((\alpha, \beta)\) in our case and \(y_i\) are the measurements of the \(i\)th county).

pooled

pooled

Unpooled measurements: separate regressions

But what if we are interested in whether different counties actually have different relationships (slope) and different base-rates of radon (intercept)? Then you might say "OK then, I'll just estimate \(n\) (number of counties) different regressions -- one for each county". In math-speak that model would be:

\[radon_{i, c} = \alpha_{c} + \beta_{c}*\text{floor}_{i, c} + \epsilon_c\]

Note that we added the subindex \(c\) so we are estimating \(n\) different \(\alpha\)s and \(\beta\)s -- one for each county.

unpooled

unpooled

This is the extreme opposite model; where above we assumed all counties are exactly the same, here we are saying that they share no similarities whatsoever. As we show below, this type of model can be very noisy when we have little data per county, as is the case in this data set.

Partial pooling: Hierarchical Regression aka, the best of both worlds

Fortunately, there is a middle ground to both of these extremes. Specifically, we may assume that while \(\alpha\)s and \(\beta\)s are different for each county as in the unpooled case, the coefficients all share similarity. We can model this by assuming that each individual coefficient comes from a common group distribution:

\[\alpha_{c} \sim \mathcal{N}(\mu_{\alpha}, \sigma_{\alpha}^2)\] \[\beta_{c} \sim \mathcal{N}(\mu_{\beta}, \sigma_{\beta}^2)\]

We thus assume the intercepts \(\alpha\) and slopes \(\beta\) to come from a normal distribution centered around their respective group mean \(\mu\) with a certain standard deviation \(\sigma^2\), the values (or rather posteriors) of which we also estimate. That's why this is called a multilevel, hierarchical or partial-pooling modeling.

hierarchical

hierarchical

How do we estimate such a complex model you might ask? Well, that's the beauty of Probabilistic Programming -- we just formulate the model we want and press our Inference Button(TM).

(Note that the above is not a complete Bayesian model specification as we haven't defined priors or hyperpriors (i.e. priors for the group distribution, \(\mu\) and \(\sigma\)). These will be used in the model implementation below but only distract here.)

Probabilistic Programming

Unpooled/non-hierarchical model

To really highlight the effect of the hierarchical linear regression we'll first estimate the non-hierarchical, unpooled Bayesian model from above (separate regressions). For each county we estimate a completely separate model. As we have no prior information on what the intercept or regressions could be, we will be using a normal distribution centered around 0 with a wide standard-deviation to describe the intercept and regressions. We'll assume the measurements are normally distributed with noise \(\epsilon\) on which we place a uniform distribution.

In [3]:
indiv_traces = {}
for county_name in county_names:
    # Select subset of data belonging to county
    c_data = data.ix[data.county == county_name]
    c_log_radon = c_data.log_radon
    c_floor_measure = c_data.floor.values
    
    with pm.Model() as individual_model:
        # Intercept prior (variance == sd**2)
        a = pm.Normal('alpha', mu=0, sd=100**2)
        # Slope prior
        b = pm.Normal('beta', mu=0, sd=100**2)
    
        # Model error prior
        eps = pm.Uniform('eps', lower=0, upper=100)
    
        # Linear model
        radon_est = a + b * c_floor_measure
    
        # Data likelihood
        radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=c_log_radon)

        # Inference button (TM)!
        step = pm.NUTS()
        trace = pm.sample(2000, step=step, progressbar=False)
    
    # keep trace for later analysis
    indiv_traces[county_name] = trace
/home/wiecki/envs/pymc3/local/lib/python2.7/site-packages/theano/scan_module/scan_perform_ext.py:85: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility
  from scan_perform.scan_perform import *

Hierarchical Model

Instead of creating models separatley, the hierarchical model creates group parameters that consider the countys not as completely different but as having an underlying similarity. These distributions are subsequently used to influence the distribution of each county's \(\alpha\) and \(\beta\).

In [7]:
with pm.Model() as hierarchical_model:
    # Hyperpriors for group nodes
    mu_a = pm.Normal('mu_alpha', mu=0., sd=100**2)
    sigma_a = pm.Uniform('sigma_alpha', lower=0, upper=100)
    mu_b = pm.Normal('mu_beta', mu=0., sd=100**2)
    sigma_b = pm.Uniform('sigma_beta', lower=0, upper=100)
    
    # Intercept for each county, distributed around group mean mu_a
    # Above we just set mu and sd to a fixed value while here we
    # plug in a common group distribution for all a and b (which are
    # vectors of length n_counties).
    a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=n_counties)
    # Intercept for each county, distributed around group mean mu_a
    b = pm.Normal('beta', mu=mu_b, sd=sigma_b, shape=n_counties)
    
    # Model error
    eps = pm.Uniform('eps', lower=0, upper=100)
    
    # Model prediction of radon level
    # a[county_idx] translates to a[0, 0, 0, 1, 1, ...],
    # we thus link multiple household measures of a county
    # to its coefficients.
    radon_est = a[county_idx] + b[county_idx] * data.floor.values
    
    # Data likelihood
    radon_like = pm.Normal('radon_like', mu=radon_est, sd=eps, observed=data.log_radon)
In [8]:
# Inference button (TM)!
with hierarchical_model:
    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    hierarchical_trace = pm.sample(2000, step, start=start, progressbar=False)
In [9]:
# Plotting the hierarchical model trace -its found values- from 500 iterations onwards (right side plot) 
# and its accumulated marginal values (left side plot) 
pm.traceplot(hierarchical_trace[500:]);

The marginal posteriors in the left column are highly informative. mu_a tells us the group mean (log) radon levels. mu_b tells us that having no basement decreases radon levels significantly (no mass above zero). We can also see by looking at the marginals for a that there is quite some differences in radon levels between counties (each 'rainbow' color corresponds to a single county); the different widths are related to how much confidence we have in each paramter estimate -- the more measurements per county, the higher our confidence will be.

Posterior Predictive Check

The Root Mean Square Deviation

To find out which of the models explains the data better we can calculate the Root Mean Square Deviaton (RMSD). This posterior predictive check revolves around recreating the data based on the parameters found at different moments in the chain. The recreated or predicted values are subsequently compared to the real data points, the model that predicts data points closer to the original data is considered the better one. Thus, the lower the RMSD the better.

When computing the RMSD (code not shown) we get the following result:

  • individual/non-hierarchical model: 0.13
  • hierarchical model: 0.08

As can be seen above the hierarchical model performs better than the non-hierarchical model in predicting the radon values. Following this, we'll plot some examples of county's showing the actual radon measurements, the hierarchial predictions and the non-hierarchical predictions.

In [*]:
selection = ['CASS', 'CROW WING', 'FREEBORN']
fig, axis = plt.subplots(1, 3, figsize=(12, 6), sharey=True, sharex=True)
axis = axis.ravel()
for i, c in enumerate(selection):
    c_data = data.ix[data.county == c]
    c_data = c_data.reset_index(drop = True)
    z = list(c_data['county_code'])[0]

    xvals = np.linspace(-0.2, 1.2)
    for a_val, b_val in zip(indiv_traces[c]['alpha'][500::10], indiv_traces[c]['beta'][500::10]):
        axis[i].plot(xvals, a_val + b_val * xvals, 'b', alpha=.1)
    axis[i].plot(xvals, indiv_traces[c]['alpha'][500::10].mean() + indiv_traces[c]['beta'][500::10].mean() * xvals, 
                 'b', alpha=1, lw=2., label='individual')
    for a_val, b_val in zip(hierarchical_trace['alpha'][500::10][z], hierarchical_trace['beta'][500::10][z]):
        axis[i].plot(xvals, a_val + b_val * xvals, 'g', alpha=.1)
    axis[i].plot(xvals, hierarchical_trace['alpha'][500::10][z].mean() + hierarchical_trace['beta'][500::10][z].mean() * xvals, 
                 'g', alpha=1, lw=2., label='hierarchical')
    axis[i].scatter(c_data.floor + np.random.randn(len(c_data))*0.01, c_data.log_radon, 
                    alpha=1, color='k', marker='.', s=80, label='original data')
    axis[i].set_xticks([0,1])
    axis[i].set_xticklabels(['basement', 'no basement'])
    axis[i].set_ylim(-1, 4)
    axis[i].set_title(c)
    if not i%3:
        axis[i].legend()
        axis[i].set_ylabel('log radon level')

In the above plot we have the data points in black of three selected counties. The thick lines represent the mean estimate of the regression line of the individual (blue) and hierarchical model (in green). The thinner lines are regression lines of individual samples from the posterior and give us a sense of how variable the estimates are.

When looking at the county 'CASS' we see that the non-hierarchical estimation is strongly biased: as this county's data contains only households with a basement the estimated regression produces the non-sensical result of a giant negative slope meaning that we would expect negative radon levels in a house without basement!

Moreover, in the example county's 'CROW WING' and 'FREEBORN' the non-hierarchical model appears to react more strongly than the hierarchical model to the existance of outliers in the dataset ('CROW WING': no basement upper right. 'FREEBORN': basement upper left). Assuming that there should be a higher amount of radon gas measurable in households with basements opposed to those without, the county 'CROW WING''s non-hierachical model seems off. Having the group-distribution constrain the coefficients we get meaningful estimates in all cases as we apply what we learn from the group to the individuals and vice-versa.

Shrinkage

Shrinkage describes the process by which our estimates are "pulled" towards the group-mean as a result of the common group distribution -- county-coefficients very far away from the group mean have very low probability under the normality assumption, moving them closer to the group mean gives them higher probability. In the non-hierachical model every county is allowed to differ completely from the others by just using each county's data, resulting in a model more prone to outliers (as shown above).

In [87]:
hier_a = hierarchical_trace['alpha'][500:].mean(axis=0)
hier_b = hierarchical_trace['beta'][500:].mean(axis=0)
indv_a = [indiv_traces[c]['alpha'][500:].mean() for c in county_names]
indv_b = [indiv_traces[c]['beta'][500:].mean() for c in county_names]
In [96]:
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, xlabel='Intercept', ylabel='Floor Measure', 
                     title='Hierarchical vs. Non-hierarchical Bayes', 
                     xlim=(0, 3), ylim=(-3, 3))

ax.scatter(indv_a,indv_b, s=26, alpha=0.4, label = 'non-hierarchical')
ax.scatter(hier_a,hier_b, c='red', s=26, alpha=0.4, label = 'hierarchical')
for i in range(len(indv_b)):  
    ax.arrow(indv_a[i], indv_b[i], hier_a[i] - indv_a[i], hier_b[i] - indv_b[i], 
             fc="k", ec="k", length_includes_head=True, alpha=0.4, head_width=.04)
ax.legend();

In the shrinkage plot above we show the coefficients of each county's non-hierarchical posterior mean (blue) and the hierarchical posterior mean (red). To show the effect of shrinkage on a single coefficient-pair (alpha and beta) we connect the blue and red points belonging to the same county by an arrow. Some non-hierarchical posteriors are so far out that we couldn't display them in this plot (it makes the axes too wide). Interestingly, all hierarchical posteriors of the floor-measure seem to be around -0.6 indicating that having a basement in almost all county's is a clear indicator for heightened radon levels. The intercept (which we take for type of soil) appears to differ among countys. This information would have been difficult to find if we had only used the non-hierarchial model.

Critically, many effects that look quite large and significant in the non-hiearchical model actually turn out to be much smaller when we take the group distribution into account (this point can also well be seen in plot In[12] in Chris' NB). Shrinkage can thus be viewed as a form of smart regularization that helps reduce false-positives!

Connections to Frequentist statistics

This type of hierarchical, partial pooling model is known as a random effects model in frequentist terms. Interestingly, if we placed uniform priors on the group mean and variance in the above model, the resulting Bayesian model would be equivalent to a random effects model. One might imagine that the difference between a model with uniform or wide normal hyperpriors should not have a huge impact. However, Gelman says encourages use of weakly-informative priors (like we did above) over flat priors.

Summary

In this post, co-authored by Danne Elbers, we showed how a multi-level hierarchical Bayesian model gives the best of both worlds when we have multiple sets of measurements we expect to have similarity. The naive approach either pools all data together and ignores the individual differences, or treats each set as completely separate leading to noisy estimates, as shown above. By assuming that each individual data set (each county in our case) is distributed according to a group distribution -- which we simultaneously estimate -- we benefit from increased statistical power and smart regularization via the shrinkage effect. Probabilistic Programming in PyMC then makes Bayesian estimation of this model trivial.

As a follow-up we could also include other states into our model. For this we could add yet another layer to the hierarchy where each state is pooled at the country level. Finally, readers of my blog will notice that we didn't use glm() here as it does not play nice with hierarchical models yet.

References

Acknowledgements

Thanks to Imri Sofer for feedback and teaching us about the connections to random-effects models and Dan Dillon for useful comments on an earlier draft.

~

by Thomas Wiecki at March 17, 2014 01:00 PM

March 12, 2014

Paul Ivanov

stem-and-leaf plots in a tweet


Summary: I describe stem plots, how to read them, and how to make them in Python, using 140 characters.

My friend @JarrodMillman, whose office is across the hall, is teaching a computational statistics course that involves a fair amount programming. He's been grading these homeworks semi-automatically - with python scripts that pull the students' latest changes from GitHub, run some tests, spit out the grade to a JSON file for the student, checks it in and updates a master JSON file that's only accessible to Jarrod. It's been fun periodically tagging along and watching his suite of little programs develop. He came in the other day and said "Do you know of any stem plot implementation in python? I found a few, and I'm using one that's ok, but it looks too complicated."

For those unfamiliar - a stem plot, or stem-and-leaf plot is a more detailed kind of histogram. On the left you have the stem, which is a prefix to all entries on the right. To the right of the stem, each entry takes up one space just like a bar chart, but still retains information about its actual value.

So a stem plot of the numbers 31, 41, 59, 26, 53, 58 looks like this:

 2|6
 3|1
 4|1
 5|389

That last line is hard to parse for the un-initiated. There are three entries to the right of the 50 stem, and these three entries 3 8 and 9 is how the numbers 53, 58, and 59 are concisely represented in a stem plot

As an instructor, you can quickly get a sense of the distribution of grades, without fearing the binning artifact caused by standard histograms. A stem-plot can reveal subtle patterns in the data that are easy to missed with usual grading histograms that have a binwidth of 10. Take this distribution, for example:

70:XXXXXXX
80:XXXXXXXXXXX
90:XXXXXXX

Below are two stem plots which have the same profile as the above, but tell a different story:

 7|7888999
 8|01123477899
 9|3467888

Above is a class that has a rather typical grade distribution that sort of clumps together. But a histogram of the same shape might come from data like this:

 7|0000223
 8|78888999999
 9|0255589

This is a class with 7 students clearly struggling compared to the rest.

So here's the code for making a stem plot in Python using NumPy. stem() expects an array or list of integers, and prints all stems that span the range of the data provided.

from __future__ import print_function
import numpy as np
def stem(d):
    "A stem-and-leaf plot that fits in a tweet by @ivanov"
    l,t=np.sort(d),10
    O=range(l[0]-l[0]%t,l[-1]+11,t)
    I=np.searchsorted(l,O)
    for e,a,f in zip(I,I[1:],O): print('%3d|'%(f/t),*(l[e:a]-f),sep='')

Yes, it isn't pretty, a fair amount of code golfing went into making this work. It is a good example for the kind of code you should not write, especially since I had a little bit of fun with the variable names using characters that look similar to others, especially in sans-serif typefaces (lI10O). Nevertheless, it's kind of fun to fit much functionality into 140 characters.

Here's my original tweet:

You can test it by running it on some generated data:

>>> data = np.random.poisson(355, 113)
>>> data
array([367, 334, 317, 351, 375, 372, 350, 352, 350, 344, 359, 355, 358,
   389, 335, 361, 363, 343, 340, 337, 378, 336, 382, 344, 359, 366,
   368, 327, 364, 365, 347, 328, 331, 358, 370, 346, 325, 332, 387,
   355, 359, 342, 353, 367, 389, 390, 337, 364, 346, 346, 346, 365,
   330, 363, 370, 388, 380, 332, 369, 347, 370, 366, 372, 310, 348,
   355, 408, 349, 326, 334, 355, 329, 363, 337, 330, 355, 367, 333,
   298, 387, 342, 337, 362, 337, 378, 326, 349, 357, 338, 349, 366,
   339, 362, 371, 357, 358, 316, 336, 374, 336, 354, 374, 366, 352,
   374, 339, 336, 354, 338, 348, 366, 370, 333])
>>> stem(data)
 29|8
 30|
 31|067
 32|566789
 33|00122334456666777778899
 34|02234466667788999
 35|001223445555577888999
 36|12233344556666677789
 37|0000122444588
 38|0277899
 39|0
 40|8

If you prefer to have spaces between entries, take out the sep='' from the last line.

>>> stem(data)
 29| 8
 30|
 31| 0 6 7
 32| 5 6 6 7 8 9
 33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
 34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
 35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
 36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
 37| 0 0 0 0 1 2 2 4 4 4 5 8 8
 38| 0 2 7 7 8 9 9
 39| 0
 40| 8

To skip over empty stems, add e!=a and in front of print. This will remove the 300 stem from the output (useful for data with lots of gaps).

>>> stem(data)
 29| 8
 31| 0 6 7
 32| 5 6 6 7 8 9
 33| 0 0 1 2 2 3 3 4 4 5 6 6 6 6 7 7 7 7 7 8 8 9 9
 34| 0 2 2 3 4 4 6 6 6 6 7 7 8 8 9 9 9
 35| 0 0 1 2 2 3 4 4 5 5 5 5 5 7 7 8 8 8 9 9 9
 36| 1 2 2 3 3 3 4 4 5 5 6 6 6 6 6 7 7 7 8 9
 37| 0 0 0 0 1 2 2 4 4 4 5 8 8
 38| 0 2 7 7 8 9 9
 39| 0
 40| 8

Thanks for reading.

by Paul Ivanov at March 12, 2014 07:00 AM

March 11, 2014

Jake Vanderplas

Frequentism and Bayesianism: A Practical Introduction

One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions between them and the different practical approaches that result. The purpose of this post is to synthesize the philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself might be better prepared to understand the types of data analysis people do.

I'll start by addressing the philosophical distinctions between the views, and from there move to discussion of how these ideas are applied in practice, with some Python code snippets demonstrating the difference between the approaches.

Frequentism vs. Bayesianism: a Philosophical Debate

Fundamentally, the disagreement between frequentists and Bayesians concerns the definition of probability.

For frequentists, probability only has meaning in terms of a limiting case of repeated measurements. That is, if I measure the photon flux \(F\) from a given star (we'll assume for now that the star's flux does not vary with time), then measure it again, then again, and so on, each time I will get a slightly different answer due to the statistical error of my measuring device. In the limit of a large number of measurements, the frequency of any given value indicates the probability of measuring that value. For frequentists probabilities are fundamentally related to frequencies of events. This means, for example, that in a strict frequentist view, it is meaningless to talk about the probability of the true flux of the star: the true flux is (by definition) a single fixed value, and to talk about a frequency distribution for a fixed value is nonsense.

For Bayesians, the concept of probability is extended to cover degrees of certainty about statements. Say a Bayesian claims to measure the flux \(F\) of a star with some probability \(P(F)\): that probability can certainly be estimated from frequencies in the limit of a large number of repeated experiments, but this is not fundamental. The probability is a statement of my knowledge of what the measurement reasult will be. For Bayesians, probabilities are fundamentally related to our own knowledge about an event. This means, for example, that in a Bayesian view, we can meaningfully talk about the probability that the true flux of a star lies in a given range. That probability codifies our knowledge of the value based on prior information and/or available data.

The surprising thing is that this arguably subtle difference in philosophy leads, in practice, to vastly different approaches to the statistical analysis of data. Below I will give a few practical examples of the differences in approach, along with associated Python code to demonstrate the practical aspects of the resulting methods.

Frequentist and Bayesian Approaches in Practice: Counting Photons

Here we'll take a look at an extremely simple problem, and compare the frequentist and Bayesian approaches to solving it. There's necessarily a bit of mathematical formalism involved, but I won't go into too much depth or discuss too many of the subtleties. If you want to go deeper, you might consider — please excuse the shameless plug — taking a look at chapters 4-5 of our textbook.

The Problem: Simple Photon Counts

Imagine that we point our telescope to the sky, and observe the light coming from a single star. For the time being, we'll assume that the star's true flux is constant with time, i.e. that is it has a fixed value \(F_{\rm true}\) (we'll also ignore effects like sky noise and other sources of systematic error). We'll assume that we perform a series of \(N\) measurements with our telescope, where the \(i^{\rm th}\) measurement reports the observed photon flux \(F_i\) and error \(e_i\).

The question is, given this set of measurements \(D = \{F_i,e_i\}\), what is our best estimate of the true flux \(F_{\rm true}\)?

(Gratuitous aside on measurement errors: We'll make the reasonable assumption that errors are Gaussian. In a Frequentist perspective, \(e_i\) is the standard deviation of the results of a single measurement event in the limit of repetitions of that event. In the Bayesian perspective, \(e_i\) is the standard deviation of the (Gaussian) probability distribution describing our knowledge of that particular measurement given its observed value)

Here we'll use Python to generate some toy data to demonstrate the two approaches to the problem. Because the measurements are number counts, a Poisson distribution is a good approximation to the measurement process:

In [1]:
# Generating some simple photon count data
import numpy as np
from scipy import stats
np.random.seed(1)  # for repeatability

F_true = 1000  # true flux, say number of photons measured in 1 second
N = 50 # number of measurements
F = stats.poisson(F_true).rvs(N)  # N measurements of the flux
e = np.sqrt(F)  # errors on Poisson counts estimated via square root

Now let's make a simple visualization of the "measured" data:

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.errorbar(F, np.arange(N), xerr=e, fmt='ok', ecolor='gray', alpha=0.5)
ax.vlines([F_true], 0, N, linewidth=5, alpha=0.2)
ax.set_xlabel("Flux");ax.set_ylabel("measurement number");

These measurements each have a different error \(e_i\) which is estimated from Poisson statistics using the standard square-root rule. In this toy example we already know the true flux \(F_{\rm true}\), but the question is this: given our measurements and errors, what is our best estimate of the true flux?

Let's take a look at the frequentist and Bayesian approaches to solving this.

Frequentist Approach to Simple Photon Counts

We'll start with the classical frequentist maximum likelihood approach. Given a single observation \(D_i = (F_i, e_i)\), we can compute the probability distribution of the measurement given the true flux \(F_{\rm true}\) given our assumption of Gaussian errors:

\[ P(D_i~|~F_{\rm true}) = \frac{1}{\sqrt{2\pi e_i^2}} \exp{\left[\frac{-(F_i - F_{\rm true})^2}{2 e_i^2}\right]} \]

This should be read "the probability of \(D_i\) given \(F_{\rm true}\) equals ...". You should recognize this as a normal distribution with mean \(F_{\rm true}\) and standard deviation \(e_i\).

We construct the likelihood function by computing the product of the probabilities for each data point:

\[\mathcal{L}(D~|~F_{\rm true}) = \prod_{i=1}^N P(D_i~|~F_{\rm true})\]

Here \(D = \{D_i\}\) represents the entire set of measurements. Because the value of the likelihood can become very small, it is often more convenient to instead compute the log-likelihood. Combining the previous two equations and computing the log, we have

\[\log\mathcal{L} = -\frac{1}{2} \sum_{i=1}^N \left[ \log(2\pi e_i^2) + \frac{(F_i - F_{\rm true})^2}{e_i^2} \right]\]

What we'd like to do is determine \(F_{\rm true}\) such that the likelihood is maximized. For this simple problem, the maximization can be computed analytically (i.e. by setting \(d\log\mathcal{L}/dF_{\rm true} = 0\)). This results in the following observed estimate of \(F_{\rm true}\):

\[ F_{\rm est} = \frac{\sum w_i F_i}{\sum w_i};~~w_i = 1/e_i^2 \]

Notice that in the special case of all errors \(e_i\) being equal, this reduces to

\[ F_{\rm est} = \frac{1}{N}\sum_{i=1}^N F_i \]

That is, in agreement with intuition, \(F_{\rm est}\) is simply the mean of the observed data when errors are equal.

We can go further and ask what the error of our estimate is. In the frequentist approach, this can be accomplished by fitting a Gaussian approximation to the likelihood curve at maximum; in this simple case this can also be solved analytically. It can be shown that the standard deviation of this Gaussian approximation is:

\[ \sigma_{\rm est} = \left(\sum_{i=1}^N w_i \right)^{-1/2} \]

These results are fairly simple calculations; let's evaluate them for our toy dataset:

In [3]:
w = 1. / e ** 2
print("""
      F_true = {0}
      F_est  = {1:.0f} +/- {2:.0f} (based on {3} measurements)
      """.format(F_true, (w * F).sum() / w.sum(), w.sum() ** -0.5, N))

      F_true = 1000
      F_est  = 998 +/- 4 (based on 50 measurements)
      

We find that for 50 measurements of the flux, our estimate has an error of about 0.4% and is consistent with the input value.

Bayesian Approach to Simple Photon Counts

The Bayesian approach, as you might expect, begins and ends with probabilities. It recognizes that what we fundamentally want to compute is our knowledge of the parameters in question, i.e. in this case,

\[ P(F_{\rm true}~|~D) \]

Note that this formulation of the problem is fundamentally contrary to the frequentist philosophy, which says that probabilities have no meaning for model parameters like \(F_{\rm true}\). Nevertheless, within the Bayesian philosophy this is perfectly acceptable.

To compute this result, Bayesians next apply Bayes' Theorem, a fundamental law of probability:

\[ P(F_{\rm true}~|~D) = \frac{P(D~|~F_{\rm true})~P(F_{\rm true})}{P(D)} \]

Though Bayes' theorem is where Bayesians get their name, it is not this law itself that is controversial, but the Bayesian interpretation of probability implied by the term \(P(F_{\rm true}~|~D)\).

Let's take a look at each of the terms in this expression:

  • \(P(F_{\rm true}~|~D)\): The posterior, or the probability of the model parameters given the data: this is the result we want to compute.
  • \(P(D~|~F_{\rm true})\): The likelihood, which is proportional to the \(\mathcal{L}(D~|~F_{\rm true})\) in the frequentist approach, above.
  • \(P(F_{\rm true})\): The model prior, which encodes what we knew about the model prior to the application of the data \(D\).
  • \(P(D)\): The data probability, which in practice amounts to simply a normalization term.

If we set the prior \(P(F_{\rm true}) \propto 1\) (a flat prior), we find

\[P(F_{\rm true}|D) \propto \mathcal{L}(D|F_{\rm true})\]

and the Bayesian probability is maximized at precisely the same value as the frequentist result! So despite the philosophical differences, we see that (for this simple problem at least) the Bayesian and frequentist point estimates are equivalent.

But What About the Prior?

You'll noticed that I glossed over something here: the prior, \(P(F_{\rm true})\). The prior allows inclusion of other information into the computation, which becomes very useful in cases where multiple measurement strategies are being combined to constrain a single model (as is the case in, e.g. cosmological parameter estimation). The necessity to specify a prior, however, is one of the more controversial pieces of Bayesian analysis.

A frequentist will point out that the prior is problematic when no true prior information is available. Though it might seem straightforward to use a noninformative prior like the flat prior mentioned above, there are some surprisingly subtleties involved. It turns out that in many situations, a truly noninformative prior does not exist! Frequentists point out that the subjective choice of a prior which necessarily biases your result has no place in statistical data analysis.

A Bayesian would counter that frequentism doesn't solve this problem, but simply skirts the question. Frequentism can often be viewed as simply a special case of the Bayesian approach for some (implicit) choice of the prior: a Bayesian would say that it's better to make this implicit choice explicit, even if the choice might include some subjectivity.

Photon Counts: the Bayesian approach

Leaving these philosophical debates aside for the time being, let's address how Bayesian results are generally computed in practice. For a one parameter problem like the one considered here, it's as simple as computing the posterior probability \(P(F_{\rm true}~|~D)\) as a function of \(F_{\rm true}\): this is the distribution reflecting our knowledge of the parameter \(F_{\rm true}\). But as the dimension of the model grows, this direct approach becomes increasingly intractable. For this reason, Bayesian calculations often depend on sampling methods such as Markov Chain Monte Carlo (MCMC).

I won't go into the details of the theory of MCMC here. Instead I'll show a practical example of applying an MCMC approach using Dan Foreman-Mackey's excellent emcee package. Keep in mind here that the goal is to generate a set of points drawn from the posterior probability distribution, and to use those points to determine the answer we seek.

To perform this MCMC, we start by defining Python functions for the prior \(P(F_{\rm true})\), the likelihood \(P(D~|~F_{\rm true})\), and the posterior \(P(F_{\rm true}~|~D)\), noting that none of these need be properly normalized. Our model here is one-dimensional, but to handle multi-dimensional models we'll define the model in terms of an array of parameters \(\theta\), which in this case is \(\theta = [F_{\rm true}]\):

In [4]:
def log_prior(theta):
    return 1  # flat prior

def log_likelihood(theta, F, e):
    return -0.5 * np.sum(np.log(2 * np.pi * e ** 2)
                         + (F - theta[0]) ** 2 / e ** 2)

def log_posterior(theta, F, e):
    return log_prior(theta) + log_likelihood(theta, F, e)

Now we set up the problem, including generating some random starting guesses for the multiple chains of points.

In [5]:
ndim = 1  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 1000  # "burn-in" period to let chains stabilize
nsteps = 2000  # number of MCMC steps to take

# we'll start at random locations between 0 and 2000
starting_guesses = 2000 * np.random.rand(nwalkers, ndim)

import emcee
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e])
sampler.run_mcmc(starting_guesses, nsteps)

sample = sampler.chain  # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].ravel()  # discard burn-in points

If this all worked correctly, the array sample should contain a series of 50000 points drawn from the posterior. Let's plot them and check:

In [6]:
# plot a histogram of the sample
plt.hist(sample, bins=50, histtype="stepfilled", alpha=0.3, normed=True)

# plot a best-fit Gaussian
F_fit = np.linspace(975, 1025)
pdf = stats.norm(np.mean(sample), np.std(sample)).pdf(F_fit)

plt.plot(F_fit, pdf, '-k')
plt.xlabel("F"); plt.ylabel("P(F)")
Out[6]:
<matplotlib.text.Text at 0x1075c7510>

We end up with a sample of points drawn from the (normal) posterior distribution. The mean and standard deviation of this posterior are the corollary of the frequentist maximum likelihood estimate above:

In [7]:
print("""
      F_true = {0}
      F_est  = {1:.0f} +/- {2:.0f} (based on {3} measurements)
      """.format(F_true, np.mean(sample), np.std(sample), N))

      F_true = 1000
      F_est  = 998 +/- 4 (based on 50 measurements)
      

We see that as expected for this simple problem, the Bayesian approach yields the same result as the frequentist approach!

Discussion

Now, you might come away with the impression that the Bayesian method is unnecessarily complicated, and in this case it certainly is. Using an Affine Invariant Markov Chain Monte Carlo Ensemble sampler to characterize a one-dimensional normal distribution is a bit like using the Death Star to destroy a beach ball, but I did this here because it demonstrates an approach that can scale to complicated posteriors in many, many dimensions, and can provide nice results in more complicated situations where an analytic likelihood approach is not possible.

As a side note, you might also have noticed one little sleight of hand: at the end, we use a frequentist approach to characterize our posterior samples! When we computed the sample mean and standard deviation above, we were employing a distinctly frequentist technique to characterize the posterior distribution. The pure Bayesian result for a problem like this would be to report the posterior distribution itself (i.e. its representative sample), and leave it at that. That is, in pure Bayesianism the answer to a question is not a single number with error bars; the answer is the posterior distribution over the model parameters!

Adding a Dimension: Exploring a more sophisticated model

Let's briefly take a look at a more complicated situation, and compare the frequentist and Bayesian results yet again. Above we assumed that the star was static: now let's assume that we're looking at an object which we suspect has some stochastic variation — that is, it varies with time, but in an unpredictable way (a Quasar is a good example of such an object).

We'll propose a simple 2-parameter Gaussian model for this object: \(\theta = [\mu, \sigma]\) where \(\mu\) is the mean value, and \(\sigma\) is the standard deviation of the variability intrinsic to the object. Thus our model for the probability of the true flux at the time of each observation looks like this:

\[ F_{\rm true} \sim \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[\frac{-(F - \mu)^2}{2\sigma^2}\right]\]

Now, we'll again consider \(N\) observations each with their own error. We can generate them this way:

In [8]:
np.random.seed(42)  # for reproducibility
N = 100  # we'll use more samples for the more complicated model
mu_true, sigma_true = 1000, 15  # stochastic flux model

F_true = stats.norm(mu_true, sigma_true).rvs(N)  # (unknown) true flux
F = stats.poisson(F_true).rvs()  # observed flux: true flux plus Poisson errors.
e = np.sqrt(F)  # root-N error, as above

Varying Photon Counts: The Frequentist Approach

The resulting likelihood is the convolution of the intrinsic distribution with the error distribution, so we have

\[\mathcal{L}(D~|~\theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi(\sigma^2 + e_i^2)}}\exp\left[\frac{-(F_i - \mu)^2}{2(\sigma^2 + e_i^2)}\right]\]

Analogously to above, we can analytically maximize this likelihood to find the best estimate for \(\mu\):

\[\mu_{est} = \frac{\sum w_i F_i}{\sum w_i};~~w_i = \frac{1}{\sigma^2 + e_i^2} \]

And here we have a problem: the optimal value of \(\mu\) depends on the optimal value of \(\sigma\). The results are correlated, so we can no longer use straightforward analytic methods to arrive at the frequentist result.

Nevertheless, we can use numerical optimization techniques to determine the maximum likelihood value. Here we'll use the optimization routines available within Scipy's optimize submodule:

In [9]:
def log_likelihood(theta, F, e):
    return -0.5 * np.sum(np.log(2 * np.pi * (theta[1] ** 2 + e ** 2))
                         + (F - theta[0]) ** 2 / (theta[1] ** 2 + e ** 2))

# maximize likelihood <--> minimize negative likelihood
def neg_log_likelihood(theta, F, e):
    return -log_likelihood(theta, F, e)

from scipy import optimize
theta_guess = [900, 5]
theta_est = optimize.fmin(neg_log_likelihood, theta_guess, args=(F, e))
print("""
      Maximum likelihood estimate for {0} data points:
          mu={theta[0]:.0f}, sigma={theta[1]:.0f}
      """.format(N, theta=theta_est))
Optimization terminated successfully.
         Current function value: 502.839505
         Iterations: 58
         Function evaluations: 114

      Maximum likelihood estimate for 100 data points:
          mu=999, sigma=19
      

This maximum likelihood value gives our best estimate of the parameters \(\mu\) and \(\sigma\) governing our model of the source. But this is only half the answer: we need to determine how confident we are in this answer, that is, we need to compute the error bars on \(\mu\) and \(\sigma\).

There are several approaches to determining errors in a frequentist paradigm. We could, as above, fit a normal approximation to the maximum likelihood and report the covariance matrix (here we'd have to do this numerically rather than analytically). Alternatively, we can compute statistics like \(\chi^2\) and \(\chi^2_{\rm dof}\) to and use standard tests to determine confidence limits, which also depends on strong assumptions about the Gaussianity of the likelihood. We might alternatively use randomized sampling approaches such as Jackknife or Bootstrap, which maximize the likelihood for randomized samples of the input data in order to explore the degree of certainty in the result.

All of these would be valid techniques to use, but each comes with its own assumptions and subtleties. Here, for simplicity, we'll use the basic bootstrap resampler found in the astroML package:

In [10]:
from astroML.resample import bootstrap

def fit_samples(sample):
    # sample is an array of size [n_bootstraps, n_samples]
    # compute the maximum likelihood for each bootstrap.
    return np.array([optimize.fmin(neg_log_likelihood, theta_guess,
                                   args=(F, np.sqrt(F)), disp=0)
                     for F in sample])

samples = bootstrap(F, 1000, fit_samples)  # 1000 bootstrap resamplings

Now in a similar manner to what we did above for the MCMC Bayesian posterior, we'll compute the sample mean and standard deviation to determine the errors on the parameters.

In [11]:
mu_samp = samples[:, 0]
sig_samp = abs(samples[:, 1])

print " mu    = {0:.0f} +/- {1:.0f}".format(mu_samp.mean(), mu_samp.std())
print " sigma = {0:.0f} +/- {1:.0f}".format(sig_samp.mean(), sig_samp.std())
 mu    = 999 +/- 4
 sigma = 18 +/- 5

I should note that there is a huge literature on the details of bootstrap resampling, and there are definitely some subtleties of the approach that I am glossing over here. One obvious piece is that there is potential for errors to be correlated or non-Gaussian, neither of which is reflected by simply finding the mean and standard deviation of each model parameter. Nevertheless, I trust that this gives the basic idea of the frequentist approach to this problem.

Varying Photon Counts: The Bayesian Approach

The Bayesian approach to this problem is almost exactly the same as it was in the previous problem, and we can set it up by slightly modifying the above code.

In [12]:
def log_prior(theta):
    # sigma needs to be positive.
    if theta[1] <= 0:
        return -np.inf
    else:
        return 0

def log_posterior(theta, F, e):
    return log_prior(theta) + log_likelihood(theta, F, e)

# same setup as above:
ndim, nwalkers = 2, 50
nsteps, nburn = 2000, 1000

starting_guesses = np.random.rand(nwalkers, ndim)
starting_guesses[:, 0] *= 2000  # start mu between 0 and 2000
starting_guesses[:, 1] *= 20    # start sigma between 0 and 20

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e])
sampler.run_mcmc(starting_guesses, nsteps)

sample = sampler.chain  # shape = (nwalkers, nsteps, ndim)
sample = sampler.chain[:, nburn:, :].reshape(-1, 2)

Now that we have the samples, we'll use a convenience routine from astroML to plot the traces and the contours representing one and two standard deviations:

In [13]:
from astroML.plotting import plot_mcmc
fig = plt.figure()
ax = plot_mcmc(sample.T, fig=fig, labels=[r'$\mu$', r'$\sigma$'], colors='k')
ax[0].plot