Ondřej Čertík: python

Showing posts with label python. Show all posts

Sunday, August 4, 2013

How to support both Python 2 and 3

I'll start with the conclusion: making backwards incompatible version of a language is a terrible idea, and it was bad a mistake. This mistake was somewhat corrected over the years by eventually adding features to both Python 2.7 and 3.3 that actually allow to run a single code base on both Python versions --- which, as I show below, was discouraged by both Guido and the official Python documents (though the latest docs mention it)... Nevertheless, a single code base fixes pretty much all the problems and it actually is fun to use Python again. The rest of this post explains my conclusion in great detail. My hope is that it will be useful to other Python projects to provide tips and examples how to support both Python 2 and 3, as well as to future language designers to keep languages backwards compatible.

When Python 3.x got released, it was pretty much a new language, backwards incompatible with Python 2.x, as it was not possible to run the same source code in both versions. I was extremely unhappy about this situation, because I simply didn't have time to port all my Python code to a new language.

I read the official documentation about how the transition should be done, quoting:

You should have excellent unit tests with close to full coverage.

Port your project to Python 2.6.

Turn on the Py3k warnings mode.

Test and edit until no warnings remain.

Use the 2to3 tool to convert this source code to 3.0 syntax. Do not manually edit the output!

Test the converted source code under 3.0.

If problems are found, make corrections to the 2.6 version of the source code and go back to step 3.

When it's time to release, release separate 2.6 and 3.0 tarballs (or whatever archive form you use for releases).

I've also read Guido's blog post, which repeats the above list and adds an encouraging comment:

Python 3.0 will break backwards compatibility. Totally. We're not even aiming for a specific common subset.

In other words, one has to maintain a Python 2.x code base, then run 2to3 tool to get it converted. If you want to develop using Python 3.x, you can't, because all code must be developed using 2.x. As to the actual porting, Guido says in the above post:

If the conversion tool and the forward compatibility features in Python 2.6 work out as expected, steps (2) through (6) should not take much more effort than the typical transition from Python 2.x to 2.(x+1).

So sometime in 2010 or 2011 I started porting SymPy, which is now a pretty large code base (sloccount says over 230,000 lines of code, and in January 2010 it said almost 170,000 lines). I remember spending a few full days on it, and I just gave up, because it wasn't just changing a few things, but pretty fundamental things inside the code base, and one cannot just do it half-way, one has to get all the way through and then polish it up. We ended up using one full Google Summer of Code project for it, you can read the final report. I should mention that we use metaclasses and other things, that make such porting harder. Conclusion: this was definitely not "the typical transition from Python 2.x to 2.(x+1)".

Ok, after months of hard work by a lot of people, we finally have a Python 2.x code base that can be translated using the 2to3 tool and it works and tests pass in Python 3.x.

The next problem is that Python 3.x is pretty much like a ghetto -- you can use it as a user, but you can't develop in it. The 2to3 translation takes over 5 minutes on my laptop, so any interactivity is gone. It is true that the tool can cache results, so the next pass is somewhat faster, but in practice this still turns out to be much much worse than any compilation of C or Fortran programs (done for example with cmake), both in terms of time and in terms of robustness. And I am not even talking about pip issues or setup.py issues regarding calling 2to3. What a big mess... Programming should be fun, but this is not fun.

I'll be honest, this situation killed a lot of my enthusiasm for Python as a platform. I learned modern Fortran in the meantime and with admiration I noticed that it still compiles old F77 programs without modification and I even managed to compile a 40 year old pre-F77 code with just minimal modifications (I had to port the code to F77). Yet modern Fortran is pretty much a completely different language, with all the fancy features that one would want. Together with my colleagues I created a fortran90.org website, where you can compare Python/NumPy side by side with modern Fortran, it's pretty much 1:1 translation and a similar syntax (for numerical code), except that you need to add types of course. Yet Fortran is fully backwards compatible. What a pleasure to work with!

Fast forward to last week. A heroic effort by Sean Vig who ported SymPy to single code base (#2318) was merged. Earlier this year similar pull requests by other people have converted NumPy (#3178, #3191, #3201, #3202, #3203, #3205, #3208, #3216, #3223, #3226, #3227, #3231, #3232, #3235, #3236, #3237, #3238, #3241, #3242, #3244, #3245, #3248, #3249, #3257, #3266, #3281, #3191, ...) and SciPy (#397) codes as well. Now all these projects have just one code base and it works in all Python versions (2.x and 3.x) without the need to call the 2to3 tool.

Having a single code base, programming in Python is fun again. You can choose any Python version, be it 2.x or 3.x, and simply submit a patch. The patch is then tested using Travis-CI, so that it works in all Python versions. Installation has been simplified (no need to call any 2to3 tools and no more hacks to get setup.py working).

In other words, this is how it should be, that you write your code once, and you can use any supported language version to run it/compile it, or develop in. But for some reason, this obvious solution has been discouraged by Guido and other Python documents, as seen above. I just looked up the latest official Python docs, and that one is not upfront negative about a single code base. But it still does not recommend this approach as the one. So let me fix that: I do recommend a single code base as the solution.

The newest Python documentation from the last paragraph also mentions

Regardless of which approach you choose, porting is not as hard or time-consuming as you might initially think.

Well, I encourage you to browse through the pull requests that I linked to above for SymPy, NumPy or SciPy. I think it is very time consuming, and that's just converting from 2to3 to single code base, which is the easy part. The hard part was to actually get SymPy to work with Python 3 (as I discussed above, that took couple months of hard work), and I am pretty sure it was pretty hard to port NumPy and SciPy as well.

The docs also says:

It /single code base/ does lead to code that is not entirely idiomatic Python

That is true, but our experience has been, that with every Python version that we drop, we also delete lots of ugly hacks from our code base. This has been true for dropping support for 2.3, 2.4 and 2.5, and I expect it will also be true for dropping 2.6 and especially 2.7, when we can simply use the Python 3.x syntax. So not a big deal overall.

To sum this blog post up, as far as I am concerned, pretty much all the problems with supporting Python 2.x and 3.x are fixed by having a single code base. You can read the pull requests above to see how to implemented things (like metaclasses, and other fancy stuff...). Python is still quite the same language, you write your code, you use a Python version of your choice and things will just work. Not a big deal overall. The official documentation should be fixed to recommend this approach, and deprecate the other approaches.

I think that Python is great and I hope it will be used more in the future.

Written with StackEdit.

Monday, July 1, 2013

My impressions from the SciPy 2013 conference

I have attended the SciPy 2013 conference in Austin, Texas. Here are my impressions.

Number one is the fact that the IPython notebook was used by pretty much everyone. I use it a lot myself, but I didn't realize how ubiquitous it has become. It is quickly becoming the standard now. The IPython notebook is using Markdown and in fact it is better than Rest. The way to remember the "[]()" syntax for links is that in regular text you put links into () parentheses, so you do the same in Markdown, and append [] for the text of the link. The other way to remember is that [] feel more serious and thus are used for the text of the link. I stressed several times to +Fernando Perez and +Brian Granger how awesome it would be to have interactive widgets in the notebook. Fortunately that was pretty much preaching to the choir, as that's one of the first things they plan to implement good foundations for and I just can't wait to use that.

It is now clear, that the IPython notebook is the way to store computations that I want to share with other people, or to use it as a "lab notebook" for myself, so that I can remember what exactly I did to obtain the results (for example how exactly I obtained some figures from raw data). In other words --- instead of having sets of scripts and manual bash commands that have to be executed in particular order to do what I want, just use IPython notebook and put everything in there.

Number two is that how big the conference has become since the last time I attended (couple years ago), yet it still has the friendly feeling. Unfortunately, I had to miss a lot of talks, due to scheduling conflicts (there were three parallel sessions), so I look forward to seeing them on video.

+Aaron Meurer and I have done the SymPy tutorial (see the link for videos and other tutorial materials). It's been nice to finally meet +Matthew Rocklin (very active SymPy contributor) in person. He also had an interesting presentation
about symbolic matrices + Lapack code generation. +Jason Moore presented PyDy.
It's been a great pleasure for us to invite +David Li (still a high school student) to attend the conference and give a presentation about his work on sympygamma.com and live.sympy.org.

It was nice to meet the Julia guys, +Jeff Bezanson and +Stefan Karpinski. I contributed the Fortran benchmarks on the Julia's website some time ago, but I had the feeling that a lot of them are quite artificial and not very meaningful. I think Jeff and Stefan confirmed my feeling. Julia seems to have quite interesting type system and multiple dispatch, that SymPy should learn from.

I met the VTK guys +Matthew McCormick and +Pat Marion. One of the keynotes was given by +Will Schroeder from Kitware about publishing. I remember him stressing to manage dependencies well as well as to use BSD like license (as opposed to viral licenses like GPL or LGPL). That opensource has pretty much won (i.e. it is now clear that that is the way to go).

I had great discussions with +Francesc Alted, +Andy Terrel, +Brett Murphy, +Jonathan Rocher, +Eric Jones, +Travis Oliphant, +Mark Wiebe, +Ilan Schnell, +Stéfan van der Walt, +David Cournapeau, +Anthony Scopatz, +Paul Ivanov, +Michael Droettboom, +Wes McKinney, +Jake Vanderplas, +Kurt Smith, +Aron Ahmadia, +Kyle Mandli, +Benjamin Root and others.

It's also been nice to have a chat with +Jason Vertrees and other guys from Schrödinger.

One other thing that I realized last week at the conference is that pretty much everyone agreed on the fact that NumPy should act as the default way to represent memory (no matter if the array was created in Fortran or other code) and allow manipulations on it. Faster libraries like Blaze or ODIN should then hook themselves up into NumPy using multiple dispatch. Also SymPy would then hook itself up so that it can be used with array operations natively. Currently SymPy does work with NumPy (see our tests for some examples what works), but the solution is a bit fragile (it is not possible to override NumPy behavior, but because NumPy supports general objects, we simply give it SymPy objects and things mostly work).

Similar to this, I would like to create multiple dispatch in SymPy core itself, so that other (faster) libraries for symbolic manipulation can hook themselves up, so that their own (faster) multiplication, expansion or series expansion would get called instead of the SymPy default one implemented in pure Python.

Other blog posts from the conference:

Aaron's post
Fernando's post

Monday, June 4, 2012

How to convert scanned images to pdf

From time to time I need to convert scanned documents to a pdf format.

Usage scenario 1: I scan part of a book (i.e. some article) on a school's scanner that sends me 10 big separate color pdf files (one pdf per page). I want to get one nice, small (black and white) pdf file with all the pages.

Usage scenario 2: I download a web form, print it, fill it in, sign it, scan it on my own scanner using Gimp and now I want to convert the image into a nice pdf file (either color or black & white) to send back over email.

Solution: I save the original files (be it pdf or png) into a folder and use git to track it. Then create a simple reduce script to convert it to the final format (view it as a pipeline). Often I need to tweak one or two parameters in the pipeline.

Here is a script for scenario 1:

And here for scenario 2:

There can be several unexpected surprises along the way. From my experience:

If I convert png directly to tiff, sometimes the resolution can be wrong. The solution is to always convert to ppm (color) or pbm (black and white) first, which is just a simple file format containing the raw pixels. This is the "starting" format (so first I need to convert the initial pdf or png into ppm/pbm) and then do anything else. That proved to be very robust.
The tiff2pdf utility proved to be the most robust way to convert an image to a pdf. All other ways that I have tried failed in one way or another (resolution, positioning, paper format and other things were wrong....). It can create multiple pages pdf files, set paper format (US Letter, A4, ...) and so on.
The linux convert utility is a robust tool for cropping images, converting color to black and white (using a threshold for example) and other things. As long as the image is first converted to ppm/pbm. In principle it can also produce pdf files, but that didn't work well for me.
I sometimes use the unpaper program in the pipeline for some automatic polishing of the images.

In general, I am happy with my solution. So far I was always able to get what I needed using this "pipeline" method.

Thursday, January 26, 2012

When double precision is not enough

I was doing some finite element (FE) calculation and I needed the sum of the lowest 7 eigenvalues of a symmetric matrix (that comes from the FE assembly) to converge to at least 1e-8 accuracy (so that I can check calculation done by some other solver of mine, that calculates the same but doesn't use FE). In reality I wanted the rounded value to 8 decimal digits to be correct, so I really needed 1e-9 accuracy (but it's ok if it is let's say 2e-9, but not ok if it is 9e-9). With my FE solver, I couldn't get it to converge more than to roughly 5e-7 no matter how hard I tried. Now what?

When doing the convergence, I take a good mesh and keep increasing "p" (the polynomial order) until it converges. For my particular problem, it is fully converged for about p=25 (the solver supports the order up to 64). Increasing "p" further will not increase the accuracy anymore, and the accuracy stays at the level 5e-7 for the sum of the lowest 7 eigenvalues. For optimal meshes, it converges at p=25, for not optimal meshes, it converges for higher "p", but in all cases, it doesn't get below 5e-7.

I know from experience, that for simpler problems, the FE solver can easily converge to 1e-10 or more using double precision. So I know it is doable, now the question is what the problem is: there
are a few possible reasons:

The FE quadrature is not accurate enough
The condition number of the matrix is high, thus LAPACK doesn't return very accurate eigenvalues
Bug in the assembly/solver (like single/double corruption in Fortran, or some other subtle bug)

When using the same solver for simpler potential, it converged nicely to 1e-10. So this suggests there is no bug in the assembly or solver itself. It is possible that the quadrature is not accurate enough, but again, if it converges for simple problem, it's probably not it. So it seems it is the ill conditioned matrix, that causes this. So I printed the residuals (that I simply calculated in Fortran using the matrix and the eigenvectors returned by LAPACK), and it only showed 1e-9. For simpler problems, it can go to 1e-14 easily. So that must be it. How do we fix it?

Obviously by making the matrix less ill conditioned, which is caused by the mesh for the problem (the ratio of the longest/shortest elements is 1e9) but for my problem I really needed such a mesh. So the other option is to increase the real number accuracy.

In Fortran all real variables are defined as real(dp), where dp is an integer defined at a single place in the project. There are several ways to define it, but it's value is 8 for gfortran and it means double precision. So I increased it to 16 (quadruple precision), recompiled. Now the whole program calculates in quadruple precision (more than 30 significant digits). I had to recompile LAPACK using the "-fdefault-real-8" gfortran option, that promotes all double precision numbers to quadruple precision, and I used the "d" versions (double precision, now promoted to quadruple) of LAPACK routines.

I rerun the calculation ---- and suddenly LAPACK residuals are around 1e-13, and the solver converges to 1e-10 easily (for the sum of the lowest 7 eigenvalues). Problem solved.

Turning my Fortran program to quadruple precision is as easy as changing one variable and recompiling. Turning LAPACK to quadruple precision is easy with a single gfortran flag (LAPACK uses the old f77 syntax for double precision, if it used real(dp), then I would simply change it as for my program). The whole calculation got at least 10x slower with quadruple. The reason is that gfortran runtime uses the libquadmath library, that simulates quadruple precision (as current CPUs only support double precision natively).

I actually discovered a few bugs in my program (typically some constants in older code didn't use the "dp" syntax, but had the double precision hardwired). Fortran warns about all such cases, when the real variables have incompatible precision.

It is amazing how easy it is to work with different precision in Fortran (literally just one change and recompile). How could this be done with C++? This wikipedia page suggests, that "long double" is only 80bit in most cases (quadruple is 128bit), but gcc offers __float128, so it seems I would have to manually change all "double" to "__float128" in the whole C++ program (this could be done with a single "sed" command).

Thursday, November 18, 2010

Google Code vs GitHub for hosting opensource projects

Cython is now considering options where to move the main (mercurial) repository, and Robert Bradshaw (one of the main Cython developers) has asked me about my experience with regards to Google Code and GitHub, since we use both with SymPy.

Google Code is older, and it was the first service that provided free (virtually unlimited) number of projects that you could easily and immediately setup. At that time (4 years ago?) that was something unheard of. However, the GitHub guys in the meantime not only made this available too, but also implemented features, that (as far as I know) no one offers at all, in particular hosting your own pages at your own domain (but at GitHub's servers, some examples are sympy.org and docs.sympy.org), commenting on git branches and pull requests before the code gets merged in (I am 100% convinced that this is the right approach, as opposed to comment on the code after it gets in), allow to easily fork the repository and it has simply more social features, that the Google Code doesn't have.

I believe that managing an opensource project is mainly a social activity, and GitHub's social features really make so many things easier. From this point of view, GitHub is clearly the best choice today.

I think there is only one (but potentially big) problem with GitHub, that its issue tracker is very bad, compared to the Google Code one. For that reason (and also because we already use it), we keep our issues at Google Code with SymPy.

The above are the main things to consider. Now there are some little things to keep in mind, that I will briefly touch below: Google Code doesn't support git and blocks access from Cuba and other countries, when you want to change the front page, you need to be an admin, while at GitHub I simply add push access to all sympy developers, so anyone just pushes a patch to this repository: https://github.com/sympy/sympy.github.com, and it automatically appears on our front page (sympy.org), with Google Code we had to write long pages (in our docs) about how to send patches, with GitHub we just say, send us a pull request, and point to: http://help.github.com/pull-requests/. In other words, GitHub takes care of teaching people how to use git and figure out how to send patches, and we can concentrate on reviewing the patches and pushing them in.

Wikipages at github are maintained in git, and they provide the webfrontend to it as opensource, so there is no vendor lock-in. Anyone with github account can modify our wiki pages, while the Google Code pages can only be modified by people that I add to the Google Code project, which forced us to install mediawiki on my linode server (hosted at linode.com, which by the way is an excellent VPS hosting service, that I have been using for couple of years already and I can fully recommend it), and I had to manage it all the time, and now we are moving our pages to the github wiki, so that I have one less thing to worry about.

So as you can see, I, as admin, have less things to worry about, as github manages everything for me now, while with Google Code, I had to manage lots of things on my linodes.

One other thing to consider is that GitHub is only for git, but they also provide svn and hg access (both push and pull, they translate the repository automatically between git and svn/hg), I never really used it much, so I don't know how stable this is. As I wrote before, I think that git is the best tool now for maintaining a project, and I think that github is now the best choice to host it (except the issue tracker, where Google Code is better).

Sunday, October 31, 2010

git has won

I switched to git from mercurial about two years ago. See here why I switched and here my experience after 4 months. Back then I was unsure, whether git will win, but I thought it has a bigger momentum. Well, I think that now it's quite clear that git has already won. Pretty much everybody that I collaborate with is using git now.

I use github everyday, and now thanks to github pull requests, I think it's the best collaboration platform out there (compared to Google Code, Sourceforge, Bitbucket or Launchpad).

I think it's partly because the github guys have a clear vision of what has to be done in order to make collaboration more easier and they do it, but more importantly that git branches is the way to go, as well as other git features, that are "right" from the beginning (branches, interactive rebase, and so on), while other VCS like bzr and mercurial simply either don't have them, or are getting them, but it's hard to get used to it (for example mercurial uses the "mercurial queues", and I think that is the totally wrong approach to things).

Anyway, this is just my own personal opinion. I'll be happy to discuss it in the comments, if you disagree.

Monday, July 19, 2010

Theoretical Physics Reference Book

Today I fulfilled my old dream --- I just created my first book! Here is how it looks like:

More images
here.

Here is the source code of the book: http://github.com/certik/theoretical-physics, the repository contains a branch 'master' with the code and 'gh-pages' with the generated html pages, that are hosted at github, at the url theoretical-physics.net.

Then I published the book at Lulu: http://www.lulu.com/product/hardcover/theoretical-physics-reference/11612144, I wanted a hardcover book, so I setup a project at Lulu, used some Lulu templates for the cover and that was it. Lulu's price for the book is $19.20 (166 black & white pages, hardcover), then I can set my own price and the rest of the money probably goes to me. I set the price to $20, because Lulu has free shipping for orders $20 or more. You can also download the pdf (for free) at the above link (or just use my git repository). So far this didn't cost me anything.

I have then ordered the book myself (just like anybody else would, at the above address) and it arrived today. It's a regular hardcover book. Beautiful, you can browse the pictures above. It smells deliciously (that you have to believe me). And all that it cost me was $19.20.

As for the contents itself, you can browse it online at theoretical-physics.net, essentially it's most of my physics notes, that I collected over the years. I'd like to treat books like software --- release early release often. This is my first release and I would call it beta. The main purpose of it was to see if everything goes through, how long it takes (the date inside the book is July 4, 2010, I created and ordered it on July 5, got the physical book on July 19) and what the quality is (excellent). I also wanted to see how the pages correspond to the pdf (you can see for yourself on the photos, click on the picasa link above).

Now I need to improve the first pages a bit, as well as the last pages, improve the index, write some foreword and so on. I also need to think how to better organize the contents itself and generally improve it. I also need to figure out some versioning scheme, so far this is version 0.1. I think I'll do edition 1, edition 2, edition 3, and so on. And whenever I feel that I have added enough new content, I'll just publish it as a new edition. So if you want to buy it, I suggest to wait for my 1.0 version, that will have the mentioned improvements.

It'd be also cool to have all the editions online somehow and create nice webpages for it (currently theoretical-physics.net points directly to the book html itself).

So far the book is just text. I still need to figure out how to handle pictures and also whether or not to use program examples (in Python, using sympy, scipy, etc.). So far I am inclining not to put there any program codes, as then I don't need to maintain them.

Overall I am very pleased with the quality, up to some minor issues that I mentioned above, everything else end up just fine. I think we have come a long way from the discovery of the printing press. Anybody can now create a book for free, and if you want to hold the hardcopy in your hands, it costs around $20. You don't need to order certain amounts of books, nor partner with some publisher etc. I think that's just awesome.

Sunday, December 13, 2009

ESCO 2010 conference

An interesting conference 2nd European Seminar on Coupled Problems (ESCO) will be held on June 28 — July 2, 2010 in Pilsen, Czech Republic.
Among the topics are solving PDEs and applications and using Python for scientific computing. In particular, Gaël Varoquaux is the keynote speaker.

Unfortunately, it was later announced that the SciPy 2010 conference is going to be at the same time, which is really unfortunate. But here are some reasons why you should consider going to ESCO 2010 instead:

If you like numerical calculations (finite elements, differences, volumes, ...) and solving partial differential equations and other problems and also programming in Python, together with C/C++ or Fortran, you will have a chance to meet some of the top people in the field. SciPy conference usually has people who solve PDEs (e.g. SciPy 09 had about 6), but ESCO 2010 will have about 60. So ESCO wins.

Robert Cimrman, who you probably know from the scipy and numpy mailinglists, also the author of the sfepy FEM package in Python, lives in Pilsen, so he'll gladly show you some good Pilsen pubs. SciPy 2010 is going to be in Austin and while Austin has cool pubs too, I must be fair and I liked that (I was there at the Sage 08 days), but it's just not comparable, the beer is better in Pilsen, it's a historic city and there are more pubs.

Pilsen is close to Prague, so you will have the chance to visit it. You should walk in the old town, have couple beers etc. Here you can see some photos that Gaël took when we met in Prauge. Again, this is incomparable with Austin.

It is held in the Pilsner Urquell Brewery. When Pavel Šolín announced that at the SciPy 09 conference, Jarrod asked "Ah, in a beer pub?". So let me be clear. The word pilsner (type of the beer) is coming from the Czech city Pilsen (Plzeň in Czech). Pilsner Urquell is not some beer pub (e.g. even Reno where I live now has a beer pub), it's The Brewery. Austin is a cool place (and Texas steaks are really good), but as you can see now, it absolutely cannot compete with Pilsen.

If you have time, I can fully recommend to go to ESCO 2010.

Monday, August 24, 2009

SciPy 2009 Conference

I attended the SciPy 2009 conference last week at Caltech.

When I first presented about SymPy at the scipy 07 conference exactly 2 years ago, it was just something that we started, so noone really used that. Last week, there were already 6 other people at the conference who contributed one or more patches to SymPy: Robert Kern, Andrew Straw, Pauli Virtanen, Brian Granger, Bill Flynn, Luke Peterson.

I gave a SymPy tutorial, main presentation and Luke gave a PyDy + SymPy lightning talk (seek to 16:04).

I also gave my experience with designing a traits GUI for FEM lightning talk (seek to 5:35).

My advisor Pavel Solin gave a talk about Hermes and FEMhub and other things that we do in our group in Reno.

Besides that, it was awesome to meet all the guys from the scientific Python community, meet old friends and also get to know other people that I only knew from the lists. I was pleased to meet there people who solve PDE using Python, we had many fruitful discussions together, and as a result, I already created FEMhub spkg packages for FiPy, other will follow. Our aim is to create a nice interface to all of them, so that we can easily test a problem in any PDE package and see how it performs.

Overall, my impression is very positive. I am very glad I chose Python as my main language many years ago, all the essential parts are now getting there, e.g. numerics (numpy, scipy, ...), symbolic (sympy, Sage, ...), 2d plotting (matplotlib, ...), 3d plotting (mayavi), GUI (traits UI, which supports both GTK, QT on linux and native widgets on Mac and Windows, and Sage notebook for web), excellent documentation tool with equations support (Sphinx), lots of supporting libraries, like sparse solvers and then very easy way to wrap C/C++ code using Cython and to speed up critical parts of the Python code using Cython. It's not that each of those libraries is the best in the world --- in fact, not a single one is --- but together as an ecosystem, plus the high level of (free) support on the lists for all of those libraries, this in my opinion makes Python a number one choice for scientific computing, together with C, C++ and sometimes Fortran for CPU intensive tasks and/or legacy libraries.

Sunday, May 10, 2009

My experience with running an opensource project

Nir Aides, the author of the excellent winpdb debugger, sent me the following email on September 21, 2008, so I asked him if I can copy his email and reply in form of a blog post (so that other people can comment and join the discussion) and he agreed. It took me almost a year to reply, but I made it. :)

Hi Ondrej,

How are you?

I am about to publish a new free software project, a new simple PHP framework, and I am interested in your advice.

You started SymPy and were able to make other people join you and develop it with you.
How did you do it?
How did it happen?
Did you actively call for other people or they spontaneously showed interest and joined you?
Are the other major contributor people who were your friends before you started the project?
Did you need to create or manage the project in a particular way to make it attractive to other people?
Are there things you are aware of that promote collaboration or demote it?

I was never successful in doing the same with Winpdb, which while it became reasonably popular, no one has ever joined me to develop it, except for a notable tutorial contribution by Chris Lasher which was developed independently.

Now with the new project, I am wondering what are my chances of making other people try it and take it on. On the one hand it is a new and fresh code base in an interesting field, on the other hand, why would anyone bother to spend their energy on this new project when they have Symfony or Drupal?

What do you think?

BTW, Ohloh believes you have a median of 19,000 lines of changed code per month since the start of their log. Can this be true? Is this humanly possible? According to it SymPy has over 1,000,000 lines of code? I can't understand these numbers. Winpdb has about 25,000 lines after 3 years of development. And from my experience 1,000,000 lines of code projects need about 20-50 full time developers to work on for 2-5 years which is about 40-250 man years. And as if this is not enough you are listed as owner in a dozen other projects in Google code and have enough time to become an awarded scientist. How is this possible?

http://www.ohloh.net/p/sympy/contributors/

BTW2, do you still use Winpdb? If you find yourself using it less, can you say what are the reasons, or what it would take to make it more useful?

BTW3, How is SymPy doing?

Cheers,
Nir

So my most honest answer how to run a successful opensource project is: I don't know.

But nevertheless I tried to summarize some of my ideas and experience and some guidelines that I try to follow, maybe it will be useful to you Nir, or anyone else.

First of all, there has to be a public mailinglist (easily accessible), public bug tracker, nice webpage, easy to find downloads, frequent releases (once a month is good, but in the worst case at least 4 times a year) and a set of guidelines to follow in order to contribute. So that's a must, if the project doesn't have the above, it's almost impossible to become successful. However, that is just a start, just a playground. There are still many projects that have the above and yet they totally fail to attract developers.

So I think the most important principle is that I always think how to employ other people in what I do. If I have some plan in my head how to do something, e.g. how to move some things forward, I always create exact steps and put it to issues, or our mailinglist, so that each step can be done by someone who is completely new to sympy. So I try to look at things from other people's perspective and think -- ok, I quite like this SymPy project and I'd like to get this done (for example a new release, or something fixed, or implemented), but I have no idea how to start and what exactly needs to be done.

So what I try to do if someone comes to our list and asks for something, is that I create a new issue for it and think how I would fix it if I had time. Then write the necessary steps in the issue and invite the submitter to fix it and I offer help with explaining anything and guiding. Now there are two things that can happen. Either the submitter has time and a will to go forward and in this case he starts wrestling with it and whenever he has some code or a question, I need to find time, review it and offer some way out. Or the submitter is too busy, in which case the instructions simply rest in the issues and the next time someone asks for the feature, the instructions are already there. I don't have estimates how frequent either case is.

When I am working on something myself, I try not to code privately, but also put up issues first and put the steps needed in the issues, so that it's easy for other people to join in.

In general, the most precious value for me is the fact that someone else had to sit down at his computer and wrote the patch. So I do everything possible to get new (or more) people interested in the development. Some people think that only super programmers can do a decent job and it's useless to invest time in people that may just have started with Python. They are wrong. Among the SymPy developers (around 65 people total have contributed patches so far at the time of writing this post), we have all kinds of people. We have people from high school, we have a retired US army engineer, we have physicists, mathematicians, biologists, engineers, teachers, or just hobbyists, who do it for fun. Unfortunately, we do not have many women (I think no patch that made it into sympy was contributed by a woman, but I may be wrong), so if anyone has any ideas how to get more women involved, let me know (I know we have several women fans, so that's a good start:). We have people whose first open source project they ever contributed to was sympy and people who are new to Python.

Many times the first patch that a new potential developer submits is not perfect, usually it's faster for me to write it myself, than to help with the first patch, however my rule is to always help the submitter do that. Sometimes he sends a second patch, or a third, and usually it needs less and less work on my side and it already pays off, because he is then able to fix things himself, if he discovers a bug and sympy has just won a one more contributor.

So I came to the conclusion that all that is needed is an enthusiasm. You don't even have to know Python (as you can learn all these things on the way) and you can still do useful things for us and really spare our time.

To answer another question from Nir's email, SymPy has about 130000 lines of code and another about 20000 lines of tests, so I think those stats are wrong. Also the changed lines of code is in my opinion wrong, we usually have about 250 new patches per release (this depends how often we release and other things).

Yes, I am involved in couple other projects, e.g. Debian, Sage, ipython, scipy, hpfem.org (and couple more), basically everything that has to do with numeric simulation and Python, but my activity there varies. The most time consuming thing in the last couple years was definitely school, I was finishing my master in Theoretical Physics in Prague and then moved to the Nevada/Reno and I just finished my first semester here at PhD in Chemical Physics, and sometimes it was just crazy, e.g. I finished teaching at 7pm and instead of going home and sleep, I stayed in my office, fixed 10 sympy issues that were holding off a release, finished at 1am, went home (by bike, since I don't have a car yet), slept couple hours and then did just school again for a week, other people reviewed the issues in the meantime, and then I made the release (instead of sleeping again). In the last semester it was not unusual that I got home at 1am every week day, then slept most of Saturday to catch up, on Sunday I did some laundry and shopping, and the rest of time I did grading and homeworks for all my classes and teaching, no time for anything else (e.g. no friends, no girls, no rest, no hobby, no opensource stuff, nothing). So sometimes one has to work pretty hard to get through it, but fortunately it's behind me finally, if all goes well, I should be just doing research from now on and have a real life too. Also I am sorry I didn't manage to reply sooner. :)

To answer the other questions:

Are the other major contributor people who were your friends before you started the project?

No, not a single major contributor was my friend before I started the project. Every single one of them become a developer using the procedure I described above, e.g. first showed on the list or in the issues, and maybe even the very first patch was not a high quality one (and if I was stupid and arrogant, or didn't see the big potential, I would just ignore them). But when given a chance, they became extremely good developers and sympy would simply just not be here without them.

Did you actively call for other people or they spontaneously showed interest and joined you?

I very much encourage everyone to contribute, but the initial interest must be in them, e.g. they at least have to show around the mailinglist/issues, so that I know about them. But once I know they are interested in some issue, yes, I try to invite them to fix it, with my help.

One observation I made is that I have to always think in the spirit "how to earn new money, not how to spare the money I already have", e.g. when applied to sympy, how to get new developers, how to develop the new great things etc. Even if I am super busy as I was, I still have to think this way. Once I start thinking how to conserve and preserve what we already have, I am done, finished and that's the road to hell.

If I am open, positive, full of energy, I can see people joining me and we can do great things together. It probably sounds obvious, but it was not for me, when for example some people I worked with, started their own projects, when I got busy, and started to compete, instead of helping sympy out. And I felt betrayed, after so much work that I invested into it and started to become protective. And then I realised that's wrong. I can never stop other people do what they want to do. If they want to have their own project, they will have it. If they don't want to help sympy out, they won't (and what is more important, there is nothing wrong with either of that). It's that simple and being protective only makes things worse.

There is also a question of the license that you use for the project, e.g. one should basically only choose between BSD (maybe also MIT or Apache), LGPL and GPL (there are also several versions of the GPL licenses). Unfortunately the fact is, that there are people who will never contribute a code under a permissive BSD license (because it's not protecting their work enough) and there are also other people who really want to code to be BSD (or other permissive license) so they can sell it and they don't need to consult with lawyers what they are or aren't allowed to do and also so that they can combine it with any other code (opensource or not). It also depends if one wants to combine (and distribute) other codes together. So choosing a license is also important. I believe that for sympy BSD is the best and for other projects (like Sage) GPL is the best and one has to decide on a case by case basis. For Winpdb, I would make it BSD too, since you can get more people using it.

To conclude, SymPy is a little more than 2 years old, and it has been a great ride so far and more things are coming, e.g. this summer we have 5 Google Summer of Code students and people are starting it to use in their research and we plan to use it in our codes at our group here in Reno too, so things look promising. I am really glad, we managed to build such a community, so that when I am busy, as I was the last semester, other people help out with patches, reviews and other things, so that the project doesn't stall and when I got rid of my school duties now, we can move things forward a lot.

So maybe you can get inspired by some of the ideas above. I am also interested in any discussion about this (feel free to post a comment below, or send me an email, or just write to a sympy list about what you think).

Tuesday, March 17, 2009

Newtonian Mechanics with SymPy

Luke Peterson from UC Davis came to visit me in Reno and we spent the last weekend hacking on the Python Dynamics package that uses SymPy to calculate equations of motion for basically any rigid body system.

On Friday we did some preliminary work, mostly on the paper, Luke showed me his rolling torus demo that he did with the proprietary autolev package. We set ourselves a goal to get this implemented in SymPy by the time Luke leaves and then we went to the Atlantis casino together with my boss Pavel and other guys from the Desert Research Institute and I had my favourite meal here, a big burger, fries and a beer.

On Saturday we started to code and had couple lines of the autolev torus script working. Then we went on the bike ride from Reno to California. I took some pictures with Luke's iphone:

Those mountains are in California and we went roughly to the snow line level and back:

This is Nevada side:

That was fun. Then we worked hard and by the evening we had a dot product and a cross product working, so we went to an Irish pub to have couple beers and I had my burger as usual.

On Sunday we spent the whole day and evening coding and we got the equations of motion working. On Monday we worked very hard again:

and fixed some remaining nasty bugs. I taught Luke to use git, so our code is at: http://github.com/hazelnusse/pydy, for the time being we call it pydy and after we polish everything, we'll probably put it into sympy/physics/pydy.py. If you run rollingtorus.py, you get this plot of the trajectory of the torus in a plane:

It's basically if you throw a coin on the table, e.g. this model takes into account moments of inertia, yaw (heading), lean, spin and the x-y motion in the plane. Depending on the initial conditions, you can get many different trajectories, e.g for example:

or:

This is very exciting, as the code is very short, and most of the things that Luke needs are needed for all the other applications of sympy, e.g. a good printing of equations and vectors (both in the terminal and in latex), C code generation, fast handling of expressions, nice ipython terminal for experimentation, plotting, etc.

Together with the atomic physics package that we started to develop with Brian sympy will soon be able to cover some basic areas of physics. Other areas are general relativity (there is some preliminary code in examples/advanced/relativity.py) and quantum field theory and Feynman diagrams - for that we need someone enthusiastic that needs this for his/her research --- if you are interested, drop me an email, you can come to Reno (or work remotely) and we can get it done.

My vision is that sympy should be able to handle all areas of physics, e.g. it needs good assumptions (if you want to help out, please help us test Fabian's patches here), then faster core, we have a pretty good optional Cython core here, so we'll be merging it after the new assumptions are in place. Then sympy should have basic modules for most areas in physics so that one can get started really quickly. From our experience so far in sympy/physics, those modules will not be big, as most of the functionality is not module specific.

Thursday, March 5, 2009

SIAM 2009 conference in Miami, part I

I am at the SIAM Conference on Computational Science and Engineering (CSE09) and it is awesome. Right now, I am writing from the 50th floor with Pearu Peterson (f2py), Brian Granger (ipython), Fernando Perez (ipython) and John Hunter (matplotlib), I took videos of them and with their permission, posted to youtube. The view from the balcony is spectacular. My own room used to be in the 15 floor in the Hilton hotel and I thought man, this is high, but then I visited Fernando and John in the 50th floor in their apartment and our 20 story hotel looks like a small hut.

As usual, I met lots of old friends and made some new ones. I liked the electronic structure section on Wednesday and a Python section today. I was also working very hard to get mayavi2 working in Sage to be ready for my presentation and it seems I just finally made it.

Ondřej Čertík