Python vs Stata

Here and there I am asked “what I think about python for science” and I must say I sometimes gave a biased answer by saying that python is for children. I could never get over python’s indentation requirement. I am more reluctant with judgement these days, so when I recently wrote a perl script to a young research assistant asking him to complete it and harvest some data for me, went away, talked to his peers and came back with a broken python script instead I did the adult thing and sat down with him to learn enough python to fix it and get him started.

The young research assistant made progress in python and added it to his personal toolkit, wrote a script for me which almost worked but I had to troubleshoot it when I started doing some plausibility checks on the data which created data quality suspicions in my mind. By that time I almost had taken the python learning curve but decided I would be faster to learn python and rewrite the script than learn python and read the assistant’s script. So I did. It went well and I got to appreciate some of the features of python: it paints great pictures, it draws on php, mysql, object orientation, R etc and hides A LOT of the more complex things one usually has to master before they start programming for a living. For example you don’t need to learn regular expressions to remove html tags you just sip a beautifulsoup.
So now that I learned the kids’ language I thought I’d do use it to paint some network graphs for a paper I am doing and then I thought of doing some arithmetic as well. I got in trouble right away. I used numpy, and drew 1000 numbers from N(0,1). When I computed the mean of the created data points it wasn’t zero! So I did this 1000 times and took the mean of all the experiments. The mean I desire to be zero can often be -3 or -9 on the average after 1000 experiments! Here is the code, try it. Maybe your machine does better:

import numpy as np
mu, sigma = 0, 1
ss = []for i in range(1,1000):
s = np.random.normal(mu, sigma, 1000)

ssmu = np.mean(ss)
sssigma = np.std(ss, ddof=1)

ssmu, sssigma

I run this on a Macbook Air with Mas OS Sierra 10.12.2 and the python which comes with Anaconda. In contrast I couldn’t get Stata 14 on the same machine to make as grave errors as python does. Try it yourself:

set more off
set obs 1000
gen mu = .

forval i =1(1)1000{

keep mu
gen t = rnormal(0, 1)
sum t
replace mu = r(mean) in `i’


sum mu

If precision matters python ain’t the right thing to be doing, at least on my kind of hardware. If you do use python rerun your paper several times and take the mean of the results before you claim you’ve found something that matters.

If you run these on your machine send me the results of your experiment.

This entry was posted in comment, opininion, science, technology and tagged , , , . Bookmark the permalink.