h5py to replace npz?!

Apparently .npz files have a limit of ~4GB, which can be a problem when dealing with large datasets ($$600^3$$), so how do I write binary files of this size?! Looks like I need to migrate to HDF5…lol; here’s an example for read/write similar to npz:

import h5py
a = np.random.rand(1000,1000)
f = h5py.File('/tmp/myfile.hdf5')
f['a'] = a # <-- Save
f.close()

now that will write the file, lets open it

import h5py
f = h5py.File('/tmp/myfile.hdf5')
b = f['a']
#b = b[()]
f.keys() #shows all values in the dataset
f.close()

but the odd thing is, you can’t directly access the data here by just ‘b’. Instead there is a ton of information attached like b.shape, b.dtype, etc etc, all found here. Instead, you simply have to use the command

b.value

and that will output the data of the file.


customizing matplotlib’s legend()

I am constantly needing to customize the legend() in python. Whether it be changing the marker size, the line width, or the fontsize I’m constantly looking things up. There’s an easy way to set this at the beginning of your script; as an example:
params = {'legend.fontsize' : 16,
'legend.linewidth' : 1.5,
'legend.markerscale' : 3}
rcParams.update(params)
You could easily add other options such as modifying the handlelen (line width), font, etc etc. All options can be found via matplotlib’s legend() api site.

mass-weighted Sobolev/HSML column values

Ken suggested I revisit the particles per grid-cell comparison - i.e. Grid vs. Sob, Sob vs. HSML, and Grid vs. HSML. In order to do this so that each grid corresponds to one value, I have to take the mass-weighted average of the Sobolev/HSML values within a given grid-cell:
\begin{equation*} <\Sigma_{\rm{Sob}}>_{\rm{MW}} = \Sigma_\rm{((Sob,or,HSML)}\times \rm{mass})/\Sigma_\rm{mass} \end{equation*}

I then have a single \(\Sigma\) value per grid-cell and can make a direct comparison:

sobolevpixels
(plot made with gal_reduction/tools/sigmacompare.py via PIXELPLOT==1)

In an ideal situation, each line would lie on the one-to-one dotted black line. Unfortunately both Sobolev and HSML values under estimate the grid values. The good news is that there isn’t much difference due to the resolution. We might have to examine more galaxies within our sims in a similar fashion to see if this under prediction takes place at the same surface densities; if that is the case we can easily incorporate some sort of correction factor. But that leads to the question - how many galaxies do we have to look at?

In making this plot I learned how to stick the legend outside of the plot. It’s as simple as passing bbox_to_anchor to the legend() call:
legend(bbox_to_anchor=(0.1,1.05),loc=3,shadow=True,fancybox=True)
this is a tiny bit tricky in that the anchor is attached to the loc. So if loc=3 then the anchor for the box is attached to the bottom left corner of the legend box. Also the x,y coordinates are in absolute figure positions. This is thanks to matplotlib and stack overflow.

Next up was Ken’s concern about the histograms I was sending him. Further inspection showed that simply passing ‘bins=50’ is not necessarily the best idea to properly represent the data. By default hist() takes the min & max of the data then splits it into N bins. The problem here is that if you’re comparing two different datasets then if they span a different range then the binsizes will differ. To circumvent this issues I simply specified the binsize manually via altering my histo() function call a tiny bit:
def histoit(values,COLOR,LABEL,HLS,BINSIZE=0.1):
indexes = where(values>0)[0] ##prevents -inf
vals = log10(values[indexes])
binner = arange(min(vals),max(vals),BINSIZE)
hist(vals,color=COLOR,log=True,
bins=binner,
label=LABEL,histtype='step',lw=1.5,ls=HLS)
With this quick fix the data looks much more comparable, here’s the before and after:

[table] Before,After [/table] better right? hello?


And last but not least, I’ve plotted the gas particle positions for this galaxy at all three resolutions for visual inspection:
ledisks
we go from slightly compressed cloud, to fluffy pancake, to regular pancake!


Progress bar….

I got tired of waiting for code to finish and having no idea how far along it was.  Stack exchange provided an elegant solution of which I modified:

 import time,sys

def update_progress(step,datalen):
 progress = float(step)/float(datalen)
 barLength = 10 # Modify this to change the length of the progress bar
 status = ""
 if progress < 0:
 progress = 0
 status = "Halt...rn"
 if step >= datalen-1:
 progress = 1
 status = "Done...rn"

block = int(round(barLength*progress))
 text = "rPercent: [%s] %2.2f%% %s" % ("#"*block +
"-"*(barLength-block),
 progress*100, status)
 sys.stdout.write(text)
 sys.stdout.flush()

datalen=150
 for i in range(datalen):
 time.sleep(0.1) #just to slow it down!
 update_progress(i,datalen)

UPDATE (10/31/13)

here’s a cython version:
 cdef void update_progress(int step, int datalen):
 cdef float progress = float(step)/float(datalen)
 cdef int barLength = 10 # Modify this to change the length of the
 progress bar
 status = ""
 if progress < 0:
 progress = 0
 status = "Halt...rn"
 if step >= datalen-1:
 progress = 1
 status = "Done...rn"

cdef int block = int(np.round(barLength*progress))
 text = "rPercent: [%s] %2.2f%% %s" % ("#"*block +
"-"*(barLength-block),
 progress*100, status)
 sys.stdout.write(text)
 sys.stdout.flush()