Scaling

Forums

Hi Everybody,

I have been noticing some "lagging" in Frealign individual frealign refinement jobs. When a new round of refinement is submitted to the cluster, I noticed that the submitted refinement tasks differ greatly in execution time. So even though all jobs start running at roughly the same time, the first one can be done before the last one is half way through.
I tried to investigate this and I noticed that the shift files are all created at +/- the same time but the time it takes to get the output for the first particle differs greatly. The time per particle then seems to be roughly the same for all jobs. My question is what happens in this "lag" phase? Is there a process in the refinement that might explain this behaviour?

I attached a plot of the Runtimes to demonstrate this. it shows a first round of MODE 4 (thus the very long execution time) followed by reconstruction and then multiple cycles of refinement and reconstruction. Y axis is the run time in seconds, X axis is the "job index" which is just basically ordered by starting time.

I haven't extensively examined this but the effect seems to be stronger with higher parallelization (400 - 600 cores). According to the People running the cluster, it shouldn't be a problem of disc I/O (my first thought). The storage uses a GPFS with ~25 GBytes/sec.

Thanks very much for any insights!

Lukas

This sounds to me as if this is a disk limitation. When running 400 jobs, every one of them has to ready in a 3D reference map. If these are reasonably big, e.g. 256 x 256 x 256 pixels, this amounts to about 25 GB. The maps are read in one line at a time, so there will be a lot of seeking that may significantly slow data transfer. The 25 GBytes/sec performance listed for GPFS disks likely only applies when reading single big files.

In reply to by niko

Dera Niko,

thanks a lot for the quick reply!

It sounds like a reasonable explanation, but I then don't entirely understand why this does not seem to affect the reconstruction step so much.

What kind of parallelization do you use, if I may ask? Is there a reasonable amount of cores to use?

Cheers
Lukas

In reply to by lukater

Parallelization in Frealign is very low tech: all that is done is to run multiple independent jobs of Frealign that deal with a specified chunk of the data. The reconstruction step does not involve reading reference maps. Each reconstruction job only reads the particle images needed for the chunk of the reconstruction that the job deals with. There is probably a sweet spot for the number of jobs versus resources used that depends on the size of the dataset and the size of the particle images. You probably already know what a reasonable number of CPUs is in your case since you have tried different numbers. If a job takes only a minute to complete once it starts running properly but it takes also one minute to get started, I would say you should reduce the number of CPUs by 50% or so.

In reply to by niko

Hi Niko,

After talking to the IT guys, who assured me that disk speed should in now way be a problem (as its 200Gb/s, the system had been updated) I ran a quick test. I made a script that starts about 1k tasks on individual cores which basically take a 300mb map, copy it to /dev/shm/$randomname on the nodes and copy it back to $workingdirectory/$randomname, producing ~1tb of data. This took less than 10 seconds to execute which was indeed much faster than I had assumed. Would you agree that disk speed is therefore likely not the issue? Or am I missing something?

My initial Idea was to try to rewrite some of the frealign scripts to copy the references once per round to each node into /dev/shm (we have 28 cores per node so that should help with any possible bottleneck). Unfortunately with this test, it seems like that would not help me reduce the lags I am experiencing (which are sometimes up to 10-15 minutes).

Thanks a lot!

Lukas

In reply to by lukater

Your test shows how fast the disks and the network is but it probably does not test what happens when maps are read one line at a time. It may indeed help if you copy the reference map to a local scratch disk and then change the script so that Frealign reads the map from the local disk.