how to use mp services

Forums

Hi,
I am trying to get the multiprocessor reconstruction to work on my cluster (just the reconstruction part, not the refinement part). In my testing, I haven't seen an appreciable improvement in the reconstruction time. I am using frealign_v8.08 with the precompiled binary frealign_v8_mp.exe. Below is my frealign job.
#!/bin/bash
#MOAB -l nodes=1:ppn=8
#MOAB -l walltime=4:00:00

export NCPUS=8

cd /panfs/storage.local/imb/stagg/sstagg/10mar29b/frealign_test

### START FREALIGN ###
frealign.exe << EOF > frealign.combine_1_8.out
I,1,F,F,F,F,0,T,F,F,0,T,4
88,0,1.840,0.07,0.00,100,50,5,10,0
1 1 1 1 1
1, 3000
D7
1.00,1.84,10.00,75.00,2.70,300.00,0.00,0.00
10.00,100.00,10.00,100.00,0.00
start.hed
match
params.iter000.par
params.001.par
shift.par
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
working_1_8.hed
weights
odd
even
phasediffs
pointspread
EOF

The output from this job indicated that frealign recognized that 8 cpus were available. The average particles processed per second for this run was 2.65 ptcls/sec. For comparison, with only a single processor, the timing was 2.33 ptcls/sec. I am wondering why I'm not seeing more substantial speedup? Unlike a previous post on the BB, both jobs produce good reconstructions. Any help will be appreciated.

Thanks,
Scott

Hi Scott,

I don't have direct experience with frealign_v8 and OpenMP threading, but I think this was implemented to speed up large reconstructions (e.g. virus), and if your box size is not very big, it could be that you won't see much of a speed up, because the overhead of running (say) 8 threads negates any speedup due to parallelisation. What is your box size? Can you try with a larger box size to see if you see a speed up?

Also, I may be missing something but if you're testing just the reconstruction part, can't you use IFLAG=0, rather than 1?

HTH,
Alexis

In reply to by Alexis

Well, that was the problem. Since RI is right under IFLAG, I was seeing it as 0 instead of 1. With IFLAG properly set, I now get 18.8 ptcls/sec for the 1 proc version and 41.7 ptcls/sec for the 8 proc version. To answer your question, my box size is 144, but it is binned by two. I'm trying to work everything out on the binned version before going to the 288 pixel stack.

Am I correct in assuming that the parallelization only works on the processors in 1 node? In other words, will it work well across multiple nodes?

Thanks,
Scott

In reply to by sstagg

Hi Scott,

Glad to hear that's sorted. When you go to smaller pixel sizes, I expect the speedup ratio to improve a little bit. In the meantime, you could go down to 2 or 4 threads (NCPUS), to save resources.

The parallelisation you are playing with at the moment (OpenMP) by setting NCPUS etc. only works with shared memory. In other words, it does not allow parallelisation over several nodes each with their own memory.

Frealign currently does not allow the reconstruction step to be parallelised over several nodes. It may in a future version though.

So I think the work flow for parallel refinement in frealign 8 is to do a parameter refinement over many nodes, followed by reconstruction on a single node with multiple threads.

Cheers,
Alexis

In reply to by Alexis

Hi Alexis,

I would like to ask you for some help about running the reconstruction with multiple threads. If I understood correctly, you can do a parallel reconstruction, if it is run on one node with e.g. 4 processors. At the moment we adapted the FREALIGN example script for multiprocessor-refinement to our group, but are still running the reconstruction on only one processor. Unfortunately this takes more time than the whole refinement.

What we've tried so far is to add "export NCPUS=4" to the beginning of the mreconstruct-script, which is itself opened by the mrefine_sge-script using "id=`qsub -l nodes=1:ppn=4 [...]".

We would really appreciate if you could help us with this.

Thanks a lot,

Michael

In reply to by msaur

If I understood correctly, you can do a parallel reconstruction, if it is run on one node with e.g. 4 processors.

That's correct.

Here are a few hints which may help you:

  • To know whether frealign is working in parallel (i.e. with multiple threads), look for the following line near the beginning of frealign's output:
    Parallel processing: NCPUS =          4
  • You need to be using a frealign executable which was compiled with the OpenMP options, otherwise no threading is possible. If you are using an executable from this website, check you're using one of the ones with "_mp" in the name.
  • The way frealign knows to run multiple threads is by checking the environment when it starts up. When it finds the environment variables OMP_NUM_THREADS or NCPUS is set to a value greater than 1, then that number of threads will be used, and the message I mentioned above is printed out.
  • So you need to make sure that the node to which you submit the reconstruction job has one of those environment variables set appropriately.
  • If you're using qsub to get that job started on a node, then adding the following option to your qsub command may do that for you "-v OMP_NUM_THREADS=4" (see qsub documentation for more info). Alternatively, you can add the line "setenv OMP_NUM_THREADS 4" near the beginning of the script.
  • Another thing to consider is the queuing system in place on your cluster. If you are using an SGE system (which I'm assuming you are since you mention the mrefine_sge script), you may need to add another option to your qsub command to tell the queuing system to reserve the appropriate number of cores for the job. On our SGE setup for example, we have a parallel environment called "multi 4" which is setup so that jobs are allocated 4 cores by the queuing system. Therefore, when we submit a job to be run on 4 cores on a node, we add the following option to our qsub command: "-pe multi 4" (again check man qsub for more info)

I hope this helps. If you're still not having any luck, maybe give us more details - the scripts you used might be helpful for example, as well as more details about what type of environment you're using.

Good luck!
Alexis

In reply to by Alexis

Thank you very much for your answer! I forwarded the information to our bioinformatics-guy and lets see what he can do. The information I gave you was a little bit misleading, since we use torque and the mrefine_sge-script was modified for torque.

Thanks again,

Michael

In reply to by Alexis

Hi Alexis,
I'm the colleague of Michael and try to get frealign to work in out environment.
I changed the reconstruction script to the _mp version, so now I can read in the Logfile, that it will be executed on 4 cores. Our problem is, the script starts only one process but this process needs 400% of the cpu resources. So it looks to me, that all processes get executed by the same core?! Also the reconstruction time is nearly the same, so no increase of speed.

What I have done is, I set the environment variable OMP_NUM_THREADS=4 and also NCPUS=4. Differences to your example script is, I use the bash shell.

Some other differences are, we dont use the SGE, we use TORQUE as PBS. For the reconstruction, I reserve one node with 4 processors, so normally it shouldn't be a problem to spread the job to all 4 cores.

In reply to by wuermchen

Hi there,

If the log file says that 4 threads were used and you're seeing 400% usage, then everything is fine and you are running the reconstruction in parallel over 4 cores. What makes you think only one core is being used?

The only troubling thing is that you say that the reconstruction time is "nearly" the same. Please have a look further up this thread, where Scott was seeing the same thing, but he was using IFLAG=1, not IFLAG=0... Maybe the same thing is happening to you?

In reply to by Alexis

Hi Alexis,
thanks for the fast reply.
I checked the IFLAG, it is set to 0 for the reconstruction.
The thing is, only one process gets started on the node. When I use top, to show me my processes on the node, only one process of freealign_v8_mp.exe is running with 400%CPU. And the fact, that the reconstruction time is nearly the same, makes me think, the processes get started only on one core.
I expected 4 different processes on this core, using 100 %CPU.

Hope you have any other idea. Some other tools need a file, saying how many cores per node are available. Does this parallelization need something like this? If you need further information, let me know.

Kind regards
Mario

In reply to by wuermchen

Hi Mario,

If frealign says near the beginning that it will be using 4 thread, then it definitely is.

It is correct that only 1 process is started, but it has 4 threads, as evidenced by top showing you 400%, not 100%. Admittedly, top has a different behaviour depending on the flavour/version of unix you're using, but still I can't think of why it would show 400% if that process didn't have 4 threads. Some versions of top have option -t or -H, which shows threads on seperate lines... Maybe you can try this (check "man top" to see if option is available and what letter it is) to reassure yourself that there are indeed 4 threads on the run.

Just to rephrase the point: you should not expect 4 different processes, you should expect one process with 4 threads.

There could be a number of reasons that the reconstruction isn't much faster with 4 threads than it was with 1 thread only. Chief suspects would be the box size (unless the box size is large, I wouldn't expect big speedups) and the number of images going into the reconstruction.

Hope this helps. If you still have doubts regarding the speedup, maybe attach the inputs and outputs to frealign.

Best regards,
Alexis