how to use mp services
Forums
Hi,
I am trying to get the multiprocessor reconstruction to work on my cluster (just the reconstruction part, not the refinement part). In my testing, I haven't seen an appreciable improvement in the reconstruction time. I am using frealign_v8.08 with the precompiled binary frealign_v8_mp.exe. Below is my frealign job.
#!/bin/bash
#MOAB -l nodes=1:ppn=8
#MOAB -l walltime=4:00:00
export NCPUS=8
cd /panfs/storage.local/imb/stagg/sstagg/10mar29b/frealign_test
### START FREALIGN ###
frealign.exe << EOF > frealign.combine_1_8.out
I,1,F,F,F,F,0,T,F,F,0,T,4
88,0,1.840,0.07,0.00,100,50,5,10,0
1 1 1 1 1
1, 3000
D7
1.00,1.84,10.00,75.00,2.70,300.00,0.00,0.00
10.00,100.00,10.00,100.00,0.00
start.hed
match
params.iter000.par
params.001.par
shift.par
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
working_1_8.hed
weights
odd
even
phasediffs
pointspread
EOF
The output from this job indicated that frealign recognized that 8 cpus were available. The average particles processed per second for this run was 2.65 ptcls/sec. For comparison, with only a single processor, the timing was 2.33 ptcls/sec. I am wondering why I'm not seeing more substantial speedup? Unlike a previous post on the BB, both jobs produce good reconstructions. Any help will be appreciated.
Thanks,
Scott
OpenMP frealign
Hi Scott,
I don't have direct experience with frealign_v8 and OpenMP threading, but I think this was implemented to speed up large reconstructions (e.g. virus), and if your box size is not very big, it could be that you won't see much of a speed up, because the overhead of running (say) 8 threads negates any speedup due to parallelisation. What is your box size? Can you try with a larger box size to see if you see a speed up?
Also, I may be missing something but if you're testing just the reconstruction part, can't you use IFLAG=0, rather than 1?
HTH,
Alexis
Well, that was the problem.
In reply to OpenMP frealign by Alexis
Well, that was the problem. Since RI is right under IFLAG, I was seeing it as 0 instead of 1. With IFLAG properly set, I now get 18.8 ptcls/sec for the 1 proc version and 41.7 ptcls/sec for the 8 proc version. To answer your question, my box size is 144, but it is binned by two. I'm trying to work everything out on the binned version before going to the 288 pixel stack.
Am I correct in assuming that the parallelization only works on the processors in 1 node? In other words, will it work well across multiple nodes?
Thanks,
Scott
Hi Scott, Glad to hear that's
In reply to Well, that was the problem. by sstagg
Hi Scott,
Glad to hear that's sorted. When you go to smaller pixel sizes, I expect the speedup ratio to improve a little bit. In the meantime, you could go down to 2 or 4 threads (NCPUS), to save resources.
The parallelisation you are playing with at the moment (OpenMP) by setting NCPUS etc. only works with shared memory. In other words, it does not allow parallelisation over several nodes each with their own memory.
Frealign currently does not allow the reconstruction step to be parallelised over several nodes. It may in a future version though.
So I think the work flow for parallel refinement in frealign 8 is to do a parameter refinement over many nodes, followed by reconstruction on a single node with multiple threads.
Cheers,
Alexis
Hi Alexis,I would like to
In reply to Hi Scott, Glad to hear that's by Alexis
Hi Alexis,
I would like to ask you for some help about running the reconstruction with multiple threads. If I understood correctly, you can do a parallel reconstruction, if it is run on one node with e.g. 4 processors. At the moment we adapted the FREALIGN example script for multiprocessor-refinement to our group, but are still running the reconstruction on only one processor. Unfortunately this takes more time than the whole refinement.
What we've tried so far is to add "export NCPUS=4" to the beginning of the mreconstruct-script, which is itself opened by the mrefine_sge-script using "id=`qsub -l nodes=1:ppn=4 [...]".
We would really appreciate if you could help us with this.
Thanks a lot,
Michael
If I understood correctly,
In reply to Hi Alexis,I would like to by msaur
That's correct.
Here are a few hints which may help you:
I hope this helps. If you're still not having any luck, maybe give us more details - the scripts you used might be helpful for example, as well as more details about what type of environment you're using.
Good luck!
Alexis
Thank you very much for your
In reply to If I understood correctly, by Alexis
Thank you very much for your answer! I forwarded the information to our bioinformatics-guy and lets see what he can do. The information I gave you was a little bit misleading, since we use torque and the mrefine_sge-script was modified for torque.
Thanks again,
Michael
Hi Alexis, I'm the colleague
In reply to If I understood correctly, by Alexis
Hi Alexis,
I'm the colleague of Michael and try to get frealign to work in out environment.
I changed the reconstruction script to the _mp version, so now I can read in the Logfile, that it will be executed on 4 cores. Our problem is, the script starts only one process but this process needs 400% of the cpu resources. So it looks to me, that all processes get executed by the same core?! Also the reconstruction time is nearly the same, so no increase of speed.
What I have done is, I set the environment variable OMP_NUM_THREADS=4 and also NCPUS=4. Differences to your example script is, I use the bash shell.
Some other differences are, we dont use the SGE, we use TORQUE as PBS. For the reconstruction, I reserve one node with 4 processors, so normally it shouldn't be a problem to spread the job to all 4 cores.
Hi there, If the log file
In reply to Hi Alexis, I'm the colleague by wuermchen
Hi there,
If the log file says that 4 threads were used and you're seeing 400% usage, then everything is fine and you are running the reconstruction in parallel over 4 cores. What makes you think only one core is being used?
The only troubling thing is that you say that the reconstruction time is "nearly" the same. Please have a look further up this thread, where Scott was seeing the same thing, but he was using IFLAG=1, not IFLAG=0... Maybe the same thing is happening to you?
Hi Alexis, thanks for the
In reply to Hi there, If the log file by Alexis
Hi Alexis,
thanks for the fast reply.
I checked the IFLAG, it is set to 0 for the reconstruction.
The thing is, only one process gets started on the node. When I use top, to show me my processes on the node, only one process of freealign_v8_mp.exe is running with 400%CPU. And the fact, that the reconstruction time is nearly the same, makes me think, the processes get started only on one core.
I expected 4 different processes on this core, using 100 %CPU.
Hope you have any other idea. Some other tools need a file, saying how many cores per node are available. Does this parallelization need something like this? If you need further information, let me know.
Kind regards
Mario
1 process, 4 threads.
In reply to Hi Alexis, thanks for the by wuermchen
Hi Mario,
If frealign says near the beginning that it will be using 4 thread, then it definitely is.
It is correct that only 1 process is started, but it has 4 threads, as evidenced by top showing you 400%, not 100%. Admittedly, top has a different behaviour depending on the flavour/version of unix you're using, but still I can't think of why it would show 400% if that process didn't have 4 threads. Some versions of top have option -t or -H, which shows threads on seperate lines... Maybe you can try this (check "man top" to see if option is available and what letter it is) to reassure yourself that there are indeed 4 threads on the run.
Just to rephrase the point: you should not expect 4 different processes, you should expect one process with 4 threads.
There could be a number of reasons that the reconstruction isn't much faster with 4 threads than it was with 1 thread only. Chief suspects would be the box size (unless the box size is large, I wouldn't expect big speedups) and the number of images going into the reconstruction.
Hope this helps. If you still have doubts regarding the speedup, maybe attach the inputs and outputs to frealign.
Best regards,
Alexis