Problems with Frealign on PBS Cluster

Forums

Frealign

Hi Niko and Alexis,

I recently started working with Frealign (v9.11), which did so far an excellent job in separating a simulated dataset using my local machine (6 cores). I especially liked the easy set-up and the organization with the mparameter and the .par-files. Now, I would like to scale up the process and hence moved to a large PBS-cluster (HLRN, https://www.hlrn.de/home/view). Here, I installed Frealign and started running it. I have the following files in my folder:

RNAP_80.par #already refined parameter
RNAP_80.mrc #refined volume
DATA/dc0_RNAP_stack.mrc #particle-stack, 3348 in total

In the beginning I am asking just for 1 class in mode 1, with start_process: 81. I did not add anything to qsub_string_re* and instead modified the code in mult_refine.com and mult_reconstruct.com to automatically add #PBS-parameters (as below) to the pbs-files. I also adjusted the scripts mult_reconstruct.com and mult_refine.com to contain "msub" instead of "qsub", which is necassary on this cluster. However, I still experienced problems. Jobs are submitted to the queue and are terminated after a few seconds.

However, when I submit a single pbs-job from the scratch-folder (e.g. "msub scratch/pbs1_1.com" ) it works fine. I added the necessary #PBS-parameters to the file for easier handling. See here:
scratch/pbs1_1.com file
"""
#!/bin/csh -f
#PBS -V
#PBS -l feature=smp2
#PBS -l nodes=1:ppn=40
#PBS -N FA
#PBS -j eo
#PBS -l walltime=24:00:00

cd /gfs2/work/bebkrupp/FA_sim3k
/gfs1/work/bebkrupp/SOFT/FREALIGN/frealign_v9.11/bin/mult_reconstruct_n.com 1 3348 80 1
"""

In this way I am able to generate parameter-files or reconstruct *_n1.mrc-volumes which are placed in the scratch-folder. However, executing frealin_run_refine did not managed to run properly. From frealign.log I get:

"""Starting refinement...
Cycle 80: reconstructing particles 1 to 3348 on Wed Jul 6 15:38:20 CEST 2016
Job hannover crashed Wed Jul 6 15:40:12 CEST 2016
Logfile /gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_reconstruct_r1_n1.log
Final lines:

Terminating...""" For refinement the same error occurs.

In scratch/monitor_frealign.log I get the following:
"""tail: cannot open `/gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_reconstruct_r1_n1.log' for reading: No such file or directory
cat: /gfs2/work/bebkrupp/FA_sim3k/scratch/pid_temp.log: No such file or directory"""

And for refinement:
"""tail: cannot open `/gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_refine_n_r1.log_1_3348' for reading: No such file or directory
tail: cannot open `RNAP_mult_refine_n_r2.log_1_3348' for reading: No such file or directory
tail: cannot open `RNAP_mult_refine_n_r3.log_1_3348' for reading: No such file or directory
cat: /gfs2/work/bebkrupp/FA_sim3k/scratch/pid_temp.log: No such file or directory"""

I seems like the big jobs (refinement and reconstruction) can run smoothly, but the book-keeping with their output-files experiences some problems in the main-script. Do you have an ideas what could cause that problem?

Thanks for any advice and help in advance.

All the best Ferdinand

Thanks for the detailed

niko Wed, 2016-07-06 11:05

Thanks for the detailed description. It looks like files in /gfs2/work/bebkrupp/FA_sim3k/scratch are in accessible. Is this file system mounted on all nodes? If necessary, a scratch directory can be specified in the mparameters file. Maybe a directory on /gfs1/work/bebkrupp/ could be used?

I tried to set scratch_dir

Ferdinand Krupp Fri, 2016-07-15 11:21

I tried to set scratch_dir /gfs1/work/bebkrupp/scratch/ but I still get the same error:

Cycle 81: refining particles 1 to 3348, class 1 Thu Jul 14 15:30:23 CEST 2016
Cycle 81: refining particles 1 to 3348, class 2 Thu Jul 14 15:30:23 CEST 2016
Cycle 81: refining particles 1 to 3348, class 3 Thu Jul 14 15:30:24 CEST 2016
Job hannover crashed Thu Jul 14 15:32:15 CEST 2016
Logfile /gfs1/work/bebkrupp/SOFT/FREALIGN/scratch/RNAP_mult_refine_n_r1.log_1_3348 RNAP_mult_refine_n_r2.log_1_3348 RNAP_mult_refine_n_r3.log_1_3348
Final lines:

The file system is mounted on all nodes.
Thanks for your help.

The next thing to try is to

niko Fri, 2016-07-15 13:39

The next thing to try is to disable the monitoring job. In the Frealign bin folder, edit the file monitor_frealign.com to add the line

exit

after the first line. The beginning of monitor_frealign.com should then look like

#!/bin/csh -f
exit
#
#   Control script to monitor Frealign jobs
#

This will disable to monitoring job that detects if a job crashed. Sometime this script does not work properly and kills jobs even if they run properly. Please note that with the monitoring job not running, the frealign_kill command will also not work and you will have to kill jobs on the cluster manually.

Thanks, for the advice. I

Ferdinand Krupp Wed, 2016-07-20 04:21

Thanks, for the advice. I added "exit" to the monitor_frealign.com.
Unfortunately that also did not worked out.

I am sorry that this did not

niko Wed, 2016-07-20 11:03

I am sorry that this did not work. There must be another problem. Maybe you could try using SSH instead?

Thanks for your help. I also

Ferdinand Krupp Tue, 2016-07-26 03:06

Thanks for your help.
I also tried login in on a single node with 40 cores. Here I FA is running but only a low speed. I managed to finish 3 rounds in 10 hours (with 150k particles and 80x80 windowsize). I am not sure what is the problem here but I moved over to my local machines which isnt to bad after all. Here I have 8 cores and a good speed.