Problems with Frealign on PBS Cluster

Forums

Hi Niko and Alexis,

I recently started working with Frealign (v9.11), which did so far an excellent job in separating a simulated dataset using my local machine (6 cores). I especially liked the easy set-up and the organization with the mparameter and the .par-files. Now, I would like to scale up the process and hence moved to a large PBS-cluster (HLRN, https://www.hlrn.de/home/view). Here, I installed Frealign and started running it. I have the following files in my folder:

RNAP_80.par #already refined parameter
RNAP_80.mrc #refined volume
DATA/dc0_RNAP_stack.mrc #particle-stack, 3348 in total

In the beginning I am asking just for 1 class in mode 1, with start_process: 81. I did not add anything to qsub_string_re* and instead modified the code in mult_refine.com and mult_reconstruct.com to automatically add #PBS-parameters (as below) to the pbs-files. I also adjusted the scripts mult_reconstruct.com and mult_refine.com to contain "msub" instead of "qsub", which is necassary on this cluster. However, I still experienced problems. Jobs are submitted to the queue and are terminated after a few seconds.

However, when I submit a single pbs-job from the scratch-folder (e.g. "msub scratch/pbs1_1.com" ) it works fine. I added the necessary #PBS-parameters to the file for easier handling. See here:
scratch/pbs1_1.com file
"""
#!/bin/csh -f
#PBS -V
#PBS -l feature=smp2
#PBS -l nodes=1:ppn=40
#PBS -N FA
#PBS -j eo
#PBS -l walltime=24:00:00

cd /gfs2/work/bebkrupp/FA_sim3k
/gfs1/work/bebkrupp/SOFT/FREALIGN/frealign_v9.11/bin/mult_reconstruct_n.com 1 3348 80 1
"""

In this way I am able to generate parameter-files or reconstruct *_n1.mrc-volumes which are placed in the scratch-folder. However, executing frealin_run_refine did not managed to run properly. From frealign.log I get:

"""Starting refinement...
Cycle 80: reconstructing particles 1 to 3348 on Wed Jul 6 15:38:20 CEST 2016
Job hannover crashed Wed Jul 6 15:40:12 CEST 2016
Logfile /gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_reconstruct_r1_n1.log
Final lines:

Terminating...""" For refinement the same error occurs.

In scratch/monitor_frealign.log I get the following:
"""tail: cannot open `/gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_reconstruct_r1_n1.log' for reading: No such file or directory
cat: /gfs2/work/bebkrupp/FA_sim3k/scratch/pid_temp.log: No such file or directory"""

And for refinement:
"""tail: cannot open `/gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_refine_n_r1.log_1_3348' for reading: No such file or directory
tail: cannot open `RNAP_mult_refine_n_r2.log_1_3348' for reading: No such file or directory
tail: cannot open `RNAP_mult_refine_n_r3.log_1_3348' for reading: No such file or directory
cat: /gfs2/work/bebkrupp/FA_sim3k/scratch/pid_temp.log: No such file or directory"""

I seems like the big jobs (refinement and reconstruction) can run smoothly, but the book-keeping with their output-files experiences some problems in the main-script. Do you have an ideas what could cause that problem?

Thanks for any advice and help in advance.

All the best Ferdinand

Thanks for the detailed description. It looks like files in /gfs2/work/bebkrupp/FA_sim3k/scratch are in accessible. Is this file system mounted on all nodes? If necessary, a scratch directory can be specified in the mparameters file. Maybe a directory on /gfs1/work/bebkrupp/ could be used?

In reply to by niko

I tried to set scratch_dir /gfs1/work/bebkrupp/scratch/ but I still get the same error:

Cycle 81: refining particles 1 to 3348, class 1 Thu Jul 14 15:30:23 CEST 2016
Cycle 81: refining particles 1 to 3348, class 2 Thu Jul 14 15:30:23 CEST 2016
Cycle 81: refining particles 1 to 3348, class 3 Thu Jul 14 15:30:24 CEST 2016
Job hannover crashed Thu Jul 14 15:32:15 CEST 2016
Logfile /gfs1/work/bebkrupp/SOFT/FREALIGN/scratch/RNAP_mult_refine_n_r1.log_1_3348 RNAP_mult_refine_n_r2.log_1_3348 RNAP_mult_refine_n_r3.log_1_3348
Final lines:

The file system is mounted on all nodes.
Thanks for your help.

In reply to by Ferdinand Krupp

The next thing to try is to disable the monitoring job. In the Frealign bin folder, edit the file monitor_frealign.com to add the line

exit

after the first line. The beginning of monitor_frealign.com should then look like

#!/bin/csh -f
exit
#
#   Control script to monitor Frealign jobs
#

This will disable to monitoring job that detects if a job crashed. Sometime this script does not work properly and kills jobs even if they run properly. Please note that with the monitoring job not running, the frealign_kill command will also not work and you will have to kill jobs on the cluster manually.

In reply to by niko

Thanks for your help.
I also tried login in on a single node with 40 cores. Here I FA is running but only a low speed. I managed to finish 3 rounds in 10 hours (with 150k particles and 80x80 windowsize). I am not sure what is the problem here but I moved over to my local machines which isnt to bad after all. Here I have 8 cores and a good speed.