Problems with Frealign on PBS Cluster
Hi Niko and Alexis,
I recently started working with Frealign (v9.11), which did so far an excellent job in separating a simulated dataset using my local machine (6 cores). I especially liked the easy set-up and the organization with the mparameter and the .par-files. Now, I would like to scale up the process and hence moved to a large PBS-cluster (HLRN, https://www.hlrn.de/home/view). Here, I installed Frealign and started running it. I have the following files in my folder:
RNAP_80.par #already refined parameter
RNAP_80.mrc #refined volume
DATA/dc0_RNAP_stack.mrc #particle-stack, 3348 in total
In the beginning I am asking just for 1 class in mode 1, with start_process: 81. I did not add anything to qsub_string_re* and instead modified the code in mult_refine.com and mult_reconstruct.com to automatically add #PBS-parameters (as below) to the pbs-files. I also adjusted the scripts mult_reconstruct.com and mult_refine.com to contain "msub" instead of "qsub", which is necassary on this cluster. However, I still experienced problems. Jobs are submitted to the queue and are terminated after a few seconds.
However, when I submit a single pbs-job from the scratch-folder (e.g. "msub scratch/pbs1_1.com" ) it works fine. I added the necessary #PBS-parameters to the file for easier handling. See here:
scratch/pbs1_1.com file
"""
#!/bin/csh -f
#PBS -V
#PBS -l feature=smp2
#PBS -l nodes=1:ppn=40
#PBS -N FA
#PBS -j eo
#PBS -l walltime=24:00:00
cd /gfs2/work/bebkrupp/FA_sim3k
/gfs1/work/bebkrupp/SOFT/FREALIGN/frealign_v9.11/bin/mult_reconstruct_n.com 1 3348 80 1
"""
In this way I am able to generate parameter-files or reconstruct *_n1.mrc-volumes which are placed in the scratch-folder. However, executing frealin_run_refine did not managed to run properly. From frealign.log I get:
"""Starting refinement...
Cycle 80: reconstructing particles 1 to 3348 on Wed Jul 6 15:38:20 CEST 2016
Job hannover crashed Wed Jul 6 15:40:12 CEST 2016
Logfile /gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_reconstruct_r1_n1.log
Final lines:
Terminating...""" For refinement the same error occurs.
In scratch/monitor_frealign.log I get the following:
"""tail: cannot open `/gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_reconstruct_r1_n1.log' for reading: No such file or directory
cat: /gfs2/work/bebkrupp/FA_sim3k/scratch/pid_temp.log: No such file or directory"""
And for refinement:
"""tail: cannot open `/gfs2/work/bebkrupp/FA_sim3k/scratch/RNAP_mult_refine_n_r1.log_1_3348' for reading: No such file or directory
tail: cannot open `RNAP_mult_refine_n_r2.log_1_3348' for reading: No such file or directory
tail: cannot open `RNAP_mult_refine_n_r3.log_1_3348' for reading: No such file or directory
cat: /gfs2/work/bebkrupp/FA_sim3k/scratch/pid_temp.log: No such file or directory"""
I seems like the big jobs (refinement and reconstruction) can run smoothly, but the book-keeping with their output-files experiences some problems in the main-script. Do you have an ideas what could cause that problem?
Thanks for any advice and help in advance.
All the best Ferdinand
Thanks for the detailed
Thanks for the detailed description. It looks like files in /gfs2/work/bebkrupp/FA_sim3k/scratch are in accessible. Is this file system mounted on all nodes? If necessary, a scratch directory can be specified in the mparameters file. Maybe a directory on /gfs1/work/bebkrupp/ could be used?
I tried to set scratch_dir
In reply to Thanks for the detailed by niko
I tried to set scratch_dir /gfs1/work/bebkrupp/scratch/ but I still get the same error:
Cycle 81: refining particles 1 to 3348, class 1 Thu Jul 14 15:30:23 CEST 2016
Cycle 81: refining particles 1 to 3348, class 2 Thu Jul 14 15:30:23 CEST 2016
Cycle 81: refining particles 1 to 3348, class 3 Thu Jul 14 15:30:24 CEST 2016
Job hannover crashed Thu Jul 14 15:32:15 CEST 2016
Logfile /gfs1/work/bebkrupp/SOFT/FREALIGN/scratch/RNAP_mult_refine_n_r1.log_1_3348 RNAP_mult_refine_n_r2.log_1_3348 RNAP_mult_refine_n_r3.log_1_3348
Final lines:
The file system is mounted on all nodes.
Thanks for your help.
The next thing to try is to
In reply to I tried to set scratch_dir by Ferdinand Krupp
The next thing to try is to disable the monitoring job. In the Frealign bin folder, edit the file
monitor_frealign.com
to add the lineafter the first line. The beginning of
monitor_frealign.com
should then look likeThis will disable to monitoring job that detects if a job crashed. Sometime this script does not work properly and kills jobs even if they run properly. Please note that with the monitoring job not running, the
frealign_kill
command will also not work and you will have to kill jobs on the cluster manually.Thanks, for the advice. I
In reply to The next thing to try is to by niko
Thanks, for the advice. I added "exit" to the monitor_frealign.com.
Unfortunately that also did not worked out.
I am sorry that this did not
In reply to Thanks, for the advice. I by Ferdinand Krupp
I am sorry that this did not work. There must be another problem. Maybe you could try using SSH instead?
Thanks for your help. I also
In reply to I am sorry that this did not by niko
Thanks for your help.
I also tried login in on a single node with 40 cores. Here I FA is running but only a low speed. I managed to finish 3 rounds in 10 hours (with 150k particles and 80x80 windowsize). I am not sure what is the problem here but I moved over to my local machines which isnt to bad after all. Here I have 8 cores and a good speed.