Frealign MLA on a cluster

Forums

Hello,

We are attempting to run Frealign Maximum Likelihood classification on our cluster, and have run into some issues. We can run Frealign using a single reference without any issues, or multi-class runs run locally, but multi-class runs on the cluster will crash in the initial cycle. Here is the final lines of a log file:

Cycle 0: reconstructing particles 1 to 15684 on Wed Nov 19 13:55:05 EST 2014
Cycle 0: reconstruction for particles 1 to 15684, ref 1, finished Wed Nov 19 14:10:16 EST 2014
Cycle 0: reconstruction for particles 15685 to 31368, ref 1, finished Wed Nov 19 14:10:16 EST 2014
Cycle 0: reconstruction for particles 31369 to 41823, ref 1, finished Wed Nov 19 14:10:17 EST 2014
Cycle 0: reconstruction for particles 1 to 15684, ref 2, finished Wed Nov 19 14:10:22 EST 2014
Cycle 0: reconstruction for particles 15685 to 31368, ref 2, finished Wed Nov 19 14:10:22 EST 2014
Cycle 0: reconstruction for particles 31369 to 41823, ref 2, finished Wed Nov 19 14:10:22 EST 2014
Cycle 0: reconstruction for particles 1 to 15684, ref 3, finished Wed Nov 19 14:10:27 EST 2014
Cycle 0: reconstruction for particles 15685 to 31368, ref 3, finished Wed Nov 19 14:10:27 EST 2014
Cycle 0: reconstruction for particles 31369 to 41823, ref 3, finished Wed Nov 19 14:10:27 EST 2014
Cycle 0: merging 3D dump files for ref 1 Wed Nov 19 14:10:27 EST 2014
Cycle 0: merging 3D dump files for ref 2 Wed Nov 19 14:10:27 EST 2014
Cycle 0: merging 3D dump files for ref 3 Wed Nov 19 14:10:27 EST 2014
Job 10127 crashed.
Logfile /panfs/storage.local/imb/stroupe/mcjohnson2/pre40S_20140507_Final/ATP_neg/Frealign/run6_MLA/scratch/merge_3d_r1.log

We are currently using v9.08 but are soon upgrading to v9.09.

Does anyone have any experience regarding this?

Thank you,

Matt

Could you please re-launch the job and then kill the monitor_frealign job that should be running on the head node. If the job runs without the monitor job, it may be a timing issue where files get written by one node but are not immediately updated through NFS for all other nodes.

In reply to by niko

Thank you for your help. After re-launching the job and then killing monitor_frealign as you suggest, the other jobs (mult_refine, mult_refine_n, frealign_v9, etc) do indeed continue running. Do you have any suggestions on what we might do to solve this issue?

In reply to by mcjohnson

We found a similar problem on some systems and I posted an updated version of Frealign yesterday (please see download page). Please try it and see if this resolves your problem. If not, it might be a problem specific to your cluster. What type of cluster do you run Frealign on?

In reply to by niko

It is a PBS cluster (I think). Scheduling is handled by MOAB, backend by Torque. The network is connected by Infiniband. We'll try the new version ASAP.

In reply to by niko

Hello again Niko,

We're now using v9.09, and as before ML classification will run locally, but not on our cluster. Now the job crashes very early. I'll post out input/output below: can you tell us if Frealign should run correctly with these settings?

In working directory at start:
start.mrc (particles)
run6_1_r1.mrc
run6_1_r1.par
mparameters
submit.job

Submission script (submit.job)

#!/bin/bash
#MOAB -l nodes=8
#MOAB -l walltime=96:00:00
cd /workingdirectorypath
frealign_run_refine

mparameters:

Control parameter file to run Frealign
======================================

This file must me kept in the project working directory from which the refinement scripts are launched.

Note: Please make sure that project and scratch directories (if specified) are accessible by all sub-processes that are run on cluster nodes.

# Computer-specific setting
cluster_type PBS ! Set to "sge", "lsf", "slurm", "pbs" or "condor" when running on an SGE, LSF, SLURM, PBS or CONDOR cluster, otherwise set to "none".
nprocessor_ref 8 ! Number of CPUs to use during refinement.
nprocessor_rec 8 ! Number of CPUs to use during reconstruction.
mem_per_cpu 2048 ! Memory available per CPU (in MB).

# Refinement-specific parameters
MODE 1 ! 1, 2, 3 or 4. Refinement mode, normally 1. Set to 2 for additional search.
start_process 2 ! First cycle to execute. Output files from previous cycle (n-1) required.
end_process 10 ! Last cycle to execute.
res_high_refinement 8.0 ! High-resolution limit for particle alignment.
res_high_class 10.0 ! High-resolution limit to calculate class membership (OCC).
thresh_reconst 0.0 ! Particles with scores below this value will not be included in the reconstruction.
nclasses 3 ! Number of classes to use.

# Search-specific parameters
res_search 30.0 ! High-resolution limit for orientational search.
thresh_refine 50.0 ! Mode 4: Score threshold above which search will not be performed.
DANG 200.0 ! Mode 3 and 4: Angular step for orientational search.
ITMAX 200 ! Mode 2 and 4: Number of repetitions of grid search with random starting angles.
Bsearch 2000.0 ! B-factor filtering (when > 0) applied during search.

# Dataset-specific parameters
data_input run6 ! Root name for parameter and map files.
raw_images start
image_contrast N ! N or P. Set to N if particles are dark on bright background, otherwise set to P.
outer_radius 155.0 ! Outer radius of spehrical particle mask in Angstrom.
inner_radius 0.0 ! Inner radius of spehrical particle mask in Angstrom.
mol_mass 1200.0 ! Molecular mass in kDa of particle or helical segment.
Symmetry C1 ! Symmetry of particle.
pix_size 1.26 ! Pixel size of particle in Angstrom.
dstep 8.3 ! Pixel size of detector in micrometer.
Aberration 2.7 ! Sherical aberration coefficient in millimeter.
Voltage 300.0 ! Beam accelleration voltage in kilovolt.
Amp_contrast 0.07 ! Amplitude contrast.

# Expert parameters (for expert users)
XSTD 0.0 ! Tighter masking of 3D map (XSTD > 0) or particles (XSTD < 0).
PBC 20.0 ! Discriminate particles with different scores during reconstruction. Small values (5 - 10) discriminate more than large values (50 - 100).
parameter_mask "1 1 1 1 1" ! Five flags of 0 or 1 (e.g. 1 1 1 1 1). Determines which parameters are refined.
refineangleinc 4 ! When larger than 1: Alternate between refinement of OCC and OCC + angles.
refineshiftinc 4 ! When larger than 1: Alternate between refinement of OCC and OCC + angles + shifts.
res_reconstruction 6.0 ! High-resolution limit of reconstruction. Normally set to Nyquist limit.
res_low_refinement 350.0 ! Low-resolution limit for particle alignment. Set to particle dimention or larger.
FMAG F ! T or F. Set to T to refine particle magnification. Not recommended in most cases.
FDEF F ! T or F. Set to T to refine defocus per micrograph. Not recommended in most cases.
FASTIG F ! T or F. Set to T to refine astigmatism. Not recommended in most cases.
FPART F ! T or F. Set to T to refine defocus for each particle. Not recommended in most cases.
FFILT T ! T or F. Set to T to apply optimal filter to reconstruction. Recommended in most cases.
FMATCH F ! T or F. Set to T to output matching projections. Only needed for diagnostics.
FBEAUT F ! T or F. Set to T to apply symmetry also in real space. Not needed in most cases.
FBOOST F ! T or F. Set to T to allow potential overfitting during refinement. Not recommended in most cases.
RBfactor 0.0 ! B-factor sharpening (when < 0) applied during refinement. Not recommended in most cases.
mp_cpus 1 ! Number of CPUs to use for each reconstruction job.
restart_after_crash F ! T or F. Set to T to restart job if it crashes.
delete_scratch T ! Delete intermediate files in scratch directory.
qsub_string_ref "" ! String to add to cluster jobs submitted for refinement (only for SGE and PBS clusters).
qsub_string_rec "" ! String to add to cluster jobs submitted for reconstruction (only for SGE and PBS clusters).
first_particle
last_particle
frealign_bin_dir
scratch_dir

# Janelia-specific parameters
night_queue F ! T or F. Set to T to use night queue.
reschedule_if_qw F ! T or F. Set to T to reschedule jobs on the normal queue if some end up waiting on night queue.

(con't below)

In reply to by mcjohnson

When this job is submitted, it crashed almost right away, with the following outputs:
run6.job.e8134580
run6.job.o8134580
frealign.log
scratch
cluster_type.log
monitor_frealign.log
mparameters_run
mult_refine.log
pid.log

these logs read:

more run6.job.e8134580
module: Command not found.

more run6.job.o8134580
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Starting refinement...
[1] 23662
[2] 23663

more frealign.log
Starting refinement...

more scratch/cluster_type.log
PBS

more scratch/monitor_frealign.log
(file is empty)

more scratch/mparameters_run
(same as input mparameters)

more scratch/mult_refine.log
awk: (FILENAME=- FNR=1) warning: error writing standard output (Broken pipe)

more scratch/pid.log
23663 monitor_frealign.log

Any help you could give us would be greatly appreciated, thank you.

In reply to by mcjohnson

I wonder why you have a submit script and cannot run frealign_run_refine directly from your head node. Have you tried this?

In reply to by niko

AH! Thank you! It is running now, after modifying qsub_string_ref and qsub_string_rec in mparameters.

I had attempted this before, and it failed with errors indicating the jobs were being rejected, which is why I attempted submitting as a script.

It seems that to run on our cluster "qsub_string_ref" and "qsub_string_rec" need to contain the correct formatting arguments regarding queue/node/cpus. I had tried to do this with v9.08 and could not, but it is running fine now.

In reply to by niko

"-q quename_q"

It seems our system can't accept jobs that aren't assigned to a specific queue. I haven't tried to optimize the nodes/processors yet to maximize processing time, but might later via

"-l nodes=X:ppn=Y"

What is the optimal way to handle this? Can we use multiple nodes/processors for each individual job?

In reply to by mcjohnson

All parallelization is normally handled by the scripts. In some cases it might be useful to use the openMP version of Frealign. You can change the mp_cpus key in mparameters for this. Since things are running now, I suggest you go with what you have and see how it works before doing any more adjustments.

In reply to by niko

We are now having a different issue, and I'm not sure if it is related or not.

This data set was previously processed with Relion and then v9.08 without issue

However, the initial reconstruction using v9.09 is very bad - it appears as if it were using the incorrect angles and/or shifts, with a nonsense toroid-shaped map resulting. This happens even when using stack/par/mparameters combinations that had previously run fine in v9.08.

This issue occurs when using both frealign_run_refine and frealign_calc_reconstructions.

In reply to by mcjohnson

Going from v9.08 to v9.09 should work without problems. Have you checked the angles and shifts that v9.09 produces after one round of refinement? Are they similar to the input? If not, something must be read in incorrectly.

In reply to by niko

After one round of refinement, the angles/shifts have changed significantly from the input, and the "toroid" nonsense map becomes a snowball, as expected if the original angles/shifts were being read incorrectly, but refined correctly given a bad input.

The magnitude of the shifts is quite large: SHX and SHY only change by 6 px, however the angles change dramatically, at time by more than 360 degrees (i.e. some angles are now bizarrely high numbers such as -417 and -476).

The difference in angles between rounds 1 and 2 also tend to cluster around certain values, which definitely indicates some pattern in how the files are being read incorrectly.

For each of phi/theta/psi, the majority of particles changed by about 0 or 360 degrees, however many other particles changed by about +/-57 degrees, +/-114 degrees, or -300 degrees, and some particles changed by about -417 or -476 degrees.

In reply to by mcjohnson

Yes, the sounds like a compatibility issue with the input parameter file. You may have to copy your old parameter file to a new one that has all the columns precisely lined up as they are in the output file you obtained from he new Frealign run.

To be sure, also check again your mparameters file to see that you copied all your dataset and microscope parameters correctly over from your old mparameters file you used with v9.08.

In reply to by niko

Finally cracked this nut, thanks!

The issue was a mismatch between the magnification in the .par, and the detector pixel size in mparameters.

Everything is running smoothly now!