Frealign Control script crashes after start

Forums

Hi all,

I am experiencing a weird crash in the control script distributed with the most recent frealign version, where it submits the jobs (which run fine), waits a few minutes, then summarily crashes.

(I am running on an sge cluster)

What log files would help to diagnose the issue?

Axel

Hi Axel,

I have been experiencing the same problem but only if I start more than one run with the control script at the same time. So I have been starting one with the control script and starting the rest by accessing mult_refine.com or mult_search.com directly. Has this also been when you have crashes?
I couldn't find anything in the log files either.

-Clarisse

In reply to by cvdfeltz

Even if only one job is run, it crases.

Here is the output of the frealign.log script.

Starting refinement...
Cycle 0: reconstructing particles 1 to 1920 on Sat May 9 09:23:16 PDT 2015
Cycle 0: reconstructing particles 1921 to 3840 on Sat May 9 09:23:16 PDT 2015
Cycle 0: reconstructing particles 3841 to 5760 on Sat May 9 09:23:16 PDT 2015
Cycle 0: reconstructing particles 5761 to 7680 on Sat May 9 09:23:16 PDT 2015
Cycle 0: reconstructing particles 7681 to 9600 on Sat May 9 09:23:16 PDT 2015
Cycle 0: reconstructing particles 9601 to 11520 on Sat May 9 09:23:16 PDT 2015
Cycle 0: reconstructing particles 11521 to 13440 on Sat May 9 09:23:16 PDT 2015
Cycle 0: reconstructing particles 13441 to 15360 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 15361 to 17280 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 17281 to 19200 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 19201 to 21120 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 21121 to 23040 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 23041 to 24960 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 24961 to 26880 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 26881 to 28800 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 28801 to 30720 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 30721 to 32640 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 32641 to 34560 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 34561 to 36480 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 36481 to 38400 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 38401 to 40320 on Sat May 9 09:23:17 PDT 2015
Cycle 0: reconstructing particles 40321 to 42240 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 42241 to 44160 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 44161 to 46080 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 46081 to 48000 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 48001 to 49920 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 49921 to 51840 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 51841 to 53760 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 53761 to 55680 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 55681 to 57600 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 57601 to 59520 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 59521 to 61440 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 61441 to 63360 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 63361 to 65280 on Sat May 9 09:23:18 PDT 2015
Cycle 0: reconstructing particles 65281 to 67200 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 67201 to 69120 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 69121 to 71040 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 71041 to 72960 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 72961 to 74880 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 74881 to 76800 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 76801 to 78720 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 78721 to 80640 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 80641 to 82560 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 82561 to 84480 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 84481 to 86400 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 86401 to 88320 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 88321 to 90240 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 90241 to 92160 on Sat May 9 09:23:19 PDT 2015
Cycle 0: reconstructing particles 92161 to 94080 on Sat May 9 09:23:20 PDT 2015
Cycle 0: reconstructing particles 94081 to 96000 on Sat May 9 09:23:20 PDT 2015
Cycle 0: reconstructing particles 96001 to 97920 on Sat May 9 09:23:20 PDT 2015
Cycle 0: reconstructing particles 97921 to 99800 on Sat May 9 09:23:20 PDT 2015
Job not crashed Sat May 9 09:24:20 PDT 2015
Logfile /net/em-stor1/abrilot/tf30/Mar152015/frealign2/scratch/indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates indicates
Final lines:

Terminating...
Cycle : reconstructing particles 0 to on Sat May 9 09:26:11 PDT 2015

In reply to by Axel

I ran a big job with a reconstruction that takes an hour or so on the Brandeis SGE cluster so but could no reproduce this error. Can you please make sure you have the latest version installed (download the latest version from the web page and compare with your installed version). If you still run into this problem, there might be an issue with files that do not get updated fast enough in the scratch directory. This appears to be an issue on some file systems. If you think this is the problem you can try editing the file monitor_frealign.com in the Frealign bin directory. Just add an "exit" at the beginning of the script. This will stop the monitoring script that kills Frealign. Please note that the command frealign_kill will not work anymore and you should make sure that if a job crashes or does not terminate normally that there are no Frealign jobs left running.

In reply to by niko

I came up with a fix, where I just copied frealign 9.09 into the frealign 9.08 bin directory, so that I could use the working 9.08 control scripts with the latest 9.09 version.

Thanks for checking, I suspect it may have something to do with our cluster, or perhaps I made a mistake in my mparameters.

Axel

In reply to by niko

If you still run into this problem, there might be an issue with files that do not get updated fast enough in the scratch directory. This appears to be an issue on some file systems. If you think this is the problem you can try editing the file monitor_frealign.com in the Frealign bin directory. Just add an "exit" at the beginning of the script. This will stop the monitoring script that kills Frealign.

This solved my problem on a PBS cluster. The qstat routine implemented in monitor_frealign is not working on this system, therefore every run is killed immediately.
Thanks for the idea to exit the monitor script just from the start.
Any drawbacks beside non-functional frealign_kill?