ctffind4 crashes sometimes depending on compiler version and cpu

we have a dataset of 860 .mrc files.

with the compiled binary from your site, some of the 860 processing jobs died.

therefore i recompiled your source with the intel 2013 suite (mkl, static).
DEST=${SW_HOME}/${PKG_NAME}/${PKG_VER}
FC=ifort F77=ifort \
./configure --prefix=${DEST} --enable-static
make -j ${BUILD_CPUS} 2>&1 | tee make.log

we can process at 2 sites:
export OMP_NUM_THREADS=1

1.) site 1 (scientific linux 6.5, AMD Opteron(tm) Processor 6380, there we made the executable)
from the 860 processes about 10 die (always the same datafiles) with
forrtl: error (73): floating divide by zero

2.) site 2 (debian 7 linux, Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz)
from the 860 processes about 23 different die (always the same datafiles at site 2, but different from site 1) with
also the same divide by zero error

then i used the intel 2015 compiler, but the only difference is that on
- site 1: now only 3 of the 860 die (were already part of the 10 before)
- site 2: no change, the same processes/datafiles as the intel 2013 version die

gdb backtrace:
#0 0x000000000168c31b in raise ()
#1 0x00000000017ed6c5 in abort ()
#2 0x00000000016aee07 in for__signal_handler ()
#3
#4 0x0000000001898e12 in images_MP_zerofloatandnormalise_ ()
#5 0x00000000018730a0 in ctffind_IP_main_ ()
#6 0x000000000186d6a3 in MAIN__ ()
#7 0x000000000040054e in main ()

Thanks for the bug report. I am away from work until December 30th, but should be able to investigate & fix this then.

Is it possible that these images have zero variance (i.e. are all zeroes, or all ones)? The divide by zero is probably occuring when there is a division by the variance of the image. Or maybe they have very, very small variances? If you find such a feature in those images, you may be able to come up with a workaround until I fix this.

Alexis

In reply to by Alexis

time for fixing is no problem, ctffind3 still runs fine.

i don't think its zero variance in the files, sinces the crashing files are different depending on the site. so i was a little bit clueless how this can happen between different cpus.

wolfgang

In reply to by Alexis

hi alexis,

thank you for your support.
i sent you some problem data by email.

zip compressing the folder containg the problem images gives a rate of 95%!
the good images are about 20% compression rate.
so there must be something wrong with the problem images.

cheers,
wolfgang

In reply to by wlmo

Hi Wolfgang,

Thanks for sending the problem images. They are blank (i.e. every pixel is 0.0000), which explains the crash.

I will add a check in ctffind so that if the image is blank, it will crash out straight after opening the file.

If you find a problem with other images which are not blank, do let me know.

Alexis

In reply to by Alexis

Hi Alexis

We noticed similar situations here. CTFFIND4 v4.0.7 works for both AMD and Intel processors, while the latest version, CTFFIND4 v4.0.8 only works on Intel processors.

Thanks,
Ming

In reply to by Ming

Hi Ming,

I expect the pre-compiled binary from the website would only work on Intel processors with AVX instruction sets. To run it on AMD or on older Intel processors should require you to compile from source.

Can you detail a bit more what you were testing exactly?

Thanks
Alexis

In reply to by Alexis

Hi Ming,

Ctffind 4.0.9, now available, is distributed as a binary which should run on older processors without AVX instruction support. Please see the 'compat' tar ball.

Thanks
Alexis

i made a debug version and now the backtrace is (site 2):
(gdb) bt
#0 0x000000000168c31b in raise ()
#1 0x00000000017ed6c5 in abort ()
#2 0x00000000016aee07 in for__signal_handler ()
#3
#4 0x0000000001898c22 in images::zerofloatandnormalise (self=Cannot access memory at address 0x1002
) at core/images_core.f90:331
#5 0x00000000018730a0 in ctffind_IP_main_ ()
#6 0x000000000186d6a3 in MAIN__ ()
#7 0x000000000040054e in main ()

i don't know fortran, but could it be:
self%real_values = self%real_values / self%GetSigmaOfValues() * new_standard_deviation