ctffind4 crashes sometimes depending on compiler version and cpu
Forums
we have a dataset of 860 .mrc files.
with the compiled binary from your site, some of the 860 processing jobs died.
therefore i recompiled your source with the intel 2013 suite (mkl, static).
DEST=${SW_HOME}/${PKG_NAME}/${PKG_VER}
FC=ifort F77=ifort \
./configure --prefix=${DEST} --enable-static
make -j ${BUILD_CPUS} 2>&1 | tee make.log
we can process at 2 sites:
export OMP_NUM_THREADS=1
1.) site 1 (scientific linux 6.5, AMD Opteron(tm) Processor 6380, there we made the executable)
from the 860 processes about 10 die (always the same datafiles) with
forrtl: error (73): floating divide by zero
2.) site 2 (debian 7 linux, Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz)
from the 860 processes about 23 different die (always the same datafiles at site 2, but different from site 1) with
also the same divide by zero error
then i used the intel 2015 compiler, but the only difference is that on
- site 1: now only 3 of the 860 die (were already part of the 10 before)
- site 2: no change, the same processes/datafiles as the intel 2013 version die
gdb backtrace:
#0 0x000000000168c31b in raise ()
#1 0x00000000017ed6c5 in abort ()
#2 0x00000000016aee07 in for__signal_handler ()
#3
#4 0x0000000001898e12 in images_MP_zerofloatandnormalise_ ()
#5 0x00000000018730a0 in ctffind_IP_main_ ()
#6 0x000000000186d6a3 in MAIN__ ()
#7 0x000000000040054e in main ()
thanks for the bug report
Thanks for the bug report. I am away from work until December 30th, but should be able to investigate & fix this then.
Is it possible that these images have zero variance (i.e. are all zeroes, or all ones)? The divide by zero is probably occuring when there is a division by the variance of the image. Or maybe they have very, very small variances? If you find such a feature in those images, you may be able to come up with a workaround until I fix this.
Alexis
time for fixing is no
In reply to thanks for the bug report by Alexis
time for fixing is no problem, ctffind3 still runs fine.
i don't think its zero variance in the files, sinces the crashing files are different depending on the site. so i was a little bit clueless how this can happen between different cpus.
wolfgang
Hi Wolfgang, Would you be
In reply to time for fixing is no by wlmo
Hi Wolfgang,
Would you be willing to share a few micrographs which made ctffind crash? If so, could you please get in touch with me via email (rohoua@janelia.hhmi.org) so we can coordinate the logistics of this?
Many thanks
Alexis
hi alexis, thank you for your
In reply to Hi Wolfgang, Would you be by Alexis
hi alexis,
thank you for your support.
i sent you some problem data by email.
zip compressing the folder containg the problem images gives a rate of 95%!
the good images are about 20% compression rate.
so there must be something wrong with the problem images.
cheers,
wolfgang
blank images
In reply to hi alexis, thank you for your by wlmo
Hi Wolfgang,
Thanks for sending the problem images. They are blank (i.e. every pixel is 0.0000), which explains the crash.
I will add a check in ctffind so that if the image is blank, it will crash out straight after opening the file.
If you find a problem with other images which are not blank, do let me know.
Alexis
ctffind4 v4.0.7 and v4.0.8
In reply to thanks for the bug report by Alexis
Hi Alexis
We noticed similar situations here. CTFFIND4 v4.0.7 works for both AMD and Intel processors, while the latest version, CTFFIND4 v4.0.8 only works on Intel processors.
Thanks,
Ming
Hi Ming, I expect the
In reply to ctffind4 v4.0.7 and v4.0.8 by Ming
Hi Ming,
I expect the pre-compiled binary from the website would only work on Intel processors with AVX instruction sets. To run it on AMD or on older Intel processors should require you to compile from source.
Can you detail a bit more what you were testing exactly?
Thanks
Alexis
Hi Ming, Ctffind 4.0.9, now
In reply to Hi Ming, I expect the by Alexis
Hi Ming,
Ctffind 4.0.9, now available, is distributed as a binary which should run on older processors without AVX instruction support. Please see the 'compat' tar ball.
Thanks
Alexis
i made a debug version and
i made a debug version and now the backtrace is (site 2):
(gdb) bt
#0 0x000000000168c31b in raise ()
#1 0x00000000017ed6c5 in abort ()
#2 0x00000000016aee07 in for__signal_handler ()
#3
#4 0x0000000001898c22 in images::zerofloatandnormalise (self=Cannot access memory at address 0x1002
) at core/images_core.f90:331
#5 0x00000000018730a0 in ctffind_IP_main_ ()
#6 0x000000000186d6a3 in MAIN__ ()
#7 0x000000000040054e in main ()
i don't know fortran, but could it be:
self%real_values = self%real_values / self%GetSigmaOfValues() * new_standard_deviation