GPU frealign with GeForce GTX 580?

Forums

I'm having problems getting the GPU version of frealign working. It compiles without error, but running frealign_v8.exe_ref always results in crashes, with a variety of Segmentation faults, Aborts, and silent crashes. The errors, when any errors are reported in the logs, are memory-related: "free(): invalid next size (normal)" or "double free or corruption (!prev)", for example. The errors are not consistent, even with the same input data (which would tend to reinforce the hypothesis of memory errors). Most but not all of the errors produce core files.

I'm running on Red Hat 6.0 servers with 8 x GeForce GTX 580 cards and no display. I'm using CUDA 3.2 and gcc 4.4.5. I am using GeFREALIGNv8.06_110514 downloaded from this site. I see similar errors across all nodes in our cluster, so I am fairly sure it's not a hardware problem.

Any advice on how to begin troubleshooting these errors? Thanks!

I guess it may related to the version of gcc. Try gcc3.4, because gcc4.4 came after CUDA3.2, and new CUDA version is never compatible with old version. I always suggest to use CUDA3.2, gcc3.4, g++3.4 and g77(compat with gcc3.4). Even the version looks lower, it never influences the performance.
And try to reboot your computer if you see any wired errors. I have many many experiences of bad GPU, ~20% GTX GPU are bad, at least for scientific computation, and most errors can be temporarily solved by rebooting computer. Afeter several days running, the error will come back. An answer from the ventor is that GTX is for game but computation, they suggest us to buy Telsa!!!

In reply to by xueming

Thanks for the suggestion. I haven't had time to try installing and using an older version of gcc yet. The GPU cluster I'm running on is a shared resource, and I can't change the installed software, especially going backward. That complicates my ability to test older gcc versions quite a bit.

And that leads to a question. Do you have any plans for GeFREALIGN to support newer versions of gcc and CUDA? CUDA 3.2 is fairly new, but gcc 3.4 is about six years old at this point.

I'm also debugging the specific crash I'm seeing, and I'll post more here if I find anything.

In reply to by djo

Thanks for all your help! Unfortunately, there are currently no plans to update GeFREALIGN. We might try to switch over to a compiler that supports execution on GPUs. However, the processing of our data on traditional CPU clusters has been quite adequate for us so far. Therefore, the development of an improved GPU version does not have the highest priority at the moment.

I will update GeFrealign later once I have time.
CUDA3.2 doesn't support gcc4, NVIDIA always suggest to use gcc3.4. New version doesn't means better or faster, just more useless features(at least for me) :).
If you install gcc3.4, you don't need to uninstall the current gcc. They can coexist in the same system. That's most people did.

Maybe you can try the newest CUDA4.2, it may works. Also I didn't try it and not sure whether CUDA4.2 can work with gcc4.4.5. gcc is updated so fast that CUDA can not follow it.

I've continued to work intermittently on trying to get the GPU version of frealign working on our GPU cluster with nothing more than occasional ambiguous successes. While I solved the specific problems I reported in my first post, I am continuing to see various kinds of segmentation faults, hangs, and silent crashes when running frealign for useful lengths of time. At this point I am still not sure whether the problems I am seeing are due to hardware or software issues.

So at this point I am wondering if anyone other than the author has had success running frealign on GPU clusters, and if so, which GPUs are they running? I see other threads on this forum from people having problems, but none of them have posted follow-ups reporting success. I'm particularly interested in the use of "gaming" cards vs. high-end cards intended for computation (eg, Tesla cards).

In reply to by djo

I am sorry to say that the GPU version of Frealign has caused problems for many people and these problems have never really been sorted out. The GPU version was programmed by the Cheng lab at UCSF and I have not been able to look into it myself to be able to fix things.

I think that with the new multi-core CPUs it becomes less urgent to use GPUs. We are not using the GPU version in the lab because the processing speed has never been a bottleneck for us.