2011-10-28

Yet another GCC e500v2 bug bites the Debian powerpcspe port

I just spent probably 10 hours pouring over the libffi assembly by hand and under GDB, completely sure that a segmentation fault issue was caused by a bug in there.  I finally figured out that the stack (or the unwind data or something) was being overwritten, and that it was happening inside of the innermost function.

After literally single-stepping through about 8000 lines of ASM and not being able to find anything wrong with the libffi parts, I finally got fed up.  I ripped the testcase out and created a basic C++ function that sets up the same data-structures without using any assembly at all, and that failed too!!!

The most annoying thing I ran into while debugging was that GDB would tell me the stack was garbage, but if I actually followed the stack pointers by hand it all looked perfect.  IE: this is what GDB gave me under my stripped down testcase:

  (gdb) bt
  #0  closure_test_fn1 (cif=<value optimized out>, resp=0xbffff46c,
      args=<value optimized out>, userdata=<value optimized out>)
      at unwindtestfunc.cc:39
  #1  0x00000001 in ?? ()
  #2  0x00000001 in ?? ()
  Backtrace stopped: previous frame inner to this frame (corrupt stack?)



And yet printing by hand it looked like a valid stack:
  (gdb) print (void (*)(void))*($r1 + 4)
  $1 = (void (*)(void))
       0x10000970 <closure_test_fn1(ffi_cif*, void*, void**, void*)+320>
  (gdb) print (void (*)(void))*(*$r1 + 4)
  $2 = (void (*)(void)) 0x100006a0 <main()+324>


AAARRRGGGHHH!!!!  I hate GCC bugs!!!


The issue apparently crops up only when building with "-Os" (not with "-O2", which is almost the same), so there's probably a really stupid bug hanging around somewhere, but the stack itself looks fine and I don't understand the C++ unwind data-structures well enough to track it down.

So I filed GCC PR target/50906, which causes GCC to miscompile e500v2 floating point code using exceptions. This was causing the libffi testsuite to fail miserably with a SIGSEGV in "unwindtest.cc" when build with "-Os".

This not exactly the first major issue that mainline GCC has had with this sub-architecture port: PR44169, PR44364 PR44606. It's further worth noting that despite our pleas with FreeScale, it ended up being an IBM developer (Alan Modra from Australia) who helped us get those previous bugs solved.

I'm hopeful that this will be a relatively obvious bug and therefore very easily solved.

Cheers,
Kyle Moffett

No comments:

Post a Comment