This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Bad codegen in pthread_mutex causing 100% cpu spin loop related to inline asm ilockcmpexchg


    Hello,

  I've been on the track of a problem in pthread_mutex that showed up during a
recent libstdc++ testsuite run.  The notes I have here all relate to a
self-built version of the cygwin DLL from 20090421, but I've verified that the
code has compiled exactly the same way in 1.7.0-48.

  The attached testcase, compiled by:

g++-4 -v pthread7-rope.cc -o pthread7-rope.exe --save-temps

gets stuck in a cpu-hogging spin loop for me, once or twice out of ten times
at the command line, or almost every time if run under GDB.  It turns out that
the hogging thread is constructing a pthread_mutex object:

Thread 5323 (thread 1800.0xab0):
#0  pthread_mutex (this=0x100bf858, attr=0x0)
    at /gnu/winsup/src/winsup/cygwin/thread.h:126
#1  0x610d3310 in pthread_mutex::init (mutex=0x1007a43c, attr=0x0,
    initializer=0x14) at /gnu/winsup/src/winsup/cygwin/thread.cc:2688
#2  0x610d6fa3 in pthread_mutex_lock (mutex=0x1007a43c)
    at /gnu/winsup/src/winsup/cygwin/thread.cc:2738
#3  0x610b2078 in _sigfe () from /usr/bin/cygwin1.dll

and it's stuck in a very tight spinloop:

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to thread 1800.0xab0]
pthread_mutex (this=0x100bf858, attr=0x0)
    at /gnu/winsup/src/winsup/cygwin/winbase.h:55
55		": "=a" (__res), "=q" (t) : "1" (t), "q" (v), "0" (c): "cc");
126	  do
55		": "=a" (__res), "=q" (t) : "1" (t), "q" (v), "0" (c): "cc");
126	  do
55		": "=a" (__res), "=q" (t) : "1" (t), "q" (v), "0" (c): "cc");
126	  do
55		": "=a" (__res), "=q" (t) : "1" (t), "q" (v), "0" (c): "cc");
  [ etc. ]

  The relevant code is the final line of the c-tor:

pthread_mutex::pthread_mutex (pthread_mutexattr *attr) :
          [ . . . ]
{
          [ . . . ]
  mutexes.insert (this);
}

which is being inlined from these two functions:

(winbase.h)
50	ilockcmpexch (volatile long *t, long v, long c)
51	{
52	  register int __res;
53	  __asm__ __volatile__ ("\n\
54		lock cmpxchgl %3,(%1)\n\
55		": "=a" (__res), "=q" (t) : "1" (t), "q" (v), "0" (c): "cc");
56	  return __res;
57	}

(thread.cc)
121	template <class list_node> inline void
122	List_insert (list_node *&head, list_node *node)
123	{
124	  if (!node)
125	    return;
126	  do
127	    node->next = head;
128	  while (InterlockedCompareExchangePointer (&head, node, node->next) !=
node->next);
129	}

... which gets compiled into this loop at -O2:

	; esi = &head
0x610cf9a7 <pthread_mutex+135>:	mov    $0x611edc98,%esi
	; ecx = head
0x610cf9ac <pthread_mutex+140>:	mov    0x611edc98,%ecx
	; [ scheduled insn from earlier code ]
0x610cf9b2 <pthread_mutex+146>:	mov    %eax,0x1c(%ebx)
	; do ...
	; edx = &head
0x610cf9b5 <pthread_mutex+149>:	mov    %esi,%edx
	; eax = head
0x610cf9b7 <pthread_mutex+151>:	mov    %ecx,%eax
	; InterlockedCompareExchangePointer (&head, node, node->next)
0x610cf9b9 <pthread_mutex+153>:	lock cmpxchg %ebx,(%edx)
	; ... while ( != node->next);
0x610cf9bd <pthread_mutex+157>:	cmp    %eax,%ecx
	; ... while ( != node->next);
0x610cf9bf <pthread_mutex+159>:	jne    0x610cf9b5 <pthread_mutex+149>
	; node->next = head;
0x610cf9c1 <pthread_mutex+161>:	mov    %ecx,0x24(%ebx)

  As you can see the store to node->next has been sunk until after the
do..while loop.  eax is being set to the value of the head pointer rather than
 node->next under the assumption that the head pointer will not change before
we fall out the bottom of the loop so the value will be the same.

  That assumption doesn't hold however, as the whole point of using a locked
insn is that other threads might write the head pointer in between when we
copy it into our node's next pointer and when we then update it to point to
our new node.  The idea is to loop back if it has changed, reset our node to
point now to the new head, and try again.  This fails badly if the store to
node->next is moved out the bottom of the loop, because then the compare part
of the cmpxchgl fails, and we loop round and try again without reloading the
new head pointer into node->next for comparison; instead, it restores the
cached value from ecx that it plans to test it against after the cmpxchg.

  Result, we repeatedly compare against the same stale value and loop around
forever.

  We need it to know that head may have changed after the asm.  It may be
volatile in the scope of ilockcmpexch, but in the scope of List_insert it's
just an ordinary list_node* that isn't expected to change.  I think that at
least one part of the problem is an incorrect asm constraint; my reading of
them is that they say:

53	  __asm__ __volatile__ ("\n\
54		lock cmpxchgl %3,(%1)\n\
55		": "=a" (__res), "=q" (t) : "1" (t), "q" (v), "0" (c): "cc");

operand 0: OUT __res, constraint "=a" -> write-only, %eax
operand 1: OUT t,     constraint "=q" -> write-only, %eax/%ebx/%ecx/%edx
operand 2: IN  t,     constraint "1"  -> same as operand 1.
operand 3: IN  v,     constraint "q"  -> %eax/%ebx/%ecx/%edx
operand 4: IN  c,     constraint "0"  -> same as operand 0.
operand 5: CLOBBER cc

  Note how the assembly loop above reloads %edx (IN 2 t) and %eax (IN 4 c)
every time round.  However it does not realise that head (*t) has changed, so
does not reload it before rewriting the value to node->next, and that's what
allows the store to fall out the bottom of the loop.  Our operand 1 tells GCC
that the value of the pointer t has changed, when it hasn't, rather than that
the value it points to may have changed.

  I see that there's been a bit of back and forth over what to do with these
asms before, so I'm cautious to rush at suggesting a solution.

  This is from the source file gcc generates when compiling thread.o as part
of a cygwin dll build; it corresponds exactly to the code from 0x610cf9a7 to
0x610cf9c1 in the disassembly above.

	movl	$__ZN13pthread_mutex7mutexesE+8, %esi	 #, pretmp.1732
LVL141:
	movl	__ZN13pthread_mutex7mutexesE+8, %ecx	 # mutexes.head, pretmp.1729
	movl	%eax, 28(%ebx)	 # <variable>.mutextype, <variable>.type
	.p2align 4,,7
L186:
LBB1351:
LBB1352:
LBB1353:
LBB1354:
LBB1355:
LBB1356:
LBB1357:
	.loc 2 53 0
	movl	%esi, %edx	 # pretmp.1732, t
LVL142:
	movl	%ecx, %eax	 # pretmp.1729, __res
LVL143:
/APP
 # 53 "/gnu/winsup/src/winsup/cygwin/winbase.h" 1
	
	lock cmpxchgl %ebx,(%edx)	 # this, t
	
 # 0 "" 2
/NO_APP
LBE1357:
LBE1356:
LBE1355:
	.loc 3 126 0
	cmpl	%eax, %ecx	 # __res, pretmp.1729
	jne	L186	 #,
	movl	%ecx, 36(%ebx)	 # pretmp.1729, <variable>.next


  This is what happens when I change the first output operand from (t) to (*t)
to try and tell the compiler the pointer value may have changed:

extern __inline__ long
ilockcmpexch (volatile long *t, long v, long c)
{
  register int __res;
  __asm__ __volatile__ ("\n\
	lock cmpxchgl %3,(%1)\n\
	": "=a" (__res), "=q" (*t) : "1" (t), "q" (v), "0" (c): "cc");
  return __res;
}

	movl	$__ZN13pthread_mutex7mutexesE+8, %esi	 #, pretmp.1732
LVL127:
	movl	%eax, 28(%ebx)	 # <variable>.mutextype, <variable>.type
	.p2align 4,,7
L180:
LBB1373:
LBB1374:
LBB1375:
LBB1376:
	.loc 3 127 0
	movl	__ZN13pthread_mutex7mutexesE+8, %eax	 # mutexes.head,
LBB1377:
LBB1379:
LBB1381:
	.loc 2 53 0
	movl	%esi, %ecx	 # tmp68,
LBE1381:
LBE1379:
LBE1377:
	.loc 3 127 0
	movl	%eax, -16(%ebp)	 #, D.28606
LBB1384:
LBB1378:
LBB1380:
	.loc 2 53 0
/APP
 # 53 "/gnu/winsup/src/winsup/cygwin/winbase.h" 1
	
	lock cmpxchgl %ebx,(%ecx)	 # this,
	
 # 0 "" 2
/NO_APP
LBE1380:
LBE1378:
LBE1384:
	.loc 3 126 0
	cmpl	%eax, -16(%ebp)	 # __res, D.28606
LVL128:
LBB1385:
LBB1383:
LBB1382:
	.loc 2 53 0
	movl	%ecx, __ZN13pthread_mutex7mutexesE+8	 # tmp68,
LBE1382:
LBE1383:
LBE1385:
	.loc 3 126 0
	jne	L180	 #,
	movl	-16(%ebp), %eax	 # D.28606,
LVL129:
	movl	%eax, 36(%ebx)	 #, <variable>.next
LBE1376:

  It's a bit better but still not right; it has divined that the of head (*t)
will change and need reloading, but rather than storing it into node->next it
is for some unknown reason storing it into a temporary stack slot and then
moving it from there to node->next after the end of the loop!  This also isn't
good, because it means our cmpxchg might rewrite the head pointer to point to
our new node, and then another thread jump in and see that and inspect our new
node (and its next pointer) before we've got out the bottom of the loop and
set that next pointer to link onto the head of the original chain.

  So, the only form that I could get to generate what looks like correct
source is by adding back the "memory" clobber, which was already removed once
for causing odd behaviour.  The only thing I'm doing different this time is
combining that with also altering the output operand #1 from (t) to (*t).
That give me this source:

extern __inline__ long
ilockcmpexch (volatile long *t, long v, long c)
{
  register int __res;
  __asm__ __volatile__ ("\n\
	lock cmpxchgl %3,(%1)\n\
	": "=a" (__res), "=q" (*t) : "1" (t), "q" (v), "0" (c): "memory", "cc");
  return __res;
}

	movl	$__ZN13pthread_mutex7mutexesE+8, %edi	 #, pretmp.1730
	movl	%eax, 28(%ebx)	 # <variable>.mutextype, <variable>.type
	.p2align 4,,7
L180:
LBB1373:
LBB1374:
LBB1375:
LBB1376:
	.loc 3 127 0
	movl	__ZN13pthread_mutex7mutexesE+8, %edx	 # mutexes.head, D.28606
LBB1377:
LBB1379:
LBB1381:
	.loc 2 53 0
	movl	%edi, %ecx	 # tmp68,
LBE1381:
LBE1379:
LBE1377:
	.loc 3 127 0
	movl	%edx, 36(%ebx)	 # D.28606, <variable>.next
LBB1384:
LBB1378:
LBB1380:
	.loc 2 53 0
	movl	%edx, %eax	 # __res,
LVL126:
/APP
 # 53 "/gnu/winsup/src/winsup/cygwin/winbase.h" 1
	
	lock cmpxchgl %ebx,(%ecx)	 # this,
	
 # 0 "" 2
/NO_APP
LBE1380:
LBE1378:
LBE1384:
	.loc 3 126 0
	cmpl	%eax, 36(%ebx)	 # __res, <variable>.next
LBB1385:
LBB1383:
LBB1382:
	.loc 2 53 0
	movl	%ecx, __ZN13pthread_mutex7mutexesE+8	 # tmp68,
LBE1382:
LBE1383:
LBE1385:
	.loc 3 126 0
	jne	L180	 #,


  This looks fully correct to me: head pointer is reloaded at the top of each
loop and written to node->next before we do the cmpxchg to try and replace it;
only after that we check if the new value of head does indeed no longer match
the original value which is correctly in node->next.  The store has been put
back where it belongs inside the loop.

  Anyway, I'm in the middle of a GCC bootstrap that means I won't be able to
do any testing of new DLLs for a few hours yet, but I thought I'd post this to
get the ball rolling and show what I've found.  Maybe somebody will spot
something I've analyzed wrong.  An alternative option might be to just rip
these inlines right out and rely on the system library versions of the
InterlockedXxx functions from kernel32.dll.  It's also worth pointing out that
in our final pre-processed code, the definition of
InterlockedCompareExchangePointer is not simply a #define for ilockcmpexch,
but gets wrapped in a bunch of casts, and comes out looking like so:

template <class list_node> inline void
List_insert (list_node *&head, list_node *node)
{
  if (!node)
    return;
  do
    node->next = head;
  while ((PVOID)ilockcmpexch((LONG volatile *)
(&head),(LONG)(node),(LONG)(node->next)) != node->next);
}

... and maybe we're doing something bad with using C-style casts in C++ here
that's fooling the optimisers.  I'm pretty sure that output operand is wrong
in the asm, but there may be more than one factor at play here, which might
explain why I couldn't get completely the right code without that "memory"
clobber.  Or might be entirely unrelated.

  Any suggestions?

    cheers,
      DaveK

// 2003-05-03  Loren J. Rittle <rittle@labs.mot.com> <ljrittle@acm.org>
//
// Copyright (C) 2003, 2004, 2005, 2006, 2007, 2009
// Free Software Foundation, Inc.
//
// This file is part of the GNU ISO C++ Library.  This library is free
// software; you can redistribute it and/or modify it under the
// terms of the GNU General Public License as published by the
// Free Software Foundation; either version 3, or (at your option)
// any later version.
//
// This library is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License along
// with this library; see the file COPYING3.  If not see
// <http://www.gnu.org/licenses/>.

// { dg-do run { target *-*-freebsd* *-*-netbsd* *-*-linux* *-*-solaris* *-*-cygwin *-*-darwin* alpha*-*-osf* mips-sgi-irix6* } }
// { dg-options "-pthread" { target *-*-freebsd* *-*-netbsd* *-*-linux* alpha*-*-osf* mips-sgi-irix6* } }
// { dg-options "-pthreads" { target *-*-solaris* } }

#include <ext/rope>
#include <cstring>
#include <pthread.h>
//#include <testsuite_hooks.h>

#define VERIFY(x) 

const int max_thread_count = 4;
const int max_loop_count = 10000;

typedef __gnu_cxx::rope<char, std::allocator<char> > rope_type;
rope_type foo2;
rope_type foo4;

void* thread_main(void *) 
{
  // To see a problem with gcc 3.3 and before, set a break point here.
  // Single step through c_str implementation, call sched_yield after
  // capture of NULL __old_c_string in any thread.  Single step
  // through another thread past that same point.  Now, one thread
  // will receive a bad pointer return.  Adding dummy sched_yield
  // should never change program semantics under POSIX threads.
  const char* data4 = foo4.c_str();

  // Please note that the memory leak in the rope implementation with
  // this test case, existed before and after fixing this bug...
  bool test __attribute__((unused)) = true;
  VERIFY( !std::strcmp (data4, "barbazbonglehellohellohello") );
  return 0;
}

int
main()
{
  bool test __attribute__((unused)) = true;

  pthread_t tid[max_thread_count];

#if defined(__sun) && defined(__svr4__) && _XOPEN_VERSION >= 500
  pthread_setconcurrency (max_thread_count);
#endif

  rope_type foo;
  foo += "bar";
  foo += "baz";
  foo += "bongle";
  const char* data = foo.c_str();
  VERIFY( !std::strcmp (data, "barbazbongle") );

  const char* data2;
  {
    foo2 += "bar2";
    foo2 += "baz2";
    foo2 += "bongle2";
    data2 = foo2.c_str();
    VERIFY( !std::strcmp (data2, "bar2baz2bongle2") );
  }

  rope_type foo3 ("hello");
  const char* data3 = foo3.c_str();
  VERIFY( !std::strcmp (data3, "hello") );

  for (int j = 0; j < max_loop_count; j++)
    {
      foo4 = foo;
      foo4 += foo3;
      foo4 += foo3;
      foo4 += foo3;

      for (int i = 0; i < max_thread_count; i++)
	pthread_create (&tid[i], NULL, thread_main, 0);

      for (int i = 0; i < max_thread_count; i++)
	pthread_join (tid[i], NULL);
    }

  VERIFY( !std::strcmp (data, "barbazbongle") );
  VERIFY( !std::strcmp (data2, "bar2baz2bongle2") );

  return 0;
}

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]