This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU


> -----Original Message-----
> From: cygwin-owner@cygwin.com 
> [mailto:cygwin-owner@cygwin.com] On Behalf Of Ernie Coskrey
> Sent: Tuesday, July 31, 2007 3:40 PM
> To: cygwin@cygwin.com
> Subject: cygwin 1.5.20-1, spinning pdksh, 100% CPU
> 
>  
> I've run into a problem with cygwin 1.5.20-1 and pdksh 
> 5.2.14.  We've got a pdksh.exe process that is spinning, 
> using all the CPU.
>  
> This scenario is very hard to reproduce, but has happened on 
> our test systems occasionally.  It occurred recently, and I 
> currently have gdb attached to the process and have the 
> symbols loaded.  I see that pdksh is continually calling 
> "sigsuspend()", which is immediately returning from 
> cancelable_wait due to the fact that the signal_arrived event 
> is set.  I also see that pdksh is waiting for a subprocess to 
> complete, and has a handle to the PID of that process - 
> however the process has long since terminated.
>  
> It appears that something went wrong during delivery of SIGCHLD.
>  
> I've got two questions related to this:
>  
> - have there been changes between 1.5.20-1 and 1.5.24-2, or 
> the latest snapshot, that might have fixed this issue?  We've 
> done some limited testing with 1.5.24-2 and haven't seen this 
> happen yet, but as I said the it only happens rarely.
> - is there anything I can look at in gdb to help identify 
> what the issue is?
>  
> Any suggestions would be appreciated!
>  
> ---------
> Ernie Coskrey 

I've discovered an interesting piece of information that I think is
related to this.  I'm hoping this might ring a bell with someone on the
list.

Looking at _main_tls->stack[], when I've set a breakpoint in
handle_sigsuspend just after the cancelable_wait() call, I see the
following entries:

    0x6109186f  0x4132ac

0x6109186f is "sigdelayed()", which is the routine that should have been
called to deliver the signal and reset the signal_arrived event.
0x4132ac is j_waitj (in pdksh).

So, somehow, when this problem occurs, "sigdelayed" gets pushed onto the
stack *before* j_waitj does.  So, _sigbe never calls sigdelayed.

I don't think there's ever a case where sigdelayed should be at
_main_tls->stack[0].  However this happened is, I believe, the cause of
this problem.

Ernie Coskrey

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]