difficult SIGCHLD / pipe hang

Joe Buehler jbuehler@hekimian.com
Mon Mar 31 20:09:00 GMT 2003


I am still trying to trace down a nasty ksh93 regression test failure.
The script that fails looks like this:

echo 0
print | cat >/dev/null &
wait $!

The problem appears to be timing-related.  What happens is that
the "cat" process and the "print" process are in a race to exit.
If "print" is fast enough, "cat" will get a SIGCHLD right around
the time its read of the pipe from "print" returns EOF.  This causes
the "cat" process to hang.

Note that both "cat" and "print" are built-ins in the version of ksh
being tested.  They are both extremely fast -- everything is running
in forked ksh processes, there is no exec() to run cat.

Here is a trace of the failure:

# process 246 is the "print" process -- it starts exit() here
   266  567270 [main] ksh 246 do_exit: do_exit (0)
   172  567442 [main] ksh 246 sig_dispatch_pending: pending_signals 0
   169  567611 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
   168  567779 [main] ksh 246 set_sigcatchers: last 1, old 0, cur 0x6DE24420, cur 0x1
   145  567924 [main] ksh 246 void: 0x6DE24420 = signal (20, 0x1)
   141  568065 [main] ksh 246 sig_dispatch_pending: pending_signals 0
   176  568241 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
   180  568421 [main] ksh 246 void: 0x0 = signal (1, 0x1)
   184  568605 [main] ksh 246 sig_dispatch_pending: pending_signals 0
   154  568759 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
   141  568900 [main] ksh 246 void: 0x1 = signal (2, 0x1)
   155  569055 [main] ksh 246 sig_dispatch_pending: pending_signals 0
   134  569189 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
   175  569364 [main] ksh 246 void: 0x1 = signal (3, 0x1)
   175  569539 [main] ksh 246 fhandler_base::close: closing '/dev/null' handle 0x178
   181  569720 [main] ksh 246 fhandler_base::close: closing '/dev/pipew' handle 0xD8
   190  569910 [main] ksh 246 fhandler_base::close: closing '/dev/pipew' handle 0xE8
# process 620 is "cat" -- it gets eof here
  5788  569923 [read_pipe] ksh 620 fhandler_base::read: returning 0, binary mode
   215  570125 [main] ksh 246 fhandler_base::close: closing '/var/run/utmp' handle 0x158
   250  570375 [main] ksh 246 fhandler_base::close: closing '/tmp/j2' handle 0x130
   189  570564 [main] ksh 246 sigproc_terminate: entering
   186  570750 [main] ksh 246 proc_terminate: nchildren 0, nzombies 0
# 620 unwinding in read()
   906  570829 [main] ksh 620 readv: 0 = readv (0, 0x22E090, 1), errno 0
   241  570991 [main] ksh 246 proc_terminate: leaving
   411  571402 [main] ksh 246 __to_clock_t: dwHighDateTime 0, dwLowDateTime 468750
   198  571600 [main] ksh 246 __to_clock_t: total 00000000 0000002E
   179  571779 [main] ksh 246 __to_clock_t: dwHighDateTime 0, dwLowDateTime 312500
   159  571938 [main] ksh 246 __to_clock_t: total 00000000 0000001F
  8805  580743 [main] ksh 246 _pinfo::exit: Calling ExitProcess 0
# 620 notices that a child died and sends itself SIGCHLD
# it is never heard from after that and gdb cannot be attached
# to the process to figure out what it thinks is going on...
16429  587258 [proc] ksh 620 proc_subproc: args: 2, 0
   343  587601 [proc] ksh 620 proc_subproc: pid 246[0] terminated, handle 0x4C, nchildren 1, nzombies 0
   233  587834 [proc] ksh 620 proc_subproc: zombifying [0], pid 246, handle 0x4C, nchildren 1
   218  588052 [proc] ksh 620 proc_subproc: returning 1
   265  588317 [proc] ksh 620 sig_send: pid 620, signal 20, its_me 1
   182  588499 [proc] ksh 620 sig_send: Not waiting for sigcomplete.  its_me 1 signal 20
    10  588509 [sig] ksh 620 wait_sig: awake
   345  588854 [sig] ksh 620 wait_sig: processing signal 20
   184  589038 [sig] ksh 620 wait_sig: Got signal 20
   163  589201 [sig] ksh 620 sig_handle: signal 20
   157  589358 [sig] ksh 620 sig_handle: signal 20, about to call 0x6DE24420
  -305  589053 [proc] ksh 620 sig_send: returning 0 from sending signal 20
   636  589689 [proc] ksh 620 wait_subproc: looping
# last message from 620
     3  589692 [sig] ksh 620 setup_handler: suspending mainthread

The "suspending mainthread" message is suspicious -- a test that does not hang
does not emit that message.  It goes something like this:

   402  527314 [proc] ksh 687 proc_subproc: args: 2, 0
   -11  527303 [main] ksh 687 geterrno_from_win_error: windows error 2 == errno 2
   312  527615 [proc] ksh 687 proc_subproc: pid 688[0] terminated, handle 0x100, nchildren 1, nzombies 0
   386  528001 [proc] ksh 687 proc_subproc: zombifying [0], pid 688, handle 0x100, nchildren 1
   239  528240 [proc] ksh 687 proc_subproc: returning 1
   255  528495 [proc] ksh 687 sig_send: pid 687, signal 20, its_me 1
   462  528957 [proc] ksh 687 sig_send: Not waiting for sigcomplete.  its_me 1 signal 20
    19  528976 [sig] ksh 687 wait_sig: awake
   414  529390 [sig] ksh 687 wait_sig: processing signal 20
   170  529560 [sig] ksh 687 wait_sig: Got signal 20
   180  529740 [sig] ksh 687 sig_handle: signal 20
     7  529747 [main] ksh 687 symlink_info::check: not a symlink
   346  530093 [main] ksh 687 symlink_info::check: 0 = symlink.check (q:\.hekimian.com\cygwin-shared\bin\cat.exe, 0x22D2C0) (0xA)
  -166  529927 [sig] ksh 687 sig_handle: signal 20, about to call 0x6DE24420
   598  530525 [main] ksh 687 path_conv::check: root_dir(q:\), this->path(q:\.hekimian.com\cygwin-shared\bin\cat.exe), set_has_acls(0)
   288  530813 [proc] ksh 687 sig_send: returning 0 from sending signal 20
   239  531052 [proc] ksh 687 wait_subproc: looping
# get this message instead of "suspending mainthread"
     7  531059 [sig] ksh 687 interruptible: pc 0x610DE709, h 0x61000000, interruptible 0, testvalid 0
   602  531661 [sig] ksh 687 interruptible: pc 0x6105FB7C, h 0x61000000, interruptible 0, testvalid 0
   514  532175 [sig] ksh 687 interruptible: pc 0x65B9063F, h 0x65B80000, interruptible 1, testvalid 0
   468  532643 [sig] ksh 687 proc_subproc: args: 3, 1
   310  532953 [sig] ksh 687 proc_subproc: clear waiting threads
   282  533235 [sig] ksh 687 proc_subproc: finished clearing
   263  533498 [sig] ksh 687 proc_subproc: returning 1
   271  533769 [sig] ksh 687 interrupt_setup: armed signal_arrived 0x224, res 1
   284  534053 [sig] ksh 687 setup_handler: signal successfully delivered
   463  534516 [sig] ksh 687 setup_handler: returning 1

Any expert advice appreciated -- the Cygwin code in this area is complex.
-- 
Joe Buehler



More information about the Cygwin-developers mailing list