This is the mail archive of the
cygwin-developers@cygwin.com
mailing list for the Cygwin project.
difficult SIGCHLD / pipe hang
- From: Joe Buehler <jbuehler at hekimian dot com>
- To: Cygwin Developers <cygwin-developers at cygwin dot com>
- Date: Mon, 31 Mar 2003 15:09:23 -0500
- Subject: difficult SIGCHLD / pipe hang
- Organization: Spirent Communications, Inc.
- Reply-to: jbuehler at hekimian dot com
I am still trying to trace down a nasty ksh93 regression test failure.
The script that fails looks like this:
echo 0
print | cat >/dev/null &
wait $!
The problem appears to be timing-related. What happens is that
the "cat" process and the "print" process are in a race to exit.
If "print" is fast enough, "cat" will get a SIGCHLD right around
the time its read of the pipe from "print" returns EOF. This causes
the "cat" process to hang.
Note that both "cat" and "print" are built-ins in the version of ksh
being tested. They are both extremely fast -- everything is running
in forked ksh processes, there is no exec() to run cat.
Here is a trace of the failure:
# process 246 is the "print" process -- it starts exit() here
266 567270 [main] ksh 246 do_exit: do_exit (0)
172 567442 [main] ksh 246 sig_dispatch_pending: pending_signals 0
169 567611 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
168 567779 [main] ksh 246 set_sigcatchers: last 1, old 0, cur 0x6DE24420, cur 0x1
145 567924 [main] ksh 246 void: 0x6DE24420 = signal (20, 0x1)
141 568065 [main] ksh 246 sig_dispatch_pending: pending_signals 0
176 568241 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
180 568421 [main] ksh 246 void: 0x0 = signal (1, 0x1)
184 568605 [main] ksh 246 sig_dispatch_pending: pending_signals 0
154 568759 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
141 568900 [main] ksh 246 void: 0x1 = signal (2, 0x1)
155 569055 [main] ksh 246 sig_dispatch_pending: pending_signals 0
134 569189 [main] ksh 246 sig_dispatch_pending: no need to wake anything up
175 569364 [main] ksh 246 void: 0x1 = signal (3, 0x1)
175 569539 [main] ksh 246 fhandler_base::close: closing '/dev/null' handle 0x178
181 569720 [main] ksh 246 fhandler_base::close: closing '/dev/pipew' handle 0xD8
190 569910 [main] ksh 246 fhandler_base::close: closing '/dev/pipew' handle 0xE8
# process 620 is "cat" -- it gets eof here
5788 569923 [read_pipe] ksh 620 fhandler_base::read: returning 0, binary mode
215 570125 [main] ksh 246 fhandler_base::close: closing '/var/run/utmp' handle 0x158
250 570375 [main] ksh 246 fhandler_base::close: closing '/tmp/j2' handle 0x130
189 570564 [main] ksh 246 sigproc_terminate: entering
186 570750 [main] ksh 246 proc_terminate: nchildren 0, nzombies 0
# 620 unwinding in read()
906 570829 [main] ksh 620 readv: 0 = readv (0, 0x22E090, 1), errno 0
241 570991 [main] ksh 246 proc_terminate: leaving
411 571402 [main] ksh 246 __to_clock_t: dwHighDateTime 0, dwLowDateTime 468750
198 571600 [main] ksh 246 __to_clock_t: total 00000000 0000002E
179 571779 [main] ksh 246 __to_clock_t: dwHighDateTime 0, dwLowDateTime 312500
159 571938 [main] ksh 246 __to_clock_t: total 00000000 0000001F
8805 580743 [main] ksh 246 _pinfo::exit: Calling ExitProcess 0
# 620 notices that a child died and sends itself SIGCHLD
# it is never heard from after that and gdb cannot be attached
# to the process to figure out what it thinks is going on...
16429 587258 [proc] ksh 620 proc_subproc: args: 2, 0
343 587601 [proc] ksh 620 proc_subproc: pid 246[0] terminated, handle 0x4C, nchildren 1, nzombies 0
233 587834 [proc] ksh 620 proc_subproc: zombifying [0], pid 246, handle 0x4C, nchildren 1
218 588052 [proc] ksh 620 proc_subproc: returning 1
265 588317 [proc] ksh 620 sig_send: pid 620, signal 20, its_me 1
182 588499 [proc] ksh 620 sig_send: Not waiting for sigcomplete. its_me 1 signal 20
10 588509 [sig] ksh 620 wait_sig: awake
345 588854 [sig] ksh 620 wait_sig: processing signal 20
184 589038 [sig] ksh 620 wait_sig: Got signal 20
163 589201 [sig] ksh 620 sig_handle: signal 20
157 589358 [sig] ksh 620 sig_handle: signal 20, about to call 0x6DE24420
-305 589053 [proc] ksh 620 sig_send: returning 0 from sending signal 20
636 589689 [proc] ksh 620 wait_subproc: looping
# last message from 620
3 589692 [sig] ksh 620 setup_handler: suspending mainthread
The "suspending mainthread" message is suspicious -- a test that does not hang
does not emit that message. It goes something like this:
402 527314 [proc] ksh 687 proc_subproc: args: 2, 0
-11 527303 [main] ksh 687 geterrno_from_win_error: windows error 2 == errno 2
312 527615 [proc] ksh 687 proc_subproc: pid 688[0] terminated, handle 0x100, nchildren 1, nzombies 0
386 528001 [proc] ksh 687 proc_subproc: zombifying [0], pid 688, handle 0x100, nchildren 1
239 528240 [proc] ksh 687 proc_subproc: returning 1
255 528495 [proc] ksh 687 sig_send: pid 687, signal 20, its_me 1
462 528957 [proc] ksh 687 sig_send: Not waiting for sigcomplete. its_me 1 signal 20
19 528976 [sig] ksh 687 wait_sig: awake
414 529390 [sig] ksh 687 wait_sig: processing signal 20
170 529560 [sig] ksh 687 wait_sig: Got signal 20
180 529740 [sig] ksh 687 sig_handle: signal 20
7 529747 [main] ksh 687 symlink_info::check: not a symlink
346 530093 [main] ksh 687 symlink_info::check: 0 = symlink.check (q:\.hekimian.com\cygwin-shared\bin\cat.exe, 0x22D2C0) (0xA)
-166 529927 [sig] ksh 687 sig_handle: signal 20, about to call 0x6DE24420
598 530525 [main] ksh 687 path_conv::check: root_dir(q:\), this->path(q:\.hekimian.com\cygwin-shared\bin\cat.exe), set_has_acls(0)
288 530813 [proc] ksh 687 sig_send: returning 0 from sending signal 20
239 531052 [proc] ksh 687 wait_subproc: looping
# get this message instead of "suspending mainthread"
7 531059 [sig] ksh 687 interruptible: pc 0x610DE709, h 0x61000000, interruptible 0, testvalid 0
602 531661 [sig] ksh 687 interruptible: pc 0x6105FB7C, h 0x61000000, interruptible 0, testvalid 0
514 532175 [sig] ksh 687 interruptible: pc 0x65B9063F, h 0x65B80000, interruptible 1, testvalid 0
468 532643 [sig] ksh 687 proc_subproc: args: 3, 1
310 532953 [sig] ksh 687 proc_subproc: clear waiting threads
282 533235 [sig] ksh 687 proc_subproc: finished clearing
263 533498 [sig] ksh 687 proc_subproc: returning 1
271 533769 [sig] ksh 687 interrupt_setup: armed signal_arrived 0x224, res 1
284 534053 [sig] ksh 687 setup_handler: signal successfully delivered
463 534516 [sig] ksh 687 setup_handler: returning 1
Any expert advice appreciated -- the Cygwin code in this area is complex.
--
Joe Buehler