This is the mail archive of the cygwin-apps mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: ITP: rxvt-unicode-X


I have now succeeded in finishing my Unicode support hook for rxvt on 
cygwin (almost, as far as Unicode operation is concerned).
There were some more obstacles to take which I will describe below in 
case anyone is interested :)

A few problems remain:
* If I start rxvt in NON-Unicode mode, 8 bit input doesn't work. This 
  also happens with the unpatched rxvt-unicode 6.0 (compiled from the 
  source archive), but it works in Charles' package, so I would hope 
  that the patch is applicable to the package without injecting this 
  error.
* The wchar_t type on cygwin is only "unsigned short", raising a minor 
  problem with handling Unicode characters beyond 16 bit; my patch is 
  now mapping the output to the Unicode replacement character U+FFFD.
  Substituting a sufficiently wide type might work but would require 
  more subtle modifications to the code.
* Charles pointed out that an application can use setlocale multiple 
  times, switching encoding dynamically, and that rxvt actually does 
  that (although I didn't understand for which purpose). Anyway, 
  a proper substitution of setlocale that mimics this behaviour is 
  still missing in my patch library.
* Suspected remaining handling bug in 'draw_string' as described below.

To apply the patch, please unzip the uwc.zip archive in the rxvt 
src subdirectory. Then invoke the uwc script which applies the patch 
generically, by substituting the respective function names in the 
source files. The final "return NOCHAR" fix described below still has 
to be applied manually, sorry.
The patch can be downloaded from <http://towo.net/mined/cygwin/uwc.zip>

Thomas


------------------------------------------------------------------------
Now about the problems I had:
* First, I had to remove one more bug in my wide character replacement 
  functions in order to avoid an occasional crash. Alright.
* Then, Unicode input still would not work. I found that indeed I had 
  overlooked one function to be replaced which is XwcLookupString.
  The code in rxvt (command.C) has an alternative invocation of 
  Xutf8LookupString which is commented "// currently disabled, doesn't 
  seem to work, nor is useful".
  It turns out that it is indeed very useful in making input work; the 
  reason the disabled rxvt code could not work is that the return 
  values are not handled properly.
* Finally, there was some occasional weird display garbage remaining 
  which I am describing below in some detail because there is some 
  really buggy rxvt code involved.

When displaying a long string to the screen it may happen that 
rxvt splits a single UTF-8 character into subsequent fills of some 
internal buffer. (I could not observe this on Linux, however, where 
the buffer seems to be chosen always long enough to fit in the complete 
output, whereas on cygwin it seems to have a maximum length of 257 bytes.)

Then at the end of the buffer, rxvt invokes mbrtowc with an incomplete 
UTF-8 sequence:

mbrtowc (& wc, C3 BC E2, 3, & ps) -> 2, wc = FC
mbrtowc (& wc, E2, 1, & ps) -> -1, wc unchanged
now the continuation of E2, combining to E2 80 A7, the dot symbol U+2027:
mbrtowc (& wc, 80 A7 C3 A4 C3 B6 C3 9F ..., 257, & ps) -> -1, wc unchanged
mbrtowc (& wc, A7 C3 A4 C3 B6 C3 9F E2 ..., 256, & ps) -> -1 wc unchanged
mbrtowc (& wc, C3 A4 C3 B6 C3 9F E2 87 ..., 255, & ps) -> 2 wc = E4

The display produced is "üâ§ä" instead of "ü�ä".

A sample program xwrite.c demonstrating the bug is included in uwc.zip 
(only if the "return NOCHAR" fix below has not yet been applied).


When I further analysed the mbrtowc function (on Linux where it works), 
it turned out that it maintains a state of incomplete UTF-8 and is 
able to automatically consider this with a continuation sequence 
requested later. Also some comments in the rxvt source suggest that 
rxvt might even depend on this undocumented behaviour. So I 
reimplemented it with my cygwin mbrtowc replacement but the display 
bug remained. It finally turned out that rxvt does not need this 
"feature" (or rather bug, as it's not documented), at least not for 
screen display.

So I checked the invocations of mbrtowc in rxvt in command.C and 
menubar.C; I thought it was the latter because it's inside a function 
called 'draw_string' which quite clearly suggests that it would be used 
for screen display but it was not the case.
It rather turned out that the function 'next_char' in command.C is 
handling screen output which is really weird (the function is 
commented "// read the next octet").
The function has the return option
      if (len == (size_t)-1) {
        return *cmdbuf_ptr++;
with the comment 
"// the _occasional_ latin1 character is allowed to slip through"; 
now this sounds mega-weird - why should something that't not right 
be allowed to slip through? Anyway, replacing this with just
      if (len == (size_t)-1) {
        return NOCHAR;
finally solves the display problem and there we are with a working 
rxvt-unicode on cygwin.


A remaining issue might be 'draw_string' in menubar.C; I don't know 
what its purpose is.


The re-implementation of the setlocale functionality in my replacement 
function which you correctly pointed out is still pending.


------------------------------------------------------------------------


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]