This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: charset changes

From: Andy Koppe <andy dot koppe at gmail dot com>
To: cygwin-developers at cygwin dot com
Date: Sat, 6 Feb 2010 13:56:41 +0000
Subject: Re: charset changes
References: <416096c61001230305x20619d39x55e3a46b428ba@mail.gmail.com> <4B6C474B.7090600@towo.net> <20100205215047.GX28659@calimero.vinschen.de> <416096c61002051514w5bb56b0bj5baeb0c65c7aece@mail.gmail.com> <4B6CB86E.5050904@towo.net> <416096c61002052220rafdb361kec907336ca5b3889@mail.gmail.com> <20100206104024.GY28659@calimero.vinschen.de>

Corinna Vinschen:
>> Other systems usually have a 32-bit wchar, though. I can see three
>> ways to tackle the issue, but none of them entirely satisfactory. When
>> encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
>> non-BMP char (and hence a UTF-16 surrogate pair):
>> 1. Just report an invalid sequence. BMP-only support would probably
>> still cover most practical needs.
>> 2. Write the high surrogate and report that one byte less than
>> actually seen has been consumed. On the next mbtowc call, ignore the
>> input, write the low surrogate, and report that 1 byte has been
>> consumed. Unfortunately this scheme falls down if the user feeds in
>> the bytes one-by-one, as Corinna previously found when handling UTF-8
>> like this.
>> 3. Write the high surrogate and report the actual number of bytes
>> consumed. On the next call, write the low surrogate, and return 0 to
>> indicate that no bytes have been consumed. Trouble is, a return value
>> of 0 from mbrtowc is supposed to indicate that a null character has
>> been found. While uses within Cygwin could be changed to recognise
>> string end by instead looking at the character actually written, this
>> would lead to truncated strings in applications.

I just found that approach 3 ends up delaying the low surrogate until
the first byte of the next character is passed to mbtowc. For keyboard
input at least, that's bad.


> Can't we just carry over the surrogate pair handling from __utf8_mbtowc()
> in newlib/libc/stdlib/mbtowc_r.c? ÂWhat's the stumbling block exactly?
> Do you have an example?

__utf8_mbtowc write the UTF-16 high surrogate after seeing only three
bytes of a four-byte sequence. It can do that because the first three
bytes of a UTF-8 sequence contain all the bits needed for the high
surrogate. When it's called with the fourth byte, it writes the low
surrogate and returns 1 to indicate it's consumed 1 byte. (Unless the
fourth byte is invalid in which case it returns -1).

That approach fits nicely with the mbrtowc spec, but I don't think it
can be used for GB18030, because there the first three bytes of a four
byte sequence do not necessarily determine all the bits of the high
surrogate.

For example (all in hex):

U+207FF  GB18030: 95 33 D1 33  UTF-16: D841 DFFF
U+20800  GB18030: 95 33 D1 34  UTF-16: D842 DC00

The first three GB18030 bytes are the same, yet the high UTF-16
surrogate is different.

Andy

References:
- Re: charset changes
  - From: Thomas Wolff
- Re: charset changes
  - From: Corinna Vinschen
- Re: charset changes
  - From: Andy Koppe
- Re: charset changes
  - From: Thomas Wolff
- Re: charset changes
  - From: Andy Koppe
- Re: charset changes
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]