This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: charset changes


Corinna Vinschen:
>> Other systems usually have a 32-bit wchar, though. I can see three
>> ways to tackle the issue, but none of them entirely satisfactory. When
>> encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
>> non-BMP char (and hence a UTF-16 surrogate pair):
>> 1. Just report an invalid sequence. BMP-only support would probably
>> still cover most practical needs.
>> 2. Write the high surrogate and report that one byte less than
>> actually seen has been consumed. On the next mbtowc call, ignore the
>> input, write the low surrogate, and report that 1 byte has been
>> consumed. Unfortunately this scheme falls down if the user feeds in
>> the bytes one-by-one, as Corinna previously found when handling UTF-8
>> like this.
>> 3. Write the high surrogate and report the actual number of bytes
>> consumed. On the next call, write the low surrogate, and return 0 to
>> indicate that no bytes have been consumed. Trouble is, a return value
>> of 0 from mbrtowc is supposed to indicate that a null character has
>> been found. While uses within Cygwin could be changed to recognise
>> string end by instead looking at the character actually written, this
>> would lead to truncated strings in applications.

I just found that approach 3 ends up delaying the low surrogate until
the first byte of the next character is passed to mbtowc. For keyboard
input at least, that's bad.


> Can't we just carry over the surrogate pair handling from __utf8_mbtowc()
> in newlib/libc/stdlib/mbtowc_r.c? ÂWhat's the stumbling block exactly?
> Do you have an example?

__utf8_mbtowc write the UTF-16 high surrogate after seeing only three
bytes of a four-byte sequence. It can do that because the first three
bytes of a UTF-8 sequence contain all the bits needed for the high
surrogate. When it's called with the fourth byte, it writes the low
surrogate and returns 1 to indicate it's consumed 1 byte. (Unless the
fourth byte is invalid in which case it returns -1).

That approach fits nicely with the mbrtowc spec, but I don't think it
can be used for GB18030, because there the first three bytes of a four
byte sequence do not necessarily determine all the bits of the high
surrogate.

For example (all in hex):

U+207FF  GB18030: 95 33 D1 33  UTF-16: D841 DFFF
U+20800  GB18030: 95 33 D1 34  UTF-16: D842 DC00

The first three GB18030 bytes are the same, yet the high UTF-16
surrogate is different.

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]