This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Console codepage setting via chcp?


On Sep 26 17:43, Andy Koppe wrote:
> 2009/9/26 Corinna Vinschen:
> > If we stick to UTF-8 exclusively we *have* to create the convmv-like
> > tool which allows to convert "broken" filenames to be converted from the
> > \016\377\x notation to the UTF-8 \c2\x or \c3\x notation, otherwise.
> 
> What's the \016\377\x notation? \016 is ^N, but the \377 isn't UTF-8,
> so is that an additional scheme?

Oh, I though I had mentioned it.  The \016\377\x is the multibyte
sequence which gets created from a lone U+DCxx UTF-16 value in
sys_cp_wcstombs.  See below.

> The way I understand it though, if filenames were always treated as
> UTF8 by the system calls, then ^N would never be needed, because
> invalid UTF8 is encoded as U+DCxx when converting to UTF16, while
> UTF16-to-UTF8 is always valid (unless Windows filenames contain
> invalid UTF16 in the first place ...).

A single U+DCxx is allowed in Windows filenames, but it is invalid
UTF-16, since it's the second half of a surrogate.  Such a lone second
half of a surrogate has no valid UTF-8 representation.  Same as the
first half alone.  A stand-alone value 0xD800 <= x <= 0xDFFF is only
valid in UCS-2, not in UTF-16.

Therefore a lone surrogate half results in an encoding error in the
newlib function __utf8_wctomb right now, and __utf8_mbtowc refuses to
translate a lone surrogate half in UCS-2 encoding to a wchar_t value.

Do you propose to change __utf8_mbtowc/__utf8_wctomb to allow UCS-2
encoding as well?

This is no problem for __utf8_mbtowc, but in __utf8_wctomb it's not
possible to convert surrogate pairs to correct UTF-8 *and* lone
surrogate first halfs to UCS-2, at least not with a lot of additional
effort.  The reason is that the first byte returned when the first half
is read is > 0xf0.  When the function is called for the second half and
it turns out there is no second half, then the already returned 0xf0
byte is suddenly wrong.  And the wctomb functions have no read-ahead
functionality.

For that reason, I invented the aforementioned \016\377\x sequence
to represent lone surrogate second halves.

The only other alternative would be to revert all the surrogate pair
handling changes and to allow only UCS-2 again, thus giving up to
support Unicode values >= U+10000.

> I vote for the proposal here, with added fence-sitting in the form of
> a CYGWIN option called 'filename_charset' (or some such) taking
> precedence over LC_ALL/LC_CTYPE/LANG.
> 
> With that, setting 'CYGWIN=fncset:UTF-8' would yield
> http://cygwin.com/ml/cygwin-developers/2009-09/msg00050.html.

No, please.  I was glad getting rid of CYGWIN=codepage:[ansi|oem].
The simple workaround is not to set LC_ALL/LC_CTYPE/LANG at all,
or to set it to "xx_XX.UTF-8".


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]